The statistical information on this site may not be the latest. For the most up to date information visit the ABS website abs.gov.au

Linking Death Registrations to the 2016 Census

The 2016 Death Registrations to Census linkage project details methodology used in the 2016 linkage process and linked dataset quality and outcomes

Release date and time

Introduction

This information paper describes the background and rationale for the 2016 Death Registrations to Census linkage project (previously referred to as the Indigenous Mortality Project). This project involved linking the Census with death registrations to examine differences in the reporting of Indigenous status across the two datasets in order to apply adjustment factors to mortality and life-expectancy estimates.

The 2016 Death Registrations to Census project involved linking twelve months of post-Census deaths data to the 2016 Census. Specifically, deaths date from 9 August 2016 until 28 September 2017, with a slightly longer range than 12 months to allow time for all relevant deaths to be registered and processed. The project attempted to link 177,380 death registrations records to 22,485,854 Census records, which led to 159,657 links, or a linkage rate of 90%. 

The aims of this project were to:

  • assist in understanding the differences in recording of Indigenous status between death registrations and Census data; and
  • assess the under-identification of Aboriginal and Torres Strait Islander deaths in death registrations records.


The 2016 Death Registrations to Census linkage project expanded on the methods used in the 2011 iteration of the same project. The main enhancements implemented for the 2016 project included:

  • use of non-sequential probabilistic linking of 2016 Census data to death records (as opposed to sequential probabilistic linking used in 2011);
  • use of alternative address information from Census and Death registrations to improve linkage of records between datasets;
  • improved name repair processes, where the rarity of a name was used in evaluating the quality of links established; and
  • an enhanced clerical review strategy resulting in higher quality links.

Background

The Australian Bureau of Statistics has been funded as part of the Council of Australian Governments (COAG) Closing the Gap initiative and given a mandate to deliver information to improve the measurement of Aboriginal and Torres Strait Islander life expectancy. The intent of the funding was to ensure the ABS has the capability to deliver high quality linked data for statistical use. Through this investment in data linking capability, the project enables reporting against the COAG target to close the life expectancy gap within a generation.

The 2016 Death Registrations to Census linkage project enables an estimate of the under-identification of Indigenous status in death registrations to be produced. This allows for adjustments to the registered data when compiling the Life Tables for Aboriginal and Torres Strait Islander Australians, 2015-17 (cat. no. 3302.0.55.003) released on 29 November 2018.

2.1 Collection of Death registrations data for Aboriginal and Torres Strait Islander people

Death registrations data from the State and Territory Registries of Births, Deaths and Marriages are used by the ABS to produce estimates of Aboriginal and Torres Strait Islander deaths. The information relates to all registered deaths including those referred to a Coroner. While there is some variation in practice among the jurisdictions, information supplied on both the Death Registration form and the Medical Certificate of Cause of Death (completed by medical practitioners) has been used where available to derive Indigenous status. Estimates of Aboriginal and Torres Strait Islander deaths are used as an input for calculating Aboriginal and Torres Strait Islander population and life expectancy estimates.

2.2 Collection of Census data for Aboriginal and Torres Strait Islander people

The Census is usually completed by a responsible adult answering for themselves or on behalf of another person present in the dwelling on Census night. In the standard Census form, Indigenous status is reported by the person completing the form and in some instances may not be answered. By contrast, Interviewer Household Forms are used in remote Aboriginal and Torres Strait Islander communities and in some urban areas. These forms are completed by a trained interviewer, who is recruited from the local community wherever possible. For further information on how the 2016 Census was undertaken please refer to Census of Population and Housing: Understanding the Census and Census Data, Australia, 2016 (cat. no. 2900.0).

2.3 Census 2016 data quality

In June 2017, the Report on the Quality of 2016 Census Data was released by the Census Independent Assurance Panel. The Panel determined that the 2016 Census data is of a comparable quality to previous Censuses, is useful and useable, and will support the same variety of uses of Census data as was the case for previous Censuses.

The report included a broad assessment of the key linking variables used in the Death Registrations to Census project; including name and date of birth. Although the quality of these variables was high for the 2016 Census, there was a decrease in the quality of this information relative to the 2011 Census. The Report noted the following:

  • a substantial increase in the non-response rate for date of birth, increasing from 10% in 2011 to 19% in 2016;
  • an increase in non-response for first name, from 49,000 persons in 2011 to 209,000 persons in 2016; and
  • an increase in non-response for surname, from 127,000 persons in 2011 to 274,000 persons in 2016.


For further information on the quality of particular Census variable, please refer to Census of Population and Housing: Understanding the Census and Census Data, Australia, 2016 (cat. no. 2900.0).

Data linking methodology

The data linking methodology used in the 2016 Death Registrations to Census linkage project can be generalised into the following steps:

  • data standardisation
  • data preparation
  • record pair comparison
  • decision model.
     

3.1 Enhancements to linking methodology

This project benefited from advances in methodology and technological resources to deliver improved integrated statistical outputs including:

  • Use of a combination of deterministic and probabilistic linkage techniques;
  • Adoption of a non-sequential approach to probabilistic linking designed to link a high quality dataset that is representative of the Australian population. The sequential approach used for the 2011 linkage removed accepted links after each probabilistic pass; in comparison, a non-sequential approach allowed for:
    • All records to be given an opportunity to link in every probabilistic pass;
    • All possible links from all passes assessed together to identify the best quality links overall for the dataset;
    • Prevention of poorer quality (and potentially inaccurate) links from earlier passes being accepted, where a higher quality link could be found in a later pass;
    • Quality of links could be assessed consistently across probabilistic passes;
  • Use of alternative address information from Census and Death registrations to improve linkage of records between datasets;
  • Improved name repair processes, where the rarity of a name was used in evaluating the quality of links established;
  • An enhanced clerical review strategy resulting in higher quality; and
  • Potential links that were rejected after review were assessed for alternative links on each record in the rejected linked pair.
     

3.2 Data standardisation

Before records on two datasets are compared, the contents of each need to be as consistent as possible to facilitate comparison. This process is known as 'standardisation' and includes a number of steps such as verification, recording and re-formatting variables, and parsing text variables (i.e. separating text variables into their components). Additionally, some variables such as name may require substantial repair prior to standardisation. 

Some variables differ between the two datasets in a predictable way, and an adjustment is required to account for this variance. Variables may also be recoded or aggregated in order to obtain a more robust form of the variable. Standardisation takes place in conjunction with a broader evaluation of the dataset, in which potential linking variables are identified.

The standardisation procedure for the Death Registrations to Census linkage project involved coding imputed and invalid values for selected variables to a common missing value. These variables included name, address, day of birth, month of birth, year of birth, age, sex, year of arrival and marital status. Standardisation for hierarchical variables involved collapsing at higher levels of aggregation to allow for potential differences in the recording and coding of the variable. This was done to improve the quality of the linkage data for the purpose of increasing the likelihood that a link would be made. An example of this is country of birth. On the Death registration record a person may have been coded to 'Northern Europe' (two digit level of country of birth), while on the 2016 Census they may have reported a specific country such as 'England' or 'Norway' (four digit level of country of birth). If left in its original state, a comparison between 'Northern Europe' and 'England' would not agree, even though one is a sub-category of the other. To account for this all 2016 Census country of birth responses were coded to the two digit level to allow for accurate comparison.

First name and surname

In the 2011 Death Registrations to Census project, Census name data was subjected to an automated repair process. Both first names and surnames were compared against corresponding master name indexes, with names being repaired when a suitably close match to a value on the index was found. The name repair process was repeated for the 2016 Census data, with the addition of a number of enhancements. These enhancements optimised both the number and accuracy of names repaired, and included the following:

  • expanded first name index, enabling a larger amount of names to be repaired;
  • identification and removal of values that were not considered to be a valid name;
  • use of age and country of birth information to assist with repairing first names;
  • use of family structure to assist with repairing surnames;
  • customised automatic repair processes based on response type (i.e. online forms vs. paper forms);
  • a manual coding process for paper form responses that could not be sufficiently repaired through automatic means (note that manual repair was performed only for a subset of records in 2011); and
  • use of additional repaired name options when more than one close match for a name value was found on the index.
     

After repair, first names were then standardised by being compared against a nickname concordance, ensuring that different variations would be grouped into a common name for the purposes of linkage. The standardisation of the same name value may also vary depending on the reported gender. For example,the name 'Jess' for a female may be standardised to 'Jessica' whereas it may be standardised to 'Jesse' for a male. Any first names that could not be matched to a nickname retained their original form.

Name data on the death registrations were of considerably better quality than those on the Census, and as such were not required to go through a repair process. However the remainder of the First name standardisation process for death registrations was consistent with the Census.

Name information from both death registrations and 2016 Census was anonymised prior to being joined with other variables for linking.

To assist with linkage using name data, flags were created to define whether a name was common or uncommon. These name flags were used during linkage to identify how frequently a name appeared in the two datasets being linked, and influenced the assessment of the quality of links that agreed on name. For example, some links may match on names that are common (e.g. 'John Smith'), whereas others may match on name values that are rare. Assuming that agreement on all other variables is equal, the links that agree on rare name values are more likely to be 'true', as it is less likely that two different people with the same rare name have been linked. Therefore these links could be deemed as being of higher quality than links that agree on common name values.

Geography / address

Linking was conducted based on the usual residential address of Census records and death registrations. Death registrations where only a residential title was supplied (e.g. nursing home, hospital, etc.) underwent additional repair.

In addition to address repair, the following standardisation techniques were applied:

  • Imputed mesh block geography was removed for linking purposes though it was retained on the analytical file;
  • Invalid postcodes were removed; and
  • Invalid data such as foreign characters were removed from house number.

Personal characteristics

A number of standardisation processes were undertaken on other key linking variables including:

  • Invalid dates of birth were removed;
  • Imputed instances of sex, marital status and age were removed;
  • Persons aged 15 and under that were originally coded as married or previously married had their marital status removed; and
  • Country of birth coded from four digit to the two digit level, for example 'Western Europe' rather than 'Austria' and 'Germany', to improve chances of linkage.
     

3.3 Data preparation

An additional data preparation technique was used in this linkage for Census records where multiple responses had been provided for key linking variables. A record may have had multiple responses for a single linking variable in the following situations:

  • a name that required repair had more than one possible repaired name value; or
  • the respondent reported different locations for address of usual residence and enumerated address (2016 Census records only).
     

The process for allowing the use of multiple responses for a linking variable involved restructuring the data for affected records; multiple rows were created for the affected record, with the number of rows generated equal to the number of different combinations that could be created from the linkage information. This is demonstrated in Tables 1a and 1b below. A respondent with two different anonymised first name values and two different mesh blocks would have four permuted rows generated. Meanwhile, the information that only had one stated value (in this example surname and date of birth) was duplicated across all of the generated rows. Structuring the data in this manner allowed for all combinations of a respondent's linkage information to be considered in a highly efficient manner while increasing the likelihood of finding the true link for the record.

Table 1A - Example of data restructure, original record

Person IDAnonymised First Name 1Anonymised First Name 2Anonymised Surname 1Anonymised Surname 2Mesh Block 1Mesh Block 2Date of Birth
1
1234
5678
9876
--
12345670000
98765430000
09/08/2016

Table 1B - Example of data restructure, restructured record

Person IDAnonymised First NameAnonymised SurnameMesh BlockDate of birth
1
1234
9876
12345670000
09/08/2016
1
1234
9876
98765430000
09/08/2016
1
5678
9876
12345670000
09/08/2016
1
5678
9876
98765430000
09/08/2016

3.4 Record pair comparison

Death registrations data and the 2016 Census were brought together using a combination of deterministic and probabilistic data linkage techniques. Deterministic linkage methods were initially used to identify matches that could be used as part of a training dataset for the creation of m and u probabilities for probabilistic linking (see Section 3.4.2 Probabilistic Linking for further information). Probabilistic linking was then used to link records that would be accepted for the final linked file.

The two datasets were linked in a way that was independent of reported Indigenous status so that any future analysis (including use in compiling the Life Tables for Aboriginal and Torres Strait Islander Australians - 2015-17 (cat. no. 3302.0.55.003)) would not be affected by bias introduced in the linking process. For this reason, Indigenous status was not used as a linking variable.

3.4.1 Deterministic linking

Deterministic data linkage, also known as rule-based linkage, involves assigning record pairs (i.e. potential links) across two datasets that match exactly or closely on common variables. This type of linkage is most applicable where the records from different sources consistently report sufficient information to efficiently identify links. It is less applicable in instances where there are problems with data quality or where there are limited characterisitics.

Initially, a deterministic linkage method was used to identify links to create a training dataset that could be used to inform the creation of m and u probabilities. This involved using selected personal and demographic characteristics (first name(anonymised), surname (anonymised), sex, date of birth/age, geography, year of arrival, marital status and country of birth), to identify record pairs.

3.4.2 Probabilistic linking

Probabilistic linking allows links to be assigned in spite of missing or inconsistent information, providing there is enough agreement on other variables to offset any disagreement. In probabilistic data linkage, records from two datasets are compared and brought together using several variables common to each dataset (Fellegi & Sunter, 1969).

A key feature of the methodology is the ability to handle a variety of linking variables and record comparison methods to produce a single numerical measure of how well two particular records match, referred to as the 'linkage weight'. This allows ranking of all possible links and optimal assignment of the link or non-link status (Solon and Bishop, 2009).

Blocking variables

In probabilistic linkage, record pairs (consisting of one record from each file) can be compared to see whether they are likely to be a match, i.e. belong to the same person. However, if the files are even moderately large, comparing every record on File A with every record on File B is computationally infeasible. Blocking reduces the number of comparisons by only comparing record pairs where matches are likely to be found – namely, records which agree on a set of blocking variables. Blocking variables are selected based on their reliability and discriminatory power. For instance, sex is partially useful as it is typically well reported, however it is minimally informative as it only divides datasets into two blocks, and therefore does not sufficiently reduce the computational intensity of larger linkages. Accordingly, it is generally not used alone but rather in conjunction with other variables.

Comparing only records that agree on one particular set of blocking variables means a record will not be compared with its match if it has missing, invalid or legitimately different information on a blocking variable. To mitigate this, the linking process is repeated a number of times ('passes'), using a range of different blocking strategies. For example, on the first pass, a block using a fine level of geography (mesh block) was used to capture the majority of Death registrations that had matching information with their corresponding 2016 Census record. The second pass blocked on repaired surname and sex, which allowed for mesh block to disagree but potentially link on street address information. The blocking variables used for each pass are outlined in Section 3.4.3 Blocking and Linking Strategy.

Linking variables

Within a blocking pass, records on the two files which agree on the specified blocking variables are compared on a set of linking variables. Each linking variable has associated field weights, which are calculated prior to comparison. Field weights indicate the amount of information (agreement, disagreement, or missing values) a linking variable provides about whether or not the records belong to the same person (match status). Field weights are based on two probabilities associated with each linking variable: first, the probability that the field values agree given that the two records belong to the same person (match); and second, the probability that the field values agree given the two records belong to different persons (non-match). These are called m and u probabilities (or match and non-match probabilities) and are defined as:

                                      m = P(fields agree | records belong to the same person)

                                      u = P(fields agree | records belong to different people)

Given that the m and u probabilities require knowledge of the true match status of record pairs, they cannot be known exactly, but rather must be estimated. The ABS calculated the m and u probabilities based on the training dataset, under the assumption that each deterministic link on the dataset was a match. The deterministic links used in this phase included (1) the highest quality links accepted in the deterministic linking passes, and (2) additional slightly lower quality links expected to be confirmed as accurate in the probabilistic linking phase. This method estimated the likelihood that a record would have a match by taking deaths and net overseas migration into account when estimating the m and u probabilities. This method also generated probabilities for disagreement, which can be referred to as md and ud probabilities:

                                      md = P(fields disagree | records belong to the same person)

                                      ud = P(fields disagree | records belong to different people)

Note that m and u probabilities were calculated separately for each pass, as the probabilities depend upon the characteristics of the pass' blocking variables. For example, the m probability for country of birth when blocking on mesh block will be different to the m probability for country of birth when blocking on sex.

Match (m) and non-match (u) probabilities are then converted to agreement and disagreement field weights. They are as follows:

                                          Agree = log2(m/u)

                                          Disagree = log2(md/ud)

These equations give rise to a number of intuitive properties of the Fellegi–Sunter framework (Fellegi & Sunter, 1969). First, in practice, agreement weights are always positive and disagreement weights are always negative. Second, the magnitude of the agreement weight is driven primarily by the likelihood of chance agreement. That is, a low probability of two random people agreeing on a variable (for example, date of birth) will result in a large agreement weight being applied when two records do agree.

The magnitude of the disagreement weight is driven by the stability and reliability of a variable. That is, if a variable is well reported and stable over time (for example, sex) then disagreement on the variable will yield a large negative weight. For each record pair comparison, the field weights from each linking variable are summed to form an overall record pair comparison weight or 'linkage weight'.

Before calculating m and u probabilities for some variables it is first necessary to define what constitutes agreement. Typical comparison functions used in the linkage include:

  • Exact match (e.g. Sex). Agreement occurs only when the two variable values are identical. This criterion is used for most linking variables;
  • Numeric difference (e.g. Age). A pair may be defined to agree if their variable values differ by an amount less than or equal to a specified maximum difference; and
  • Approximate string comparison (e.g. First name). Two strings may be said to agree in spite of a certain proportion of missing, differing, or transposed characters, allowing for misspellings, transcriptions of poor handwriting, etc. Approximate string comparators, such as the Winkler comparator, allow for partial agreement if the strings being compared are similar but do not exactly match, and can be used to ensure that both identical and similar string pairs are defined to agree.
     

For further details on comparison functions used for probabilistic linkage, see Christen & Churches (2005).

Near or partial agreement may also be factored into the linking process through calculation of m and u probabilities accounting for such agreement. For example, a person’s age on equivalent records will frequently be an exact match, and the m and u probabilities are calculated based on this definition. During linkage, however, a partial agreement weight was given for age within a one or two year difference to cater for persons who may have incorrectly reported age for a variety of reasons.

3.4.3 Blocking and linking strategy

The strategy employed for linking the 2016 Death Registrations to Census project builds on the 2011 linking strategy, using developments in linking methodology, software and available data to improve the approach. For further details on the 2011 linkage refer to Information Paper: Death registrations to Census linkage project - Methodology and Quality Assessment - 2011-12 (cat. no. 3302.0.55.004).

Table 2 displays the blocking and linking variables applied in this linking project for each pass.

Table 2 - Blocking and linking variables, by pass number

PASS NUMBER (a)(b)(c)123456789
ANONYMISED NAME
First Name - Cleaned
W85
 
W85
 
W85
  
W85
 
First Name - Repaired (Common) 
L
 
L
 
L
  
L
First Name - Repaired (Uncommon) 
L
 
L
 
L
  
L
First Name - Standardised      
B
  
Surname - Cleaned
W85
   
W85
  
W85
 
Surname - Repaired 
L
B
B
 
L
B
 
L
ADDRESS INFORMATION
Street Number  
L
L
L
L
 
L
L
Street Name  
W90
W90
W90
W90
 
W90
W90
Suburb  
W90
W90
W90
W90
   
Postcode       
B
B
Mesh Block
B
B
       
PERSONAL INFORMATION
Day Of Birth
L
+/- 2
L
 
B
B
L
+/- 2
+/- 2
Month Of Birth
L
L
L
L
B
B
L
L
L
Year Of Birth    
B
B
   
Age
+/- 1
+/- 1
+/- 1
+/- 1
  
+/- 1
+/- 2
+/- 2
Sex
L
L
B
B
L
L
L
B
B
Country Of Birth
L
L
L
L
L
L
L
L
L
Year Of Arrival
+/- 1
+/- 1
+/- 1
+/- 1
+/- 1
+/- 1
+/- 1
+/- 2
+/- 2
Marital Status
L
L
L
L
L
L
L
L
L
a. W - Winkler comparator and the required Winkler score
b. B - blocking variable
c. L - linking variable
 

3.5 Decision model

In probabilistic linking, once record pairs are generated and weighted, a decision algorithm determines whether the record pair is linked, not linked, or requires further consideration as a possible link. The generation of record pairs from probabilistic linking can result in the records on one dataset linking to multiple records on the other, resulting in a file of ‘many-to-many’ links. The first phase of the decision process involves assigning a record to its best possible pairing. This process is known as one-to-one assignment. Ideally (and often true in practice) each record has a single, unique best pairing, which is its true match.

The 2011 Death Registrations to Census project used an auction algorithm to assign probabilistic links optimally from the pool of all possible links. The auction algorithm maximises the sum of all the record pair comparison weights through alternative assignment choices, such that if a record A1 on File A links well to records B1 and B2 on File B, but record A2 links well to B2 only, the auction algorithm will assign A1 to B1 and A2 to B2, to maximise the overall comparison weights for all record pairs.

For the 2016 project, a change was made to the assignment algorithm. Using the previous example, A1 may still link to B1, but A2 would only link to B2 if it was considered a better quality link than A1 to B2. This change ensured that links would only be assigned when they are the absolute best option for both records in the link, which subsequently improved the quality of the links output at this phase. The modified algorithm was also far more efficient than the auction method, with the assignment process completed in a matter of minutes compared to several hours or days when using the auction algorithm.

An additional change made for the linkage was that the one-to-one assignment was generated using the combined many-to-many results from all passes in the linkage (i.e. non-sequential approach), rather than running the assignment over the results from each pass individually and accepting links before moving to the next pass (sequential approach). This allowed the best links from all passes to be obtained from a single assignment procedure.

The second phase of the probabilistic decision rule stage takes the output of one-to-one assignment and decides which pairs should be retained as links, and which pairs should be rejected as non-links. The simplest decision rule uses a single ‘cut-off’ point, where all record pairs with a linkage weight at or above the cut-off are assigned as links, and all those pairs with a linkage weight below the cut-off are assigned as non-links. A more sophisticated decision rule was used in the 2016 Death Registrations to Census linkage project, employing lower and upper cut-offs. Record pairs with a weight at or above the upper cut-off were declared links while those with a weight below the lower cut-off were declared non-links. In order to establish the upper and lower cut-off values, a sample of the record pairs identified by the assignment algorithm was clerically reviewed. The upper cut-off was then set at a weight value such that no false links had been detected above the cut-off in the sample. The record pairs with weights between the upper and lower cut-offs were clerically reviewed to determine which links to retain for the final linked dataset.

3.5.1 Clerical review of record pairs

Each record pair was manually inspected to resolve its match status (i.e. if the link was 'true' or 'false'). As part of this process, a clerical reviewer was often able to use information which could not be captured in the automated comparison process, but could be identified by the reviewer, such as common transcription errors (e.g. 1 mistaken as 7) or transposed information, such as the day of birth reported as the month or vice versa.

In addition to the linking variables, supplementary information was also used to confirm a link as true. This included:

  • non-linking variables such as ancestry, occupation, schooling and qualification; and,
  • reviewing the dates of birth and country of birth of parents (when available) for child records that had been linked.
     

These supplementary variables helped to inform difficult decisions, especially on record pairs belonging to children, allowing for greater insight into whether a record pair was an actual match or just contained similar demographic and personal characteristics for two different individuals. 

Clerical review was performed on 62,115 links, resulting in the confirmation of 35,820 matches. Initially, reviewers assessed the 'best' option for a link, that is, where Death registrations were matched to Census records based on the greatest level of agreement on linking variables. However, for Death registrations where the best option was rejected, subsequent clerical review also assessed the second-best and, if relevant, third-best option. This was further supplemented by a specific investigation into Aboriginal and Torres Strait Islander links. Following the inspection of first, second and third options for Death registrations, reviewers also assessed the remaining potential links identified as Aboriginal and Torres Strait Islander on either Census or Death registrations datasets. By this late stage of clerical review, fewer than 10% of those potential links had an agreement weight comparable to the other accepted links.

While the 2011 project applied high standards for precision, the 2016 linkage placed an even greater emphasis on ensuring as many links as possible in the final set of results were 'true' (i.e. the linked records do in fact belong to the same individual). This was achieved through the following processes:

  • A more extensive sampling review process. About 17,000 links were sampled in 2016 to assess precision of the 2016 links, compared to approximately 3,000 links in 2011. The sample size was much larger to ensure adequate agreement patterns were found in the links sampled. This involved sampling at least 5% of links from each individual weight value generated in the linkage run. All of the sample 2016 links were reviewed twice by different clerical review staff to ensure reliability of the precision estimates deduced from sampling. For more information refer to 3.5.1 Quality Assurance of clerically reviewed record pairs; and
  • A more conservative approach to confirming links in clerical review. Links were only confirmed when there was a high degree of confidence that the link was true. Staff were instructed to reject links where a '50/50' decision had to be made (i.e. there were valid reasons to both confirm and deny the link.). Emphasis was given to finding sufficient agreement in key linkage fields (i.e. name, address, date of birth) to confirm links. Staff were instructed to reject links where two or more key linking fields did not agree, and there was no available evidence that explained the fields' disagreement (e.g. examining the Census form to find typographical errors in the data).
     

3.5.2 Quality assurance of clerically reviewed record pairs

Clerical review relies upon judgment by a well-trained individual, therefore, while efforts are taken to minimise the risk, it is possible for a link to be incorrectly assigned as a match or non-match.

Quality assurance (QA) techniques were applied to clerical review to assess the accuracy of the clerical review decisions. The QA process involved having a sample of the clerical record pairs reviewed a second time by a different reviewer. If the decision for a record pair made by the QA reviewer conflicted with the decision made in the original clerical review, this was identified as an 'adjudication' pair. Adjudication results were used to update the original decisions made on clerically reviewed links.

Performing QA on clerically reviewed record pairs enabled a basic measure of quality, referred to as a 'clerical review consistency rate' (CR), to be obtained. This rate is calculated by dividing the number of adjudication pairs against the total number of record pairs that were quality assured. Note that the CR is not strictly an estimate of clerical review accuracy, rather it is a measure of the level of consistency with which different coders applied decisions to record pairs. The QA results were not used to supplement the final linked results. The quality assurance process produced a clerical review consistency rate of 95%, indicating the clerical review process was of high quality.

Linkage results

Of the 177,380 death records, 159,657 (90.0%) records were linked to one of 22,485,854 eligible Census records. Of the 3,246 Aboriginal and Torres Strait Islander death records, 2,315 (71.3%) were linked.

Examination of the characteristics of the links identified and the unlinked records can be found in 4.3 Characteristics of linked and unlinked Death registrations.

4.1 Linkage accuracy

The following quality measures were calculated for the linkage and indicate a good level of overall quality:

  • The linkage rate, 90%, being the proportion of Death registrations linked to a 2016 Census record; and
  • The estimated proportion of correctly linked records, otherwise referred to as 'linkage precision'.
     

4.2 Linkage precision

Not all record pairs assigned as links in a data linkage process are a true match, that is, a record pair belonging to the same individual. While the methodology is designed to ensure that the vast majority of links are true, some are actually false, i.e. the records in the link belong to different people rather than the same person. The linkage strategy used for the project was designed to ensure a high level of accuracy. Accordingly, the strategy was restrictive and conservative.

One of the key measures of linkage quality is the proportion of links in the dataset that are false. The number of false links is able to be estimated through the use of methods such as clerically reviewing a sample of links, or by using modelling techniques. Once an estimate of the number of false links is obtained, a 'precision' can be calculated. The precision is an estimate of the proportion of links that are matches (i.e. belonging to the same entity).

                                         Precision = (Total links - False link estimate)/Total links

Once the precision of the dataset is estimated, the false link rate is easily calculated.

                                          False link rate = 1 - Precision

The estimated link precision of the 2016 Death Registrations to Census linkage dataset is 100% as the decision model did not allow for any false links. As previously discussed, the upper cut-off was set such that it was estimated there were no false links above the cut-off while the clerical review process only accepted links for which there was sufficient evidence to support them being accurate matches. In reality, there will be a small number of false links due to a slight degree of inconsistent decisions between clerical reviewers. While the number of false links is not able to be quantified precisely, the proportion is expected to be very small.

4.3 Characteristics of linked and unlinked Death registrations

Table 3 - Census and Death registrations, Australia

DescriptionRecords
Number
 
 Census records eligible for linking(a)
22 485 854
 Aboriginal and Torres Strait Islander Census records
649 171
 Records on death file(b)
177 380
 Death records linked
159 657
 Death records not linked
17 723
 Aboriginal and Torres Strait Islander records on death file(c)
3 246
 Aboriginal and Torres Strait Islander records linked(c)
2 315
Percent
 
 All death records linked
90.0
 Aboriginal and Torres Strait Islander death records linked
71.3
a. Excludes residents temporarily overseas on Census night, imputed records and Census net undercount adjustment.
b. Deaths which occurred between 09 Aug 2016 and 28 Sep 2017.
c. According to Indigenous status reported on death registration form.
 

The number and percentage of death records linked to Census records by selected characteristics of deceased persons are presented in Table 4. A slightly higher linkage was achieved for females (91.4%) compared with males (88.6%). The linkage rate varied considerably by age, being lowest for 0-14 year old deceased persons (63.4%). This may be due to the comparatively high Census undercount rate in this age group. The linkage rate was highest for 75 years and older deceased persons (92.9%).

Table 4 - Death registrations linked to Census records by selected characteristics, Australia

  Total death recordsLinked recordsLinked records
Reported characteristics in death registrationno.no.%
Sex
 
 
 
 Males
91 143
80 796
88.6
 Females
86 237
78 861
91.4
Age (years)
 
 
 
 0-14
686
435
63.4
 15-24
1 219
868
71.2
 25-44
5 704
3 974
69.7
 45-64
22 543
18 582
82.4
 65-74
27 979
25 066
89.6
 75 and over
119 247
110 730
92.9
Indigenous Status
 
 
 
 Aboriginal and Torres Strait Islander
3 246
2 315
71.3
 Non-Indigenous
173 186
156 546
90.4
 Not stated
948
796
84.0
State of usual residence
 
 
 
 New South Wales
59 887
54 077
90.3
 Victoria
43 130
38 915
90.2
 Queensland
34 017
30 463
89.6
 South Australia
15 349
14 045
91.5
 Western Australia
16 269
14 439
88.8
 Tasmania
5 315
4 832
90.9
 Northern Territory
1 077
785
72.9
 Australian Capital Territory
2 294
2 066
90.1
Marital status
 
 
 
 Never married
18 547
14 774
79.7
 Married
70 298
64 811
92.2
 Widow
64 004
59 249
92.6
 Divorced
18 268
15 807
86.5
 Separated
152
123
80.9
 Not applicable (<15 years)
6 111
4 893
80.1
Elapsed time between Census and death
 
 
 
 Within 6 months of Census
80 234
71 238
88.8
 Beyond 6 months of Census
97 146
88 419
91.0

The linkage success varied by state of usual residence as reported on the death registration form. Rates were highest for South Australia (91.5%) and lowest for the Northern Territory (72.9%). All other states and territories had linkage rates between 88.8% and 90.9%. The low linkage rate for the Northern Territory reflects comparatively low linkage rates for both the Aboriginal and Torres Strait Islander and non-Indigenous populations. The linkage rate was similar for married and widowed persons (92.2 and 92.6% respectively). The linkage rate was lower for deaths which occurred within six months of the Census (88.8%) than those which occurred beyond six months after the Census (91.0%).

The linkage success also varied by Indigenous status recorded on the death registration form. People of non-Indigenous origin on the death registration form had a considerably higher linkage success (90.4%) compared with people of Aboriginal and Torres Strait Islander origin (71.3%). A more strict approach to implementing the 2016 linkage clerical review resulted in a lower, but more accurate linkage rate than in 2010-2012.

Table 5 - Death registrations linked to Census records by state of usual residence and indigenous status, Australia

 Indigenous Status
State of Usual ResidenceAboriginal and Torres Strait IslanderNon-IndigenousNot Stated (a)
New South Wales
676
52 415
474
Victoria
145
38 689
115
Queensland
663
29 997
66
South Australia
143
13 887
15
Western Australia
354
13 987
98
Tasmania
46
4 787
-
Northern Territory
272
501
-
Australian Capital Territory
16
2 283
24
Total
2 315
156 546
796
a. Small cell counts have been suppressed to preserve confidentiality.
 

4.4 Reasons for unlinked Death registrations

There were two main reasons why death registrations were not linked to a Census record:

  1. Records belonging to the same individual were present in the Death registration and Census datasets but these records failed to be linked because they contained missing or inconsistent information; or
  2. A link was not possible because there was no Census record corresponding to the death registration as the person was missed from the Census. Proximity of death to Census night is a significant factor in the ability for a link to be achieved.
     

Missing and/or inconsistent information

The quality of a data linkage project is significantly dependent on the quality of three key sources of information, these being name, address and date of birth. When all three sources of information are of very high quality on the linking datasets, identifying true links becomes less complicated, resulting in a high quality outcome for the linkage.

In some cases, the true match was present in the pool of all record pairs but it was not identified because there was a high level of inconsistency between information on the Death registration and the 2016 Census record, or key linking fields were missing from one or both datasets. The reasons for the match being missed can be categorised into the following groups:

  • the missing or inconsistent information did not allow the record pair to be compared in the same blocking categories and could not be linked;
  • the record pair did not contain enough unique common information to distinguish the match from other potential record pairs;
  • the record pair was linked, but was attributed a low link weight as it contained substantial missing or inconsistent information and was positioned below the cut-off identified in sample clerical review; or
  • the record pair was subjected to clerical review, but the high level of inconsistency prevented it from being deemed a true link.


Inconsistent Census information may be recorded due to a range of factors, including:

  • transcription errors in the Census, where the wrong category is selected or the information is transposed, such as the day the person was born being reported in the month field instead of in the day field;
  • data capture errors, where the Census form is scanned using Optical Character Recognition (OCR) software and certain characters may be misclassified, such as a 1 captured as a 7 or a 3 as an 8;
  • reporting errors, where information is given for the wrong member of the household (e.g. person 1's information is reported for person 3) or where the person completing the Census form for a household guesses or estimates information about a fellow household member; or
  • information that was not stated by the respondent and has been imputed as part of Census processing (such as age or sex), while set to missing for linking, the imputed values are included in the analytical dataset.


Accurate address coding was crucial in narrowing the search and differentiating between true and false links. However the nature of the data did not always allow for linkage on address or geography to be possible, as people may have changed address after Census night and prior to their death (e.g. moving to a nursing home); therefore the address recorded on the death record may not have been captured on the Census.

No Census record

A person may have had no 2016 Census record because they were not in scope of the Census due to absence from Australia, or died in the period around Census night, or they may have been missed in the 2016 Census. 

Due to the size and complexity of the Census, it is inevitable that some people are missed and some are counted more than once. It is for this reason that the Census Post Enumeration Survey (PES) is run shortly after each Census, to provide an independent measure of Census coverage. The PES determines how many people should have been counted in the Census, how many were missed (undercount), and how many were counted more than once (overcount). It also provides information on the characteristics of those in the population who have been under- or overcounted.

The net undercount rate for the 2016 Census was 1%, with a higher rate for Aboriginal and Torres Strait Islander people (17.5%) than for the non-Indigenous population (6.6%). For more information please refer to Census of Population and Housing - Details of Overcount and Undercount, Australia 2016 (cat. no. 2940.0).

In a small number of cases, the absence of a Census form could be the result of the person being overseas at the time of the Census but subsequently dying in Australia and the death registered during the linkage reference period.

Timing of death registration

Due to an individual generally having reduced capacity to complete a Census form when near death, deaths occurring closest to Census night are more difficult to link. That is, it is more likely that deaths in the Census enumeration month do not possess a corresponding Census record, as the Census was not completed for the individual with sufficient information for linking (note that Census counts people in institutions such as hospitals, but may not collect all information required for linking). In 2011, 10% of deaths in August were unlinked, compared to a range of 6-8% for other months. Due perhaps to an extended enumeration period in 2016, the rate of unlinked records in the month closest to Census night doubled to nearly 20%.

Rolling enumeration procedures for the 2016 Census in remote Aboriginal and Torres Strait Islander communities may have increased the likelihood of an equivalent Census record not existing for deaths of members of these communities occurring around the time of the Census. Rolling enumeration involves conducting the Census over an extended period of four weeks. In these instances it is possible that a resident who moved during the enumeration period may have been missed and therefore a corresponding Census record would not exist, or they may have passed away after Census Night (9 August 2016) but before Census enumeration was conducted in their residential area.

Table 6 - Death registrations linked to Census records by month and year of death, Australia

Year of DeathMonth of DeathTotal Death RegistrationsLinked Death RegistrationsLinked Death Registrations
  no.no.%
2016    
 August
4 220
3 250
77.0
 September
14 054
12 160
86.5
 October
12 775
11 508
90.1
 November
13 753
12 437
90.4
 December
11 398
10 285
90.2
2017    
 January
12 703
11 522
90.7
 February
11 843
10 718
90.5
 March
13 454
12 240
91.0
 April
10 331
9 370
90.7
 May
14 766
13 407
90.8
 June
13 906
12 636
90.9
 July
13 043
11 879
91.1
 August
16 053
14 579
90.8
 September
15 081
13 666
90.6

References

Show all

Australian Bureau of Statistics:

(2018) Life Tables for Aboriginal and Torres Strait Islander Australians - 2015-17, cat. no. 3302.0.55.003.

(2016) Australian Statistical Geography Standard (ASGS): Volume 1 - Main Structure and Greater Capital City Statistical Areas, July 2016, cat. no. 1270.0.55.001.

(2016) Census of Population and Housing - Details of Overcount and Undercount, Australia 2016, cat. no. 2940.0.

(2016) Research Paper: Death Registrations to Census Linkage Project - A Linked Dataset for Analysis, Mar 2016, cat. no. 1351.0.55.0.58

(2016) Understanding the Census and Census Data, Australia, 2016, cat. no. 2900.0.

(2013) Death registrations to Census linkage project - Key Findings for Aboriginal and Torres Strait Islander peoples, 2011-12, cat. no. 3302.0.55.005.

(2013) Information Paper: Death registrations to Census linkage project - Methodology and Quality Assessment - 2011-12, cat. no. 3302.0.55.004

(2008) Information Paper: Census Data Enhancement - Indigenous Mortality Quality Study - 2006-07, cat. no. 4723.0

Chipperfield, J, Hansen, N, & Rossiter, P (2018) "Estimating Precision and Recall for Deterministic and Probabilistic Record Linkage", International Statistical Review.

Christen, P & Churches, T (2005) Febrl 0.3 Documentation, (last viewed on 05 December 2018).

Fellegi, I & Sunter, A (1969) “A Theory for Record Linkage”, Journal of the American Statistical Association, 64(328), pp. 1183–1210.

Harding, S, Jackson Pulver, L, McDonald, P, Morrison, P, Trewin, D, & Voss, A (2017). Report on the quality of 2016 Census data, (last viewed on 05 December 2018).

History of changes

Show all

10/12/2018 - Information paper formerly known and released under the title "Information Paper: Death registrations to Census linkage project - Methodology and Quality Assessment" for the 2011-12 reference period.