Deterministic linking and linkage keysCreating a linkage key
An example of deterministic linking using a linkage key
Deterministic linking involves the exact matching of information on different records across the datasets being combined for a linking project.
The simplest form of deterministic linking uses a unique identifier, such as an Australian Business Number or a social security number, to determine if the records refer to the same entity. (An entity may be a person, household, organisation or locality.)
This is also called ‘exact linking’ because the identifier either matches or does not match. This means that if the unique identifier contains any errors, the matches will not be found because the identifiers must be identical on all the datasets being linked.
If it is the case that the unique identifiers may be unreliable for linking purposes, there are a number of other options available. One is to instead use probabilistic linking – see Sheet 4.
Other possible approaches include variations of deterministic linking, such as ‘stepwise deterministic linking’ and ‘rules-based linking’. These techniques use other information on the records to overcome deficiencies in the quality of the unique identifier. For more information, see the references on this sheet.
If a unique identifier is not available, or is not of sufficient quality, it is possible to create a proxy, often referred to as a linkage key.
A linkage key is a code created using a combination of identifying information on each record, such as name, address and date of birth (see Table 1 for an example).
The linkage key usually replaces identifiers on the linked record. If the records in the linked dataset are de-identified (by removing name and address), this helps to protect the identity of the people or organisations in the new dataset.
However, this does not necessarily ensure privacy protection. Even without name and address it may still be possible to recognise a person or organisation, through a set of unusual characteristics in the linked dataset. For example, small-area data (e.g., a suburb) showing a 17-year-old widow with four children could be recognisable to someone living in that area. Therefore, further confidentiality techniques need to be applied before releasing the data.
The Confidentiality Series provides more information on privacy and confidentiality.
Table 1 shows how a 12 character key might be built using:
• the second, third and fifth letters of a person’s last name, second and third letter from a person’s first name
• the second, fourth, sixth and seventh numbers from a person’s date of birth (DD/MM/YYYY)
• gender (male is 1 and female is 2)
• the second and third numbers of the postcode.
Table 1: Example of creating a linkage key
As with the unique identifier, if there is an error or missing information on the records, the linkage key may not match exactly and therefore the records will not be linked.
As linkage keys use identifiers in their creation, technically they could be reconstructed, thereby identifying people in the dataset. Therefore, encryption of the key is recommended as an additional safety measure to avoid the risk of identification or re-identification.
Stage 1: Assigning linkage keys to all records within datasets A and B.
This example uses a linkage key (based on Table 1) for a project looking at educational attainment and earnings, by age and sex.
Stage 2: Extracting the content data and unique identifier.
In this example, the linkage key (MIHOH2597162) identifies the records that refer to the same person, in this case John Smith.
Stage 3: Merging to create a linked record.
Using the linkage key, only the information required for the project (highest educational attainment, income, sex and age) is extracted from each record and merged into a new linked record (now known by the identifier MIHOH2597162) in the new dataset. (See Diagram 1)