A method for Linking Multiple De-identified Datasets

Main Article Content

Andrew Waugh
David Rowley
Auren Clarke


National Statistics Institutes have been exploring the value of using administrative data. The Administrative Data Team within the Scotland’s Census 2021 Programme are exploring bringing administrative datasets together to support the census
and produce alternative population estimates.

We are developing methods to link de-identified administrative datasets, drawing on existing methods.

Our method uses hashed linking variables, derived from name, address, date of birth and gender. One linking variable is a names correction, produced by comparing names to each name in a reference set and scoring the difference. The scoring algorithm developed considers transpositions, deletions, insertions, substitutions and moves, and is sensitive to the particular letters involved.

Linking variables are combined at run time to produce thousands of matchkeys, allowing more matches to be linked deterministically using hashed data. Overall link strength scores are calculated as a combination of:

  • Penalties associated with the matchkey, based on the linking variables used, and

  • Similarity on dates of birth, measured at run time using weighted Bloom Filters.

We concatenate all the datasets and link the resulting dataset to itself. This allows simultaneous linking across all datasets and resolution of duplicate records within each dataset. This results in potentially complex patterns of links. By considering the records and links as a graph we allocate
records to unique individuals through a vertex colouring algorithm on the complement of each component. The link strength is considered to prioritize allocation.

Clerical review on links made found that those with stronger scores were more likely to be considered a match.

This linking method is being used and tested further in linking admin datasets for population estimates. We also plan to use it for several linking tasks in the processing of Scotland’s Census 2021.

Article Details

How to Cite
Waugh, A., Rowley, D. and Clarke, A. (2018) “A method for Linking Multiple De-identified Datasets”, International Journal of Population Data Science, 3(2). doi: 10.23889/ijpds.v3i2.494.