Data Linkage of Hashed Data: Derive and Conquer

Main Article Content

Josie Plachta
Charlie Tomlin
Rachel Shipsey


Data Linkage of hashed datasets is much more difficult than linking in-the-clear data. Hashing prevents the use of matching tools that overcome messy data such as ‘contained-within’ functions and edit distance metrics. Hashing sensitive data received from third parties is becoming more common due to increased Data Security concerns. Institutions need to be ready to link hashed data with high accuracy, otherwise the quality of outputs from these linked datasets will suffer.

Objectives and Approach
We designed an innovative matching method, Derive and Conquer (D&C). We derived variables containing substrings or patterns of the full variable (e.g. Soundex or first 4 characters of a string) to match on instead. However, using lots of combinations of these derived variables would require thousands of traditional match keys to be programmed, run, and reviewed. Instead, D&C runs matchkeys on a derived agreement variable which amalgamates information stored in multiple derived variables into one value, reducing the number of matchkeys to a manageable amount. D&C runs on distributing computing systems using PySpark to link datasets containing millions of records in a timely manner.

D&C was developed using in-the-clear UK Census and health records with results comparable to the in-the-clear gold standard. It is currently being tested on hashed data to link UK tax and benefits data to UK health records. 66.4 million records were declared matched - a realistic match rate for the UK population. Research into the linkage quality is ongoing to produce estimates on the amount of bias in the linkage and the precision and recall. We will be excited to present these results at the Conference in October. These results will be used to improve D&C.

Conclusion / Implications
Using these derived variables, we have been able to overcome the challenge of matching massive hashed datasets with a realistic match rate and in a realistic time frame.

Article Details

How to Cite
Plachta, J., Tomlin, C. and Shipsey, R. (2020) “Data Linkage of Hashed Data: Derive and Conquer”, International Journal of Population Data Science, 5(5). doi: 10.23889/ijpds.v5i5.1447.