Linkage of national clinical datasets without patient identifiers using probabilistic methods.

Main Article Content

Helen Blake
Linda Sharples
Katie Harron
Jan van der Meulen
Kate Walker


To develop a step-by-step process for probabilistic linkage of national clinical and administrative datasets without personal information, providing guidance on selecting variables for linkage, estimating match weights, and choosing the probabilistic linkage threshold. To validate this process against deterministic linkage using patient identifiers.

We undertook probabilistic linkage without personal information using electronic health records from the National Bowel Cancer Audit (NBOCA) and Hospital Episode Statistics (HES) databases for bowel cancer patients undergoing emergency surgery in England. We selected linkage variables based on completeness, and ability to discriminate between matches and non-matches, assessed using a novel score derived from m-probabilities and u-probabilities. Taking deterministic linkage using patient identifiers as the reference-standard, we calculated sensitivity and specificity of probabilistic linkage, plotted a Receiver Operating Characteristic curve across alternative thresholds of match weights, and compared patient characteristics and estimates from fitted regression models between linkage methods.

When considering the ability to discriminate between matches and non-matches, patient and administrative variables tended to discriminate better than clinical variables. 81.4% of NBOCA records were linked to HES using probabilistic linkage, versus 82.8% using deterministic linkage. Most NBOCA records were linked to HES using both methods (8,427/10,566). Probabilistic linkage had over 96% sensitivity and 90% specificity compared to deterministic linkage using patient identifiers.  Patients that linked deterministically, but not probabilistically, were younger and more likely to have emergency admission, but otherwise had similar characteristics. Regression models for mortality and length of hospital stay according to patient and tumour characteristics were not sensitive to the linkage approach.

Probabilistic linkage without personal information can be used as an alternative to deterministic linkage using patient identifiers, or as a method for enhancing deterministic linkage. It allows analysts outside highly secure data environments to undertake linkage while minimising costs and delays, protecting data security, and maintaining linkage quality.

Article Details

How to Cite
Blake, H., Sharples, L., Harron, K., van der Meulen, J. and Walker, K. (2022) “Linkage of national clinical datasets without patient identifiers using probabilistic methods”., International Journal of Population Data Science, 7(3). doi: 10.23889/ijpds.v7i3.2067.

Most read articles by the same author(s)

1 2 3 > >>