Researchers from University College London have developed and demonstrated a new method for creating synthetic identifier datasets to assist the development and evaluation of data linkage methods. They wanted to see if replicating the relationships between personal attributes and identifiers would improve the usefulness of the synthetic data for assessing linkage errors. 

Evaluating linkages between datasets and developing improved linkage algorithms while protecting individuals' privacy can be challenging for researchers. One way to overcome this challenge is by using synthetic personal identifiers, which mimic real identifiers without revealing any personal information. This is not straightforward, as it is known that patterns of errors and missingness in personal identifiers differ across population groups and this can lead to bias and inequitable linkage outcomes. In order to help researchers make better decisions on how best to minimise biases in their linkage strategies, we needed to replicate these errors in a realistic manner.

The team propose a step-by-step framework to generate synthetic identifiers. In collaboration with a large UK cohort study called the Avon Longitudinal Study of Parents and Children, they created synthetic versions of the identifiers. These synthetic identifiers are described as “high fidelity”, meaning that their relationship with other variables in the data, such as maternal age and ethnicity, is preserved. The researchers introduced different ways of mimicking how mistakes might occur when recording people’s names in data records, such as typos and surname changes after marriage. They then tested how well the synthetic identifier data could assess the quality of linkage (in terms of false matches and missed matches) by comparing it with the original data.

The study found that synthetic data accurately estimated the quality of linkage compared to the original data, with very small differences in missed matches and false matches. Incorporating associations between identifier errors and other variables improved the similarity of the linkages using synthetic identifiers. This research highlights the need to encourage data owners and researchers to work together to improve their ability to evaluate linkage methods and biases.

Joseph Lam, lead author and PhD student at UCL, adds, “There is no easy way to evaluate linkage biases, but we know that marginalised groups are often the ones most likely to be missed or falsely matched. It is both a methodological and ethical imperative to understand how these groups are represented in linkage studies while safeguarding their privacy.”


Click here to view the full article

Joseph Lam, PhD Student/Research Assistant, Great Ormond Street Institute of Child Health, University College London




X: @Jo_Lam_


Lam, J., Boyd, A., Linacre, R., Blackburn, R. and Harron, K. (2024) “Generating synthetic identifiers to support development and evaluation of data linkage methods”, International Journal of Population Data Science, 9(1). doi: 10.23889/ijpds.v9i1.2389.