Main Article Content
National mortality registers are essential for medical research. Therefore, most nations operate such registers. Due to the administrative structure and data protection legislation, there is no such registry in Germany. We demonstrate that a national mortality registry is technically feasible under the given constraints with privacy preserving record linkage (PPRL).
Objectives and Approach
Getting the legal permission to operate a national mortality registry for research will be easier if the linkage can be done without revealing personal identifiers by using PPRL. To estimate precision and recall of different encodings, we used two settings: (1) matching a local mortality registry (n = 14,003) with mortality data of a university hospital (n = 2,466); (2) matching 1 million simulated records from a national database of names with a corrupted subset. This corresponds to a match of all deceased persons with the deceased persons in the largest federal state (n = 205,000).
Linkage results for clear-text identifiers show very high recall and precision. Bloom-Filter based encryptions yield comparable results. Neither precision nor recall declines more than 2%. Phonetic codes yield high precision but low recall. Some variants of Bloom Filter-based encodings yield better results than probabilistic linkage on clear-text identifiers. This is mainly due to the rarely mentioned detail of using different passwords for different identifiers in the same Bloom Filter. Therefore, implementation details of Bloom Filters are more important than commonly thought. Overall, we recommend the use of salted Bloom Filter-based methods with different passwords for different identifiers to increase security and to prevent all known attacks on identifier encryptions.
Although most PPRL techniques would yield acceptable results in the given setting of a national register, salted Bloom filter encodings are more secure against attacks while still showing high precision and recall. Therefore, we consider a national mortality register using only encrypted identifiers of deceased persons as feasible.
Linking medical records across multiple healthcare settings is critical to delivering high-quality medical care and conducting valid research. Linkage is very challenging in countries without national identifiers, and remains an issue elsewhere because of imperfect recording and transfer of identifiers into databases. Researchers need to understand linkage software behavior.
Objectives and Approach
Our objective is to compare 4 record linkage software packages and various algorithms using real data: independent inpatient (n=69,523) and outpatient (n=176,154) datasets from a major medical system without a universal identifier. We conducted 30 trials, varying the software package (LinkPlus, LinXmart/CUPLE [Curtin University], Merge ToolBox and the R RecordLinkage package), the algorithm (deterministic or probabilistic; exact or inexact string matching) and the variables used for matching (first and last name, and gender, and these three plus full date of birth), using year of birth as blocking variable. We evaluated performance using the weights assigned to each of the 132M record pairs.
Despite substantial similarity, the packages and algorithms did not behave identically. The number of weights assigned to the compared pairs ranged over trials from 4 to 7,925,493, leading to different decisions when declared matches were pairs whose weights exceed a threshold. In all trials with exact string matching and three matching variables, 30,805 pairs received the maximum weight; with four matching variables, 30,536 pairs did. However, software and algorithms varied in assigning the 2nd, 3rd, and 4th rank weights. For example, some algorithms assigned higher weight to pairs matching on first name and gender but not on last name, and others to pairs matching on first name and last name but not gender. Ordering of the weights also reflected differences in treatment of missing values.
Unlike previous linkage work, we have analyzed weights assigned to all possible record pairs in each trial, which allows us to describe exactly what match decisions are possible. Because weights are not comparable across trials, focusing on ranks is crucial. Software documentation does not yield ready insight into differences.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.