Evaluating hardening techniques against cryptanalysis attacks on Bloom filter encodings for record linkage
Main Article Content
Due to privacy concerns personal identifiers used for linking data often have to be encoded (masked) before being linked across organisations. Bloom filter (BF) encoding is a popular privacy technique that is now employed in real-world linkage applications. Recent research has however shown that BFs are vulnerable to cryptanalysis attacks.
Objectives and Approach
Attacks on BFs either exploit that encoding frequent plain-text values (such as common names) results in corresponding frequent BFs, or they apply pattern mining to identify co-occurring BF bit positions that correspond to frequent encoded q-grams (sub-strings). In this study we empirically evaluated the privacy of individuals encoded in BFs against two recent cryptanalysis attack methods by Christen et al. (2017/2018). We used two snapshots of the North Carolina Voter Registration database for our evaluation, where pairs of records corresponding to the same voter (with name or address variations) resulted in files with 222,251 BFs and 224,061 plain-text records, respectively.
We encoded between two and four of the fields first and last name, street, and city into one BF per record. For combinations of three and four fields all plain-text values and BFs were unique, challenging any frequency-based attack. For hardening BFs, different suggested methods (balancing, random hashing, XOR, BLIP, and salting) were applied.
Without any hardening applied up to 20.7% and 5% of plain-text values were correctly re-identified as 1-to-1 matches by both the pattern-mining and frequency-based attack methods, respectively. No more than 5\% correct 1-to-1 re-identification matches were achieved with the frequency-based attack on BFs encoding two fields when either balancing, random hashing, or XOR folding was applied; while the pattern-mining based attack was not successful in any correct re-identifications for any hardening technique.
Given that BF encoding is now being employed in real-world linkage applications, it is important to study the limits of this privacy technique. Our experimental evaluation shows that although basic BFs without hardening technique are susceptible to cryptanalysis attacks, some hardening techniques are able to protect BFs against these attacks.
The International Classification of Diseases (ICD) is globally used for coding morbidity statistics, however, its use, as well as the training provided to individuals assigning codes, varies greatly across countries.
Objectives and Approach
The goal is to understand the quality of coder training worldwide. After an in-depth grey and academic literature review, an online survey was created to poll the 194 World Health Organization (WHO) member countries. Questions focused on hospital data collection systems and the training provided to the coding professionals. The survey was distributed to potential participants that meet the specific criteria, as well as to organizations specialized in the topic, such as WHO-CC (WHO Collaborating Centers) and IFHIMA (International Federation of Health Information Management Association), to be forwarded to their representatives. Answers will be analyzed using descriptive statistics.
This ongoing project aims to capture responses from as many countries as possible, and thus far, data from 45 respondents from 20 different countries has been collected. Initial results reveal worldwide use of ICD, with variations in the maximum allowable coding fields for diagnoses and interventions. Coding specialists are the main personnel assigning codes, followed by physicians, and although minimum training is not mandatory in all countries (Sweden, Italy, Germany and Thailand), in those where it is, college/university degree is the most common requirement. Coding certificates most frequently entail passing a certification exam. Continuing education for coders is offered in all countries except one (Nigeria). Once more information is available, countries will be ranked and those depicting a better performance will be highlighted.
These survey data will establish the current state of ICD use and coding training internationally, which will ultimately be valuable to the WHO for the promotion of ICD and the rollout of ICD-11, for better international comparisons of health data, and for further research on how to improve ICD coding.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.