Creating A Validation Dataset to Test Record Linkage Algorithms in A Province-Wide South African Health Information Exchange

Main Article Content

Themba Mutemaringa
Alexa Heekes
Mariette Smith
Nicki Tiffin
Andrew Boulle


Increasing use of digital medical records creates disparate data resources for the same health care client population; and harnessing the benefits of real-time health data requires effective data linkage. A South African Health Information Exchange (HIE) collates and links routine health data from multiple sources, running daily updates through an automated ETL process. Many existing deterministic and probabilistic algorithms link person-level data using demographic identifiers, and can be combined in an optimised methodological pipeline. The performance of such pipelines must be validated against known matched pairs. The HIE uses current algorithms for record linkage, but methods that rely on similar spelling, name frequency and phonetic matching have been optimised for non-African names, and are not as effective.

We assessed common problems arising in the linkage process in the HIE, using this information to compile a curated representative African validation database for optimising existing and new linkage pipelines.

Using current linkage algorithms, we have identified the proportion of duplicates in the last five years, ranging from 25% in 2015 and stabilising at 10% by 2019. Common causes of duplicates across the whole database include mismatch in first name (37%), surname (17%), date of birth (13%), sex (8%) and South African Identification Number (0.2%). Complications from new-born naming and records of twins affect >8% of all records, and temporary health identifiers assigned at birth, during emergency response, and during poor connectivity of facilities to the provincial patient master index affect 2% of records.

Based on these data, we have constructed a South African-specific, representative validation dataset that contains linkage pairs that represent placeholder phrases for newborns prior to naming (e.g. “baby of”), language variations; twins; character insertions, substitution and omissions in names with similar spellings; frequencies of names in the general population; and similar-sounding names.

Article Details

How to Cite
Mutemaringa, T., Heekes, A., Smith, M., Tiffin, N. and Boulle, A. (2020) “Creating A Validation Dataset to Test Record Linkage Algorithms in A Province-Wide South African Health Information Exchange”, International Journal of Population Data Science, 5(5). doi: 10.23889/ijpds.v5i5.1625.

Most read articles by the same author(s)