Modernization of Record Linkage At ICES
Main Article Content
Abstract
Introduction
Probabilistic Record Linkage of large databases requires a substantial amount of time and resources, resulting in significant costs. In addition, the process is subject to error, particularly during manual grey area resolution of uncertain matched pairs.
Objectives and Approach
The objective of this semi-experimental desinged study was to compare the accuracy and efficiency of different record linkage approaches. Four different record linkage software packages were selected: AutoMatch, G-Link, SAS Data Quality (DataFlux) and LinxMart. A large data set with all required linkage variables (e.g., first and last name, date of birth and gender) and a common unique identifier with the ICES linkage spine (registry) was chosen to represent our ground truth. Four non-overlapping cohorts were randomly selected from this data source, representing small (n=10,000), medium (n=250,000) and large (n=5,000,000) data sets. Simulated errors were inserted into each cohort to represent a real linkage scenario.
The smallest cohort was used to run a complete record linkage for each software package. Where the software allowed for manual grey area resolution, linkage was replicated by two different linkage analysts who were blinded to the simulated errors included in the data set. The time spent by each analyst on processing, programming and manual grey area resolution was recorded. The larger cohorts were used to measure accuracy and processing time taken by each of the software packages. In order to analyse possible errors, detailed output from each software package was generated to compare accepted and rejected pairs with our ground truth.
Results
This project is still ongoing. Evaluation of AutoMatch, G-Link and SAS Data Quality has largely been completed. The remaining analyses will be completed by August 2020.
Conclusion / Implications
The outcome of this project can inform the record linkage strategy at organizations and data centres such as ICES and help identify more efficient methods that preserve an acceptable level of accuracy for their needs.