Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)

Main Article Content

Anna Lin
Soon Song
Nancy Wang


Stats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to inconsistency.

Objectives and Approach
A modelled approach, ‘SoLinks’ has been developed in order to automate the FP estimation process for the IDI. It uses a logistic regression model to calculate the probability that a given link is a true match. The model is based on the agreement types defined for four key linking variables – first name, last name, sex, and date of birth. Exemptions have been given to some specific types of links that we believe to be high quality true matches. The training data used to estimate the model parameters was based on the outcomes of the clerical review process over several years.

We have compared the FP rates estimated through clerical review to the ones estimated through the SoLinks model. Some SoLinks estimates fall outside the 95% confidence intervals of the clerically reviewed ones. This may be the result of the pre-defined probabilities for the specific types of links are too high.

The automation of FP checking has saved analyst time and resource. The modelled FP estimates have been more stable across time than the previous clerical reviews. As this model estimates the probability of a true match at the individual link level, we may provide this probability to researchers so that they can calculate linked quality indicators for their research populations.

Article Details

How to Cite
Lin, A., Song, S. and Wang, N. (2020) “Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)”, International Journal of Population Data Science, 5(5). doi: 10.23889/ijpds.v5i5.1484.