Reusable, set-based selection algorithm for matched control groups IJPDS (2017) Issue 1, Vol 1:373 Proceedings of the IPDLN Conference (August 2016)

Main Article Content

Daniel Thayer
John W Gregory
Liv Kosnes
Damon Berridge
Martin L Heaven
David V Ford
Keith Lloyd
Ann John
Published online: Apr 19, 2017


The wealth of data available in linked administrative datasets offers great potential for research, but researchers face methodological and computational challenges in data preparation, due to the size and complexity of the data. The creation of matched control groups in the Secure Anonymised Information Linkage (SAIL) Databank illustrates this point: SAIL contains multiple health datasets describing millions of individuals in Wales. The volume of data creates the potential for more precise matching, but only if an appropriate algorithm can be applied. We aimed to create such an algorithm for reuse by many research projects.

We developed set-based code in SQL that efficiently selects matches from millions of potential combinations in a relational database environment. It is parameterized to allow different matching criteria to be employed as needed, including follow-up time around an index event. A combinatorial optimisation problem occurs when a potential control could match more than one subject, which we solved by ranking potential match pairs first by subject with the fewest potential matches, then by closeness of the match.

One example of the algorithm’s use was the Suicide Information Database Cymru, an electronic case-control study on suicide in Wales between 2003 and 2011. Subjects who had a cause of death recorded as self-harm were each matched to twenty controls who were alive at the subject’s date of death and had the same gender and similar birth week. The rate of matching success was >99.9%, with all subjects but one matching the full twenty controls. >99.99% of the matched controls had a week of birth that was identical to the subject. The second example was a matched cohort study looking at hospital admissions and type 1 diabetes, using the Brecon register of childhood diabetes in Wales, with matching based on week of birth within two weeks, gender, county of residence, deprivation quintile, and residence in Wales at time of diagnosis. This study had a matching rate of 98.9%; 97.5% of subjects matched to five controls, and 69.8% of matches had the same week of birth.

This algorithm provides good matching performance while executing efficiently and scalably on large datasets. Its implementation as reusable code will facilitate more efficient, high-quality research in SAIL. Instead of spending many hours developing a custom solution, analysts can execute parameterized code in a few minutes. We hope it to be useful more widely beyond SAIL as well.


Our first objective is to investigate how between subject differences in the likelihood of record linkage consent and record linkability determine the composition of and so risks of biases in estimates from linkable business datasets. The utility of datasets linking information from multiple sources is compromised by such non-linkage biases, but both components of the linkage process have rarely been considered. Our second objective is to introduce methods for evaluating non-linkage bias risks in datasets. Such evaluations can inform linkage method choice and assessment of the validity of linked dataset findings. Previous work, often lacking non-linked subject information on non-sample dataset covariates, tends to utilise overall linkage rates as quality measures, but in the related area of survey non-response correlations between analogous response rates and non-response biases are weak.


We utilise the UK 2010 Small Business Survey (SBS) dataset. If a survey subject consents to record linkage, an attempt is made to append their Inter-Departmental Business Register (IDBR) identifier (if one exists), enabling linkage to other surveys etc. Given this, we evaluate bias risks arising from variation in subject linkage consent and identifier appendability, as well as its product, overall linkability, utilising representativeness indicators developed to evaluate survey non-response bias risks. These measure risks in terms of sample-subset similarity (representativeness) given an attribute covariate set obtained from the sample dataset, based on variation in subject inclusion propensities estimated by logistic regression, and are decomposable to assess correlates of inclusion propensity variation. Specifically, we use the CV (the standard deviation of inclusion propensities divided their mean), computed given nine attribute covariates describing business demography and perceived performance.


We give full details in our presentation. Briefly, overall CVs suggest the linkable dataset exhibits substantial non-representativeness and non-linkage bias risk. Decompositions suggest main impacts on the linkable dataset are under-representation of very small businesses (those with low turnovers, few employees and / or un-incorporated), due to being both less likely to consent and less likely to have an identifier appended, and under-representation of businesses unable / refusing to respond to survey items, due to being less likely to consent.


Our analyses provide evidence of non-linkage bias risks in linked SBS datasets caused by under-representation of several sample subgroups. Each is explicable given known IDBR under-coverage or knowledge of business response processes. We also conclude that representativeness indicators are an easily applied method by which such risks can be evaluated.

Article Details

Most read articles by the same author(s)

<< < 1 2 3