Breaking the problem into pieces: pre-clustering on-the-fly with the NGLMS IJPDS (2017) Issue 1, Vol 1:251 Proceedings of the IPDLN Conference (August 2016)

Main Article Content

James Farrow
Published online: Apr 18, 2017


The SA.NT DataLink Next Generation Linkage Management System (NGLMS) stores linked data in the form of a graph (in the computer science sense) comprised of nodes (records) and edges (record relationships or similarities). This permits efficient pre-clustering techniques based on transitive closure to form groups of records which relate to the same individual (or other selection criteria).

Only information known (or at least highly likely) to be relevant is extracted from the graph as superclusters. This operation is computationally inexpensive when the underlying information is stored as a graph and may be able to be done on-the-fly for typical clusters. More computationally intensive analysis and/or further clustering may then be performed on this smaller subgraph. Canopy clustering and using blocking used to reduce pairwise comparisons are expressions of the same type of approach.

Subclusters for manual review based on transitive closure are typically computationally inexpensive enough to extract from the NGLMS that they are extracted on-demand during manual clerical review activities. There is no necessity to pre-calculate these clusters. Once extracted further analysis is undertaken on these smaller data groupings for visualisation and presentation for review and quality analysis. More computationally expensive techniques can be used at this point to prepare data for visualisation or provide hints to manual reviewers.

Extracting high-recall groups of data records for review but providing them to reviews grouped further into high precision groups as the result of a second pass has resulted in a reduction of the time taken for clerical reviewers at SANT DataLink to manual review a group by 30–40%. The reviewers are able to manipulate whole groups of related records at once rather than individual records.

Pre-clustering reduces the computational cost associated with higher order clustering and analysis algorithms. Algorithms which scale by n^2 (or more) are typical in comparison scenarios. By breaking the problem into pieces the computational cost can be reduced. Typically breaking a problem into many pieces reduces the cost in proportion to the number of pieces the problem can be broken into. This cost reduction can make techniques possible which would otherwise be computationally prohibitive.


There is increased interest in identifying strategies the reduce health inequities. With this focus, population health scientists have applied equity measures first developed in other disciplines to health equity research. The objective of this study is to illustrate the application of these measures in research using linkable administrative databases. This presentation will provide a brief description of some commonly-used equity measures and issues investigators face when applying them in their own health equity research.


Analyses focused on children born in Manitoba, 1984 to 2014. We used linkable administrative data from health, social services, and education to develop indicators of health and the social determinants of health. Income data from the Canadian Census were used to stratify children by socioeconomic status. Our study considered the distribution of several child outcomes: breastfeeding initiation, mortality, complete immunization rates at age 2, Grade 9 completion, and high school completion. We examined several measures often used to capture income-related health inequities: rate ratios and rate differences comparing children from high-income neighbourhoods with children from low-income neighbourhoods; the concentration index which quantifies the equity in the distribution of outcomes across the entire socioeconomic gradient; and the relative and absolute indices of inequality which compare the most advantaged individuals with the least advantaged individuals in the population while accounting for the distribution of health across the population.


When these measures are applied to health equity, they can be affected by factors not initially considered by investigators. The application of Concentration measures using health outcomes that are frequently dichotomized, and the prevalence of the health outcome can affect the degree of inequity that is possible, with highly prevalent outcomes showing very little divergence from the line of equity. Comparing concentration measures to the inequality indices can produce contradictory and seemingly incompatible results. Sample selection that alters the distribution of income from the population can also change the apparent equity of health outcomes. These matters are complicated when monitoring changes in health equity, over time.


Summary measures of equity can be useful but come with limitations that need to be considered when interpreting and applying study findings. We offer some suggestions to consider when applying these measures in health equity research.

Article Details