An International Cross-cohort Harmonization and Data Integration Initiative towards Achieving Statistical Power and Meaningful Results IJPDS (2017) Issue 1, Vol 1:362 Proceedings of the IPDLN Conference (August 2016)

Main Article Content

Tanya Flanagan
Isabel Fortier
Mélanie Fon Sing
Celine Moore
Published online: Apr 19, 2017


The complex interaction between lifestyle, behaviours, genetic factors and the social and physical environment have a fundamental role in modulating risk and/ or progression of health outcomes, especially cancer. To address this complexity, access to large-scale cohorts involving hundreds of thousands of participants and collecting comprehensive and valuable information are required. In the real world however, attaining adequate statistical power presents a major challenge.

Retrospective data harmonization and integration across multiple cohort studies has been shown to be an effective analytical approach to attaining statistical power, with the potential to support population health research and policy related questions and improve our understanding of the complex factors affecting health outcomes.

Large cohorts, with at least 50,000 participants, initiated in countries all over the world, focused on innovative research on cancer and other chronic diseases were invited to participate in this retrospective data harmonization initiative. Cohorts shared their comprehensive metadata related to their study content and design. Almost 150 variables, selected for their relevance to be part of a generic set of information useful for a broad range of research question, were assessed for their harmonization potential and made available on an online searchable study catalogue. Lastly, a proof of concept research question on the retrospective harmonized data was conducted and aimed to investigate methods to analyze individual patient data from multiple studies by studying the determinants associated with age at menopause.

Eight cohorts from multiple countries shared their comprehensive metadata related to their study content and design, resulting in over 2 million study participants. Of the 150 potential variables, the majority of them were harmonizable for co-analysis. The proof of concept research question, applied to these variables generated interesting results, widely supported by other research on this topic, found in the literature. This work demonstrates the value of retrospective data harmonization and integration to be an effective analytical approach to attaining statistical power.

The searchable study catalogue, available online for researchers to use in their own international research projects offers a new innovative tool for potential co-analysis of similar measures collected by separate cohort studies.

Retrospective harmonization offers an innovative approach to optimize use of existing research data with increased statistical power.


Rheumatic heart disease remains a major public health concern in developing countries. Motivated by the lack of up-to-date epidemiologic data from endemic settings, we sought to quantity morbidity and mortality attributable the condition in Fiji, a middle-income country where a high prevalence has consistently been reported. Having resolved to undertake the analysis using the existing routine clinical and administrative data at our disposal, we first set out to develop a data linkage procedure robust to the inherent limitations of data from low resource settings.


Records were available from four sources: an electronic patient information system, a database of death certificates, a disease control register, and echocardiography clinic registers. All referred to 2008-2012.

Throughout the design and calibration process we used 1,406 known duplications in the patient information system from which we calculated the sensitivity and specificity. After cleaning, standardisation and preliminary blocking, we categorised identifiers including names, dates and demographics into agreement, partial agreement, disagreement or missing, accounting for issues such as out of order or misspelt names. After concentrating true matches by further blocking, we estimated match and nonmatch probabilities using expectation maximisation under the Fellegi-Sunter model of record linkage. We then derived the posterior match probability taking into consideration the size of block and prior information about the probability a match be present given the demographics of the individual concerned. In its final configuration, with record pairs considered a match if they achieved a posterior probability of over 50%, our procedure identified the known duplications with sensitivity of 91.4% and specificity of 99.9%.


Having identified 2,619 cases from the 1,773,999 records available, we used the linked data to make population-based estimates of prevalence using capture-recapture analyses and cause-specific mortality using relative survival methods, the first such estimates for a developing country. Moreover, in sensitivity analyses, we found that changing posterior probability threshold above which record pairs were considered a match had limited impact on the results.


Although data linkage is widely used for epidemiologic research in high-income settings, its application to developing countries has been limited. We developed and validated a data linkage procedure that can be used to turn largely unstudied routine clinical and administrative data into robust estimates of disease burden. With the growing availability of computerized data, we propose our approach has strong potential to assist the production of disease burden statistics in developing countries where civil registration systems are weak.

Article Details