Main Article Content
Linkage of administrative data for universal state education and National Health Service (NHS) hospital care would enable research into the inter-relationships between education and health for all children in England.
We aim to describe the linkage process and evaluate the quality of linkage of four one-year birth cohorts within the National Pupil Database (NPD) and Hospital Episode Statistics (HES).
We used multi-step deterministic linkage algorithms to link longitudinal records from state schools to the chronology of records in the NHS Personal Demographics Service (PDS; linkage stage 1), and HES (linkage stage 2). We calculated linkage rates and compared pupil characteristics in linked and unlinked samples for each stage of linkage and each cohort (1990/91, 1996/97, 1999/00, and 2004/05).
Of the 2,287,671 pupil records, 2,174,601 (95%) linked to HES. Linkage rates improved over time (92% in 1990/91 to 99% in 2004/05). Ethnic minority pupils and those living in more deprived areas were less likely to be matched to hospital records, but differences in pupil characteristics between linked and unlinked samples were moderate to small.
We linked nearly all pupils to at least one hospital record. The high coverage of the linkage represents a unique opportunity for wide-scale analyses across the domains of health and education. However, missed links disproportionately affected ethnic minorities or those living in the poorest neighbourhoods: selection bias could be mitigated by increasing the quality and completeness of identifiers recorded in administrative data or the application of statistical methods that account for missed links.
• Longitudinal administrative records for all children attending state school and acute hospital services in England have been used for research for more than two decades, but lack of a shared unique identifier has limited scope for linkage between these databases.
• We applied multi-step deterministic linkage algorithms to 4 one-year cohorts of children born 1 September-31 August in 1990/91, 1996/97, 1999/00 and 2004/05. In stage 1, full names, date of birth, and postcode histories from education data in the National Pupil Database were linked to the NHS Personal Demographic Service. In stage 2, NHS number, postcode, date of birth and sex were linked to hospital records in Hospital Episode Statistics.
• Between 92% and 99% of school pupils linked to at least one hospital record. Ethnic minority pupils and pupils who were living in the most deprived areas were least likely to link. Ethnic minority pupils were less likely than white children to link at the first step in both algorithms.
• Bias due to linkage errors could lead to an underestimate of the health needs in disadvantaged groups. Improved data quality, more sensitive linkage algorithms, and/or statistical methods that account for missed links in analyses, should be considered to reduce linkage bias.
Administrative data have been routinely collected for more than two decades in England from schools and hospitals by the Department for Education (DfE) and National Health Service (NHS) Digital respectively [1, 2]. These data collections have been used to monitor service provision and costs, and longitudinal linkage has made them powerful resources for national research [3–7]. Despite evidence from other countries of the value of linking education and health data to inform policy and practice [8–14], these databases have not previously been linked for children in England because they do not share a unique identifier. Linkage between these datasets can only be done using confidential, personal identifiers such as full names, postcodes, date of birth and sex, thereby creating technical and governance challenges.
Linkage error could significantly undermine the real-world benefits for policy if certain groups, such as those with a foreign name structure, are less likely to link than others . For example, missed links could lead to undercounting of adverse health or education outcomes for these groups, and in turn, under-provision of services. Evidence on linkage error can help data providers to improve the quality of identifiers or to develop more effective linkage algorithms. Evidence on differences in the characteristics between groups who link or not can be used by researchers to account for linkage bias in analyses .
We describe the methods used to link education data from the National Pupil Database (NPD) to hospital data for children in England (Hospital Episode Statistics; HES) [1, 2]. Our goal was to create de-identified, linked cohorts of pupils’ longitudinal records of education and hospital events over the childhood years. We also evaluated associations between child characteristics and linkage error in order to understand the implications of these errors for analysis. Our evaluation is based on 2.2 million children in England born in four one-year cohorts in 1990/91, 1996/97, 1999/00 and 2004/05. These cohorts reflect age and time periods when identifier quality, and hence linkage quality, is likely to differ due to data collection and system changes. This paper is relevant to users of The Education and Child Health Insights from Linked Data (ECHILD) database, which will be available from Spring 2022 and combines education, social care and hospital data for all children in England born from 1995 [1, 2, 17]. The findings are also relevant more generally to data linkages that lack a unique, high-quality identifier.
Study design and population
Governance permissions and data flows for the linkage followed the separation principle , whereby identifiers such as names and postcodes were kept separate at all times from attribute data (records from school or hospital records). Figure 1 shows the flow of identifiers and a pseudo-identifier (the anonymised Pupil Matching Reference, aPMR) from the Department for Education to NHS Digital. Separately, education attribute data flowed from the Department for Education to the Office of National Statistics Secure Research Service (ONS SRS). A two-stage linkage process was used to link NPD to HES. Stage 1 linked NPD to the Personal Demographic Service (PDS), which contains all individuals with an NHS number, and stage 2 linked NPD-PDS linked data to HES. At the first stage of linkage (step C in Figure 1), NHS Digital linkers had access only to the identifiers (date of birth, sex, and histories of forenames, surnames and postcodes) but no attribute data. At the second stage of linkage (step D), NHS Digital used the NHS number, date of birth, sex and postcode to link to HES data. The linkage step, pseudonymised HESID and anonymised PMR were transferred (step E) and merged with a University College London (UCL) held extract of HES within the UCL Data Safe Haven (DSH) (step F). Linked HES-PMR records were ultimately transferred to the ONS SRS (step G).
The study population consisted of four cohorts of children born between 1 September and 31 August in the academic years of 1990/91, 1996/97, 1999/00, and 2004/05 (Figure 2). These cohorts were defined separately in NPD and HES, so that linkage created three comparison groups for each of the four cohorts: linked NPD-HES, unlinked NPD, and unlinked HES records. We compared pupil characteristics in the linked and unlinked NPD cohorts at each stage of each linkage process. We used NPD as the inception cohort, as state school is a universal service attended at some point in the school years by at least 95% of all children [2, 18]. On the other hand, not all children attend hospital, unless they were young enough for their birth to be recorded in HES (1997 onwards).
Figure 2 shows that whether a pupil is expected to link to a HES record or not is affected by the start date of the PDS, the NPD and the subsets of HES data. Pupils born in 1990/91 were expected to have the lowest proportion of records in NPD that linked to HES (i.e. linkage rate). These children only appeared in NPD at the first school census collection in 2001/02 at age 10. Their names and postcodes captured each year in NPD from 2001/02 until leaving state school in 2009/10 or earlier, would be linked to names and postcode details recorded prospectively from General Practitioner (GP) registrations and hospital contacts on the PDS from 2004 onwards. These children could link to HES admission records from 1997 onwards (age 6 years), outpatients from age 12, or accident and emergency department from age 16.
Whilst it was expected that most children would have contact with hospital at some point during childhood or adolescence, we did not anticipate complete overlap between the two datasets. We expected children born in 2004/05 to have the best linkage rates of the four cohorts (and for linkage quality to remain constant or improve for subsequent cohorts). Firstly, 97% of children born in England would be expected to have their birth recorded in HES and in PDS . Secondly, their linkage to subsequent health records should be more accurate than earlier cohorts due to immediate allocation by midwives of NHS numbers to babies at birth, a process introduced at the end of 2002 .
The data sources are described in detail in the Supplementary Appendices 2 and 3.
National pupil database (NPD)
NPD contains pupil-level information on all children and adolescents attending state-funded schools in England, capturing information on attainment tests, absences, exclusions and alternative provision (details in Supplementary Figure 1 of Supplementary Appendix 2) . The school census collects information each term on pupils enrolled and updates of the pupil’s name, address, and postcode. We used identifiers recorded in the Spring census (submitted in February) for linkage as this is the definitive entry for the year (i.e. for school year 2001/2). Pupil records are linked across years and between NPD modules using a pseudo-identifier called the anonymised Pupil Matching Reference (aPMR).
Hospital episode statistics (HES)
HES is an episode level administrative database that covers all admissions (day case and overnight) to the National Health Service (NHS) hospitals in England , as well as all attendances at the accident and emergency attendances (from 2007/8) and outpatient appointments (from 2003/4). From January 1998 onwards, HES has been routinely linked to ONS death registration records . Supplementary Table 1 in Supplementary Appendix 3 describes data availability in HES. For researchers using de-identified attribute data from HES, episodes of care relating to a patient can be linked over time or between datasets using a study-specific pseudonymised patient identifier generated by NHD Digital – HESID .
Personal demographics service (PDS)
PDS is a national electronic database that contains the chronology of demographic information, including sex, name and address, for all individuals in England with an NHS number. Introduced in June 2004, as part of The National Programme for IT, the PDS was developed to integrate management of patient demographic information across NHS services in England. PDS replaced the NHS Central Register (CHRIS); the demographic functions of the National Health Applications and Infrastructure Services (NHAIS); the NHS Strategic Tracing Service (NSTS); and the NHS Number for Babies (NN4B) . Current identifiers from these databases were transferred into PDS in 2004. The patient demographic details on the PDS data can be updated by NHS care providers when a person uses an NHS service, including GP surgeries, inpatient or outpatient appointments [24, 25]. The accuracy and quality of PDS data is assured by staff at the PDS National Back Office (NBO) in NHS Digital .
Figure 1 shows two stages of linkage. Stage 1 involved transfer of a linkage file containing full name and postcode histories and other identifiers (Table 1) from the Department for Education to NHS Digital for linkage to the PDS. Extracts from NPD and PDS listed multiple identifiers for each individual together with the date interval when the identifier was recorded (details in Supplementary Appendix 4). To link the NPD linkage file and PDS, we relied on a deterministic linkage algorithm comprising 8 steps, shown in Table 2. These steps were designed to identify records that have high levels of agreement across names, date of birth, sex and postcode, and to resolve inconsistencies between records belonging to the same pupil.
|Linkage identifiers||Data sources|
|Date of birth (e.g. 23/02/1988)||✓||✓||✓|
|Residence postcodes dates**||✓||✓||✓|
|Anonymised Pupil Matching Reference (aPMR)||✓|
|Step||First name||Surname||Date of birth||Sex||Postcode*|
|3||1st character||Characters 1–3||Exact||Exact||Exact|
|4||1st character||Characters 1–3||Exact||Exact|
|8||1st character||Characters 1–3||Exact||Exact|
Besides considering the 8 steps in Table 2, a further restriction was that a linked pair of records needed to have identifiers within the same academic year in PDS and in NPD (details in supplementary Appendix 4). All eight steps of the algorithm were run for each school year (September to August) ordered from 2004/05 to 2016/17 for all pupils. In order to allow for multiple links with the highest level of agreement between NPD and PDS, step 1 was repeated (details in Supplementary Appendix 4). For all other steps, a pupil was removed from the linking pool (i.e. all records for that pupil were excluded from subsequent linking steps) once a linkage was identified.
Stage 2 involved linking the PDS table of identifiers for children linked to NPD with HES, using the NHS Digital internal 7 step algorithm (Table 3). The bridging files resulting from this linkage did not contain any identifiable data (such as name or date of birth) and contained all possible linkage pairs (linked and unlinked) resulting from linkage stages 1 and 2. Files contained the pseudonymised HESIDs for each of the four cohorts that included: all individuals in HES with a birth date in the relevant cohort and for those that linked to NPD, the anonymised PMR, two record-level indicators identifying the resulting linkage step of the linkage stages 1 and 2, and a variable indicating the specific cohort.
|Step||NHS number||Date of birth||Sex||Postcode*|
|Where NHS number does not contradict the match and date of birth is not 1 January|
|Where date of birth is not 1 January|
Figure 1 shows the transfer of pseudonymised HES attribute data (admitted patient care, accident and emergency, outpatient), together with the linkage bridging file of all possible linkage pairs, to the ONS SRS. Similarly, the Department for Education transferred NPD attribute data extracts containing the anonymised PMR to the ONS SRS.
The final phase of the process was to merge NPD and HES attribute data, using the bridging file obtained from stage 2 of the linkage. This was done by an Accredited Researcher (NL) in the ONS SRS. There were minor differences in HESIDs transferred by NHS Digital to UCL and those held by UCL as the NHS Digital HES data is continually updated, whereas UCL holds a static subset of the NHS Digital HES data (e.g. that is limited by age).
Evaluation of linkage quality
Among pupils who linked to a HES record, we calculated the distribution linked at each step for linkage stages 1 and 2, according to region, ethnic group, decile of deprivation, measured by income deprivation affecting children index (IDACI), and cohort year. We calculated the overall linkage rate as the percentage of pupils in the NPD who linked to any HES record for each of the four cohorts .
To evaluate potential bias resulting from missed matches, we compared characteristics of pupils in NPD who were linked to HES records with pupils in NPD who were not linked to HES [15, 28]. Unlinked pupils could include pupils who never attended hospital or missed matches of pupils who did attend hospital. We used standardized differences (mean difference in standard deviation units) as these are thought to be more informative to detect potential biases than P-values in large samples [28, 29]. Standardized differences were calculated using the ‘stddiff’ command in Stata for the following variables: sex; ethnic group; region of pupil’s residence; IDACI Deciles; age at start of the first academic year; whether a child receives Special Education Need (SEN) provision (recorded in NPD as receiving Action, Action Plus or Support (AAP/S) and having a statement of SEN or an Education Health & Care Plan (S/EHCP) ); and persistent authorized annual absence rate for all academic years available defined as whether a child was absent in 10% or more of academic sessions (see Supplementary Appendix 5 for recording of variables) .
Multivariable logistic analysis was used to evaluate linkage from NPD to HES using the following demographic characteristics: sex, ethnicity, region of residence and IDACI Deciles.
The bridging file produced by NHS Digital included 2,289,587 records with all possible linkage results. From this file, 41 duplicates were excluded since the same aPMR-HESID pairs linked in two different academic years. The second bridging file that included only the modified linkage step 1 of linkage stage 1 (i.e., where multiple links were allowed for each NPD record) contained 2,093,787 records, of which only 8,858 records were new linkage results. By combining both files, we linked an additional 4,059 (0.18%) aPMR-HESID pairs.
The final bridging file contains 2,294,369 records, corresponding to 2,287,671 pupils that were used in the linkage quality analysis (Figure 3). Of the 2,287,671 pupil records in the four cohorts, 2,174,601 (95%) linked to a HES record. As expected, linkage rates increased as we moved from pupils born in academic year 1990/91(92%) to those born 2004/05 (99%). Results for each linkage stage show that 30,323 (1.3%) of pupils’ records were not linked in stage 1, 61,223 (2.7%) records were not linked in stage 2, and a further 21,524 (0.9%) were not merged with the UCL extract. An improvement of linkage was observed over time. For example, in the cohort born in 1990/91 3.3% of records were not linked in stage 1, whereas only 1.1% of records were not linked in the cohort born in 2004/05.
Distribution of pupil characteristics in linked records
At stage 1, between 91% and 95% of pupils linked at the first step of the 8-step algorithm, i.e. exact linkage by first name, surname, date of birth, sex and postcode (Table 2; Supplementary Appendix 6). However, evaluation by ethnic group showed that the additional steps in this algorithm, i.e. from 2-8, captured a greater percentage of ethnic minority groups (11.8% of minority ethnic groups versus 4.2% of white ethnic group).
A considerable percentage of records were linked in years after the first available Spring census (Figure 4). For example, 12% and 21% of records of pupils born in academic years 1990/91 and 1996/97 respectively, were matched after 2004/05 – their first available Spring census when it was possible to link to PDS. Similarly, in academic years 1999/00 and 2004/05, 16% and 9% of pupils were matched after their academic Year 1- their second available Spring census. For pupils born in academic year 1999/00 or after, the majority of records were linked in the first two academic years. In particular, 50% of records in cohort 1999/00 and 51% in 2004/05 were linked in Year 1, while 34% and 40% were linked in reception year (Supplementary Appendix 6).
Linkage at stage 2, from PDS to HES using the NHS Digital internal 7-step algorithm (Table 3) showed a similar pattern to linkage at stage 1. Of the 2,202,823 pairs in NPD linked at stage 2, 81% (n=1,791,480) were linked at step 1 and 18% at step 2 (n=386,579) (Supplementary Table 7.1 in Supplementary Appendix 6). Pupils from ethnic minorities were disproportionately linked at steps 2-8. For example, around 20% of pupils categorized in Black and Chinese ethnic groups were linked at step 2, compared to 17% of white pupils that linked at this step. Of steps 3-8 of the algorithm, step 6 was particularly important for the linkage of ethnic minority groups, linking between 0.7%-1.7% of ethnic minority records (see Supplementary Appendix 6 for more details).
Linkage rates by demographic characteristics of pupils
Pupils who linked to HES after both linkage stages and who were merged with HES attribute data comprise the matched dataset used for all subsequent analyses. Linkage rate by region, ethnic group, sex and IDACI deciles are shown in the Supplementary Appendix 7. We found that linkage rates improved over time for all these variables. However, ethnic minorities and pupils living in more deprived areas were less likely to match to HES. The linkage rate for white pupils improved from 94.6% in the 1990/91 cohort to 98.9% in the 2004/05 cohort. In contrast, for ethnic minority pupils in the same cohorts the linkage rate rose from 92.4% to 97.7%, respectively. We found a similar pattern by IDACI deciles. Linkage rates by region provide evidence that London has consistently lower linkage rates than the rest of the country.
Comparing characteristics of linked and unlinked pupils
Differences in the distribution of sociodemographic and educational characteristics of pupils recorded in NPD who linked or not to HES are shown in Table 4 (and Supplementary Table 9.1–9.4 in Supplementary Appendix 8). Overall, relatively low standardized differences are observed across all variables providing evidence of small or moderate differences between linked and unlinked groups. We considered standardized differences of 0.2, 0.5 and 0.8 as small, moderate and large, respectively [28, 34]. The largest differences were for the AAP/S and persistent authorized absence rate in cohort 1996/97 with values of 0.44 and 0.42. The mean standardized difference across cohort for region and ethnic groups was 0.25 and 0.24 whereas for sex and IDACI deciles was 0.13 and 0.17 (Table 4).
|Cohort 1990/91||Cohort 1996/97|
|Non-linked (n = 47,934) (%)||Linked (n =565,798) (%)||Stand. Diff.||Non-linked (n = 35,299) (%)||Linked (n = 536,619) (%)||Stand. Diff.|
|London||7,729 (16.1)||68,073 (12.0)||0.191||6,243 (17.7)||71,652 (13.4)||0.247|
|South East||8,000 (16.7)||81,806 (14.5)||5,961 (16.9)||75,452 (14.1)|
|South West||4,217 (8.8)||52,018 (9.2)||3,021 (8.6)||50,302 (9.4)|
|West Midlands||4,915 (10.3)||63,013 (11.1)||3,392 (9.6)||60,027 (11.2)|
|North West||6,200 (12.9)||83,376 (14.7)||3,630 (10.3)||77,805 (14.5)|
|North East||1,567 (3.3)||29,318 (5.2)||1,025 (2.9)||27,374 (5.1)|
|Yorkshire and The Humber||3,885 (8.1)||57,539 (10.2)||2,908 (8.2)||54,564 (10.2)|
|East Midlands||3,535 (7.4)||47,096 (8.3)||2,769 (7.8)||42,187 (7.9)|
|East of England||5,541 (11.6)||59,686 (10.5)||4,525 (12.8)||54,424 (10.1)|
|Wales||28 (0.1)||38 (0.0)||*||*|
|Missing||2,317 (4.8)||23,835 (4.2)||1,818 (5.2)||22,794 (4.2)|
|White||27,692 (57.8)||488,330 (86.3)||0.160||24,452 (69.3)||453,764 (84.6)||0.159|
|Asian||2,541 (5.3)||33,024 (5.8)||2,584 (7.3)||37,654 (7)|
|Black||1,507 (3.1)||17,047 (3.0)||1,429 (4.0)||19,228 (3.6)|
|Chinese||278 (0.6)||1,384 (0.2)||213 (0.6)||1,439 (0.3)|
|Other ethnic group||498 (1.0)||3,627 (0.6)||626 (1.8)||3,951 (0.7)|
|Mixed||834 (1.7)||13,808 (2.4)||1,278 (3.6)||19,286 (3.6)|
|Missing||14,584 (30.4)||8,578 (1.5)||4,717 (13.4)||1,297 (0.2)|
|Male||27,334 (57.0)||285,716 (50.5)||0.131||17,014 (48.2)||275,479 (51.3)||0.062|
|Female||20,543 (42.9)||279,520 (49.4)||18,268 (51.8)||261,094 (48.7)|
|Missing||57 (0.1)||562 (0.1)||17 (0.0)||46 (0.0)|
|1 (deprived)||7,306 (15.2)||54,336 (9.6)||0.242||4,866 (13.8)||50,540 (9.4)||0.218|
|2||6,001 (12.5)||55,606 (9.8)||4,247 (12.0)||51,132 (9.5)|
|3||5,414 (11.3)||56,149 (9.9)||3,811 (10.8)||51,662 (9.6)|
|4||4,941 (10.3)||56,600 (10.0)||3,738 (10.6)||51,725 (9.6)|
|5||4,611 (9.6)||56,620 (10.0)||3,444 (9.8)||52,336 (9.8)|
|6||4,255 (8.9)||56,927 (10.1)||3,310 (9.4)||52,503 (9.8)|
|7||3,854 (8.0)||56,891 (10.1)||2,936 (8.3)||53,336 (9.9)|
|8||3,685 (7.7)||56,122 (9.9)||2,914 (8.3)||54,281 (10.1)|
|9||3,514 (7.3)||54,875 (9.7)||2,851 (8.1)||55,791 (10.4)|
|10 (affluent)||3,630 (7.6)||54,286 (9.6)||2,701 (7.7)||56,355 (10.5)|
|Missing||723 (1.5)||7,386 (1.3)||481 (1.4)||6,958 (1.3)|
|Cohort 1999/00||Cohort 2004/05|
|Non-linked (n = 22,185) (%)||Linked (n = 507,725) (%)||Stand. Diff.||Non linked (n = 8,477) (%)||Linked (n =570,332) (%)||Stand. Diff.|
|London||4,303 (19.4)||71,001 (14.0)||0.31||1,590 (18.8)||83,817 (14.7)||0.237|
|South East||3,881 (17.5)||74,189 (14.6)||1,353 (16.0)||83,748 (14.7)|
|South West||1,364 (6.1)||45,672 (9.0)||504 (5.9)||49,993 (8.8)|
|West Midlands||2,274 (10.3)||55,174 (10.9)||759 (9.0)||60,358 (10.6)|
|North West||2,036 (9.2)||70,533 (13.9)||986 (11.6)||76,373 (13.4)|
|North East||585 (2.6)||24,497 (4.8)||197 (2.3)||26,007 (4.6)|
|Yorkshire and The Humber||1,502 (6.8)||49,701 (9.8)||671 (7.9)||56,330 (9.9)|
|East Midlands||1,786 (8.1)||40,944 (8.1)||689 (8.1)||45,255 (7.9)|
|East of England||3,119 (14.1)||52,238 (10.3)||1,040 (12.3)||57,545 (10.1)|
|Missing||1,327 (6.0)||23,720 (4.7)||685 (8.1)||30,840 (5.4)|
|White||15,692 (70.7)||415,660 (81.9)||0.281||5,255 (62.0)||439,397 (77.0)||0.358|
|Asian||2,581 (11.6)||43,061 (8.5)||1,207 (14.2)||57,790 (10.1)|
|Black||1,735 (7.8)||21,528 (4.2)||696 (8.2)||31,656 (5.6)|
|Chinese||172 (0.8)||1,530 (0.3)||89 (1.0)||2,038 (0.4)|
|Other ethnic group||700 (3.2)||5,146 (1.0)||486 (5.7)||8,375 (1.5)|
|Mixed||1,178 (5.3)||20,177 (4.0)||575 (6.8)||29,871 (5.2)|
|Missing||127 (0.6)||623 (0.1)||169 (2.0)||1,205 (0.2)|
|Male||9,717 (43.8)||261,398 (51.5)||0.153||3,660 (43.2)||292,784 (51.3)||0.166|
|Female||12,445 (56.1)||246,116 (48.5)||4,814 (56.8)||277,508 (48.7)|
|Missing||23 (0.1)||211 (0.0)||0 (0.0)||43 (0.0)|
|1 (deprived)||2,863 (12.9)||49,733 (9.8)||0.142||909 (10.7)||53,590 (9.4)||0.07|
|2||2,487 (11.2)||49,457 (9.7)||855 (10.1)||53,748 (9.4)|
|3||2,257 (10.2)||49,130 (9.7)||849 (10.0)||54,246 (9.5)|
|4||2,263 (10.2)||49,153 (9.7)||750 (8.8)||54,250 (9.5)|
|5||2,139 (9.6)||49,450 (9.7)||812 (9.6)||55,571 (9.7)|
|6||2,056 (9.3)||49,965 (9.8)||840 (9.9)||56,601 (9.9)|
|7||1,980 (8.9)||50,467 (9.9)||844 (10.0)||57,776 (10.1)|
|8||2,077 (9.4)||51,321 (10.1)||885 (10.4)||58,854 (10.3)|
|9||1,972 (8.9)||52,884 (10.4)||858 (10.1)||61,048 (10.7)|
|10 (affluent)||1,953 (8.8)||53,904 (10.6)||821 (9.7)||62,514 (11.0)|
|Missing||138 (0.6)||2,261 (0.4)||54 (0.6)||2,134 (0.4)|
Evaluation of linkage from NPD to HES
Table 5 shows the results of multivariable logistic models displaying adjusted Odds Ratios (OR) for linkage to HES. Unadjusted models are also shown in Supplementary Appendix 9. OR below 1 indicates lower odds of linkage to HES compared with the reference category. Consistent with linkage rate estimates, we found differences across ethnic groups, deprivation and region. Across all cohorts, we found that relative to pupils of white ethnicity, pupils of ethnic minorities including Asian, Black, Chinese, Mixed and Any other ethnic group were less like to be matched. The odds of linkage to HES for Asian ethnic groups were less than ethnic minority pupils (e.g. 1990/91: Adjusted OR 0.69, 95% CI 0.66 to 0.72, p < 0.01; 2004/05: Adjusted OR 0.51, 95% CI 0.47 to 0.54, p < 0.01). Relative to male pupils, with the exception of pupils born in academic year 1990/91, female pupils were less likely to be matched (e.g. 2004/05: Adjusted OR 0.72, 95% CI 0.69 to 0.75, p < 0.01). Compared to pupils in the fifth IDACI Deciles, pupils living in the most deprived areas were less likely to be matched, whereas pupils living in the most affluent areas were more likely to be matched. Similarly, results for the region of pupil residence show differences for linkage success.
|Characteristics from NPD||Cohort 1990/91||Cohort 1996/97|
|aOR||Conf. Int.||aOR||Conf. Int.|
|Any other ethnic group||0.42||[0.38,0.46]**||0.32||[0.30,0.35]**|
|Yorkshire and The Humber||1.34||[1.28,1.40]**||1.42||[1.35,1.49]**|
|East of England||1.14||[1.09,1.19]**||1.00||[0.95,1.04]|
|Characteristics from NPD||Cohort 1999/00||Cohort 2004/05|
|aOR||Conf. Int.||aOR||Conf. Int.|
|Any other ethnic group||0.26||[0.24,0.28]**||0.18||[0.17,0.20]**|
|Yorkshire and The Humber||1.61||[1.51,1.72]**||1.23||[1.12,1.35]**|
|East of England||0.86||[0.81,0.90]**||0.83||[0.76,0.90]**|
This study is the first to link administrative records from schools and hospitals for all children and adolescents attending state-funded schools in England for four 1-year birth cohorts (~2.2 million children). It builds upon previous studies that have demonstrated the public benefit and challenges for data sharing across educational and health services for specific subgroups [8, 13, 35, 36], and in other countries [9–14]. We evaluated two deterministic algorithms implemented by NHS Digital and found that although linkage rates were high and improved over time, pupils from ethnic minority groups or living in areas of high deprivation were disproportionately less likely to match to HES.
Our finding that the linkage rate was 99% for the youngest cohort is encouraging for future studies using multi-step deterministic algorithms in England. This linkage rate is similar to studies in Scotland, Wales and Australia that used probabilistic linkage methods [11, 13, 14, 37–39]. For instance, linkage rates for the annual Scottish Governments pupils census linked to the community health index database ranged between 86.3% and 95% , while two other Scottish studies found linkage rates of 99.7%  and 81.8% .
We found that between 2.3–7.6% of ethnic minority pupils were not linked to health records. Ethnic differences reported in previous linkage success reflect differences in the quality of registration of Chinese, Asian and Hispanic names [8, 27, 28]. The differences in linkage rates by ethnic minority in linkage steps that relaxed the requirement to agree on exact full name suggest that inconsistencies in forenames and surnames explain the lower linkage rates for ethnic minority pupils. Residential instability may also be relevant: lower rates of linkage for pupils from ethnic minorities at steps 1 and 2 between PDS and HES (i.e. stage 2), could be due to poor recording of postcode, as reported in other studies [40, 41]. It is also estimated that 20% of children aged 0 to 15 years are born outside the UK, which may have a differential impact on linkage success . Additional steps in the deterministic algorithm that incorporate phonetic systems codes for other languages [43, 44], or methods that discriminate partial agreements in string comparisons [45–48], or probabilistic linkage methods could be used to further improve linkage rates for ethnic minorities [40, 48].
We found that pupils living in more deprived neighbourhoods were less likely to link to health records than pupils living in more affluent areas. Previous studies have suggested that families from more affluent areas are more likely to comply with the administrative process . However, pupils living in London were less likely to link to HES records than in other regions, even after accounting for sociodemographic characteristics. This difference may reflect higher rates of international emigration from London, less use of health services, differential use of private health services, or poorer quality of identifiers in London.
Improvements in the quality of recording of identifiers in schools and health data systems likely account for improved linkage rates over time. Changes in health systems governing collection of patient identifiers, such as the implementation of NHS Numbers for Babies (NN4B) service on 29th October 2002, the introduction of Registration ONline system (RON) on 1st July 2009, the correction of a postcode extraction error by NHS Digital on 1st April 2013, have been shown to improve the completeness of identifiers used in the linkage . Retrospective correction of this extraction error and re-linkage by NHS Digital of birth episodes to subsequent HES records, would be expected to improve linkage to NPD in earlier years.
Strengths and limitations
Our study demonstrates very high linkage rates between educational and HES records for pupils attending state schools in England. The governance for this project addressed the challenges of cross-sectoral linkage between health and educational institutions in England whilst avoiding disclosure during the linkage process . Use of multiple steps at each stage of linkage, and of identifiers recorded over multiple years for each child, were critical to achieving high linkage rates. Preliminary findings indicate that two-thirds of the linked HES records related to at least one admission, excluding the birth admission (to be reported elsewhere). The linkage algorithms used for this project are currently being used to link educational and health records for all pupils in England born academic years 1995/96 onwards and will be relevant for other studies linking data to HES or NPD (or both) .
Linking educational data with hospital and death records creates new possibilities for studying a wide spectrum of policy-relevant questions. For example, the availability of data across the child life course could enable studies into the impact of health on education and education on health. Linked data for all children will be made available for applications for research from government and academia in 2021 [49, 50].
Record-level indicators of the linkage process (i.e. variables indicating the step in our rule-based linkage algorithms at which a pair of records linked) were shared by NHS Digital to enable us to evaluate linkage biases. We used this information to demonstrate the value of later steps in the algorithm for linking pupils from ethnic minority and deprived areas. However, we did not have information on country of birth, and so could not assess whether linkage rates were lower for children who were born outside England. Future studies should consider sharing information about the completeness or quality of the identifiers to identify whether changes in data entry systems could address missed links in these more vulnerable groups .
A limitation and advantage were the system changes in administrative data resulting in improvements in identifier and linkage quality and additional data collections from both services. These changes can introduce variation in linkage error over time, for instance, patients with fewer contacts with health services or more mobile populations could have out-of-date residential information in PDS disproportionately affecting linkage quality, which analysts need to consider when investigating trends.
A further limitation is that since no gold-standard dataset defining true match status was available, we could not derive standard measures of linkage quality (sensitivity/recall, false match rate and positive predictive value/precision). Approaches for estimating rates of false matches in further linkage between HES and NPD could be applied, for example by applying the linkage algorithms to a set of ‘negative controls’ (i.e. NPD records for which we are certain there should be no link in HES or vice versa) and counting how many records were erroneously linked [51, 52]. This would allow an estimation of false match rates, but would not allow identification of which records were falsely matched. Existing ‘gold-standard’ data for health records in England for specific sub populations also have the potential to be used in the future evaluations of linkage quality . Future studies could develop representative gold-standard data using known links from UK cohort studies, such as the Millennium Cohort Study or Next Steps to allow linkage error to be fully measured [54, 55].
We created a de-identified linked database that brings together data from the Department for Education (education and social care) and hospitalisation data for all children – the ECHILD Database. This resource will be made available for approved researchers later in 2021 for purposes that benefit health, wellbeing, education and the provision of health or social care. The ECHILD dataset will enable a step change in the scale and depth of research into the inter-relationships between health, education and social care across the life course, and how services across England vary in their responses.
Our linkage created a de-identified bridging file that combines pseudo-identifiers from education (anonymised Pupil Matching Reference) and HES. This bridging file can be used by the data providers to link to further datasets for approved studies, without the need to link real-world identifiers such as names and postcodes. As the data systems for capturing identifiers change, as is currently happening at NHS Digital , our evaluation of linkage success will need to be repeated and linkage metrics provided to researchers.
Researchers addressing questions relating to ethnic minority or deprived groups need to consider whether to adjust for missing data among these groups due to missed links. Statistical techniques include weighting or imputation, depending on the research objectives .
We found high linkage rates between administrative education and hospital data for pupils in four cohorts born between academic years 1990/91-2004/05 in England. Linkage rates improved over time, but ethnic minorities and pupils living in deprived neighbourhoods were disproportionally affected by linkage error. Evidence from comparing linked and unlinked populations provides measures that can be used to take into account potential biases due to linkage error.
The data underlying this article cannot be shared publicly due to data sharing agreements with NHS Digital and Department for Education.
Conflict of interest statement
Research ethics approval was granted (project ID 232547, REC reference 17/LO/1494) and data sharing agreements are in place with NHS Digital (NIC- 27404) and the Department for Education (DR150701.02). The Confidentiality Advisory Group confirmed that this research is exempt from review (reference 15/CAG/0004) because it only uses pseudonymised NHS data.
This research benefits from and contributes to the NIHR Children and Families Policy Research Unit, but was not commissioned by the National Institute for Health Research (NIHR) Policy Research Programme. We are grateful to Gary Connell (Department for Education), Garry Coleman (NHS Digital) and their teams for supporting this work. We thank to the ECHILD team: Dr. David Etoori, Dr. Louise Mc Grath-Lone, Matthew Lilliman and Dr Erin Walker.
This work was supported by ESRC via the Administrative Data Research UK through the Strategic Hub [grant number ES/V000977/1]; the Administrative Data Research Centre for England; the NIHR Great Ormond Street Hospital Biomedical Research Centre and the Health Data Research UK [grant number LOND1], which is funded by the UK Medical Research Council and eight other funders; Wellcome Trust [grant number 212953/Z/18/Z to KH]; and UKRI Innovation Fellowship funded by the Medical Research Council [grant number MR/S003797/1 to RB].
Herbert A, Wijlaars L, Zylbersztejn A, Cromwell D, Hardelid P. Data Resource Profile: Hospital Episode Statistics Admitted Patient Care (HES APC). Int J Epidemiol. 2017;46(4):1093-i. 10.1093/ije/dyx015.https://doi.org/10.1093/ije/dyx015
Jay MA, Grath-Lone LM, Gilbert R. Data Resource: the National Pupil Database (NPD). International Journal of Population Data Science. 2019;4(1). 10.23889/ijpds.v4i1.1101.https://doi.org/10.23889/ijpds.v4i1.1101
Crawford C, Dearden L, Greaves E. The drivers of month-of-birth differences in children’s cognitive and non-cognitive skills. J R Stat Soc Ser A Stat Soc. 2014;177(4):829–60. 10.1111/rssa.12071.https://doi.org/10.1111/rssa.12071
Zylbersztejn A, Gilbert R, Hjern A, Wijlaars L, Hardelid P. Child mortality in England compared with Sweden: a birth cohort study. Lancet. 2018;391(10134):2008-18. 10.1016/S0140-6736(18)30670-6.https://doi.org/10.1016/S0140-6736(18)30670-6
Herbert A, Gilbert R, Gonzalez-Izquierdo A, Li L. Violence, self-harm and drug or alcohol misuse in adolescents admitted to hospitals in England for injury: a retrospective cohort study. BMJ Open. 2015;5(2):e006079. 10.1136/bmjopen-2014-006079.https://doi.org/10.1136/bmjopen-2014-006079
Coathup V, Boyle E, Carson C, Johnson S, Kurinzcuk JJ, Macfarlane A, et al. Gestational age and hospital admissions during childhood: population based, record linkage study in England (TIGAR study). BMJ. 2020;371:m4075. 10.1136/bmj.m4075.https://doi.org/10.1136/bmj.m4075
Harron K, Gilbert R, Fagg J, Guttmann A, van der Meulen J. Associations between pre-pregnancy psychosocial risk factors and infant outcomes: a population-based cohort study in England. Lancet Public Health. 2021;6(2):e97–e105. 10.1016/S2468-2667(20)30210-3https://doi.org/10.1016/S2468-2667(20)30210-3
Downs JM, Ford T, Stewart R, Epstein S, Shetty H, Little R, et al. An approach to linking education, social care and electronic health records for children and young people in South London: a linkage study of child and adolescent mental health service data. BMJ Open. 2019;9(1):e024355. 10.1136/bmjopen-2018-024355.https://doi.org/10.1136/bmjopen-2018-024355
Jones KH, Ford DV, Thompson S. A Profile of the SAIL Databank on the UK Secure Research Platform. International journal of population data science. 2019;4(2). 10.23889/ijpds.v4i2.1134.https://doi.org/10.23889/ijpds.v4i2.1134
University of Adelaide: School of Public Health; 2016.
MacKay DF, Smith GCS, Dobbie R, Pell JP. Gestational Age at Delivery and Special Educational Need: Retrospective Cohort Study of 407,503 Schoolchildren. PLOS Medicine. 2010;7(6):e1000289. 10.1371/journal.pmed.1000289.https://doi.org/10.1371/journal.pmed.1000289
Maret-Ouda J, Tao W, Wahlin K, Lagergren J. Nordic registry-based cohort studies: Possibilities and pitfalls when combining Nordic registry data. Scand J Public Health. 2017;45(17_suppl):14–9. 10.1177/1403494817702336.https://doi.org/10.1177/1403494817702336
Stewart CH, Dundas R, Leyland AH. The Scottish school leavers cohort: linkage of education data to routinely collected records for mortality, hospital discharge and offspring birth characteristics. BMJ Open. 2017;7(7). 10.1136/bmjopen-2016-015027.https://doi.org/10.1136/bmjopen-2016-015027
Wood R, Clark D, King A, Mackay D, Pell J. Novel cross-sectoral linkage of routine health and education data at an all-Scotland level: a feasibility study. The Lancet. 2013;382:S10. 10.1016/S0140-6736(13)62435-6.https://doi.org/10.1016/S0140-6736(13)62435-6
Doidge JC, Harron KL. Reflections on modern methods: linkage error bias. Int J Epidemiol. 2019;48(6):2050–60. 10.1093/ije/dyz203.https://doi.org/10.1093/ije/dyz203
Gilbert R, Lafferty R, Hagger-Johnson G, Harron K, Zhang L-C, Smith P, et al. GUILD: GUidance for Information about Linking Data sets. J Public Health (Oxf). 2018;40(1):191–8. 10.1093/pubmed/fdx037.https://doi.org/10.1093/pubmed/fdx037
ECHILD. The Education and Child Health Insights from Linked Data 2021 [Available from: https://www.ucl.ac.uk/child-health/research/population-policy-and-practice-research-and-teaching-department/cenb-clinical-20.
Societies LCfLaLCiKEa. Institute of education;.
Harron K, Gilbert R, Cromwell D, van der Meulen J. Linking Data for Mothers and Babies in De-Identified Electronic Health Data. PLoS One. 2016;11(10):e0164667. 10.1371/journal.pone.0164667.https://doi.org/10.1371/journal.pone.0164667
Zylbersztejn AG, Ruth; Hardelid, Pia. Impact of changes to data collection on a national birth cohort from administrative health records in England. PLoS One. 2020.
NHS Digital. A Guide to Linked Mortality Data from Hospital Episode Statistics and the Office for National Statistics. 2015 [Available from: https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/linked-hes-ons-mortality-data#ons-mortality-datafrom:.
(HSCIC) THaSCIC. Methodology for creation of the HES Patient ID (HESID) 2014 [Available from: https://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/1370/HES-Hospital-Episode-Statistics-Replacement-of-the-HES-patient-ID/pdf/HESID_Methodology.pdf/.
The Stationery Office Limited: House of Commons, Committee of Public Accounts; 2007.
Statistics OfN. Personal Demographics Service data Office for National Statistics2020 [Available from: https://www.ons.gov.uk/census/censustransformationprogramme/administrativedatacensusproject/datasourceoverviews/personaldemographicsservicedata.
Population Health Sciences, Bristol Medical School, University of Bristol: Cohort & Longitudinal Studies Enhancement Resources (CLOSER); 2018.
NHS Digital. Personal Demographics Service fair processing 2020 [Available from: https://digital.nhs.uk/services/demographics/personal-demographics-service-fair-processing#: :text=The%20Personal%20Demographics%20Service%20(PDS,(known%20as%20demographic%20information).
Wiley: Methodological Developments in Data Linkage; 2016.
Harron KL, Doidge JC, Knight HE, Gilbert RE, Goldstein H, Cromwell DA, et al. A guide to evaluating linkage quality for the analysis of linked data. Int J Epidemiol. 2017;46(5):1699–710. 10.1093/ije/dyx177.https://doi.org/10.1093/ije/dyx177
Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–107. 10.1002/sim.3697.https://doi.org/10.1002/sim.3697
Jay MA, Gilbert R. Special educational needs, social care and health. Arch Dis Child. 2020. 10.1136/archdischild-2019-317985.https://doi.org/10.1136/archdischild-2019-317985
Economics BCDo. Statistical Software Components;.
Metadata. Office for National Statistics.: The Office for National Statistics (ONS); 2015.
Zylbersztejn A, Gilbert R, Hardelid P. Developing a national birth cohort for child health research using a hospital admissions database in England: The impact of changes to data collection practices. PLOS ONE. 2020;15(12):e0243843. 10.1371/journal.pone.0243843.https://doi.org/10.1371/journal.pone.0243843
Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–107. 10.1002/sim.3697.https://doi.org/10.1002/sim.3697
Fleming M, Fitton CA, Steiner MFC, McLay JS, Clark D, King A, et al. Educational and health outcomes of children and adolescents receiving antidepressant medication: Scotland-wide retrospective record linkage cohort study of 766 237 schoolchildren. Int J Epidemiol. 2020;49(4):1380–91. 10.1093/ije/dyaa002.https://doi.org/10.1093/ije/dyaa002
Fleming M, Fitton CA, Steiner MFC, McLay JS, Clark D, King A, et al. Educational and health outcomes of children treated for asthma: Scotland-wide record linkage study of 683 716 children. European Respiratory Journal. 2019;54(3). 10.1183/13993003.02309-2018.https://doi.org/10.1183/13993003.02309-2018
Holman CDAJ, Bass AJ, Rouse IL, Hobbs MST. Population-based linkage of health records in Western Australia: development of a health services research linked database. Australian and New Zealand Journal of Public Health. 1999;23(5):453–9. 10.1111/j.1467-842X.1999.tb01297.x.https://doi.org/10.1111/j.1467-842X.1999.tb01297.x
Jones KH, Ford DV, Jones C, Dsilva R, Thompson S, Brooks CJ, et al. A case study of the Secure Anonymous Information Linkage (SAIL) Gateway: A privacy-protecting remote access system for health-related research and evaluation. Journal of Biomedical Informatics. 2014;50:196–204. 10.1016/j.jbi.2014.01.003.https://doi.org/10.1016/j.jbi.2014.01.003
Wellcome Trust. Public health research data forum. enabling data linkage to maximise the value of public health research data: full report. 2015 [Available from: https://wellcome.ac.uk/sites/default/files/enabling-data-linkage-to-maximise-value-of-public-health-research-data-phrdf-mar15.pdf.
Hagger-Johnson G, Harron K, Goldstein H, Aldridge R, Gilbert R. Probabilistic linkage to enhance deterministic algorithms and reduce data linkage errors in hospital administrative data. J Innov Health Inform. 2017;24(2):891. 10.14236/jhi.v24i2.891.https://doi.org/10.14236/jhi.v24i2.891
Roberts E, Doidge JC, Harron KL, Hotopf M, Knight J, White M, et al. National administrative record linkage between specialist community drug and alcohol treatment data (the National Drug Treatment Monitoring System (NDTMS)) and inpatient hospitalisation data (Hospital Episode Statistics (HES)) in England: design, method and evaluation. BMJ Open. 2020;10(11):e043540. 10.1136/bmjopen-2020-043540.https://doi.org/10.1136/bmjopen-2020-043540
University of Oxford: The Migration Observatory; 2014.
Zahoransky D, Polášek I. Text Search of Surnames in Some Slavic and Other Morphologically Rich Languages Using Rule Based Phonetic Algorithms. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2015;23(3):553–63. 10.1109/TASLP.2015.2393393.https://doi.org/10.1109/TASLP.2015.2393393
ADRN Publication UoE. The Administrative Data Research Network: Better Knowledge Better Society; 2016.
Gong J, Wang L, Oard D. Matching person names through name transformation. Proceeding of the 18th ACM Conference on Information and Knowledge Management. 2009:1875–8.
Newcombe HB, Fair ME, Lalonde P. Discriminating powers of partial agreements of names for linking personal records. Part II: The empirical test. Methods Inf Med. 1989;28(2):92–6.
Newcombe HB, Fair ME, Lalonde P. Discriminating powers of partial agreements of names for linking personal records. Part I: The logical basis. Methods Inf Med. 1989;28(2):86–91.
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence; 2012.
UK A. ECHILD: Linking children’s health and education data for England 2021 [Available from: https://www.adruk.org/our-work/browse-all-projects/echild-linking-childrens-health-and-education-data-for-england-142/.
London UC. Education and Child Health Insights from Linked Data 2021 [Available from: https://www.ucl.ac.uk/child-health/research/population-policy-and-practice-research-and-teaching-department/cenb-clinical-20.
Harron K, Doidge JC, Goldstein H. Assessing data linkage quality in cohort studies. Annals of Human Biology. 2020;47(2):218–26. 10.1080/03014460.2020.1742379.https://doi.org/10.1080/03014460.2020.1742379
Government Analysis Function & Office for National Statistics; 2020.
Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H. Evaluating bias due to data linkage error in electronic healthcare records. BMC Medical Research Methodology volume. 2014;14(36). 10.1186/1471-2288-14-36.https://doi.org/10.1186/1471-2288-14-36
England: Next Steps: Linked Health Administrative Datasets (Hospital Episode Statistics); 2017.
Hockley C, Quigley M, Hughes G, Calderwood L, Joshi H, Davidson L. Linking Millennium Cohort data to birth registration and hospital episode records. Paediatric and perinatal epidemiology. 2008;22:99–109. 10.1111/j.1365-3016.2007.00902.x.https://doi.org/10.1111/j.1365-3016.2007.00902.x
NHS Digital. Hospital Episode Statistics data changes in 2021 2021 [Available from: https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics/hospital-episode-statistics-data-changes-in-2021.
Goldstein H, Harron K, Wade A. The analysis of record-linked data using multiple imputation with data value priors. Stat Med. 2012;31(28):3481–93. 10.1002/sim.5508.https://doi.org/10.1002/sim.5508
This work is licensed under a Creative Commons Attribution 4.0 International License.