Generating synthetic identifiers to support development and evaluation of data linkage methods
Main Article Content
Abstract
Introduction
Careful development and evaluation of data linkage methods is limited by researcher access to personal identifiers. One solution is to generate synthetic identifiers, which do not pose equivalent privacy concerns, but can form a 'gold-standard' linkage algorithm training dataset. Such data could help inform choices about appropriate linkage strategies in different settings.
Objectives
We aimed to develop and demonstrate a framework for generating synthetic identifier datasets to support development and evaluation of data linkage methods. We evaluated whether replicating associations between attributes and identifiers improved the utility of the synthetic data for assessing linkage error.
Methods
We determined the steps required to generate synthetic identifiers that replicate the properties of real-world data collection. We then generated synthetic versions of a large UK cohort study (the Avon Longitudinal Study of Parents and Children; ALSPAC), according to the quality and completeness of identifiers recorded over several waves of the cohort. We evaluated the utility of the synthetic identifier data in terms of assessing linkage quality (false matches and missed matches).
Results
Comparing data from two collection points in ALSPAC, we found within-person disagreement in identifiers (differences in recording due to both natural change and non-valid entries) in 18% of surnames and 12% of forenames. Rates of disagreement varied by maternal age and ethnic group. Synthetic data provided accurate estimates of linkage quality metrics compared with the original data (within 0.13-0.55% for missed matches and 0.00-0.04% for false matches). Incorporating associations between identifier errors and maternal age/ethnicity improved synthetic data utility.
Conclusions
We show that replicating dependencies between attribute values (e.g. ethnicity), values of identifiers (e.g. name), identifier disagreements (e.g. missing values, errors or changes over time), and their patterns and distribution structure enables generation of realistic synthetic data that can be used for robust evaluation of linkage methods.
Introduction
Data linkage facilitates the combination of detailed information on individuals captured in disparate data sources, without the need for new data collection. Linkage is increasingly used as an efficient approach, particularly with existing administrative datasets, and has great potential for social good. Access to identifiable information is crucial when linking multiple datasets, as linkage depends on either the availability of unique identifiers (e.g. a social security number) or a set of individually non-unique variables such as name, sex and date of birth, which in combination, can identify an individual. This is the case for both linkage using identifiers in their natural form or for privacy preserving techniques which mask the identifiers in some form. The level of completeness, uniqueness and accuracy of identifiers recorded in administrative data pose a challenge for linkage, particularly when linking across multiple sectors in countries where unique citizen identifiers are unavailable [1]. Careful development and evaluation of linkage methods is therefore required in order to achieve high quality linkage and robust results [2, 3]. However, methodological development has been constrained by confidentiality concerns and legislative restrictions governing access to personal information for research.
In practice, access to identifiers is usually limited to either the data owners or trusted third parties, who may be unwilling or unable to make use of these identifiers for methodological purposes. Conversely, analysts will typically only have access to the de-identified linked data, with limited information about any uncertainty in linkage, or information with which to assess the quality of linkage [4]. This separation limits opportunities for the development of advanced linkage methods and the assessment of computational performance and/or linkage quality, as the researchers cannot access the identifiable data needed for these evaluations [5]. Even when it is possible to access identifiers, lack of a “ground truth” makes it difficult to evaluate different linkage strategies, as there is no gold standard against which results can be compared.
One solution to this problem is to generate synthetic datasets of identifiers that mimic the characteristics of real identifiers (and so can be used for methodological work) but that do not pose any confidentiality issues [6, 7]. Such synthetic datasets of identifiers would include a “ground truth” to enable evaluation of different linkage methods. Synthetic data generators have been developed in the context of providing realistic research datasets, where the aim is to mimic the underlying statistical properties of the original data whilst minimising disclosure risk [8, 9]. A summary of existing synthetic data generation methods is described by Kokosi [9]. Synthetic data that retain the relationships between variables in the original data can provide an accurate representation of the original data and be used for a range of purposes, including evaluation of different methodological approaches [10]. However, these approaches have mainly been developed in the context of ‘attribute’ data, i.e. variables typically used within an analysis (e.g. social or health status, occupation). In the context of developing linkage methods, we are concerned with ‘identifier’ data, i.e. variables used for linkage but not necessarily for analysis (e.g. postcode, name). In some cases, there is overlap between the two: date/year of birth and sex can be both attribute variables and personal identifiers.
In most applications of synthetic data, retaining the relationships between different variables helps to replicate the underlying structure of the data and enables users to test and evaluate different methodological approaches. When generating synthetic identifier data, there is also a need to ensure that the data retain dependencies between variables (for example, name might be associated with date of birth). However, there are a number of reasons why an alternative approach to generating synthetic data is required to address the idiosyncrasy of identifiers. Firstly, identifier variables do not always follow standard statistical distributions. Secondly, identifiers are affected by specific types of recording errors and changes that occur within and between datasets, and over time. We refer to these disagreements as ‘errors’, whilst recognising that in some cases these will be genuine changes (e.g. address change due to migration or surname change following marriage) rather than errors in recording. Such errors are often related to attribute variables (e.g. names may be more often misspelt for particular ethnic groups; address changes are associated with age and changes in socio-economic and potentially health status). There may also be interdependencies between identifier errors, e.g. if name and address change at the same time due to divorce. Accurate replication of identifier errors and their dependencies on attribute variables is important, since these dependencies are directly related to the impact that linkage errors have on analysis [11]. Therefore, these errors and dependencies should be replicated within any synthetic identifier datasets that are used to test linkage methods, so that an assessment of bias resulting from linkage can be conducted [12]. Existing datasets generated to facilitate the development and testing of data linkage algorithms have typically not focussed on preserving these dependencies, and have not been evaluated in terms of their utility for testing linkage algorithms. There is therefore a scientific requirement to develop more robust and realistic synthetic identifier datasets [13].
This paper presents a framework for generating synthetic identifier data that could be used by data owners to enable researchers to develop and test linkage methods in different settings. Use of these data could help overcome the limited capacity for linkage methodology development by providing wider access to realistic identifier data, without disclosure risk, and with a ground truth against which linkage quality can be assessed. In Section 1, we describe a motivating scenario and outline how the steps needed to generate synthetic data can be implemented. Importantly, we consider the need to preserve the dependencies between identifier values, identifier errors, and attribute values. In Section 2, we evaluate the use of synthetic data for assessing linkage quality, based on an exemplar of longitudinal linkage within a large UK cohort study (the Avon Longitudinal Study of Parents and Children; ALSPAC)
Section 1: A framework for generating synthetic identifier data
Motivating scenario
Our motivating scenario is one in which we aim to conduct performance comparisons between different linkage approaches, in a secure manner with low ethico-legal barriers and no intrusion into personal privacy. The aim of such performance comparisons is to optimise linkage algorithms that would then be applied to specific, real-world linkage projects. We assume that those commissioning the linkage (e.g. a researcher) will not have access to identifiers and that the linkage will be conducted by a trusted third party or data owner. We will examine the utility of synthetic datasets to conduct linkage performance comparisons that would be sufficiently similar to the real data whilst not intrusive of personal privacy. Such datasets could be useful for the development of linkage methods in two settings: 1) by data owners who have access to identifiers but where there are restrictions around using these identifiers for methodological development rather than business-as-usual linkage; 2) by researchers or research infrastructure providers (such as Trusted Research Environments) who cannot access identifiers but for whom synthetic data would be useful for understanding the implications of different linkage methods on their outputs.
Types of variables
First, we distinguish between two types of variables: identifiers and attributes.
- i. Identifier variables (e.g. name, NHS number, postcode, sex, date of birth) that are used within linkage but not necessarily the analysis (though some, e.g. sex, are also attribute variables). Some of these variables may be related to the values of other identifiers and/or attribute variables (e.g. values of name might be associated with sex and ethnicity). Presence of errors in one identifier might be related to errors in other identifiers (e.g. if name is mistyped, it might be more likely that date of birth is also recorded with error).
- ii. Attribute variables (e.g. ethnicity) that are used within analysis but not necessarily the linkage. Some of these variables may be associated with patterns in identifier values and/or identifier errors (e.g. ethnicity might be associated with values and also errors in name).
We make this distinction because under our motivating scenario, we are mostly interested in generating identifier variables. However, to ensure that the synthetic data are realistic, we need to consider i) the dependencies between identifier values and attributes, and ii) how errors in identifiers are distributed in relation to attribute variables. We often find that errors in linkage (and by implication, in identifiers) are related to differences in the underlying data quality for particular subgroups or to particular events and circumstances. For example, family name may have a higher probability of being typed incorrectly for individuals from minority ethnic groups compared to a majority ethnic group given that the family name may be unfamiliar to the operative recording the data, or that there are cultural differences in the length or structural complexity of names; linkage may be less likely to be successful for individuals following family separation given the tendency for this to result in changes in both address and names. This can therefore lead to dependencies between errors in identifiers – i.e. a change in name may be more likely for individuals who have also changed address. Evidence from the literature suggests that age, ethnic group, sex, deprivation and measures of health and social status may all be related to the risk of linkage error [14, 15]. In order to generate a dataset that is realistic for testing linkage methods, it is therefore crucial to consider whether identifier errors are likely to be related to attribute values. An example of the possible dependencies between identifier values, attribute values, and identifier errors is presented in Figure 1.
Steps in the process of generating synthetic identifier data
This section outlines five steps that are required to generate a realistic set of synthetic identifiers. In summary, the objective of these steps is to generate a ‘gold-standard’ dataset, i.e. the correct identifiers recorded in the absence of errors or changes over time (Steps 1–3). We then need to customise types and patterns of errors to be introduced to the gold-standard dataset (Step 4). Finally, we create multiple versions of the corrupted data (Step 5). A workflow for this process is presented in Figure 2.
Step 1: Elicit Information
Data linkers or data owners should elicit information on the set of identifiers in each file that are available for linkage, the rates of missingness (percentage of records with missing values for each identifier) and the characteristics of these identifiers (e.g. the range of dates of birth, the percentage of records that have a unique name).
They also need to elicit information about likely rates of errors, types of errors and their patterns of co-occurrence in identifiers, and how these errors are associated with attribute variables. In practice, information on errors may be difficult to obtain, and may need to be based on knowledge about identifier errors (or linkage errors) from other similar data sources or the literature [16]. For the purposes of this paper, we use information on the rates, types and distribution of identifier errors based on analysis of data collected over different waves of the Avon Longitudinal Study of Parents and Children (ALSPAC) birth cohort study (see Section 2 for details on ALSPAC) [17, 18].
Information should be obtained on the number of records in each dataset and the joint distribution of key attribute variables (e.g. age and ethnicity), and whether individuals are likely to be recorded multiple times within a dataset (e.g., as they would in hospital admission records).
Step 2: Generate attribute variables
With access to the gold-standard identifiers and attribute data, we can use then use the Synthpop package in R to synthesise attribute variables (such as age or ethnicity) [19]. Synthpop uses a series of conditional models based on the original data to sequentially predict and impute values of each variable in the synthetic data. This process preserves variable inter-dependency by utilising classification and regression tree models. The output of this step is a ‘gold-standard’ synthetic dataset of attribute variables replicating those found in the original data.
Step 3: Generate identifiers
Two types of identifiers can be generated: those that are dependent on attribute variables, and those that are independent. Independent identifiers, e.g. NHS number (or social security number, etc.) can be generated according to predefined rules. Identifiers that are dependent on attribute variables will be generated according to the attribute values generated in Step 2. For example, date of birth can be generated according to the distribution of age. Table 1 describes how different types of identifiers might be generated, according to whether or not they are dependent on attribute variables.
Identifier | Generation process | Example | |
Identifiers that are independent of attribute variables | Date of birth and sex | Given aggregate information on the distribution of these identifiers, or elements of these identifiers (i.e. year of birth) within the original data, values can be sampled directly from the relevant distributions. In some cases, we might want to reflect dependency between records, for example, there will be a minimum distance between the date of birth of a baby and their mother. In other cases, we might want to allow date of birth to depend on attribute variables, such as place of residence. |
Date of birth can be generated from the distribution of year of birth, by assuming a uniform distribution over all possible (or eligible) dates within each year. We could also allow for variations according to day of the week or month of the year (e.g. those born on 31st of December of any year may have a higher probability of being recorded as being born on 1st January of the next year, rather than another random date). |
Unique identifiers | Values of unique identifiers such as a social security number or NHS number can be randomly generated following defined rules. The assumption that unique identifiers are independent does not hold where another identifier is included within in the unique identifiers (e.g. the Community Health Index number in Scotland, which is derived from date of birth and sex). | NHS number is assigned at birth in England and is unrelated to any other personal information [22]. It comprises ten digits, of which the majority are random numbers and the tenth is a check digit to confirm validity: it can therefore be generated using a simple algorithm. If there are multiple unique identifiers (e.g. NHS number and hospital number), these can be generated independently. | |
Other identifiers | Personal identifiers such as email addresses, telephone numbers, and social media handles can be generated according to rules. In some cases, we might also want to allow these identifiers to depend on attribute variables: e.g. generating random telephone numbers based on the country and area of residence, or generating random email addresses based on names, date and country of birth. | Fake Mail Generator (https://fakedetail.com/fake-mail-generator) allows the generation of random email addresses given real domains. We can also allow these identifiers to depend on other identifiers or attributes: for example, Fake Number (https://fakenumber.org/united-kingdom) can generate random telephone numbers based on the country and area of residence. | |
Identifiers that are dependent on attribute variables | Names | First names may be related to age, ethnicity, sex and geography; surnames may also be related to ethnicity. Frequency look-up tables provide a useful tool for sampling names and mapping them to predictor attribute variables. Names can be directly sampled from such frequency tables, and can be allowed to depend on attribute variables such as sex and ethnicity, where these are available. | The Office for National Statistics (ONS) publishes the rank and count of the baby birth names in England and Wales every year, which can be used as the forename frequency table for the England and Wales population [23]. National Records of Scotland also publish popular baby forenames depending on year of birth and gender [24]. Another example is data on forename, gender and ethnicity extracted from the US census and implemented in the R package ‘randomNames’ [25]. Similar frequency tables, including for surnames, are published in many countries [26]. |
Addresses | Addresses may be related to personal social status, income and ethnic background [27]. For example, in 2018, 41% of residents in the London borough of Tower Hamlets were of Asian ethnic background, compared with 5% in the borough of Bromley. | To represent these dependencies in synthetic data, we can start by sampling postcodes from a relevant list. Levels of deprivation can then be assigned to each postcode using the English indices of deprivation (Index of Multiple Deprivation; IMD), and ethnic group distributions can be assigned using ethnic group statistics by geography [28, 29]. Given information on the distribution of ethnic group in the original data, addresses can then be sampled from a frequency table. | |
Indirect identifiers | Other non-traditional identifiers used for linkage might include ‘indirect’ identifiers such as clinical variables or dates [30]. Given sufficient aggregate data on the distributions of these variables, and assumptions about their dependence on attribute variables, these could be generated in a similar way to date of birth and sex (i.e. according to specified distributions). |
Identifiers that are dependent on attribute variables and have high cardinality, such as names, are more challenging to synthesise. There is no existing library that readily generates names and maintains dependencies with other variables. Generation of names should consider the following factors:
- Privacy and Disclosure Risk: ensuring none of the unique forename-surname combination in the original data appear in the synthesised data
- Uniqueness
- Frequency: common names should be synthesised for common names in the original data. For example: “John” (White, Male, common) in ALSPAC surnames or forename could be replaced with “Peter” (White, Male, common) from the name dictionary/look up table.
- Sharing of surnames between siblings, parents and children and partners.
It is helpful to consider the frequency or uniqueness of different identifier values, as well as their distribution with respect to attribute variables. For example, a male has a higher probability than a female of having a forename of ‘Patrick’, and the distribution or uniqueness of names may vary according to ethnic group, levels of deprivation and by age (reflecting changing fashions for names).
Step 4: Data corruption
Table 2 provides a summary of different types of errors that may be found in real data and can be introduced during the synthetic data generation.
Type of error/identifier | Description | Manifestations |
Typographic error (string variables) |
• Occurs during manual typing, e.g. a receptionist types a patient’s information for a general practitioner appointment booking. • Depending on the keyboard layout, characters may be substituted with neighbouring keyboard characters e.g. ‘s’ instead of ‘d’. • New characters or space may be accidentally inserted into a field, random characters may be omitted from a field, or character positions may be transposed. • Errors may result from hitting a key twice, letting eyes move faster than the hand, or misreading [31]. |
• Typographical errors are more likely to happen in the middle or towards the end of the word, and in longer words [32, 33]. • Over 80% of typographical errors are single instances of substitution, insertion, deletion or transposition [31]. • The likelihood of substituting neighbouring characters differs according to layout as well as personal typing habit, e.g. it is more likely that ‘d’ is replaced with ‘s’ than with ‘x’ [34]. |
Phonetic error (string variables – particularly name) |
• Occurs during dictation, where letters may be substituted with letters that are phonetically the same but orthographically incorrect for the intended word, e.g. when a receptionist records information given by a patient, (s)he may mishear information due to the accent of the patient or the pronunciation of similar words or characters, such as ‘F’ instead of ‘Ph’ [34]. |
• Information on phonetic errors can be derived from phonetic algorithms, which apply a range of rules and exceptions to encode words by their pronunciation, instead of spellings. These algorithms have been widely used in applications such as spell checkers and search engines and algorithms have been transformed into look-up tables and rules to group similar-sounding words together [35]. • Soundex is one of the most widely known phonetic algorithms for Anglo-Saxon surname encoding [36]. Extensions to Soundex overcome limitations in recognising different languages and dialects that may have different pronunciations for the same names [37]. |
Optical Character Recognition error (OCR, any identifiers) |
• The OCR system is used to process scanned handwritten documents into electronic versions. • OCR errors occur when the system fails to distinguish two characters that have similar shapes, such as ‘l’ and ‘1’ or ‘m’ and ‘rn’. |
• Error rates in OCR systems can be high if the scanned documents are poorly handwritten, in bad physical condition or have a complex layout [38]. • Error rates in OCR systems are impacted by configuration settings (such as where the threshold is set for manual review). • Look up tables are available that provide around 80 pairs of OCR errors where letters, digits, symbols and combinations of these appear to be similar. |
Naming convention inconsistencies |
• Some people have two first names (with or without a hyphen), or middle names that are used as first names or vice versa. • Double-barrel surnames may be recorded differently in different datasets (e.g. with or without hyphens) and may include abbreviations (e.g. Saint John as St. John). • First names and surnames may be swapped. • Migrant groups might ‘adopt’ localised versions of names • Nicknames and diminutives might be provided |
• Look up tables of common name variants are available. The software ‘Febrl’ provides around 350 rules and name variants (e.g. ‘Edward’ for ‘Ted’, ‘Edwin’ and Edwards’) [13]. Database of common English diminutives of formal given names are available on Wiktionary. • Table of common surnames with different Romanised representation of the same character are available on Wiktionary (Mandarin Chinese, Cantonese, Hakkan, Korean, Vietnamese, Japanese) |
Date errors (date of birth or other date identifiers) |
• Format differences, i.e. between countries or people. In the UK, people usually record their date of births in Day-Month-Year format, while in the US it is more often written in the format of Month-Day-Year and in China is Year-Month-Day. • Default/generic values. Some systems have a default value for the date of birth, resulting in those people with missing date of birth automatically being given a default date. • Accidental input of ‘today’s date’ |
|
Changes over time (e.g. name, sex, postcode) |
• Postcode changes occur as people move and if addresses are not updated on a system (e.g. postcodes in healthcare data might only be updated when a patient registers with a new general practitioner, which might be some time after an address change). • Children may have multiple genuine postcodes if they have more than one residence, e.g. mother’s or father’s address. • Surnames may change following marriage or divorce; recorded sex may change over time. • Postcodes change over time for the same property to reflect changes in the postal system. |
• In the UK, evidence suggests that 40% of children move home in the first 5 years of life; 5% move 3 or more times within this time period [39]. |
Unique identifier errors |
• Checksums or other validation methods may be used to prevent invalid identifiers from being recorded. • Intentional use of another person’s identifier may lead to errors. • Changes to unique identifiers may occur over time, and some identifiers might be reused, resulting in multiple individuals with the same identifier [40]. • Individuals may be issued many unique IDs (e.g. a pupil moving from one school to another) |
• Accurate recording of unique identifiers that depend on interactions with services may be related to how different individuals access those services. For example, completeness of NHS number is often lower for young males [41]. |
Data corruption is split into the following steps:
Firstly, error rates, types and co-occurrence patterns are defined and pre-specified.
Secondly, for each row of synthetic data, a corrupted version is generated. There are several approaches available for this data corruption. One approach is to generate multiple rows of corrupted data capturing all combinations of expected errors and patterns. This method retains all pre-specified error type combinations but could be computationally expensive for large datasets. Alternatively, the Splink synthetic data corruptor adapts a likelihood approach to introducing errors, generating multiple rows of corrupted data probabilistically [20]. In Splink’s synthetic data corruptor, a baseline probability is assigned for each type of error, and a multiplier is applied based on attribute variables. For example, by following the Zipf distribution, up to 20 rows with varying error types and combinations can be generated for each row of data [21]. This method is less computationally expensive and has the capability to introduce some error-attribute variable dependency. However, this method does not necessarily capture all pre-specified error type combinations and co-occurrence patterns.
The final stage is to draw samples from the corrupted data that satisfy the pre-specified error types, co-occurrence patterns, and error-attribute characteristics.
Step 5: Generate linkage files
Since the errors selection in Step 4 is probabilistic, we can generate multiple sets of corrupted data files by repeating the step. This gives us several (e.g. 5) different corrupted versions of the same gold standard file, which represent multiple versions of a ‘linkage’ file. Generating multiple versions of the linkage file is appropriate as it reflects the uncertainty in the process of replicating the original data, in line with the logic of using multiple imputation to model uncertainty.
Section 2: Evaluating synthetic data
Motivating scenario
The following section describes an evaluation of the utility of the data we have generated under our framework. We use an exemplar of data linkage within the ALSPAC birth cohort. In ALSPAC, identifiers for each participant were recorded at multiple time points or data collection waves. For the purposes of evaluating the synthetic data, we used data from a gold-standard list of identifiers held within the ALSPAC administrative database (called ARCADIA, see Appendix Table 1) which contains the ‘live’ best understanding of participants current details, and raw records from one data collection wave (the Child Health Database; CHDB), collected when participants were aged 6 years. A unique ALSPAC ID identifies the same individual within ARCADIA and CHDB, but the identifiers collected in each dataset differ. This gives us a gold-standard database that can be used to assess how well synthetic data performs at evaluating different linkage approaches.
We first generate synthetic versions of the identifier data held within ALSPAC, creating a number of ‘linkage files’ to represent ARCADIA and CHDB. Next, we link the synthetic versions of ARCADIA with synthetic versions of CHDB, and derive metrics of linkage quality. Finally, we compare the linkage quality metrics derived from the synthetic data to the metrics derived from the gold-standard ALSPAC data.
Source data
The Avon Longitudinal Study of Parents and Children (ALSPAC) is a prospective population-based study [17, 18]. Initial recruitment of pregnant women took place in 1990-1992 and the health and development of the children from these pregnancies and their family members have been followed ever since. For this study, we focus on the original parents/carers (Generation 0, G0) and the index children (Generation 1, G1). ALSPAC recruited 14,541 pregnancies by women (G0) who were resident in and around the City of Bristol (South West UK) with expected dates of delivery 1st April 1991 to 31st December 1992. Of these initial pregnancies, there were a total of 14,676 foetuses, resulting in 14,062 live births and 13,988 children who were alive at 1 year of age. The eligible sampling frame was constructed retrospectively using linked recruitment and health service records. Additional offspring that were eligible to enrol in the study have been welcomed through major recruitment drives at the ages of 7 and 18 years; and through opportunistic contacts since the age of 7. A total of 913 additional G1 participants have been enrolled in the study since the age of 7 years with 195 of these joining since the age of 18. This additional enrolment provides a baseline sample of 14,901 G1 participants who were alive at 1 year of age.
Linkage methods
Our aim was to determine whether we could use synthetic data to evaluate the quality of different linkage algorithms. Therefore, we used three different linkage strategies to link data for 13,281 individuals in ARCADIA who also had a record in CHDB. We conducted the linkage based on child’s forename, surname, date of birth and gender, plus mother’s surname, using the following methods, with further details in Appendix 4. Linkage strategies were compared and probabilistic linkage thresholds were chosen to align with the deterministic linkage model, to enable a fair comparison. We estimated false match and missed match rates for each method.
- Deterministic linkage. We classified records as belonging to the same individual if at least 4 of the 5 identifiers matched exactly.
- Probabilistic linkage with similarity scores. We calculated probabilistic match weights for agreement/disagreement using the Fellegi-Sunter approach [42]. To allow for typographical errors in names, we calculated probabilistic match weights using the Jaro-Winkler similarity score [33]. Similarity scores were categorised as little agreement (a score of 0–0.8), moderate agreement (0.8–<1), or full agreement (a score of 1). Linkages were accepted at or above the weight threshold of 3.
- Probabilistic linkage with similarity scores and term frequency adjustments for forenames and surnames. On top of method 2, we accounted for name frequencies by proportionally adjusting u-probabilities for agreement or disagreement on less common names. Linkages were accepted at or above the weight threshold of 2.
Generating synthetic ALSPAC data
In order to generate realistic synthetic data, we first needed to understand the levels of errors observed in identifiers within ALSPAC. Since we had access to the gold-standard ALSPAC data, we could directly estimate the error rates for each identifier (see Appendix 1, Appendix Tables 2–4).
We generated synthetic data to replicate the two ALSPAC datasets described above (ARCADIA and CHDB). Using “Synthpop” in R, we generated a ‘gold-standard’ dataset of identifier and attribute variables (apart from forenames and surnames) to replicate ARCADIA [19]. The dataset contained 13,281 records and was generated using sequential regression modelling based on the original ALSPAC data, using date of birth, gender, maternal age category, ethnic group, and quintile of the Index of Multiple Deprivation (Appendix 3, Appendix Tables 5, 6). Given the small number of people with non-white ethnicity, not all combinations of maternal age and ethnicity exist in the original data. We used a rejection sampling mechanism to ensure synthesised dataset did not generate combinations of attribute variables that did not appear in the original study [43]. Detailed methodology used to synthesise attributes, identifiers and names are described in Appendix 2 and 3.
Data corruption
Four different data corruption approaches were used to examine how results were affected by differences in the types, co-occurrences and dependencies of errors that were introduced to the synthetic data:
- Error types: We varied whether or not the synthetic data had the same types of errors as original data.
- Error Field co-occurrence pattern: We varied whether or not the synthetic data had the same pattern of co-occurrence at the field level (e.g. 5% of errors co-occur in G1 forename and surname).
- Error type co-occurrence pattern: We varied whether synthetic data had the same pattern of co-occurrence of errors at field and type level. For example, 5% of errors co-occurred in G1 forename and G1 surname; 30% of the error co-occurrence is a random name replacement, and 70% of the co-occurrence is a forename variant error with random surname replacement).
- Error-attribute variable dependency: We varied whether or not the error rates were dependent on attribute variables (in our case, maternal age and ethnic group).
Generating linkage files
Under each of the four scenarios below, we created five synthetic datasets to examine differential impact of error distribution and characteristics on linkage.
- Scenario 1: Error rates were based on known values derived directly from the original data source. We specified the error rate for each identifier. We allowed identifier error rates to vary according to maternal age and ethnic group. Identifier errors were of the same types as in the original (e.g. 95% surname errors were random replacements). We used the same error co-occurrence patterns as the original data.
- Scenario 2: Error rates were assumed to be unknown but were assumed to be dependent on maternal age and ethnic group. Identifier errors were restricted to random replacements. We did not allow errors to co-occur in this scenario.
- Scenario 3: Error rates were assumed to be unknown and were assumed to be independent of attribute characteristics (i.e. constant across maternal age and ethnicity). Identifier errors were of the same types as in the original data but the error co-occurrence pattern was assumed to be unknown.
- Scenario 4: Error rates were assumed to be unknown but were assumed to be independent of attribute characteristics. In this scenario, we assumed that identifier error rates were constant across maternal age and ethnicity. Identifier error types were randomly assigned. We did not allow errors to co-occur in this scenario.
Deriving linkage quality metrics
Using the three linkage methods described, we linked the five synthetic gold-standard datasets to each of the corrupted synthetic datasets in the four scenarios. Since we had generated these data ourselves, we knew the true match status of each record pair. We were therefore able to evaluate the quality of each linkage method by deriving the rates of missed matches (true links that were matched) and false matches (records that were linked to the wrong individual) for each linkage method. Estimates were averaged over the five synthetic datasets. We then compared these results with linkage error rates derived from the original source data.
Results
Linkage results
There were 13,281 records that linked between the ARCADIA and CHDB datasets based on the gold-standard ALSPAC data. Using deterministic linkage, 12,673 individuals were linked. The number of linked records ranged from 12,920 with probabilistic linkage using similarity scores for comparing names, to 12,962 with probabilistic linkage using term frequency adjustments for comparing names (Table 4). Rates of errors (both missed matches and false matches) were lower using probabilistic compared with deterministic linkage, and lowest with the addition of term frequency adjustment. All results presented for the synthetic linkages were averaged over 5 synthetic datasets.
G1 Surname^ | G1 Forename* | G0 Surname^ | |||
Scenario 1: Known error rates | Error Co-occurring Patterns (% of all records) | Maternal Age | % error | % error | % error |
G0 surname & G1 surname & G1 forename (0.30%) | <20 | 11.1 | 6.6 | 36.7 | |
G1 surname & G1 forename (0.43%) | 20-29 | 6.0 | 11.5 | 19.7 | |
G1 surname & G0 surname (1.90%) | 30-39 | 4.2 | 14.5 | 12.0 | |
G1 forename & G0 surname (1.90%) | 40+ | 5.8 | 15.6 | 13.6 | |
Missing | 7.5 | 11.5 | 16.5 | ||
Ethnic group | |||||
White | 5.6 | 13.0 | 17.5 | ||
Black | 8.8 | 13.6 | 22.4 | ||
Asian | 1.9 | 13.3 | 8.6 | ||
Other | 12.2 | 14.9 | 18.9 | ||
Missing | 5.7 | 9.2 | 17.7 | ||
Scenario 2: Estimated error rates# | No co-occurring errors | Maternal Age | |||
<20 | 7.2 | 13.1 | 4.7 | ||
20-29 | 4.6 | 9.4 | 5.7 | ||
30-39 | 4.0 | 8.4 | 6.2 | ||
40+ | 6.1 | 5.7 | 6.6 | ||
Missing | 6.1 | 7.7 | 5.4 | ||
Ethnic group | |||||
White | 4.4 | 9.2 | 5.9 | ||
Black | 7.2 | 14.3 | 9.2 | ||
Asian | 4.1 | 6.8 | 7.4 | ||
Other | 16.4 | 10.5 | 7.3 | ||
Missing | 5.0 | 8.5 | 5.2 | ||
Scenario 3: Independent error rates | G0 surname & G1 surname (2.30%) | 5.0 | 10.0 | 15.0 | |
G1 forename & G1 surname (1.00%) | |||||
G1 forename & G0 surname (0.75%) | |||||
Scenario 4: Independent error rates | No co-occurring errors | 5.0 | 10.0 | 15.0 |
Deterministic linkage | Probabilistic linkage with similarity scores | Probabilistic linkage with similarity scores and term frequency adjustment | ||
Original data | n linked records | 12,673 | 12,920 | 12,962 |
Missed match rate* | 4.59% | 2.61% | 2.40% | |
False match rate* | 0.23% | 0.12% | 0.05% | |
Synthetic data – known error rates, dependent on attributes, original error co-occurence 1 | n linked records | 12,656 | 12,718 | 12,712 |
Missed match rate | 4.72% | 4.26% | 4.29% | |
False match rate | 0.29% | 0.32% | 0.16% | |
Synthetic data – guessed error rates, dependent on attributes, no error co-occurence 2 | n linked records | 13,274 | 13,276 | 13,279 |
Missed match rate | 0.05% | 0.04% | 0.02% | |
False match rate | 0.09% | 0.12% | 0.07% | |
Synthetic data – guessed error rates, independent of attributes, assumed pattern of error co-occurence 3 | n linked records | 12,746 | 12,817 | 12,809 |
Missed match rate | 4.04% | 3.50% | 3.56% | |
False match rate | 0.25% | 0.24% | 0.13% | |
Synthetic data – guessed error rates, independent of attributes, no error co-occurence 4 | n linked records | 13,266 | 13,277 | 13,279 |
Missed match rate | 0.12% | 0.03% | 0.02% | |
False match rate | 0.10% | 0.10% | 0.04% |
Linkage quality metrics
All of the synthetic datasets broadly replicated the same pattern seen in the original data linkage, i.e. that rates of missed matches were lower than rates of false matches, and that probabilistic linkage with similarity scores and term frequency adjustment had the best performance (Table 4). Scenarios 1 and 3 result in comparable linkage error rates compared to the original linkage, and successfully demonstrated that probabilistic linkage was able to reduce both false-matches and missed-matches compared with deterministic linkage.
Across the deterministic linkages, scenarios 1 and 3 had more comparable linkage error rates, with an absolute difference of 0.13–0.55% for missed matches, and 0.00-0.04% for false matches. Linkages for scenarios 2 and 4 had larger variations of linkage errors compared to the original, with a difference of 4.05–4.12% for missed matches, and 0.15–0.16% for false matches.
In scenario 1, linkage error rates were slightly over-estimated in the synthetic data, by 0.55–2.02% for missed matches and 0.04–0.11% for false matches (Table 4). The difference in estimation was similar in scenario 3, at 0.13–1.26% for missed matches and 0.00-0.12% for false matches.
In scenario 2, linkage error rates were under-estimated in the synthetic data, by 2.20–4.12% for missed matches and 0.00–0.16% for false matches (Table 4). Under-estimation was found to a similar extent in scenario 4, by 2.21–4.05% for missed matches and 0.01–0.15% for false matches.
Missed match and false match characteristics
In the original linkage, of the 346 true matches missed by probabilistic linkages with similarity scores, 65.0% were those where there was agreement on forename, date of birth and gender, but disagreement on surname and mother’s surname. These missed matches affected people of different ethnicities and genders similarly and affected younger mothers more than older mothers. Use of term frequency adjustment further reduced missed matches to 319. These missed matches appeared to correspond to cases in which both the mother and the child changed their surname between data collection waves (rather than being due to typographical errors). The second most common missed match pattern occurred for records with disagreement on forename and mother’s surname, with 13.6% in probabilistic linkage with similarity scores, and 13.8% with term frequency adjustments. These missed matches appeared to correspond to mothers changing their surnames, and children providing alternative names or derivatives at different data collection waves.
Missed match rates were comparable to the original linkage in scenarios 1 and 3. The disagreement pattern of missed matches were also similar to the original linkage (Appendix Table 8).
Compared to the original linkage, missed matches in scenario 3 had similar distributions of disagreement patterns. With probabilistic linkage with similarity scores, 61% of missed matches disagreed on surname and mother’s surname; with term frequency adjustment, 58% missed matches disagreed on surname and mother’s surname. In both probabilistic linkage with similarity scores and term frequency adjustment, 19.3% missed matches disagreed on forename and mother’s surname.
Comparing to the original linkage, missed matches in scenario 1 had a lower proportion of disagreements on surname and mother’s surname with 42.3% for probabilistic linkage and 40.2% for term frequency adjustments. Higher proportions of missed matches disagreed on forename and mother’s surname, with 37.9% for probabilistic linkage, and 37.8% for term frequency adjustments.
False-match rates were low in the original linkage and synthetic linkages. The higher rate of false-matches with deterministic linkage was predominantly explained by the 54% of record pairs that agreed on surname (both mother and child), sex and date of birth, but disagreed on forename. This was followed by 39% of false-matches in pairs that disagreed on date of birth only (Appendix Table 9). In the deterministic linkages using synthetic datasets, similar patterns and proportions of false-matches records were replicated, where 60.0% of false-matches disagreed only on forename, and 34.9% disagreed on date of birth only. As date of birth was recorded with high accuracy in the ALSPAC data, these pairs were (correctly) not accepted as links by the probabilistic strategies (since disagreement on date of birth conferred a large penalty to the match weight).
In terms of missed match rates, false match rates, and characteristics of missed matches, we were able to best produce linkages most similar to original linkage in scenario 3. This demonstrates that replicating error types and co-occurrence patterns (even if the co-occurrence patterns are estimated) without incorporating dependencies between error and attribute is sufficient to produce realistic synthetic data. Further incorporating information about dependencies between errors and attribute (with true error rates), and error co-occurrence patterns (scenario 1) did not produce substantially more realistic linkages.
Conversely, retaining error and attribute dependency without incorporating error types and co-occurrence (scenario 2) performed similarly to when identifier error rates were assumed to be independent (scenario 4).
Discussion
We provide a generalisable and open-source framework for generating synthetic identifier data that can be used to facilitate development and evaluation of improved linkage methodologies. We show how this framework can be implemented and provide a means of producing corrupted datasets that can be used for linkage development and a complete ‘gold standard’ file that can be used for linkage validation. We generated synthetic ALSPAC identifier datasets, which are freely available for legitimate users on request to the authors: the intention is that these data, with known characteristics, can be used for the development and comparative benchmarking of different linkage approaches.
Our framework builds on previous methodological work aiming to generate synthetic identifier data for use in data linkage [3, 13]. We extended previous work by overcoming the assumption of independence of identifier errors through explicitly incorporating the associations between identifier errors and attribute variables. If accurate information on the joint distribution of identifiers and identifier errors were available, there would be no need to include information on their dependencies with attributes. However, evaluating linkage quality according to attributes such as age, sex and ethnicity is convenient and intuitive, and knowledge of how linkage errors are typically distributed amongst these subgroups can be easily incorporated into synthetic data generators [44]. Our findings comparing linkage quality metrics for synthetic data generated under different scenarios highlight that preserving error types and co-occurrence patterns is vital for generating a dataset that accurately represents real-world data and that can be meaningfully used to evaluate linkage algorithms, and is useful when incorporating the dependencies between identifier errors and attributes is not easily achievable [16]. This framework can be used to test linkages between more than 2 datasets.
The strengths of our study include the use of gold-standard data from a large cohort study that was used to assess the performance of synthetic data for deriving linkage quality metrics. We compared a range of linkage methods and different scenarios under which the synthetic data were generated. It is likely that the errors observed in these data are representative of those occurring in other administrative and research datasets. We acknowledge that the exemplar ALSPAC dataset is predominately of a White UK population, and recommend that other cultural, geographic and time-point specific alternatives are generated in order to avoid any unintended bias in linkage algorithm development (i.e. to factor in error patterns that exist yet were not observed in the ALSPAC data). However, synthetic data generators such as this one should give users the ability to alter the identifier error rates, types and co-occurrence patterns according to their particular data context. This allows for any uncertainty to be explored, by using a range of error rates and patterns to investigate how results may vary. This could be used to help inform choice of linkage strategy: for example, it could tell us that a simple deterministic approach might generate results of sufficiently high quality if identifier error rates are low, whilst a more sophisticated and resource intensive probabilistic approach might be more suitable in settings where identifier error rates are high. It could also point to possible improvements in algorithms: in our example, all three linkage algorithms failed to identify true matches where there was a disagreement on surname and mother’s surname: better handling of name specific characteristics, such as double-barrel surnames, could go some way to mitigating this problem. Using synthetic data could also be used to provide a plausible range of linkage error rates that are likely to arise, given different assumptions about the levels of identifier errors. Under these assumptions, researchers can explore the sensitivity of their linkage approach by assessing the impact of including or excluding certain error-prone identifiers on linkage rates. This is particularly relevant for longitudinal population data, where richer insight in the variation of identifier errors is more observable, researchers could demonstrate with which data sets the original data could be best linked. Researchers can then use different methods to account for linkage error rates within analysis, e.g. quantitative bias analysis to explore the extent to which results of analyses might be affected by linkage error rates [45]. Synthetic identifier data would be particularly useful for evaluating the quality of privacy preserving linkage techniques, where access to identifiers in the clear is not permitted. Currently, access to real data is needed generate synthetic identifier data. Alternative approaches, such as estimating parameters from existing publications, could provide information sufficient to assess linkage quality to a certain extent (such as Scenario 3, where error rates for each variable were educated guesses). However, this approach might be blind to error characteristics, error co-occurring patterns, and error inter-dependencies that may underlie specific data sources. As these synthetic data would be used to evaluate the validity and utility of the linkages, using mis-specified models, or multiple proposed synthetic models would confer to challenges in data governance. Our proposed framework, while seemingly relying on higher involvement of the data owners, has the advantage of giving more control to data owners, and presents as a more pragmatic approach to drive change.
Limitations of our study are that we only had one gold-standard dataset with which to evaluate the performance of the synthetic data and therefore our testing of dependencies is based on information about a specific population group; further evaluations should be conducted on other datasets with varying proportions of missingness in identifiers. Our name generation mechanism takes advantage of the small sample size and low cardinality of name distributions in ALSPAC (4,000–7,000 distinct forename and surname terms). Replication using the same method would require a more diverse name dictionary. The key advantage of generating realistic names with name dictionaries, (versus string or number sequences), is the potential to better reflect dimensions of name characteristics that are non-metricized and may associate with error distributions by attributes. The current name generation mechanism did not fully preserve name clusters and name-specific characteristics, such as word length, hyphens or number of terms per name [46]. Our framework could be extended in several ways, including by adding in additional variable types, error types and error co-occurrence patterns, by allowing the generation of data at the household level, or for multiple generations to capture between record dependencies. More sophisticated synthetic identifier data might include more nuanced errors (i.e. specifying the most likely letter transpositions based on keyboard strokes, or introducing date-specific errors such as recording today’s date). However, these nuances would only be required if the linkage algorithm that was being evaluated was tailored towards resolving these specific sorts of errors. A further problem is on assessing how accurate the error type and co-occurrence pattern has to be for the generated synthetic data to be considered similar enough to reliably test the proposed linkage methods. Our study offers an approach to start investigating this idea more structurally, by contrasting multiple data corruption scenarios. Further investigations on this direction would allow us to be more confident in our comparisons.
Our framework provides a novel and generalisable mechanism for developing and benchmarking record linkage algorithms, which is protective of public privacy and avoids assumptions that errors in personal identifiers are independent of the other identifiers and attribute data. Our findings show that replicating dependencies between attribute values (e.g. ethnicity), values of identifiers (e.g. name), and errors in identifiers (e.g. missing values, typographical errors or changes over time) and its patterns enables generation of realistic synthetic data that can be used to evaluate different linkage methods.
Conflicts of Interest
The authors declare there is no conflict of interest.
Acknowledgements
Harvey Goldstein had a key role in developing this study, but sadly died prior to publication. We are very grateful to his input to this work. We would also like to thank Haoyuan Zhang for his early input to this work.
We are extremely grateful to all the families who took part in this study, the midwives for their help in recruiting them, and the whole ALSPAC team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists and nurses. We particularly thank Mark Mumme for his time in producing extracts of data for this study. We thank Ruth Gilbert and James Doidge for their input to and feedback on this work.
Ethics
Ethical approval for the ALSPAC cohort study was obtained from the ALSPAC Ethics and Law Committee (a University of Bristol Faculty Ethics Committee) and NHS Local Research Ethics Committee(s). Informed consent for the use of data collected via questionnaires and clinics was obtained from participants following the recommendations of the ALSPAC Ethics and Law Committee at the time. This study was approved through the ALSPAC data access mechanism (ALSPAC Reference: B3002, https://proposals.epi.bristol.ac.uk/?q=node/127384). Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol.ac.uk/alspac/researchers/our-data).
Funding
The UK Medical Research Council and Wellcome (Grant ref: 217065/Z/19/Z) and the University of Bristol provide core support for ALSPAC. This publication is the work of the authors and Katie Harron and Andy Boyd will serve as guarantors for the contents of this paper. This research was funded in whole, or in part, by the Wellcome Trust [212953/Z/18/Z]. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
Data availability
Synthetic ALSPAC data can be downloaded on UCL Research Data Repository (doi:10.5522/04/25921408). Original ALSPAC data can be requested from ALSPAC website.
References
-
Harron K, Dibben C, Boyd J, Hjern A, Azimaee M, Barreto ML, et al. Challenges in administrative data linkage for research. Big Data & Society. 2017 Dec;4(2):205395171774567. 10.1177/2053951717745678
10.1177/2053951717745678 -
Jorm L. Routinely collected data as a strategic resource for research: priorities for methods and workforce. Public Health Res Pr [Internet]. 2015 [cited 2023 Dec 7];25(4). Available from: http://www.phrp.com.au/issues/september-2015-volume-25-issue-4/routinely-collected-data-as-a-strategic-resource-for-research-priorities-for-methods-and-workforce/ 10.17061/phrp2541540
10.17061/phrp2541540 -
Christen P, Vatsalan D. Flexible and extensible generation and corruption of personal data. In: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM ’13 [Internet]. San Francisco, California, USA: ACM Press; 2013 [cited 2023 Dec 7]. p. 1165–8. Available from: http://dl.acm.org/citation.cfm?doid=2505515.2507815. 10.1145/2505515.2507815
10.1145/2505515.2507815 -
Kelman CW, Bass AJ, Holman CDJ. Research use of linked health data–a best practice protocol. Aust N Z J Public Health. 2002;26(3):251–5.
-
Harron K, Wade A, Muller-Pebody B, Goldstein H, Gilbert R. Opening the black box of record linkage. J Epidemiol Community Health. 2012 Dec;66(12):1198. 10.1136/jech-2012-201376
10.1136/jech-2012-201376 -
Christen P. Probabilistic Data Generation for Deduplication and Data Linkage. In: Gallagher M, Hogan JP, Maire F, editors. Intelligent Data Engineering and Automated Learning - IDEAL 2005 [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2005 [cited 2023 Dec 7]. p. 109–16. (Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, et al., editors. Lecture Notes in Computer Science; vol. 3578). Available from: http://link.springer.com/10.1007/11508069_15. 10.1007/11508069_15
10.1007/11508069_15 -
Ferrante A, Boyd J. A transparent and transportable methodology for evaluating Data Linkage software. J Biomed Inform. 2012 Feb;45(1):165–72. 10.1016/j.jbi.2011.10.006
10.1016/j.jbi.2011.10.006 -
Nowok B, Raab GM, Dibben C. Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R 1. Statistical Journal of the IAOS. 2017 Jan 1;33(3):785–96. 10.3233/SJI-150153
10.3233/SJI-150153 -
Kokosi T, De Stavola B, Mitra R, Frayling L, Doherty A, Dove I, et al. An overview on synthetic administrative data for research. IJPDS [Internet]. 2022 May 23 [cited 2022 Jul 7];7(1). Available from: https://ijpds.org/article/view/1727. 10.23889/ijpds.v7i1.1727
10.23889/ijpds.v7i1.1727 -
Raghunathan TE. Annual Review of Statistics and Its Application Synthetic Data. Annual Review of Statistics and Its Application. 2021;8(1):129–40. 10.1146/annurev-statistics-040720-031848
10.1146/annurev-statistics-040720-031848 -
Doidge JC, Harron KL. Reflections on modern methods: linkage error bias. International Journal of Epidemiology. 2019 Dec 1;48(6):2050–60. 10.1093/ije/dyz203
10.1093/ije/dyz203 -
Harron KL, Doidge JC, Knight HE, Gilbert RE, Goldstein H, Cromwell DA, et al. A guide to evaluating linkage quality for the analysis of linked data. International Journal of Epidemiology. 2017 Oct 1;46(5):1699–710. 10.1093/ije/dyx177
10.1093/ije/dyx177 -
Christen P, Pudjijono A. Accurate Synthetic Generation of Realistic Personal Information. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB, editors. Advances in Knowledge Discovery and Data Mining. Berlin, Heidelberg: Springer; 2009. p. 507–14. (Lecture Notes in Computer Science). 10.1007/978-3-642-01307-2_47
10.1007/978-3-642-01307-2_47 -
Bohensky M. Chapter 4: Bias in data linkage studies. In: In: Harron K, Dibben C, Goldstein H, editors Methodological Developments in Data Linkage [Internet]. John Wiley & Sons, Ltd; 2015 [cited 2023 Dec 7]. p. 63–82. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119072454.ch4. 10.1002/9781119072454.ch4
10.1002/9781119072454.ch4 -
Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data Linkage: A powerful research tool with potential problems. BMC Health Services Research. 2010 Dec 22;10(1):346. 10.1186/1472-6963-10-346
10.1186/1472-6963-10-346 -
Harron K, Hagger-Johnson G, Gilbert R, Goldstein H. Utilising identifier error variation in linkage of large administrative data sources. BMC Medical Research Methodology. 2017 Feb 7;17(1):23. 10.1186/s12874-017-0306-8
10.1186/s12874-017-0306-8 -
Fraser A, Macdonald-Wallis C, Tilling K, Boyd A, Golding J, Davey Smith G, et al. Cohort Profile: the Avon Longitudinal Study of Parents and Children: ALSPAC mothers cohort. Int J Epidemiol. 2013 Feb;42(1):97–110. 10.1093/ije/dys066
10.1093/ije/dys066 -
Boyd A, Golding J, Macleod J, Lawlor DA, Fraser A, Henderson J, et al. Cohort Profile: the ’children of the 90s’–the index offspring of the Avon Longitudinal Study of Parents and Children. Int J Epidemiol. 2013;42(1):111–27. 10.1093/ije/dys064
10.1093/ije/dys064 -
Nowok B, Raab GM, Dibben C. synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software. 2016 Oct 28;74:1–26. 10.18637/jss.v074.i11
10.18637/jss.v074.i11 -
Linacre R, Lindsay S, Manassis T, Slade Z, Hepworth T. Splink: Free software for probabilistic record linkage at scale. International Journal of Population Data Science [Internet]. 2022 Aug 25 [cited 2023 Jun 5];7(3). Available from: https://ijpds.org/article/view/1794. 10.23889/ijpds.v7i3.1794
10.23889/ijpds.v7i3.1794 -
Piantadosi ST. Zipf’s word frequency law in natural language: A critical review and future directions. Psychon Bull Rev. 2014 Oct 1;21(5):1112–30. 10.3758/s13423-014-0585-6
10.3758/s13423-014-0585-6 -
CLOSER-resource-NHS-Numbers-and-their-management-systems.pdf [Internet]. [cited 2024 Jan 3]. Available from: https://www.closer.ac.uk/wp-content/uploads/CLOSER-resource-NHS-Numbers-and-their-management-systems.pdf.
-
Office for National Statistics. Baby names in England and Wales statistical bulletins. [cited 2024 Jan 3]. Office for National Statistics. Available from: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/bulletins/babynamesenglandandwales/previousReleases.
-
National Records of Scotland. National Records of Scotland. National Records of Scotland; 2013 [cited 2024 Jan 3]. National Records of Scotland. Available from: https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/names/babies-first-names/
-
Betebenner DW. randomNames: Function for Generating Random Names and a Dataset. [Internet]. 2021. Available from: https://cran.r-project.org/package=randomNames.
-
McElduff F, Mateos P, Wade A, Borja MC. What’s in a name? The frequency and geographic distributions of UK surnames. Significance. 2008;5(4):189–92. 10.1111/j.1740-9713.2008.00332.x
10.1111/j.1740-9713.2008.00332.x -
Danesh J, Gault S, Semmence J, Appleby P, Peto R. Postcodes as useful markers of social class: population based study in 26 000 British households. BMJ. 1999 Mar 27;318(7187):843–5. 10.1136%2Fbmj.318.7187.843
10.1136%2Fbmj.318.7187.843 -
Ministry of Housing, Communities & Local Government. GOV.UK. 2019 [cited 2024 Jan 3]. English indices of deprivation. Available from: https://www.gov.uk/government/collections/english-indices-of-deprivation
-
Office for National Statistics. Ethnic Groups by Borough - London Datastore [Internet]. [cited 2024 Jan 3]. Available from: https://data.london.gov.uk/dataset/ethnic-groups-borough
-
Harron K, Gilbert R, Cromwell D, van der Meulen J. Linking Data for Mothers and Babies in De-Identified Electronic Health Data. Gebhardt S, editor. PLoS ONE. 2016 Oct 20;11(10):e0164667. 10.1371/journal.pone.0164667
10.1371/journal.pone.0164667 -
Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM. 1964 Mar 1;7(3):171–6. 10.1145/363958.363994
10.1145/363958.363994 -
Pollock JJ, Zamora A. Automatic spelling correction in scientific and scholarly text. Commun ACM. 1984 Apr;27(4):358–68. 10.1145/358027.358048
10.1145/358027.358048 -
Thomas N. Herzog, Fritz J. Scheuren, William E. Winkler. Data Quality and Record Linkage Techniques [Internet]. New York, NY: Springer; 2007 [cited 2024 Jan 3]. Available from: http://link.springer.com/10.1007/0-387-69505-2. 10.1007/0-387-69505-2
10.1007/0-387-69505-2 -
Kukich K. Techniques for automatically correcting words in text. ACM Comput Surv. 1992 Dec;24(4):377–439. 10.1145/146370.146380
10.1145/146370.146380 -
Black PE. Dictionary of Algorithms and Data Structures. NIST [Internet]. 1998 Oct 1 [cited 2024 Jan 3]; Available from: https://www.nist.gov/publications/dictionary-algorithms-and-data-structures.
-
Odell M.K. The profit in records management. Systems (New York). 1956; 20(20).
-
Holmes D, McCabe MC. Improving precision and recall for Soundex retrieval. In: Proceedings International Conference on Information Technology: Coding and Computing [Internet]. Las Vegas, NV, USA: IEEE Comput. Soc; 2002 [cited 2024 Jan 3]. p. 22–6. Available from: http://ieeexplore.ieee.org/document/1000354/. 10.1109/ITCC.2002.1000354
10.1109/ITCC.2002.1000354 -
Cheriet M, Kharma N, Liu CL, Suen C. Character Recognition Systems: A Guide for Students and Practitioners. John Wiley & Sons; 2007.
-
Gambaro L, Joshi H. Moving home in the early years: what happens to children in the UK? Longitudinal and Life Course Studies. 2016 Jul 18;7(3):265–87. 10.14301/llcs.v7i3.375
10.14301/llcs.v7i3.375 -
Ludvigsson JF, Otterblad-Olausson P, Pettersson BU, Ekbom A. The Swedish personal identity number: possibilities and pitfalls in healthcare and medical research. Eur J Epidemiol. 2009 Nov 1;24(11):659–67. 10.1007/s10654-009-9350-y
10.1007/s10654-009-9350-y -
Aldridge RW, Shaji K, Hayward AC, Abubakar I. Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies. PLOS ONE. 2015 Aug 24;10(8):e0136179. 10.1371/journal.pone.0136179
10.1371/journal.pone.0136179 -
Fellegi IP, Sunter AB. A Theory for Record Linkage. Journal of the American Statistical Association. 1969;64(328):1183–210.
-
Roger Eckhardt. Stan Ulam, John von Neumann and the Monte Carlo Method. Los Alamos Science. 1987;100(15):131.
-
Harron K, Doidge JC, Goldstein H. Assessing data linkage quality in cohort studies. Annals of Human Biology. 2020 Feb 17;47(2):218–26. 10.1080%2F03014460.2020.1742379
10.1080%2F03014460.2020.1742379 -
Doidge JC, Morris JK, Harron KL, Stevens S, Gilbert R. Prevalence of Down’s Syndrome in England, 1998–2013: Comparison of linked surveillance data and electronic health records. International Journal of Population Data Science [Internet]. 2020 Mar 19 [cited 2023 Jun 11];5(1). Available from: https://ijpds.org/article/view/1157. 10.23889/ijpds.v5i1.1157
10.23889/ijpds.v5i1.1157 -
Nanayakkara C, Christen P, Ranbaduge T. An Anonymiser Tool for Sensitive Graph Data. InCIKM (workshops) 2020.