Reviews for Thirty-three myths and misconceptions about population data: from data capture and processing to linkage
By Peter Christen and Rainer Schnell
Article as submitted
Article Authors
Submission Date: 24/09/2022
Round 1 Reviews
Reviewer A
Anonymous Reviewer
Completed 03/11/2022
https://doi.org/10.23889/ijpds.v8i1.2115.review.r1.reviewa
The paper presents several improvements to support populational data science with reasonable scientific accuracy. While authors strive to cover the most critical misconceptions about conducting observational studies with large amounts of data, the manuscript provides subsidies for building a populational data science framework with implications for data gathering, preprocessing and linkage. A deeper discussion of those topics may lead to significant advances in data centre organizations.
The authors may consider including omitted topics related to misconceptions:
- Information systems: at some point, the authors mention that "Many of the misconceptions about population data are about QIDs and how they are captured, processed, and used to link population databases to transform them into a form suitable for a research study[...]". It is essential to address the process of planning, designing, developing, supporting and updating the information systems often responsible for data capture. These systems are built to tackle administrative requirements by modelling a mini-world, implementing minimum viable versions and establishing an organic succession of updates. Including a deeper discussion on information systems' role in populational data science can help researchers (in computer science or statistics) and policy managers prevent related misconceptions.
- Federated linkage: there is an opportunity to include misconceptions exclusively related to the federated integration of large amounts of data.
- Equality and linkage methods: the sense that we can employ the same classification task to link every population present on data might be a broadly committed misconception.
- Equality and linkage validation: Similar to the previous suggestion, setting a single cutoff point or matching weights to disparate populations on data subsets might also be an error.
Some comments on recommendations:
- Some technical and strategic exchange should be encouraged between product owners, developers, database managers and data scientists.
- Adopting FAIR principles or well-known data models may not be enough to address misconception #25.
Despite the importance of the discussion on paper, the appeal of the title may not be adequate. Additionally, I strongly suggest the authors replace "By demonstrating common misconceptions about population data..." with "By presenting common misconceptions about population data..."
Reccomendation: Accept Submission
Reviewer B
Anonymous Reviewer
Completed 03/11/2022
https://doi.org/10.23889/ijpds.v8i1.2115.review.r1.reviewb
Thank you for the opportunity to review this interesting piece of research. Overall I found the concept of the study a valid and useful one, however there were parts where the methods became lost in translation and difficult to follow. I feel that the paper would benefit from an overhaul to ensure that the text is clear and methods transparent.
I will present this review based upon the sections within the reviewers’ guidelines.
Problem Statement, Conceptual Framework, and Research Question
In essence it was clear what the purpose of the study was. The introduction provided a clear background and a case for the research question. I was also able to understand the question posed.
Reference to the Literature and Documentation
The references drawn upon seem both sensible, in context and reasonably up to date. They are correctly referred to within the text and it is clear how this supports the research question.
Relevance
The study seems relevant to the journal and its audience and addresses an important issue within many research studies – that of identifying the correct prevalence and population. It adds to the literature as it considers outcomes not previously studied. There are some generalisability issues which are discussed in the discussion
Research Design
The research design is good, however there are places where the text needs clarification as it is hard to follow. The design is appropriate for the question. Biases within the methods are intrinsic to the study question and as such are referred to within the discussion.
Instrumentation, Data Collection, and Quality Control
The methods of data collection are adequately described, however again in places they are hard to follow.
Population and Sample
There are places within the methods where results have got mixed in, for example unless the inclusion/exclusion criteria is those aged 0 to 112 this is a result, and if it is part of the criteria then why 112? I feel that this text could be simplified which would make it easier to follow.
Data Analysis and Statistics
The analysis section should be relabelled to indicate that this is part of the methods. It needs to be made clear that the gold standard is a composite of data from APCDC, EDDC, RBDM, CODURF, i.e. the occurrence of the outcome of interest in any of these datasets. The formulae does not add anything to the explanation, a suitable reference could be used for those wishing to follow the methods in that depth. More information needs to be provided around the confidence intervals, where was it applicable to adjust them for clustering by practice site. I’ve looked at a few of the confidence intervals and haven’t been able to replicate any -this may well be down to lack of clarity and explanation rather than incorrect values. I also think that the study would benefit from more descriptive statistics within the text.
Reporting of Statistical Analyses
- Please give a measure of spread in the text for age.
- How many general practices are included?
- Please quote the national estimates – do the figures for this population sit in the confidence intervals around the estimates?
- The details of the selection process should be presented before you talk about the comparative numbers of outcomes in each dataset.
- When there is only male and female there is no need to quote both.
- Can you describe the times in a continuous measure as well as being categorised?
Presentation of Results
Generally the results can be followed, however there are places where the text is hard to follow and I feel more clarity is need. I have also indicated above places where more information would provide context to the results.
Discussion and Conclusion: Interpretation
The conclusions are clearly stated, however it is unclear where they are leading. They summarise the study well and how they support information already known. However it just slightly leaves me feeling, what next? The implications for current research are demonstrated, but are there any suggestions or direction for improvement?
Title, Authors, and Abstract
The title is clear and informative, whilst being representative of the study. However it doesn’t indicate that what the study is doing is seeing how well the outcomes are matched across two datasets. The authorship is appropriate. The abstract provides a concise and accurate summary of the main paper.
Presentation and Documentation
As mentioned previously there are some parts in the text where potential errors have caused loss of clarity, for example
“However, limited such linkage has occurred to date in Australia.”
Do you mean: “However limited, such linkage has occurred to date in Australia?” This sentence isn’t clear.
“The decision to link datasets therefore requires a good understanding of the value add of bringing them together to answer specific questions.“
Some restructure within the methods section could make the paper easier to follow and the methods more replicable.
Reccomendation: Revisions Required
Editor Decision
Marcos Barreto
Decision Date: 15/11/2022
https://doi.org/10.23889/ijpds.v8i1.2115.review.r1.dec
Decision: Request Revisions
Author Response
Prof Peter Christen
Response Date: 01/12/2022
We would like to thank you for taking time to review our manuscript submitted to the of Population Data Science, and for providing valuable comments and suggestions which helped us to improve the quality of our manuscript.
In the following, we will address the specific comments made by the reviewers. We show the reviewers' comments in italics font. We number the comments made by the first reviewer from R1.1 to R1.8, and the comments made by the second reviewer from R2.1 to R2.8.
We furthermore have included some additional references to better describe certain points and included new examples. We number these A1 and A2, where we describe these change at the end of this response letter.
We have included the page numbers in the revised manuscript where we have addressed each of the reviewers’ comments. We mark all new text in the revised manuscript in blue font (in the version with tracked changes).
We hope to have addressed all comments raised by the reviewers.
Peter Christen and Rainer Schnell,
2. December 2022
Comments by the first reviewer:
R1.1: Information systems: at some point, the authors mention that "Many of the misconceptions about population data are about QIDs and how they are captured, processed, and used to link population databases to transform them into a form suitable for a research study[...]". It is essential to address the process of planning, designing, developing, supporting and updating the information systems often responsible for data capture. These systems are built to tackle administrative requirements by modelling a mini-world, implementing minimum viable versions and establishing an organic succession of updates. Including a deeper discussion on information systems' role in populational data science can help researchers (in computer science or statistics) and policy managers prevent related misconceptions.
We thank the reviewer for this important comment. We have now added the following new text (in bold below) to the first recommendation point on page 14:
- If possible, data scientists and researchers should aim to get involved in the capturing, processing, and linking of any data they plan to use for their research. This involves discussions with database owners about what data to collect in what format, how to ensure high quality of these data, and that adequate metadata are collected. It also means proper planning and designing of the information systems that are required for data capture, processing and linkage, and their adequate support as well as updates over time. It is vital to have the involvement of data scientists and data policy managers in these processes.
R1.2: Federated linkage: there is an opportunity to include misconceptions exclusively related to the federated integration of large amounts of data.
We agree with the reviewer that linking data from multiple (more than two) data sources, which we assume ‘federated linkage’ refers to, will add many more challenges and misconceptions. Instead of adding a new misconception, we have added the following sentence (in bold font below) to the introduction text of Section 3.3 on page 12:
Linking databases is generally based upon comparing the QID values of individuals, such as people’s names, addresses, and other personal details (as illustrated in Fig. 1), to find records that refer to the same person [25,38]. These QID values, however, can contain errors, be missing, and they can change over time. This can lead to incorrect linkage results even when modern linkage methods are employed [49]. Linking databases can therefore be the source of various misconceptions about a linked data set. While all of the following misconceptions can occur when data from two sources are being linked (or even when duplicate records need to be identified within a single database [22]), in situations where records from multiple (more than two) sources have to be linked any of these misconceptions can become even more challenging and more difficult to deal with.
R1.3: Equality and linkage methods: the sense that we can employ the same classification task to link every population present on data might be a broadly committed misconception.
We thanks the reviewer for this suggestion, which seems to be a new misconception (while related to misconception (32) we think it is different enough to have its own). We therefore have added the following new misconception:
(33) Linkage techniques and their settings are easily transferrable. If a linkage method together with its parameter settings (for example how blocking is conducted, how values are compared, and how a classification threshold is set) has been successful deployed in a given linkage project, this does not mean that the same method and settings will provide comparable high linkage quality results on a different linkage project. For each linkage project, different methods and corresponding parameter setting will need to be established. Furthermore, the same holds even when linking large disparate population databases, where for different subpopulations different optimal parameter settings (such as classification thresholds) will need to be identified. Finally, repeated linkages over time, for example a yearly update, may also require different parameter settings.
This new misconception will address both this and the following comment.
R1.4: Equality and linkage validation: Similar to the previous suggestion, setting a single cutoff point or matching weights to disparate populations on data subsets might also be an error.
We have addressed this point with the response to the previous comment by the reviewer (second part of the new misconception (33)).
R1.5: Some technical and strategic exchange should be encouraged between product owners, developers, database managers and data scientists.
We believe this topic has to some extend already covered in the second dot point in the recommendation section on page 14. We believe with ‘product owners’ the reviewer meant the owners of the databases being used, processed and linked (if not we would appreciated a clarification). However, to more specifically address this comment by the reviewer we have added the following (in bold font below):
- If at all possible, data scientists and IT personnel who are processing and linking population data need to work in close collaboration with the researchers who will conduct the actual analysis of these data. Both technical and strategic aspects of a project that involves population data should ideally be discussed with the analysts, data scientists, database managers, developers, project managers, as well as the owners of the population databases being used, processed, and linked. Forming multi-disciplinary teams with members skilled in data science, statistics, domain expertise, as well as ‘business’ aspects of research [5], will be highly beneficial for successful projects that rely upon population data. Interaction between data and domain experts might mean that a project based on population data becomes an iterative endeavour where data might have to be recaptured, reprocessed, and relinked until they are suitable for a research study.
R1.6: Adopting FAIR principles or well-known data models may not be enough to address misconception #25.
We agree with the reviewer that not all metadata related issues might be addressed when using the FAIR principles, or guidelines such as the GUILD or RECORD. We however believe that making researchers and data scientists working on population data aware of these will help improve overall data quality aspects and help reduce many misconceptions. There will be situations where, for example, privacy and confidentiality will prevent metadata from being provided or shared. At least following these principles and guidelines will allow data scientists and researchers to highlight to data custodians that metadata would be highly beneficial for their work.
To address this comment by the reviewer we have added the following text (in bold font) to the fifth dot point in the recommendations (Section 3) on page 15:
- Existing guidelines and checklists, such as RECORD [58] and GUILD [59], should be employed and adapted to other research domains. Frameworks such as the Big Data Total Error model [18] can be adapted for population data to better characterise errors in such data. Furthermore, data management principles such as FAIR (Findable, Accessible, Interoperable, Reusable) [60] should be adhered to, although in some situations the sensitive nature of personal data [15] might limit or prevent such principles from being applied. In such situations, at least metadata and any software used in a study should be made public in an open research repository. Following these principles, guidelines, and checklists will allow data scientists and researchers to highlight to data custodians that having access to metadata would be highly beneficial for their work.
R1.7: Despite the importance of the discussion on paper, the appeal of the title may not be adequate.
We have discussed this point, and decided to change our title to: Thirty-three Myths and Misconceptions about Population Data: From Data Capture and Processing to Linkage To clarify what is meant by myths and misconceptions, we furthermore added the following footnote (in bold font below) on page 2:
Therefore, the kind of problems we consider in this paper are usually underestimated by non-specialists, leading to inflated expectations. Such over-expectations might cause costly mismanagement in areas such as public health or in government decision-making. Furthermore, failing population data projects, such as census operations or health surveillance, might even result in the loss of trust in governments and science by the public [8,11]. In the context of research, myths and misconceptions1 about population data can lead to wrong outcomes of research studies that can result in conclusions with severe negative impact [12,13].
1 According to Merian Webster (https://www.merriam-webster.com), a myth is a “popular belief or tradition that has grown up around something or someone”, while a misconception is a “a wrong or inaccurate idea or conception”. For brevity, throughout the paper we will only use misconception.
R1.8: Additionally, I strongly suggest the authors replace "By demonstrating common misconceptions about population data..." with "By presenting common misconceptions about population data…"
We have changed the wording of this sentence in the introduction section on page 2 according to the suggestion by the reviewer.
Comments by the second reviewer:
There are two common misconceptions that I feel the paper perpetuates:
R2.1: That linkage is a process that requires “two or more databases” (section 2 text and figure 2), whereas in practice perhaps the most common application of linkage is to records within the same database. To be fair, this is acknowledged much later in misconception (28) “A linked data set contains no duplicates”, but I suggest that this could be elaborated and highlighted. There is a pervasive misconception that population databases are created or maintained as person-level data, whereas the reality is that most population data sets must undergo internal linkage/deduplication before even they alone can be analysed at a person level.
We like to thank the reviewer for this very valuable comment, and we agree with the reviewer. We have added the following clarifications (in bold font) to misconception (5):
(5) Each individual in a population is represented by a single record in a database. A common assumption is that population databases are generated and maintained as per - son-level databases, with one record per person in a population. However, it is not uncommon for a population database to contain duplicate records referring to the same person due to errors or variations in QID values. While data entry validation may prevent exact duplicates, fuzzy or approximate duplicates [22] might be missed by (automatic) checks. In real-world settings, the same person can therefore be registered multiple times at different institutions and their duplicate records are not being identified as referring to the same individual.
Some duplicates are very difficult to find, for example women who change their last name and address when they get married and therefore only their first name, gender, and place and date of birth stay the same. The flip side is that several people with highly similar personal details (similar QID values), such as twins who only have different first names, might not be recognised as two individuals but instead as duplicates.
Duplicate records are possible even if entity identifiers (such as social security numbers or patient identifiers) are available that should prevent multiple records by the same individual from being added to a population database. Due to human behaviour and errors such identifiers are not always provided or entered correctly. It might therefore be beneficial to apply deduplication (also know as internal linkage) [22] to a population database to identify any possible duplicate records that refer to the same person, and then handle such duplicates appropriately.
R2.2: That there is a substantive distinction between unique entity identifiers and quasi-identifiers (QIDs). In my experience as a data linker, unique entity identifiers have never been more than relatively high-quality quasi-identifiers. Again, this is touched upon in (28) but could be elaborated. There is a pervasive misconception that unique entity identifiers such as NHS number, if recorded and valid, are accurate. For example the linkage algorithms employed by NHS Digital all disallow disagreement on a valid NHS number, regardless of any level of agreement on all other available quasi-identifiers. Valid but incorrect unique entity identifiers are more common than many people assume, even with the use of technologies such as checksum digits. Records get mixed up, by accident and by intent.
We thank the reviewer for this important comment based on practical experiences. Note that the issue of duplicate records is also the topic of misconception (5), as we discussed above under comment R2.1.
We have now elaborated on this topic by adding the following text (in bold font) to misconception (28):
(28) A linked data set contains no duplicates. When linking databases, pairs or groups of records that refer to the same individual might not be linked correctly (missed true links, Type II errors). One reason for this to occur is if a wrong entity identifier has been assigned to an individual (by accident or on purpose), as has been reported even in voter registration databases [34]. If a linkage requires agreement on such unique identifier values, then two records with different values in the unique identifier will not be linked even if they have many highly similar (or even agreeing) QID values. Another reason is if crucial QID values of an individual have changed over time, such as both their name and address details, or are missing, resulting in two records that are not similar with each other [22]. Therefore, many linked data sets do contain more than one record for some individuals in a population.
R2.3: “For administrative data, decades of experience have shown that most unusual patterns in large databases are due to data errors [10].” This statement seems vague, strongly worded and not especially well supported by the citation.
In the paper by David Hand [10], on page 562 he writes: “Unfortunately, one of the lessons that we have learnt from data mining practice over the past 20 years is that most of the unusual structures in large data sets arise from data errors, rather than anything of intrinsic interest.”
We might have paraphrased this sentence a bit too strong, and have therefore rewritten it as (on page 2):
For administrative data, decades of experience have shown that many unexpected patterns (such as unusual combinations of attribute values) in large databases are due to data errors instead of anything of actual interest [10].
R2.4: “It is less likely for an individual to provide incorrect values on an official government form or when opening a bank account compared to when ordering a book in an online ” This statement could use referencing or to be made more flexible. I expect that there are times when this would hold true and other times when the opposite might be true.
We agree with the reviewer on this comment and we have accordingly reworded and clarified this sentence. We also have added a new reference to the book ‘ Obfuscation: A User's Guide for Privacy and Protest’ by Brunton and Nissenbaum (2015). The modified (in bold font) text of the last paragraph in misconception (7) is now:
The decision to provide incorrect or withhold personal details is dependent upon the context in which this information is being collected. Generally, it is less likely for an individual to provide incorrect values (for non-crucial QIDs) on an official government form or when opening a bank account compared to when ordering a book in an online store. However, the opposite might be true in case where an individual does not trust the institution that is collecting their data [40].
With: [40] Finn Brunson and Helen Nissenbaum. Obfuscation: A User's Guide for Privacy and Protest. MIT Press, 2015.
R2.5: Consider highlighting that sometimes different versions of the same coding system can be employed simultaneously by different organisations or regions contributing to the same population data set, for example when transition to a new standard does not occur simultaneously among all contributors.
We thank the reviewer for this valuable comment. We have added the following text (in bold font below) to Misconception (10) about coding schemes:
(10) Coding systems do not change over time. Categorical QID values and microdata are often coded using systems such as the International Standard Classification of Occupations (ISCO) or the International Classification of Diseases (ICD), the latter currently in its eleventh revision. It is commonly assumed that such codes are fixed over time and unique in that a certain item, such as an occupation or disease, is only assigned one code, and that this assignment does not change. However, many coding systems are revised over time with new codes being added, outdated and unused codes being removed, and whole groups of codes being recoded (including codes being swapped). A database might therefore contain codes which are no longer valid. Furthermore, at a given time, different revisions of coding systems might occur in a population database, for example if the transition to a new version of a system is not conducted at the same time by all organisations that contribute to that database.
An example are the codes of the Australian Pharmaceutical Benefit Scheme (PBS) [40], where the antidepressant Venlafaxine had the code N06AE06 until 1995, when it was given the code N06AA22, which was then changed to N06AX16 in 1999. Using such codes to group or categorise records can therefore lead to wrong results of a research study if records have been collected over time.
R2.6: Re (12): Consider mentioning effects of public holidays on data collection, e.g. weekly death registrations.
We thank the reviewer for this comment. While we have already mentioned weekly aspects and public holidays in misconception (12), to clarify we have now added the following example (in bold font below) to this misconception (we keep to the topic of COVID-19 as we discuss similar examples in other misconceptions):
Daily, weekly, monthly, or seasonal aspects can influence data measurements, as can events such as public holidays and religious festivities which likely only affect certain subpopulations. For example, daily reporting of new COVID-19 infections might be limited on weekends and the beginning of a week due to less testing and delayed laboratory diagnosis on weekends. Similar delays will happen during and after public holidays.
R2.7: Re (17): Consider also mentioning that the ordering of first and last names can be culture-dependent.
To clarify this point, we have added the following sentence (in bold below) to misconception (17):
(17) Data values are in their correct attributes. Data entry personnel do not always enter values into the correct attribute. Many Asian and some Western names, for example, can be used interchangeably as first and last names, leading to misinterpretation. For example, ‘Paul’, ‘Thomas’, ‘Chris’, and ‘Dennis’ are all used as first and last names. The ordering of how first and last names are written can also depend on the culture and origin of an individual.
We note that we also discuss names under misconception (9): where we write:
Furthermore, there are many cultural aspects of names, including different name structures, ambiguous transliterations from non-roman into the roman alphabet, or name changes over time for religious reasons, to name a few. Name variations are a known problem when names are compared between records when linking databases [38]. Working with names can therefore be a challenging undertaking that requires expertise in the cultural and ethnic aspects of names [39].
Author Response
Prof Peter Christen
Decision Date: 01/12/2022
Editor Decision
Marcos Barreto
Decision Date: 09/12/2022
https://doi.org/10.23889/ijpds.v8i1.2115.review.r2.dec
Decision: Article Accepted