By Rainer Schnell and Severin V. Weiand

 

Article as submitted

Article Authors

Submission Date: 24/09/2022


Round 1 Reviews

Reviewer A

Anonymous Reviewer

Completed 23/11/2022

View text

https://doi.org/10.23889/ijpds.v8i1.2122.review.r1.reviewa

This article describes a microsimulation of an ongoing population register including comprehensive testing of different linkage methods and levels of created errors in the data. The simulation is to inform the establishment of a regularly updated national educational data base for research. The results of such a simulation can be a valuable resource for working data linkage centres, who often do not have the opportunity or resources to undertake such work themselves, and generally have to utilise whatever data are available for linkage in datasets established for other purposes.

Implementation of the microsimulation

The authors are to be commended on carrying out a very comprehensive microsimulation.

One aspect that is not totally clear is the process for name generation in aligning the joint distribution of the encrypted identifiers to the distribution of unencrypted identifiers from another database. Does this mean that the frequency of a particular unencrypted identifier was made equivalent to the frequency of the encrypted identifier at the same point in the distribution? How were the frequencies of the additional names calibrated?

A minor point: The authors state that “names are essential for identifying people”. This is generally true and certainly so in this context, but there are datasets for which other combinations of variables can be used successfully for linkage, for example where two event dates are available.

It is an interesting and useful refinement to calibrate the distribution of day and month of birth using hospital data, taking into account that more babies are born on weekdays. Presumably, this trend has also changed slightly over the years.

Errors in dates were created by substituting one integer by another and in strings by one or two of substitutions, insertions, deletions and transpositions. Another common date error that could have been considered is the reversal of two digits. Errors in linkage can also be caused by variants in first names as for ‘Bill’ and ‘William’, which could be an issue for linkage especially if using PPRL methods. (The authors do not state whether they used a register of name equivalencies.)

Missing data is another common issue for linkage accuracy that may need to be considered.

Record linkage methods

The authors may wish to consider the intended audience for the article as this section is somewhat dense in parts and may be difficult for less technical readers to follow. In particular, the explanation of the multiple matchkey approach is not as clear as the description in the source paper, Randall et. al. (2019).

It surprising that the authors found difficulties of instability in linking 20 million records, as datasets of this size and larger are now routinely linked by a number of DL centres and here each update would be less than a million records.

How was the threshold of 0.8 chosen for probabilistic linkage? In practice thresholds are often chosen after some examination of grey area linkage results and may depend to a degree on the nature and quality of the linkage data.

What exactly are the thresholds 1 to 19 in Figure 2? Is the order for these meaningful.

Results

The most important result is probably the decay in linkage over time. It suggests that for ongoing registries, efforts to improve quality with each update are worth undertaking. Bias in subpopulations is also a common issue and needs to be estimated and dealt with to inform any subsequent analysis of the data.

It is interesting that place of birth strengthens the linkage significantly. This variable is generally not available in large datasets and would usually need to be specifically collected. It does show that collecting a variable for linkage other than the standard ones can be of value. Address is often available but, of course, this can generally only be used as a confirming or distinguishing item. Sometimes an analytical variable can also be useful for linkage. In this case there could be one or more education variables that could be available.

Discussion and Conclusion

The discussion and conclusion are a good encapsulation of the issues but are possibly the wrong way round?

The discussion makes the critical point, not previously discussed directly, of the need to differentiate between statistical and administrative purposes, when considering the quality of linkage and what is an acceptable level of error.

The authors conclude that to maintain flexibility of linkage methodology into the future, it would be best to store plaintext QIDs. A further advantage of this is that it allows the possibility that peculiarities in the data can be identified over time to improve linkage. In such an ongoing register, knowledge of data problems generally increases with successive updates. It also enables linkage error rates to be estimated by clerical review of samples of the data.

There are also more general issues with using PPRL methods as they require a more complex data supply process with at least one additional third party involved. This can cause practical difficulties with routinely collected datasets, particularly as national registries may have multiple sources of collection. More generally the linkage methods that can be used may be affected by the auspices under which the data are collected.

Reccomendation: Revisions Required


Reviewer B

Anonymous Reviewer

Completed 03/11/2022

View text

https://doi.org/10.23889/ijpds.v8i1.2122.review.r1.reviewb

I was very glad to be given the opportunity to review this paper that presents the results from a micro simulation study of educational attainment register linkage to predict the linkage quality.

I do not have any major objections with the methodology followed overall, I think that the paper describes novel and original work in a way that could become a template for similar work.

However, I have the following minor comments which I feel that the authors should adress (or maybe clarify) before the camera ready version of the paper is ready to be included in the Journal.

These are as follows:

1. On my copy, the formating of the figures was all over the place. I think that this is the easiest thing to confirm before the final version is uploaded but since I observed it on my copy I thought I would mention it.

2. Figure 1 was completely ilegible on my printout (yes, I still have to read things in printed form). It is an important figure, so I would recommend that we double check that a high enough resolution image is uploaded for the final version.

3. Throughout the paper the authors are making ample use of footnotes. Some of them are short side notes that do not detract from the "argument" that the author are building up. But, other footnotes, are so long that if you make the mistake to follow the footnote before finishing reading the paragraph you forget what you were reading about (e.g. footnotes 1 and 4. In fact 4 is long enough to continue to the next page). What I would recommend about these is to limit them as much as possible and try to "take them in your stride". That is, if a footnote is important enough to warrant more than a line of text, then let's embed it in the main text flow.

4. Throughout the paper there is consistent use of the term "Population covering...". I think that a more fitting term would be "Population wide...". For example: "Population wide attainment registers are necessary...", "...population wide datasets are rare, and nothing..." and so on.

5. Figures 3,4,5. While I understand what these refer to, I notice that precision / recall are presented in a form that is akin to a time series. The reader sees one metric (either precision or recall) to be decreasing from year to year and wanders: Is this the systematic error associated with certain processes on a year to year basis or something else? I think that if the reason behind the divergence is that on every year we get a cumulative effect on errors it should be clarified somewhere in the text. Otherwise, I think that an average figure of the overall process, would be enough.

6. As a point of note in the methodology: The authors set up a "process model" for the reasons behind the modification of the records and each modification accepts a probability of error. The first two lines of the conclusion read "The novel microsimulation approach to studying the generation of QIDs of a register showed that the main problem of a register would be linkage bias caused by insufficient information to identify a person uniquely.". Would it be possible to clarify if death, as a life event was taken into account? When one monitors a population where the identity of a person, as inferred by their records, is uncertain, then "death" (or removing the "true" identifier from the pool) is also modulating the "insufficient information to identify a person uniquely". When an identifier is removed from the pool it increases the probability of other identifiers producing a positive match (>0.8) that is in fact a false positive. It might be interesting to see how much these accuracy figures change by taking into account an age dependent death rate versus not taking it into account. In developed countries this might not matter match if child mortality is too low but in developing countries this "death parameter" might contribute to quicker increases of the errors in each subsequent "generation" within the micro simulation.

7. Finally, I feel that the third paragraph in section "2.1 Data Generation" might need some clarification on what the matching of the two distributions implies. It is mentioned that "To get access to a joint distribution of encrypted identifiers, a formal request, including a detailed description of the project to the data guardian of a local educational register was necessary. For name generation, the joint distribution of the encrypted identifiers was aligned to the frequency distribution of an unencrypted commercial database." If the encryption algorithm is so weak as to allow the matching of two distributions , then there is no point in matching them, we can simply request the histogram of names from the data guardian MAKING SURE that they return names that have more than X (e.g. 100) counts. I hope that it is understood that matching the histograms is an "attack" on the encyption. If the two histograms match then I know what the encrypted data is and if we have not excluded the rare names, then we can track those individuals down very easily. I think that it would be worth providing a bit more information on that processing step.


Other than this, I would like to wish you all the best for your paper with IJPDS.

Reccomendation: Revisions Required


Editor Decision

Kim McGrail

Decision Date: 29/11/2022

View text

https://doi.org/10.23889/ijpds.v8i1.2122.review.r1.dec

Dear Rainer Schnell, Severin V. Weiand,

Based on reviewer comments on your submission to the International Journal of Population Data Science, "Microsimulation of an educational attainment register to predict future record linkage quality", I am writing to let you know that we are requesting relatively minor revisions.

Please find the reviewer comments below, which focus mainly on details and clarity in your paper. Please address those comments and return to us: one clean and one tracked changes version of your revised manuscript, plus a point by point letter of response/rebuttal, by January 31, 2023.

Kind Regards,

Kim McGrail

Deputy Editor

Decision: Request Revisions


Author Response

Prof Peter Christen

Response Date: 01/12/2022

Article as resubmitted

View text

Dear reviewers,

we would like to thank you all for reading our paper and giving all the helpful advice. We addressed all your concerns in a revised manuscript. Our changes are clarified by the following point by point response.

Comment R1-C1: One aspect that is not totally clear is the process for name generation in aligning the joint distribution of the encrypted identifiers to the distribution of unencrypted identifiers from another database. Does this mean that the frequency of a particular unencrypted identifier was made equivalent to the frequency of the encrypted identifier at the same point in the distribution? How were the frequencies of the additional names calibrated?

Thank you for these questions. We replaced our description of the alignment with a more detailed explanation:

For name generation, the joint distribution of the encrypted identifiers was aligned to the marginal frequency distributions of an unencrypted commercial database. The alignment process for firstnames was stratified for sex (two strata). Additionally, the process for firstnames and the process for surnames was stratified for migration status (two strata: foreign birth or not). The alignment used the frequency distributions of each token of firstnames (up to six tokens) and the frequency distribution of each token of surnames (up to four tokens) was grouped into percentiles. Within each stratum of the encrypted distribution, the hash code of the token was replaced by a randomly sampled token from the same percentile in the corresponding stratum of the unencrypted distribution. This process was repeated for each token of a full name in the encrypted database. Since the encrypted database contained about 400,000 names for 600,000 records, the repeated sampling from the same percentile resulted in 10 million names for 27 million records.

Comment R1-C2: A minor point: The authors state that “names are essential for identifying people”. This is generally true and certainly so in this context, but there are datasets for which other combinations of variables can be used successfully for linkage, for example where two event dates are available.

Thank you for this reminder. We modified our statement accordingly:

Names are essential QIDs for identifying people (if no unique identification number or proxies given by reliable and stable attributes are available).

Comment R1-C3: It is an interesting and useful refinement to calibrate the distribution of day and month of birth using hospital data, taking into account that more babies are born on weekdays. Presumably, this trend has also changed slightly over the years.

Thank you for this comment. As has been shown by Lerchl 2005 (reference 18), there seems to be no further changes in the distribution of day of week after 1988 in Germany. Since we model a registry consisting primarily of children born after 1990, we considered this distribution as stable for our purpose.

Comment R1-C4: Errors in dates were created by substituting one integer by another and in strings by one or two of substitutions, insertions, deletions and transpositions. Another common date error that could have been considered is the reversal of two digits. Errors in linkage can also be caused by variants in first names as for ‘Bill’ and ‘William’, which could be an issue for linkage especially if using PPRL methods. (The authors do not state whether they used a register of name equivalencies.)

Thank you for these comments. Transpositions in numerical data are common, but we are not aware of empirical data on transpositions in numerical data within administrative databases. Therefore, we used an uniform error distribution for errors in digits. Due to the very strict regulations on legally permitted firstnames in Germany, the number different firstnames might be smaller in Germany than in many other countries. Furthermore, at least within administrative databases, the names should agree (except errors) with the name given in the birth certificate. Finally, nicknames or standard replacements (such as Heini instead of Heinrich) are very rare in Germany and not permitted in administrative databases. Therefore, we didn’t use name dictionaries.

Comment R1-C5: Missing data is another common issue for linkage accuracy that may need to be considered.

We fully agree to this point. However, one purpose of an educational system is granting certificates to individuals. Therefore, it is hardly possible that an educational register within a jurisdiction without unique identification number does not include firstnames, lastnames, sex and DOB. Hence, for our purpose of an educational system, we didn’t simulate missing data processes. But we agree that a simulation of a census or a medical registry should consider missing identifiers. We added a sentence on missing data at the end of the description of the error generation:

Since we simulated an educational register whose purpose is surveillance of awarded certificates, incomplete data – common in other administrative databases – was not simulated.

Comment R1-C6: The authors may wish to consider the intended audience for the article as this section is somewhat dense in parts and may be difficult for less technical readers to follow. In particular, the explanation of the multiple matchkey approach is not as clear as the description in the source paper, Randall et. al. (2019).

Thank you for this comment. We agree that the paper is densely written, but within the constraints given by a journal, the text should not be extended further. However, we modified the description of the Randall approach in the text:

In this multiple matchkey approach, the matchkeys are combinations of all non-missing QID fields. The QIDs are concatenated and hashed. Since every field combination yields a matchkey, a large number of matchkeys results. This number of matchkeys is reduced by removing matchkeys containing a superset of QIDs if a subset of these QIDs would be accepted as a match [26]. For example, if two matchkeys differ only in their surname usage, they can be reduced to the matchkey that does not contain the surname. The selected matchkeys are based on weights estimated by an EM algorithm. Only matchkeys exceeding a predefined threshold are used.

Comment R1-C7: It surprising that the authors found difficulties of instability in linking 20 million records, as datasets of this size and larger are now routinely linked by a number of DL centres and here each update would be less than a million records.

We agree to this point. However, to the best of our knowledge, there is no public available implementation of the Fellegi-Sunter-model for dataset blocks exceed ing 1 million records, which is stable. Splink reported ”out of memory” during the simulations despite 1TB RAM. After some discussion with the developers it became clear that Splink requires very strict and tailor-made blocking rules. FastLink within R, which was the environment desired by the client, was unable to handle larger blocks. Due to the time constraint given by the client, we opted for the only stable program public available we could find (recordlinkage in Python). We modified the text accordingly:

There is no stable, proven public available open-source software for RL using QIDs of data sets with more than 10 million records and large blocks.

Comment R1-C8: How was the threshold of 0.8 chosen for probabilistic linkage? In practice thresholds are often chosen after some examination of grey area linkage results and may depend to a degree on the nature and quality of the linkage data.

Thank you for attention to this detail. The selection of the threshold was based on our experience with linking simulated German census data. We modified our text accordingly:

Based on previous experience in linking simulated German census data, we use a threshold of 0.8 for all probabilistic linkages.

Our contract with German official statistics does not allow us to cite or share the technical report.

Comment R1-C9: What exactly are the thresholds 1 to 19 in Figure 2? Is the order for these meaningful.

Thank you for this important question. We modified our text accordingly:

The multiple matchkey approach considers a sum of the weights in the Fellegi-Sunter-model exceeding a threshold as indicating a suitable matchkey. To select this threshold, we linked the first datasets with thresholds within the observed range from 1 to 19 (see Figure 2).

Furthermore, we extended the description of the plot by adding the following sentence to the caption:

The plots shows the linkage quality (F2) dependent on the tested thresholds (integer of the sum of weights in the Fellegi-Sunter-model) of the multiple matchkey approach.

Comment R1-C10: The most important result is probably the decay in linkage over time. It suggests that for ongoing registries, efforts to improve quality with each update are worth undertaking. Bias in subpopulations is also a common issue and needs to be estimated and dealt with to inform any subsequent analysis of the data.

Thank you for this observation. We think that linkage bias is already clearly described in the discussion:

However, linkage bias is likely as long as the error rates between subpopulations differ. Furthermore, ethnicity is consistently associated with higher identifier error rates and lower linkage rates [2, 14].

Concerning the improvement of data quality you mentioned, we fully agree. However, we consider this problem as out of scope for this paper since future changes in policies regarding data quality are difficult to simulate. Any discussion of the necessary modifications of the microsimulation would extend the manuscript further.

Comment R1-C11: It is interesting that place of birth strengthens the linkage significantly. This variable is generally not available in large datasets and would usually need to be specifically collected. It does show that collecting a variable for linkage other than the standard ones can be of value. Address is often available but, of course, this can generally only be used as a confirming or distinguishing item. Sometimes an analytical variable can also be useful for linkage. In this case there could be one or more education variables that could be available.

Thank you for this hint. We agree that fine-tuning the record linkage procedure in the educational register might help to improve record linkage quality. However, educational paths are increasingly complex. For example, in Germany specific vocational training gives access to selected university studies. A university grade might be followed by vocational training. Hence, using educational information as additional identifiers will be tricky. Therefore, we didn’t simulate the use of educational information for linkage.

Comment R1-C12: The discussion and conclusion are a good encapsulation of the issues but are possibly the wrong way round?

We are not sure how we should understand this question. It could mean that the text of the ”discussion” should be the text for the ”conclusion” section and vice versa. Usually, the ”conclusion” should answer the research question. In our understanding, our ”conclusion” does that. If we misunderstood your statement, please elaborate.

Comment R1-C13: The discussion makes the critical point, not previously discussed directly, of the need to differentiate between statistical and administrative purposes, when considering the quality of linkage and what is an acceptable level of error.

Thank you for this nice statement.

Comment R1-C14: The authors conclude that to maintain flexibility of linkage methodology into the future, it would be best to store plaintext QIDs. A further advantage of this is that it allows the possibility that peculiarities in the data can be identified over time to improve linkage. In such an ongoing register, knowledge of data problems generally increases with successive updates. It also enables linkage error rates to be estimated by clerical review of samples of the data.

Thank you for this very helpful comment! We modified the text accordingly:

In general, plaintext allows clerical review and, in this way, estimations of error rates in subsamples. Furthermore, unexpected encodings or unknown variants of QIDs could be detected and used for improving preprocessing. Finally, plaintext will guarantee that future developments in RL can use existing datasets. In addition, also historical linkages could be updated using future methods. These advantages of plaintext are further arguments against using PPRL methods in official statistics [29].

Comment R1-C15: There are also more general issues with using PPRL methods as they require a more complex data supply process with at least one additional third party involved. This can cause practical difficulties with routinely collected datasets, particularly as national registries may have multiple sources of collection. More generally the linkage methods that can be used may be affected by the auspices under which the data are collected.

We fully agree to these statements. However, to avoid distraction of the readers we don’t want to extend our discussion of the disadvantages of PPRL within official statistics in this paper.

Comment R2-C1: On my copy, the formating of the figures was all over the place. I think that this is the easiest thing to confirm before the final version is uploaded but since I observed it on my copy I thought I would mention it.

We agree to this statement and are also irritated by the layout of the galleys. This is most probably due to the editorial management system used by IJPDS. We will provide vector graphics produced by R (currently we used high resolution JPGs). Therefore, the published version of the figures will be more readable.

Comment R2-C2: Figure 1 was completely ilegible on my printout (yes, I still have to read things in printed form). It is an important figure, so I would recommend that we double check that a high enough resolution image is uploaded for the final version.

Please see our previous comment.

Comment R2-C3: Throughout the paper the authors are making ample use of footnotes. Some of them are short side notes that do not detract from the ”argument” that the author are building up. But, other footnotes, are so long that if you make the mistake to follow the footnote before finishing reading the paragraph you forget what you were reading about (e.g. footnotes 1 and 4. In fact 4 is long enough to continue to the next page). What I would recommend about these is to limit them as much as possible and try to ”take them in your stride”. That is, if a footnote is important enough to warrant more than a line of text, then let’s embed it in the main text flow.

Thank you for this comment. As you suggested, we either deleted a footnote or incorporated the footnote in the main text. The current version reduced the number of footnotes from 16 to 1. The remaining one-line footnote gives an URL.

Comment R2-C4: Throughout the paper there is consistent use of the term ”Population covering...”. I think that a more fitting term would be ”Population wide...”. For example: ”Population wide attainment registers are necessary...”, ”...population wide datasets are rare, and nothing...” and so on.

We replaced all six occurrences of ”population covering” by ”population wide”.

Comment R2-C5: Figures 3,4,5. While I understand what these refer to, I notice that precision / recall are presented in a form that is akin to a time series. The reader sees one metric (either precision or recall) to be decreasing from year to year and wanders: Is this the systematic error associated with certain processes on a year to year basis or something else? I think that if the reason behind the divergence is that on every year we get a cumulative effect on errors it should be clarified somewhere in the text. Otherwise, I think that an average figure of the overall process, would be enough.

Indeed, the figures reflect the process of cumulative errors. In the text, we summarized the cumulative effects already. For example, we noted ”Precision decreases 1. with increasing error rates, 2. with the duration of register activity, 3. if the place of birth is not available and 4. for foreign names.” Using only an average plot will reduce the required space a bit. However, we consider the visual impact of the cumulative effect necessary for our audience. Therefore, we would prefer to leave the plots unchanged.

Comment R2-C6: As a point of note in the methodology: The authors set up a ”process model” for the reasons behind the modification of the records and each modification accepts a probability of error. The first two lines of the conclusion read ”The novel microsimulation approach to studying the generation of QIDs of a register showed that the main problem of a register would be linkage bias caused by insufficient information to identify a person uniquely.”. Would it be possible to clarify if death, as a life event was taken into account? When one monitors a population where the identity of a person, as inferred by their records, is uncertain, then ”death” (or removing the ”true” identifier from the pool) is also modulating the ”insufficient information to identify a person uniquely”. When an identifier is removed from the pool it increases the probability of other identifiers producing a positive match ( 0.8) that is in fact a false positive. It might be interesting to see how much these accuracy figures change by taking into account an age dependent death rate versus not taking it into account. In developed countries this might not matter match if child mortality is too low but in developing countries this ”death parameter” might contribute to quicker increases of the errors in each subsequent ”generation” within the micro simulation.

Thank you for this interesting observation. We simulated only 10 years and the death rate in the relevant age group according to official statistics is quite low (99126 of 100000 live births reach the age of 25). Therefore, the results of our simulation will not change by considering age specific death rates. However, we added a sentence to reflect your important observation:

This might increase the error rate in RL for a long-running register if no information of death or out-migration is considered.

Comment R2-C7: Finally, I feel that the third paragraph in section ”2.1 Data Generation” might need some clarification on what the matching of the two distributions implies. It is mentioned that ”To get access to a joint distribution of encrypted identifiers, a formal request, including a detailed description of the project to the data guardian of a local educational register was necessary. For name generation, the joint distribution of the encrypted identifiers was aligned to the frequency distribution of an unencrypted commercial database.” If the encryption algorithm is so weak as to allow the matching of two distributions , then there is no point in matching them, we can simply request the histogram of names from the data guardian MAKING SURE that they return names that have more than X (e.g. 100) counts. I hope that it is understood that matching the histograms is an ”attack” on the encyption. If the two histograms match then I know what the encrypted data is and if we have not excluded the rare names, then we can track those individuals down very easily. I think that it would be worth providing a bit more information on that processing step.

Thank you for these observations. We agree that the alignment of the frequencies would allow an easy attack. However, in this case, from the point of view of the data guardian, we are the attacker. The data guardian considered the safeguards provided by the encoding as sufficient for his purpose (not providing cleartext). We extended our description of the frequency match as discussed in our first answer to the first reviewer above (R1-C1). We hope that the matching process is now more clear.

We hope you find our revisions as sufficient and complete.

Yours sincerely,
Rainer Schnell & Severin V. Weiand


Editor Decision

Kim McGrail

Decision Date: 09/12/2022

https://doi.org/10.23889/ijpds.v8i1.2122.review.r2.dec

Decision: Article Accepted