A Privacy Attack on Multiple Dynamic Matchkey based PrivacyPreserving Record Linkage
Main Article Content
Abstract
Introduction
Over the last decade, the demand for linking records about people across databases has increased in various domains. Privacy challenges associated with linking sensitive information led to the development of privacypreserving record linkage techniques. The multiple dynamic matchkey encoding approach recently proposed by Randall et al. (IJPDS, 2019) is such a technique aimed at providing sufficient privacy for linkage applications while obtaining high linkage quality. However, the use of this encoding in large databases can reveal frequency information that can allow the reidentification of encoded values.
Objectives
We propose a frequencybased attack to evaluate the privacy guarantees of multiple dynamic matchkey encoding. We then present two improvements to this matchkey encoding approach to prevent such a privacy attack.
Methods
The proposed attack analyses the frequency distributions of individual matchkeys in order to identify the attributes used for each matchkey, where we assume the adversary has access to a plaintext database with similar characteristics as the encoded database. We employ a set of statistical correlation tests to compare the frequency distributions of matchkey values between the encoded and plaintext databases. Once the attribute combinations used for matchkeys are discovered, we then reidentify encoded sensitive values by utilising a frequency alignment method. Next, we propose two modifications to the matchkey encoding; one to alter the original frequency distributions and another to make the frequency distributions uniform. Both will help to prevent frequencybased attacks.
Results
We evaluate our privacy attack using two large realworld databases. The results show that in certain situations the attack can successfully reidentify a set of sensitive values encoded using the multiple dynamic matchkey encoding approach. On the databases used in our experiments, the attack is able to reidentify plaintext values with a precision and recall of both up to 98%. Furthermore, we show that our proposed improvements are able to make this attack harder to perform with only a small reduction in linkage quality.
Conclusions
Our proposed privacy attack demonstrates the weaknesses of multiple matchkey encoding that should be taken into consideration when linking databases that contain sensitive personal information. Our proposed modifications ensure that the multiple dynamic matchkey encoding approach can be used securely while retaining high linkage quality.
Introduction
Privacypreserving record linkage (PPRL) is the process of linking records that belong to the same individual across different databases while preserving the privacy of the individuals that are represented by the records in these databases [1]. Existing PPRL approaches have a tradeoff between linkage quality, scalability, and privacy [1], with some being vulnerable to privacy attacks [2,3,4].
The multiple dynamic matchkey approach [5] is a recently proposed method for PPRL which aims to provide privacy protection against frequency attacks, while achieving high linkage quality. As we describe next, the main idea of this approach is to generate distinct hashcodes (called matchkey values) using attribute value combinations (called matchkeys) and compare these matchkey values across databases to identify matching pairs of records.
Multiple dynamic matchkey approach
We now briefly describe the main steps of multiple dynamic matchkey encoding as proposed by Randall et al. [5]. Common notations used throughout the paper are shown in Table 1. We assume a sensitive database, D, with n records, where each record has k attributes. A matchkey is a combination of attributes from D. Given there are k attributes in D, there can be 2^{k}  1 attribute combinations. A list of matchkeys mk_{e} is first selected and then matchkey values are created for each record r ∈ D. These matchkey values are hashcodes generated by hashing the concatenated values of each matchkey mk_{e} ∈ mk_{e}, as illustrated in Figure 1. An encoded database (consisting of hashcodes) generated from D is denoted as H, a matrix of hashcodes. Essentially the matrix H is generated with n rows and m columns where n = D and m = mk_{e}. A row in H corresponds to a record r ∈ D and each row contains a list of m hashcodes. A column in H corresponds to a certain matchkey mk_{e} ∈ mk_{e}, where a given column contains the hashcodes generated for all r ∈ D. Hence a cell H_{ij} represents a hashcode, h_{ij}, generated for the i^{th} record using the j^{th} matchkey, where 1 ≤ i ≤ n and 1 ≤ j ≤ m.
Notation  Definition 

D  Sensitive database 
V  Plaintext database 
r  A record in a database 
k  Number of attribute values in a database 
mk_{e}, mk_{p}  List of encoded matchkeys, and list of plaintext matchkeys 
H, H_{*j}  Encoded matrix of hashcodes, and a column in H which represents a mk_{e} ∈ mk_{e} 
M, M_{*j}  Plaintext matchkey matrix, and a column in M which represents a mk_{p} ∈ mk_{p} 
C, C_{i*}, C_{*j}  Correlation matrix, a row in C (mk_{p} ∈ mk_{p}), and a column in C (correlation metric) 
f  A frequency distribution 
i, j  ith row and jth column in a matrix 
The list mk_{e} is determined using the probabilistic record linkage method proposed by Fellegi and Sunter [6], which utilises the error characteristics of the databases that are being linked. As described in detail by Randall et al. [5], for each potential matchkey a weight score w is calculated using the method proposed by Fellegi and Sunter [6]. For the multiple dynamic matchkey approach, these weight scores w are then compared against a userdefined weight threshold, w_{t}. Only those matchkeys that have a weight score of at least w_{t} are selected for mk_{e} to generate matchkey values. The authors of the multiple matchkey approach also recommended selecting matchkeys with at least two attributes [5].
Due to the large number of matchkeys that are possible for k attributes (2^{k}  (k+1) in total), the authors have employed a superset pruning approach to reduce the number of matchkeys that are being selected in order to reduce the computational costs [5]. For instance, if the matchkey FirstName, LastName has a total score of at least w_{t} then all the supersets of this matchkey are pruned. This is because having a superset such as FirstName, LastName, Gender will not result in any additional matches compared to the matches already identified by the matchkey FirstName, LastName.
Once a list of matchkeys mk_{e} is selected, matchkey values are generated for each matchkey using the corresponding attribute values of each record. To generate matchkey values, a keyed cryptographic hash function, such as the hashed message authentication code (HMAC) [7] that uses a hash function and a secret key (known only to the owners of the databases that are being encoded for linkage) can be used. If one of the attribute values used in a matchkey is missing, then that matchkey value is left empty. Given the list of selected matchkeys mk_{e}, each record in the database D will have up to mk_{e} matchkey values assigned to it (less if the record has missing values in one or more of the attributes used in mk_{e}). We provide an example of this generation of matchkey values in Figure 1. The matchkey values for each r ∈ D are stored as lists (i.e. with an order), and only matchkey values at the same position in these lists are compared between records. This will reduce the computational complexity of the comparison since one matchkey value of a record will only be compared with one matchkey value of another record.
In a linkage protocol, the owners of the sensitive databases to be linked first generate such a list of matchkey values for each record in their database. Assuming two databases, D_{A} and D_{B}, these databases are encoded into two matrices of matchkey values (hashcodes), H_{A} and H_{B}, respectively. In the comparison and classification step, assumed to be conducted by a third party known as the linkage unit [1], the corresponding matchkey values (hashcodes in the same column in H_{A} and H_{B}) are then compared to classify each candidate record pair either as a match or nonmatch. Following Randall et al. [5], a record pair is classified as a match (assumed to refer to the same individual) if one or more of its matchkey values are the same. If none of the corresponding matchkey values of a record pair are the same, then the record pair will be classified as a nonmatch (assumed to refer to different individuals).
Contribution
In this paper, we analyse the privacy guarantees of this multiple dynamic matchkey approach using a privacy attack that aims to reidentify the plaintext values that were encoded into hashcodes, using a known (publicly available) plaintext database. The attack tries to identify the attributes used for matchkeys by analysing the frequency distributions of individual matchkey values. Using two large realworld databases, we show that the attack can reidentify plaintext values with an acceptable level of accuracy. We then propose two improvements as countermeasures for multiple matchkey encoding and discuss how these methods can strengthen the privacy guarantees of the matchkey encoding approach and prevent this type of privacy attacks.
Methods
We define the following terms which we use to describe the matchkey approach and the proposed privacy attack (as also summarised in Table 1). Assume a sensitive database D is encoded using the multiple dynamic matchkey approach, which results in a matrix H of hashcodes, as shown in Figure 1. Each column in H contains hashcodes generated using a unique matchkey, mk_{e} ∈ mk_{e}. We refer to these columns as encoded matchkeys. As with existing attacks on PPRL [2,3,4], we assume the adversary has access to a plaintext database V with a frequency distribution of attribute values similar to the encoded database D. Such plaintext databases can be sourced externally (e.g. a telephone directory or voter database) or internally (if the attack is conducted by an insider who has access to some type of plaintext database that is similar to the encoded database). We then generate a list of matchkeys, mk_{p}, from the plaintext database V to compare against the encoded matchkeys. We refer to these matchkeys as plaintext matchkeys. The resulting plaintext matchkey matrix is denoted as M where each column is a plaintext matchkey mk_{p} ∈ mk_{p}. We assume that the plaintext database V contains some or all attributes that are used for the encoded matchkeys in D. We also assume different levels of knowledge of the adversary, as we discuss in the Results section.
As we describe in detail below, the proposed privacy attack consists of five main steps:
 Calculate the frequency distributions of encoded matchkeys (f_{j}^{e}) and plaintext matchkeys (f_{j}^{p}).
 Calculate the correlations between the frequency distributions of each encoded matchkey H_{*j} ∈ H with all plaintext matchkeys M_{*j} ∈ M, and for each H_{*j} build a correlation matrix, C.
 Select candidate plaintext matchkeys for each encoded matchkey H_{*j} by comparing correlation values in C.
 Assign highly correlated plaintext matchkeys mk_{p} to each encoded matchkey mk_{e}.
 Reidentify plaintext values using frequency alignment of plaintext matchkeys.
The first two steps of the attack are illustrated in Figure 2. We discuss these two steps in detail in the following two sections.
Step 1: Calculating frequency distributions
In the first step of the attack we calculate the frequency distributions of matchkey values, f_{j}^{e}, for each encoded matchkey (i.e. column H_{*j} ∈ H) where 1 ≤ j ≤ mk_{e}. Since the matchkey values of records are stored as lists (i.e. with an order), we are able to calculate frequency distributions of matchkey values for each encoded matchkey.
Next, we generate different matchkeys of the plaintext database V. Assuming there are k attributes in V (including some or all attributes used for the encoded matchkeys) we generate all matchkeys from length two to k, assuming matchkeys have only been generated from combinations of two or more attributes [5]. For the resulting list of plaintext matchkeys, mk_{p}, for each record r ∈ V, we generate a list of plaintext matchkey values using all mk_{p} ∈ mk_{p}. This results in a matrix of plaintext matchkey values, M, where a column M_{*j} ∈ M represents a plaintext matchkey mk_{p} ∈ mk_{p}. Then, for each column M_{*j}, we calculate the frequency distribution, f_{j}^{p}, of concatenated attribute values in mk_{p} from M.
For instance, if there are three attributes FirstName, LastName, and BirthYear in the plaintext matchkey database M, we can generate four matchkeys M_{*j} = (FirstName+LastName), (FirstName+BirthYear), (LastName+BirthYear), (FirstName+LastName+BirthYear). Here + is used to represent the string concatenation. We then calculate frequency distributions for all four matchkeys. At the end of the first step the adversary has the frequency distributions of all encoded matchkeys from H (f_{j}^{e}) and all plaintext matchkeys from M (f_{j}^{p}).
Step 2: Measuring the correlation of frequency distributions
In the second step, we use a series of statistical tests to compare the frequency distributions of all plaintext matchkeys, f_{j}^{p}, with one given encoded matchkey, f_{j}^{e}. For each encoded matchkey, H_{*j}, we build a correlation matrix, C which contains all correlations calculated between each plaintext matchkey, M_{*j}, and that encoded matchkey, H_{*j} (each pair of f_{j}^{e} and f_{j}^{p}), using different correlation metrics. Then by comparing these correlation values, we obtain the most similar plaintext matchkeys for the current encoded matchkey, H_{*j}. In the following we discuss this process in more detail.
The statistical tests we use can be divided into two categories. The first consists of basic statistics which can be calculated for a distribution, while the second consists of correlation tests which can be used to compare two distributions. As for the basic statistics we calculate the mean, standarddeviation, variance, and skewness of two frequency distributions and calculate the absolute difference of those values for different pairs of f_{j}^{e} and f_{j}^{p}. Then we employ the following correlation metrics to compare the two distributions f_{j}^{e} and f_{j}^{p}:
 Earth mover's distance (EMD) [8]: The EMD is used to measure the dissimilarity between two probability distributions over some feature space. Given one distribution as the baseline, EMD calculates the least amount of work needed to reach the second distribution.
 KolmogorovSmirnov (KS) test [9]: The KS test is used to measure the equality of two continuous probability distributions to inspect if one distribution is a sample of another distribution.
 Pearson's correlation coefficient [10]: Given two continuous variables x and y, Pearson's correlation coefficient measures the correlation between these two variables. The relationship calculation is based on the method of covariance.
 Spearman's rank correlation [10]: Spearman's Rank correlation measures the ranked correlation between two variables. Specifically, it calculates the strength and direction of association between two ranked variables.
 Relative entropy (KullbackLeibler divergence) [11]: Relative entropy is used to measure how one probability distribution differs from another reference probability distribution. We use relative entropy to measure how much one frequency distribution is different from the second distribution.
 Histogram intersection [12]: Histogram intersection measures the similarity of two probability distributions by calculating the intersection of their histograms. For this calculation the two histograms of the probability distributions should be of the same length or both these distributions can be placed into a same number of bins using an appropriate binning technique such as equal depth binning. Given two histograms, x and y, with b bins in each, the intersection of the two histograms can be calculated as: \(I(x,y)=\sum_{(i=1)}^b min(x_i,y_i),\) where min() outputs the minimum of x_{i} and y_{i}. To normalise the intersection value, we can divide I(x,y) by the average of the two histograms: \(I(x,y)_{norm}=\frac{2I(x,y)}{\sum_{(i=1)}^b x_i+y_i }.\) In our attack, we use histogram intersection to calculate the intersection of the frequencies of distribution pairs f_{j}^{e} and f_{j}^{p}. The higher the value of I(x,y)_{norm}, the more similar the two distributions are.
These six metrics are commonly used to compare the correlations between probability or frequency distributions. Given we are interested in measuring the correlations between two frequency distributions, we aim to identify correlated matchkeys by employing these metrics. Furthermore, our evaluation using these metrics will help us to determine the usability of these correlation metrics in such an attack scenario.
The correlation values calculated using the above statistical tests will be added to a single vector, c, of ten values (i.e. four values from the basic statistical metric differences and six values from the above discussed correlation metrics) for each pair of frequency distributions f_{j}^{e} and f_{j}^{p}. For each encoded matchkey, H_{*j}, and its corresponding frequency distribution, f_{j}^{e}, we obtain a set of correlation vectors by comparing f_{j}^{e} with the frequency distributions f_{j}^{p} of all possible plaintext matchkeys in M. This results in a correlation matrix C_{j} (for that particular encoded matchkey H_{*j}) where C_{i*} (row i) represents a correlation vector c calculated for a certain plaintext matchkey and C_{*j} (column j) represents a vector of correlations for a certain correlation metric (i.e. one of the above ten tests). In practice it is important to understand which correlation metrics are robust when calculating correlations with different distributions. This aspect is outside the scope of this paper and we considered the performance of all correlation metrics to be equal in our context. In the following steps we describe how we can use the obtained correlation values to filter plaintext match keys that are unlikely to match to encoded match keys.
Frequency based filtering: Given there are k attributes in V, potentially there are 2^{k}  (k+1) attribute combinations that we need to analyse. Hence, we propose a filtering step to reduce this number of candidate plaintext matchkeys by comparing their frequencies. We identify not possible plaintext matchkeys through their top most frequent plaintext values. For instance, an encoded matchkey with the highest frequency of 50 will not be compared with a plaintext matchkey that has a highest frequency of 150. This is assuming that the two databases have a similar number of records and frequency distributions in their attribute values. We define a frequency range parameter, ε, for the range of frequency values that can be used to filter plaintext matchkeys. An encoded matchkey with highest frequency of max(f_{j}^{e}), will only be compared with a plaintext matchkey with highest frequency of max(f_{j}^{p}) if (max(f_{j}^{e})ε) ≤ max(f_{j}^{p}) ≤ (max(f_{j}^{e})+ε). Here max() represents the top frequency (highest count) of that particular frequency distribution. The value for ε needs to be adjusted based on the encoded and plaintext database sizes, D and V.
Step 3: Filtering candidate plaintext matchkeys
In the third step of the attack we first normalise the correlation values in the matrix C for each statistical test using minmax normalisation. Here correlation values are normalised in such a way that the highest correlation will become 1.0 and the lowest correlation will become 0.0. However, the distance values (such as Earth mover's distance and Relative entropy) are normalised in such a way that the highest value will become 0.0 and the lowest value will become 1.0. This ensures that all values obtained using different statistical tests are in the same range and comparable.
Next, for each correlation metric, C_{*j}, we sort the plaintext matchkeys mk_{p} in descending order based on their normalised correlation values. We then select the top plaintext matchkeys based on this sorted list. For each pair of plaintext matchkeys, mk_{x} and mk_{x+1}, we calculate the difference ratio, α_{d}, of their normalised correlation values. We do this to filter the candidate plaintext matchkeys that have the highest correlation values with small differences between them. Only plaintext matchkeys that have a difference ratio below a user defined difference ratio limit of α will be considered. Assuming the correlation values of mk_{x} and mk_{x+1} are s_{x} and s_{x+1} respectively, the difference ratio between those two values, α_{d}, is calculated as:
If α_{d} ≤ α, then we select both matchkeys mk_{x} and mk_{x+1}. If α_{d} > α then we stop the selection of candidate matchkeys at that point. This process is performed for all statistical tests, resulting in a filtered matrix of plaintext matchkeys for each encoded matchkey. Table 2 shows an example matrix C obtained after the filtering process for the encoded matchkey (LastName+Gender+BirthYear). Plaintext matchkeys with crossedout correlation values are not selected for that particular correlation metric due to having a too low correlation value, as discussed in the above filtering step.
Plaintext matchkey  Mean  Skewness  …  EMD  Entropy  KStest 

(FirstName+LastName+Gender) 
0.985 
0.541 
…  0.967  0.901  0.389 
(FirstName+LastName+BirthYear) 
0.662 
0.877  …  0.989 
0.578 
0.998 
(LastName+Gender+BirthYear) 
0.988  0.921  …  0.969  0.974  0.986 
At the end of this step, we obtain one filtered correlation matrix C_{j} per encoded matchkey, H_{*j}. This C_{j} will contain plaintext matchkeys, mk_{p}, that have the highest correlation values based on the most similar frequency distributions to that encoded matchkey H_{*j}.
Step 4: Reidentifying encoded matchkeys
In the fourth step of the attack, we try to identify the correct plaintext matchkey for each encoded matchkey, H_{*j}, from its corresponding matrix C_{j}. In this section, since we are analysing a single encoded matchkey at a time, we exclude j from all notations to improve readability. Therefore, it is important to note that the following process is applied for each encoded matchkey H_{*j} separately.
For a given filtered matrix C, we first calculate an overall correlation, a_{c}, of each plaintext matchkey mk_{p} by averaging the correlation value vector of that matchkey. As we discussed above, this vector consists of the normalised correlation values for the ten different statistical tests. Given that a plaintext matchkey mk_{p} ∈ C has a correlation vector c_{p}, the overall correlation a_{c} can be calculated as:
Next we look at the number of unique matchkey values the plaintext matchkey (number of unique concatenated plaintext values) and the encoded matchkey (number of unique hashcodes) have. Note that even if the distributions of an encoded matchkey and a plaintext matchkey is similar these two matchkeys might have different numbers of unique values. Therefore, apart from correlation measurements, the number of unique values can also be used to compare the two matchkeys. If the encoded matchkey has q_{e} unique values and a plaintext matchkey mk_{p} ∈ C has q_{p} unique values, we calculate the difference ratio d_{c} of those values as:
where abs() calculates the absolute value. We next calculate the average of the two values, a_{c} and d_{c}. This can be calculated as a weighted average, where we assign certain weights to each value. Assuming the weight values for a_{c} and d_{c} are ω and 1  ω respectively (with 0 ≤ ω ≤ 1), the average value s_{c} is calculated as:
For each encoded matchkey we obtain C of these average values where C is the number of plaintext matchkeys assigned to that encoded matchkey. We then sort the plaintext matchkeys according to these s_{c} values and assign the plaintext matchkey with the highest s_{c} value to that encoded matchkey. We repeat this step for all the encoded matchkeys H_{*j}. At the end of this step we get a ranked list of plaintext matchkeys that are likely to be more similar to each of the encoded matchkeys.
Based on the superset pruning method suggested by the authors of the multiple dynamic matchkey approach [5], we can also check for super or subsets in the identified plaintext matchkeys for each encoded matchkey H_{*j}. If any of the encoded matchkeys, H_{*j}, has already been assigned to a plaintext matchkey mk_{p}^{x}, then any of the super or subsets of the mk_{p}^{x} cannot be assigned to another encoded matchkey, under the assumption that the first assignment of mk_{p}^{x} is correct. An example is discussed in the Introduction section.
Step 5: Reidentification of plaintext values
Once the plaintext matchkeys for each encoded matchkey H_{*j} are identified, the final step of the attack is to perform plaintext value reidentification by applying a frequency alignment method [2]. For each encoded matchkey, H_{*j}, we obtain the frequency distribution of matchkey values, and for the identified plaintext matchkey, mk_{p}^{j}, of H_{*j}, we obtain the frequency distribution of the corresponding plaintext values. Both these frequency distributions are then sorted in descending order as shown in Table 3. First, we adjust these frequency values according to the sizes of the databases D and V. We can adjust either the frequency values of matchkeys by multiplying each value by V / D, or alternatively we can adjust the frequency values of plaintext values by multiplying each value by D / V. Then we loop over each pair of frequency aligned matchkey value and plaintext value and assign the i^{th} plaintext value to the corresponding i^{th} matchkey value. We continue this process until the differences of their frequencies is greater than Δ_{t}, a user defined threshold. Given the frequencies of the i^{th} matchkey value and plaintext values are f_{j}^{i} and f_{c}^{i} we calculate the frequency difference Δ_{i} as:
where abs() returns the absolute value. If Δ_{i} > Δ_{t}, then we stop the alignment of plaintext values to encoded matchkey values (hashcodes).
Furthermore, if several adjacent frequencies of matchkey values and plaintext values in the sorted lists are the same then it will be difficult to determine which plaintext value corresponds to which matchkey value. For instance if f_{j}^{i} = f_{j}^{i}^{+1} = f_{c}^{i} = f_{c}^{i}^{+1}, we cannot be certain that the i^{th} matchkey value refers to i^{th} plaintext or (i+1)^{th} plaintext value. In such scenarios we assign both the i^{th} and (i+1)^{th} plaintext values to both the i^{th} and (i+1)^{th} matchkey values, respectively. An example of this plaintext alignment process is illustrated in Table 3. As shown, records with the same frequency values are aligned together in order to improve the accuracy of the alignment. It is also important to note that this alignment works when the underlying attribute value distributions are similar in both V and D. In cases where attribute value distributions are dissimilar this frequency alignment can potentially lead to incorrect assignments.
Encoded database D  Plaintext database V  

Record ID  Matchkey value  Frequency  Record ID  (F+L+B)  Frequency  
d1  heYcgrjawf3AVtt  10  v1  brittany+nicole+1987  10  
d2  JbsJvZQ7lucFDcE  7  v2  brian+johnson+1968  7  
d3  NDZ5v3pzT7tvlko  7  v3  james+smith+1991  7  
d4  1XW13iYzExn4KGZ  7  v4  ronald+young+1982  7  
d5  5qIiMWET4suKARu  4  v5  ashley+johnson+1975  4  
d6  b97VABaDFZSg9OM  4  v6  johnny+motley+1989  4 
Using the above discussed process we can now identify the combined plaintext values from V for the most frequent matchkey values in each encoded matchkey H_{*j}. In the following section we describe experiments conducted using our proposed attack method and discuss the results we obtained.
Results
We evaluated our proposed attack method on the multiple dynamic matchkey approach [5] using two large realworld databases. We investigated different parameter settings under different attack scenarios to measure the feasibility of our attack in real linkage situations.
Databases
We first used a North Carolina Voter Registration (NCVR) database ^{1}, where we used six snapshots collected in October, August, and February 2019, October 2018 and 2017, and October 2011. We used the snapshot from October 2019 as the sensitive encoded database, D, and the other snapshots as the plaintext databases, V, to represent several time intervals (2 months, 8 months, 1 year, 2 years, and 8 years, respectively) between the encoded and the plaintext databases. Our aim is to evaluate how the temporal differences in records, such as attribute changes, between databases affects the accuracy and scalability of the attack. Our hypothesis is that the larger the time difference the larger the differences in attribute value distributions between D and V, as more voters change their names and/or addresses.
The second database was the Michigan voter registration (MVR) database^{2}, where we used four snapshots collected in September and January 2016, September 2014, and June 2013. We used the snapshot from September 2016 as the sensitive encoded database, D, and the other snapshots as the plaintext databases, V. As with the NCVR databases, our aim is to evaluate how the time difference between databases affects the attack.
The selected databases contained between 6,233,661 and 8,114,702 records. The number of changes in attribute values monotonically increase as the time differences between database pairs increases for both the NCVR and MVR databases. For example, between the October 2019 and October 2011 snapshots of NCVR, the attributes StreetAddress and MiddleName had 47.7% and 8.1% changes in their values, respectively, while between the September 2016 and June 2013 snapshots of MVR, the attributes StreetNumber and LastName had only 2% and 1% changes in their values, respectively.
Evaluation criteria
To evaluate the quality of the identification of plaintext matchkeys for each encoded matchkey H_{*j}, we analysed the top three to five plaintext matchkeys assigned for each encoded matchkey and see whether the attack was able to identify the correct matchkey with a high correlation. We then evaluated the accuracy of the reidentified plaintext values for matchkey values using precision and recall [13,14] of how many plaintext values identified are correct and how many correct values are in the identified set of plaintext values.
We also define three different attack scenarios based on the assumed knowledge an adversary has about the matchkey encoding settings. The scenarios we considered are as follows;
 Knowledge on attribute combinations (Comb): In this scenario we assume that the adversary has knowledge about the actual attribute combinations (matchkeys) used in the matchkey generation process, but does not know which encoded matchkey H_{*j} corresponds to which plaintext matchkey M_{*j}. In this situation the adversary would not have to consider all possible plaintext matchkeys. Only the known combinations need to be evaluated to measure the correlation of frequency distributions. This scenario can be considered as the best possible attack scenario on an encoded database from an adversary’s point of view.
 Knowledge on used attributes (Attr): In this scenario we assumed that the adversary has knowledge only about the set of attributes used for the matchkey generation, but does not know which combinations of attributes are used. Unlike the previous scenario, in this situation the adversary would have to conduct the first step of the attack (as discussed in the Methods section) using all possible plaintext matchkeys to identify the best matching plaintext matchkeys for each encoded matchkey, and then perform the second step of the attack.
 Domain knowledge (Dom): In this final scenario we assumed that the adversary does not have any specific knowledge about the attributes, nor the matchkeys, used in the matchkey generation process. The adversary only knows what kind of entities are encoded in H (such people with their names, addresses, and so on) and therefore can partially assume what kind of quasiidentifiers are being used for the encoding. Hence, the adversary needs to run the attack on all possible matchkeys using all attributes available in V that potentially were encoded into matchkeys. This scenario can be considered as the worst case scenario on an encoded database for an adversary.
We implemented all attack methods using Python 2.7 and the experiments were run on a server with 64bit Xeon 2.1 GHz 16Core CPU, 512 GBytes of memory and running Ubuntu 18.04. To facilitate reproducibility the prototype programs are available from: https://dmm.anu.edu.au/pprlattack/.
For attack parameter settings, we set α = 0.05, ω = 0.7, Δ_{t} = 0.2. An adaptive value is selected for the parameter ε based on the maximum frequency value of the considered encoded matchkey. A ratio from 0.1 to 0.9 was first selected and then the selected ratio is multiplied by the maximum frequency value of the encoded matchkey to calculate ε. These settings provided good results in a series of setup experiments.
The parameter α can be set to an appropriate value by analysing the correlation values obtained for each correlation metric. If the frequency distributions of the encoded and plaintext matchkeys are similar then α can be set to a low value such as α < 0.1. Inversely, with contrasting frequency distributions, α can be set to a higher value such as α = 0.3 to select a wide range of candidate plaintext matchkeys. However, setting α to even a higher value such as α > 0.4 will potentially result in a large selection of candidate plaintext match keys. We set the weight ω of to a higher value than 0.5 because the average correlation value a_{c} was consistently more accurate in correctly identifying encoded matchkeys than the difference ratio d_{c} in our experiments. The value for Δ_{t} should be set to a low value such as Δ_{t} < 0.3 to get accurate reidentifications. If the frequency distributions of the two databases are known to have differences, a higher value such as Δ_{t} = 0.4 can be set. This should however be done carefully because this could lead to inaccurate plaintext reidentifications, because allowing substantial differences in value distributions to be mapped can be wrongly used by the attack to map dissimilar plaintexts with each other. It is also worth noting that the optimal values for these parameters are data dependent and therefore should be set over several iterations of experiments.
Matchkey generation
Following the originally proposed dynamic matchkey encoding [5], we used the following sets of attributes for the experiments for the two databases NVCR and MVR.
 NCVR: FirstName, MiddleName, LastName, BirthYear, StreetAddress, and ZipCode
 MVR: FirstName, MiddleName, LastName, BirthYear, StreetNumber, StreetName, and ZipCode
These attributes are commonly used in linking records across different databases [1]. Following Randall et al. [5], we selected a list of matchkeys that provided the highest linkage quality (according to the Fmeasure) using the above attributes to encode the sensitive databases.
Encoded matchkey with its top frequency  Identified plaintext matchkeys with different time intervals D to V  

2 months  8 months  2 years  8 years  
F+M+L+B+S 4  F+M+L+B+S / 0.996  
M+L+B+S+Z / 0.895 
None 
None 
None 

F+M+B+S+Z / 0.893  
F+L+B+S+Z 4  F+L+B+S+Z / 0.983  F+L+B+S+Z / 0.964  
F+M+L+B+Z / 0.917  F+M+L+B+Z / 0.909 
None 
None 

F+M+L+B+S / 0.908  F+M+L+B+S / 0.906  
F+M+L+B+Z 4  F+M+L+B+Z / 0.981  
F+L+B+S+Z / 0.948 
None 
None 
None 

M+L+B+Z / 0.922  
F+M+L+S+Z 4  F+M+L+S+Z / 0.990  F+M+L+S+Z / 0.989  
F+L+S+Z / 0.912  F+L+S+Z / 0.905 
None 
None 

F+M+S+Z / 0.776  F+M+S+Z / 0.763  
M+L+B+S+Z 4  M+L+B+S+Z / 0.982  M+L+B+S+Z / 0.975  M+L+B+S+Z / 0.967  
F+M+B+S+Z / 0.950  F+M+B+S+Z / 0.963  F+L+B+S+Z / 0.915 
None 

F+M+L+B+S / 0.893  F+M+L+B+S / 0.867  F+M+L+B+S / 0.912  
F+M+B+S+Z 5  F+M+B+S+Z / 0.997  F+M+B+S+Z / 0.973  M+L+B+S+Z / 0.966  
M+L+B+S+Z / 0.964  M+L+B+S+Z / 0.971  F+M+B+S+Z / 0.962 
None 

F+M+L+B+S / 0.915  F+M+L+B+S / 0.922  F+M+L+B+S / 0.936  
F+M+L+B 8  F+M+L+B / 0.991  F+M+L+B / 0.981  F+M+L+B / 0.925  
None 
F+M+L+S / 0.780  M+L+S+Z / 0.781  F+M+L+Z / 0.731  
M+L+S+Z / 0.774  F+M+L+Z / 0.776  M+L+B+Z / 0.710  
F+M+L+Z 6  F+M+L+Z / 0.986  F+M+L+Z / 0.955  
None 
None 
M+L+B+Z / 0.973  M+L+B+Z / 0.936  
F+L+B+Z / 0.916  F+L+B+Z / 0.843  
F+L+B+Z 4  F+L+B+Z / 0.984  M+L+B+Z / 0.943  
None 
None 
F+L+B+S+Z / 0.957  F+M+L+Z / 0.874  
M+L+B+Z / 0.955  F+L+B+Z / 0.868  
F+M+L+S 20  F+M+L+S / 0.982  
None 
None 
F+B+S+Z / 0.721 
None 

M+B+S+Z / 0.639  
F+L+S+Z 4  F+L+S+Z / 0.987  
None 
None 
F+M+L+S+Z / 0.940 
None 

F+M+S+Z / 0.811  
L+B+S+Z 14  L+B+S / 0.938  
None 
None 
None 
L+B+S+Z / 0.928  
F+M+S+Z / 0.920  
M+B+S+Z 26  M+B+S+Z / 0.965  
None 
None 
None 
M+B+S / 0.960  
F+B+S+Z / 0.874  
F+B+S+Z 11  F+B+S / 0.916  
None 
None 
None 
F+B+S+Z / 0.915  
F+M+B+Z / 0.795 
Encoded matchkey with its top frequency  Identified plaintext matchkeys with different time intervals D to V  

8 months  2 years  3 years  
L+F+M+B+S_{n} 
L+F+M+B+S_{d}+S_{n} / 0.830 
L+F+M+B+S_{d}+S_{n} / 0.844 
L+F+M+B+S_{d}+S_{n} / 0.844 
L+F+M+B+S_{d}+Z / 0.827 
L+F+M+B+S_{d}+Z / 0.839 
L+F+M+B+S_{d}+Z / 0.840 

L+F+M+B+S_{n}+Z / 0.827 
L+F+M+B+S_{n}+Z / 0.838 
L+F+M+B+S_{n}+Z / 0.837 

L+F+M+B+S_{d} / 0.825 
L+F+M+B+S_{d} / 0.835 
L+F+M+B+S_{d} / 0.835 

L+F+M+B+S_{n} / 0.825 
L+F+M+B+S_{n} / 0.834 
L+F+M+B+S_{n} / 0.833 

L+F+M+B+S_{d} 
L+F+M+B+S_{d}+S_{n} / 0.830 
L+F+M+B+S_{d}+S_{n} / 0.844 
L+F+M+B+S_{d}+S_{n} / 0.844 
L+F+M+B+S_{d}+Z / 0.827 
L+F+M+B+S_{d}+Z / 0.839 
L+F+M+B+S_{d}+Z / 0.840 

L+F+M+B+S_{n}+Z / 0.827 
L+F+M+B+S_{n}+Z / 0.838 
L+F+M+B+S_{n}+Z / 0.837 

L+F+M+B+S_{d} / 0.825 
L+F+M+B+S_{d} / 0.835 
L+F+M+B+S_{d} / 0.835 

L+F+M+B+S_{n} / 0.825 
L+F+M+B+S_{n} / 0.834 
L+F+M+B+S_{n} / 0.833 

L+M+B+S_{d} 
L+M+B+S_{n}+Z / 0.974 
L+M+B+S_{d}+S_{n}+Z / 0.979 
L+M+B+S_{d}+Z / 0.920 
L+F+M+B+S_{d}+Z / 0.960 
L+M+B+S_{d}+Z / 0.952 
L+F+B+S_{n}+Z / 0.910 

L+F+M+B+S_{d} / 0.954 
L+M+B+S_{d} / 0.938 
L+M+B+S_{n}+Z / 0.890 

L+F+M+B+S_{n} / 0.953 
L+M+B+S_{d}+S_{n} / 0.911 
L+F+B+S_{n} / 0.889 

L+M+B+S_{d} / 0.946 
L+F+B+S_{n}+Z / 0.894 
L+M+B+S_{d}+S_{n} / 0.884 

F+M+B+S_{d} 
L+M+B+S_{n} / 0.970 
L+M+B+S_{n} / 0.967 
F+M+B+S_{d}+S_{n}+Z / 0.957 
F+M+B+S_{d}+Z / 0.947 
F+M+B+S_{d}+S_{n} / 0.965 
F+M+B+S_{d}+S_{n} / 0.936 

L+F+M+S_{d}+S_{n} / 0.929 
F+M+B+S_{d}+S_{n}+Z / 0.953 
L+F+M+S_{d}+S_{n} / 0.934 

F+M+B+S_{d}+S_{n} / 0.928 
F+M+B+S_{d}+Z / 0.949 
L+F+M+S_{d}+S_{n}+Z / 0.927 

L+F+M+S_{d}+Z / 0.918 
L+M+B+S_{n}+Z / 0.947 
F+M+B+S_{d}+Z / 0.926 

L+F+B+S_{d} 
L+F+B+S_{d}+Z / 0.957 
L+F+M+B+Z / 0.953 
L+F+B+S_{d} / 0.953 
L+F+B+S_{d}+S_{n} / 0.955 
L+F+B+S_{d}+Z / 0.947 
L+F+B+S_{d}+Z / 0.950 

L+F+M+B+S_{n} / 0.933 
L+F+B+S_{d}+S_{n}+Z / 0.945 
L+F+B+S_{n}+Z / 0.908 

L+F+M+B+S_{d} / 0.933 
L+F+B+S_{d}+S_{n} / 0.900 
L+F+B+S_{d}+S_{n}+Z / 0.903 

L+F+M+B+S_{d}+S_{n} / 0.922 
L+M+B+S_{d}+Z / 0.894 
L+F+B+S_{d}+S_{n} / 0.883 

L+F+M+S_{d} 
L+F+M+S_{d} / 0.978 
L+F+M+S_{d} / 0.987 
L+F+M+S_{d}+S_{n} / 0.985 
L+M+B+S_{n} / 0.976 
L+F+M+S_{d}+S_{n}+Z / 0.987 
L+F+M+S_{d}+S_{n}+Z / 0.977 

L+F+M+S_{d}+Z / 0.974 
L+F+M+S_{d}+Z / 0.985 
F+M+B+S_{d}+S_{n} / 0.974 

L+F+M+S_{d}+S_{n}+Z / 0.965 
L+F+M+S_{d}+S_{n} / 0.984 
L+F+M+S_{d}+Z / 0.973 

L+F+M+S_{d}+S_{n} / 0.962 
F+M+B+S_{d}+S_{n}+Z / 0.975 
L+F+M+S_{d} / 0.972 

L+F+M+B+S_{d} 
L+F+M+B+S_{d}+S_{n} / 0.830 
L+F+M+B+S_{d}+S_{n} / 0.844 
L+F+M+B+S_{d}+S_{n} / 0.844 
L+F+M+B+S_{d}+Z / 0.827 
L+F+M+B+S_{d}+Z / 0.839 
L+F+M+B+S_{d}+Z / 0.840 

L+F+M+B+S_{n}+Z / 0.827 
L+F+M+B+S_{n}+Z / 0.838 
L+F+M+B+S_{n}+Z / 0.837 

L+F+M+B+S_{d} / 0.825 
L+F+M+B+S_{d} / 0.835 
L+F+M+B+S_{d} / 0.835 

L+F+M+B+S_{n} / 0.825 
L+F+M+B+S_{n} / 0.834 
L+F+M+B+S_{n} / 0.833 
Discussion
In Tables 4 and 5 we show the reidentification results for each encoded matchkey used in different databases from different time intervals using the NCVR and MVR databases. As can be seen, with NCVR we are able to identify the attribute combinations used for each encoded matchkey with high accuracy despite the time difference and corresponding percentages of attribute changes between these databases. The correct matchkey was always amongst the top three identified plaintext matchkeys. Furthermore, the attack was able to identify the correct combination as the first (the matchkey with highest correlation) 20 out of 24 times.
Time interval D to V  Comb  Attr  Dom  

Prec / Reca (%)  Total reident.  Prec / Reca (%)  Total reident.  Prec / Reca (%)  Total reident.  
NCVR  2 months  98.6 / 98.4  1137  98.6 / 98.4  1137  91.8 / 91.1  951 
8 months  81.7 / 81.4  14415  81.7 / 81.4  14415  77.0 / 80.9  736  
1 year  69.7 / 69.3  705  58.0 / 59.9  705  70.4 / 70.1  887  
2 years  47.7 / 41.4  451  58.3 / 54.0  1481  61.4 / 62.1  1480  
8 years  30.0 / 30.4  1196  31.7 / 29.4  1186  16.3 / 15.3  1798  
MVR  8 months  48.6 / 48.5  4972  46.6 / 35.6  45  14.2 / 11.3  30 
2 years  32.1 / 33.2  1675  37.2 / 37.7  1657  42.4 / 45.6  1657  
3 years  26.6 / 27.6  1671  32.3 / 32.3  1727  28.7 / 32.0  1735 
With the MVR databases the reidentification accuracy of the matchkeys are lower compared to the NCVR databases. However, as can be seen in Table 5, the identified matchkeys are always a superset or a subset of the correct matchkey. The correct matchkey was amongst the top five identified matchkeys 17 out of 21 reidentifications. The lower accuracy results are caused by having low and similar values in the frequency distributions f_{j}^{e} of matchkeys. As shown in Table 5, the highest frequency values of encoded matchkeys in MVR are only 3 and 2 compared to the NCVR databases where they are ranging from 4 to 26. Therefore, the correlation values measured using different plaintext matchkeys (especially the supersets and subsets of the correct matchkey) are all highly similar and it is difficult to distinguish the correct matchkey from other possible candidates. While the results on the NCVR databases illustrate the robustness of the attack in reidentifying encoded matchkeys when there is enough frequency information in matchkeys, at the same time the results on the MVR databases highlight the need of enough frequency information and distinct frequency distributions for the attack to be highly successful.
Table 6 shows the precision, recall, and the number of reidentifications of plaintext values for the NCVR and MVR databases. These are the results obtained in the final step of the attack as discussed in the Methods section. As can be seen, with the NCVR databases, the attack was able to reidentify plaintext values encoded in matchkey values with high accuracy when the time difference between databases was small. However, both precision and recall drop when the time difference between databases is increasing. This is because, even if the attack was able to identify the correct attribute combinations used for encoded matchkeys by analysing the frequency distributions, exact reidentifications of plaintext values as described in the fifth step of the attack are difficult due to having changes in actual attribute values. With different scenarios the attack maintains fairly consistent results. As expected with the Dom scenario the numbers of reidentifications are slightly lower than the other scenarios because the accuracy of the reidentification of matchkeys is lower.
With the MVR databases, the plaintext alignment accuracy is relatively low compared to the NCVR database experiments. This is because plaintext alignment is done only using the top most selected plaintext matchkey as discussed in the Methods section. According to Table 5, the attack was not able to identify the correct encoded matchkey as the top selected plaintext matchkey most of the times. Therefore, the plaintext alignment did not perform as accurately as with NCVR databases. However, we are still able to identify some of the attribute values correctly because all identified plaintext matchkeys for the MVR are either a superset or a subset of the correct encoded matchkeys.
With regard to the time efficiency, the first and the second steps are the most time consuming steps of the attack requiring a maximum of 3,840 and 7,352 seconds respectively. The third and fourth steps took less than one second for all experiments, and the fifth step took a maximum of 307 seconds. Overall the attack took less than 3.5 hours to complete for all databases and attack scenarios ^{3}.
Out of all the tests we utilised to compare frequency distributions, the correlation metrics Earth mover’s distance, KS test, Pearson's correlation, Spearman’s rank correlation, Relative entropy, and Histogram intersection provided consistently good results with different attack scenarios and databases with different time intervals compared to the four basic statistical measures used. This is due to the ability of correlation metrics to identify common characteristics of distributions even with differences in actual frequency values.
Privacy improvements and recommendations
Considering the identified privacy vulnerabilities in the matchkey generation process, we now propose two recommendations to strengthen the privacy guarantees of the multiple dynamic matchkey encoding approach. We further show empirical evidence to support our claim of using these recommendations to make the matchkey encoding resilient to frequencybased privacy attacks.
Recommendation 1: As we discussed in the Introduction section, matchkeys are encoded and stored as lists so that only corresponding matchkey values are compared when matching two records. For instance, if records are encoded using four matchkey values, [mk_{e}^{1}, mk_{e}^{2}, mk_{e}^{3}, mk_{e}^{4}], when a record pair r1 and r2 is compared, r1[mk_{e}^{1}] is only compared with r2[mk_{e}^{1}] and r1[mk_{e}^{2}] is only compared with r2[mk_{e}^{2}], and so on. This is because the matchkey values are stored in the same order for all records.
This ordering of matchkey values allows the calculation of frequency distributions of each matchkey in the attack. An adversary can safely assume that a single column in the encoded database D (a column H_{*j} ∈ H) corresponds to a single encoded matchkey. To prevent such assumptions by the adversary we propose to store matchkey values in sets as opposed to lists. Each row H_{i*} ∈ H in the encoded database should be a set of matchkey values where there is no order. Now the adversary cannot assume a single column represents a certain encoded matchkey.
As opposed to improved privacy, there is a tradeoff with this technique. A single matchkey value from a record now needs to be compared with every matchkey value from another record which will lead to increased time consumption. Furthermore, the possibility of obtaining false positives will also increase. For instance if two record r1 and r2 have the following attribute values: r1: FirstName = “Paul”, LastName = “Thomas”, BirthYear = “1992” and r2: FirstName = “Thomas”, LastName = “Paul”, BirthYear = “1992”, then matchkey (F+B) of r1 and matchkey (L+B) of r2 will have the same value and the record pair (r1, r2) will be classified as a match. Increasing false positives will reduce the overall quality of the linkage. However, this can be solved by prefixing an attribute specific value (such as an attribute qualifier or an attribute index) to the plaintext value before hashing. For example if we use ‘F’ for first name and ‘L’ for last name, we obtain matchkey values such as “FThomas1992” and “LThomas1992”. This approach is different from record specific salting values as used in the context of Bloom filter encoding to avoid frequency attacks. Therefore, adding an attribute specific string however may not affect the frequency distribution, rather it is recommended to avoid the false positive rate.
Recommendation 2: As discussed in the Methods section, our proposed attack compares the correlations between the frequency distributions of encoded matchkeys with plaintext matchkeys. In order for this to be successful, the encoded matchkeys need to have distinct frequencies for different matchkey values. If the frequency distributions are close to uniform, an adversary will not be able to compare those distributions with plaintext frequency distributions.
As our second recommendation, we propose making the frequency distributions of encoded matchkeys close or equal to uniform. In the encoded database H, we set the matchkey values that have frequency larger than x, where x ≥ 1, as missing. This will ensure that all the remaining matchkey values (hashcodes) for each encoded matchkey are unique and the frequencies of all those matchkey values will be between 1 and x making the distribution close to uniform. Hence, the adversary can no longer conduct any correlation analysis on these distributions since they are uniform and cannot be distinguished from each other. As with our first recommendation, this method also has a tradeoff for improved privacy at the cost of reduced linkage quality. Setting matchkey values as missing will lower the overall recall of the linkage since potential matches could be missed. At the same time, overall precision of the linkage will be improved since potential false positives could be removed when using this technique.
We conducted multiple experiments with x = 50, 20, 10, 5, 2, 1 where we set the matchkey values with frequency > x as missing. As expected, the precision of the linkage has increased with a minimum and a maximum increase of +0.69% and +10.4%, respectively. The recall has dropped due to the true matches being missed because their matchkey values are removed. The overall recall had a minimum and a maximum decrease of 0.04% and 1.29% respectively. The small changes in both precision and recall is because the number of records with matchkey value frequencies > 1 is small compared to the 7 million records in the databases. However, if the databases have more errors or missing values in attributes this will potentially lead to a selection of encoded matchkeys, H_{*j}, with a small number of attributes which will improve the recall of the linkage. As we further discuss below, selecting encoded matchkeys with a smaller number of attribute values (such as only 2 or 3 attributes in a H_{*j}) will increase the frequencies of those encoded matchkeys. In such a case using this recommendation to improve privacy will also lead to a potentially substantial reduction of recall.
It is possible to apply one or both of the above two recommendations to increase the privacy of the sensitive values in the encoded database. However, we recommend to initially apply the first recommendation and then check if any distinct frequency distributions can still be obtained and aligned between the plaintext and encoded databases. Depending on the results of the first modification we can then apply the second recommendation if further improvement of privacy is required. Furthermore, it is also worth noting that if the second recommendation is applied first, then there is no practical advantage of applying the first recommendation next because the attack will not be possible after the application of the second recommendation.
Practical considerations of the multiple matchkey encoding
While the original multiple dynamic matchkey encoding proposed by Randall et al. [5] provides high linkage quality with acceptable privacy, there are few practical aspects that need to be taken into consideration when using this encoding for PPRL in realworld applications.
 The selection of matchkeys depends on the Fellegi and Sunter agreement and disagreement weights of individual attributes. In [5] the authors have conducted their experiments on selected matchkey that provided the highest Fmeasure values for a given database. However, in practice these agreement and disagreement weights need to be estimated based on partially available ground truth or domain knowledge. Hence, with estimated values, there is no guarantee that the Fmeasure and thus linkage quality of the final record linkage will be optimal.
 The multiple matchkey encoding approach depends on exact matching of hashcodes. As also discussed by the authors in [5] this could be a disadvantage if the matching databases contain errors or variations in attribute values. Even a small typographical error or a character change (Christine vs. Christina) in an attribute value will lead to a completely different hashcode, whereas with approximate matching techniques such as Bloom filter encoding [15] this might not lead to false negatives. Furthermore, in our experiments, we observed that selected matchkeys always contain a name attribute (FirstName, MiddleName, or LastName), as can be seen in Tables 4 and 5. This is because name attributes generally have more discriminatory power compared to other attributes used. Therefore, if the name attributes in the databases to be linked contain more errors or changes in their values, then this could lead to lower linkage quality despite low amounts of errors in other used attributes.
 When the sizes of the databases get bigger, it is more likely for them to have a distinct frequency distribution for each matchkey (attribute combination). Even if the authors of the multiple matchkey encoding [5] suggested to have two or more attribute values in a matchkey, our experiments showed that having even three attribute values in a matchkey will still lead to a discrete frequency distribution for that particular matchkey which can lead to reidentifications. In our experiments on databases with seven million records, the frequencies start to get close to uniform with matchkeys with four or more attributes. But then again, if we are using five or more attributes for a matchkey we need to make sure the attributes do not contain a large number of errors or missing values because that will lead to lower linkage quality.
Conclusion
We have presented a privacy attack on the recently proposed multiple dynamic matchkey encoding method for PPRL [5]. The proposed attack employs correlation calculations on frequency distributions to identify attribute combinations used to generate matchkey values. Based on the identified matchkeys, the attack then reidentifies attribute values that corresponds to individual matchkey values (hashcodes). The experimental results showed that multiple dynamic matchkey encoding is susceptible to frequency attacks under certain parameter settings. As countermeasures, we have proposed two recommendations to improve the privacy of the multiple matchkey encoding approach and have discussed the tradeoffs between linkage quality and improved privacy when using these methods. These recommendations show that with a small number of errors and missing values in databases, privacy can be strengthened while keeping high linkage quality.
As future work, we plan to investigate methods of improving the performance and the accuracy of the attack by analysing potential filtering techniques for matchkeys. One such technique could be to analyse if multiple matchkeys are the same for a given record pair. Such patterns of same matchkeys in record pairs can potentially lead to the identification of attribute values used in encoded matchkeys and thereby the reidentification of encoded plaintext values.
Acknowledgments
This work was funded by the Australian Research Council under DP160101934.
Ethics statement
No ethics approval was required for this study, because all the databases used in this study are publicly available and necessary references are provided for them.
Statement on conflicts of interest
The authors have no conflicts of interest.
Footnotes

1
Available at: http://dl.ncsbe.gov/?prefix=data

2
Available at: http://michiganvoters.info

3
Additional evaluation results of the proposed attack such as time complexity and database analysis (excluded from the paper due to space constraints) are available at: https://dmm.anu.edu.au/pprlattack/
References

Vatsalan D, Sehili Z, Christen P, Rahm E. Privacypreserving record linkage for Big Data: current approaches and research challenges. In: Handbook of Big Data Technologies. Springer; 2017. p. 851–95. DOI: 10.1007/9783319493404_25
https://doi.org/10.1007/9783319493404_25 
Christen P, Ranbaduge T, Vatsalan D, Schnell R. Precise and fast cryptanalysis for Bloom filter based privacypreserving record linkage. IEEE Trans Knowl Data Eng. 2019 Nov;31(11):2164–77 DOI: 10.1109/TKDE.2018.2874004
https://doi.org/10.1109/TKDE.2018.2874004 
Christen P, Vidanage A, Ranbaduge T, Schnell R. Patternmining based cryptanalysis of Bloom filters for privacypreserving record linkage. In: Adv Knowl Discov Data Min. Melbourne; 2018. p. 530–42. DOI: 10.1007/9783319930404_42
https://doi.org/10.1007/9783319930404_42 
Vidanage A, Ranbaduge T, Christen P, Schnell R. Efficient pattern mining based cryptanalysis for privacypreserving record linkage. In: Proc Int Conf Data Eng. Macau; 2019. p. 1698–701. DOI: 10.1109/ICDE.2019.00176
https://doi.org/10.1109/ICDE.2019.00176 
Randall S, Brown A, Ferrante A, Boyd J. Privacy preserving linkage using multiple dynamic matchkeys. Int J Popul Data Sci. 2019;4(1). DOI: 10.23889/ijpds.v4i1.1094
https://doi.org/10.23889/ijpds.v4i1.1094 
Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969;64(328):1183–210. DOI: 10.1080/01621459.1969.10501049
https://doi.org/10.1080/01621459.1969.10501049 
Bellare M, Canetti R, Krawczyk H. Keying hash functions for message authentication. In: Annual International Cryptology Conference. 1996. p. 1–15. DOI: 10.1007/3540686975_1
https://doi.org/10.1007/3540686975_1 
Levina E, Bickel P. The Earth Mover’s distance is the Mallows distance: some insights from statistics. In: Proc IEEE Int Conf Comput Vis. 2001;2:251–6 DOI: 10.1109/ICCV.2001.937632
https://doi.org/10.1109/ICCV.2001.937632 
Wozniak PJ. Applied Nonparametric Statistics (2nd ed.). Technometrics. 1991;33(3):364–5. DOI: 10.1080/00401706.1991.10484849
https://doi.org/10.1080/00401706.1991.10484849 
de Winter JCF, Gosling SD, Potter J. Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychol Methods. 2016 Sep;21(3):27390. DOI: 10.1037/met0000079
https://doi.org/10.1037/met0000079 
Hershey JR, Olsen PA. Approximating the Kullback Leibler divergence between Gaussian mixture models. In: Proc IEEE Int Conf Acoust Speech Signal Process. 2007;4:IV317IV–320. DOI: 10.1109/ICASSP.2007.366913
https://doi.org/10.1109/ICASSP.2007.366913 
Barla A, Odone F, Verri A. Histogram intersection kernel for image classification. In: Proc Int Conf Image Proc. 2003;3:513–6. DOI: 10.1109/ICIP.2003.1247294
https://doi.org/10.1109/ICIP.2003.1247294 
Christen P. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer; 2012. DOI: 10.1007/9783642311642
https://doi.org/10.1007/9783642311642 
Hand, D and Christen, P. A note on using the Fmeasure for evaluating record linkage algorithms. Stat Comput. 2018;28(3):539–47. DOI: 10.1007/s1122201797466
https://doi.org/10.1007/s1122201797466 
Schnell R, Bachteler T, Reiher J. Privacypreserving record linkage using Bloom filters. BMC Med Inf Decis Mak. 2009;9(1). DOI: 10.1186/14726947941
https://doi.org/10.1186/14726947941