Main Article Content
Clerical review in probabilistic data linkage is a manual process of checking record pairs where model is not certain about the matching. In general, a record pair might contain attributes such as name, date of birth, date of death, gender and address to find similarity for matching. Hence, clerical review is a time consuming process for large data and requires knowledge to ascertain the similarity of record pairs. Machine learning can be used to improve the clerical review process by utilising captured knowledge.
Objectives and Approach
To improve the clerical review process as a tool for expert data linkers, we developed machine learning models that can classify whether a record pair is matched for linkage. In the models, we trained a large number of record pairs data already labelled as match or not by expert linkers to capture the features for clerical review process. Both traditional machine learning and deep learning methods are used for model development. In both approaches of modelling, we developed diverse features so that similarity of record pairs can be classified accurately. All models are developed in our secured internal integration authority environment.
Our initial modelling result shows at least 98% accuracy as F-score in classifying record pairs as match or not using Random Forest technique. We further
implemented deep learning methods with similar accuracy aiming to enhance the model accuracy in future when more data are accessible.
Conclusion / Implications
We developed machine learning models to improve clerical review process in probabilistic data linkage. When linking data for health and wellbeing research, it is often highly desired to capture as many as link possible and we have strong manual clerical review process in place. We found that augmenting machine learning in clerical review is more efficient in decision making when linkage projects are large and complex.
This work is licensed under a Creative Commons Attribution 4.0 International License.