Synthetic data generator for testing record linkage routines in Brazil.

Main Article Content

Vitor Trentin Valeria Bastos Myrian Costa Kenneth Camargo Rejane Sobrino Luis Carlos Guillen Claudia Coeli
Published online: Aug 28, 2018


Introduction
Record linkage has been increasingly used in Brazil. However, only a few studies report the quality of the linkage process. Synthetic test data can be used to evaluate the quality of data linkage.


Objectives and Approach
To develop a synthetic data generator that creates test datasets with similar attributes and error characteristics found in the Brazilian databases.


We analyzed the 2013 mortality database from Rio de Janeiro State to know the characteristics and frequency distribution of the database attributes (name, mother’s name, sex, date of birth and address).


We used Python and C++ to customize and add routines to GeCo (http://dlrep.org/dataset/GeCo), a personal data generation tool developed by Tran et al. (DOI:10.1145/2505515.2508207).


Results
Brazilian names have specific characteristics that distinguish them from other countries’ patterns: multiple family names are usual, as are composite first names, and, despite that, homonyms are frequent. Family names may include the full extension or only parts of either the father and mother’s respective family names, or both, so there is a wide variation in progeny family names and not necessarily a common family name for all family members.


Conclusion/Implications
Due to the specific national characteristics of name building in Brazil, modeling synthetic data is particularly challenging and needs to have more flexible rules in order to generate databases that will actually allow assessing the quality of data linkage processes.


Introduction

Record linkage has been increasingly used in Brazil. However, only a few studies report the quality of the linkage process. Synthetic test data can be used to evaluate the quality of data linkage.

Objectives and Approach

To develop a synthetic data generator that creates test datasets with similar attributes and error characteristics found in the Brazilian databases.

We analyzed the 2013 mortality database from Rio de Janeiro State to know the characteristics and frequency distribution of the database attributes (name, mother’s name, sex, date of birth and address).

We used Python and C++ to customize and add routines to GeCo (http://dlrep.org/dataset/GeCo), a personal data generation tool developed by Tran et al. (DOI:\href{https://doi.org/10.1145/2505515.2508207}{10.1145/2505515.2508207}).

Results

Brazilian names have specific characteristics that distinguish them from other countries’ patterns: multiple family names are usual, as are composite first names, and, despite that, homonyms are frequent. Family names may include the full extension or only parts of either the father and mother’s respective family names, or both, so there is a wide variation in progeny family names and not necessarily a common family name for all family members.

Conclusion/Implications

Due to the specific national characteristics of name building in Brazil, modeling synthetic data is particularly challenging and needs to have more flexible rules in order to generate databases that will actually allow assessing the quality of data linkage processes.

Article Details