In an era where data-driven decisions shape policies and research, it is critical to ensure that data linkage is accurate. However, databases are often designed for Indo-European languages. This means that people’s names are romanised and don’t take into account nuances such as diacritical marks for pronunciation (e.g. Brontè, or Muños), multiple surnames, or non-Latin scripts. This oversight systematically excludes ethnic minorities and migrant populations leading to underrepresentation in research, policy, and service provision, and may introduce biases that disproportionately affect such groups.

A new study focusing on Chinese names, published in the International Journal of Population Data Science (IJPDS), compares various systems that are being used to romanise Chinese characters. It revealed that certain systems such as Jyutping (for romanising the Cantonese language) and Pinyin (for romanising Mandarin) outperform HKG-romanisation, the Hong Kong Government Cantonese Romanisation system, when it comes to reducing linkage errors. This finding calls for reconsiderations in database design, management and linkage practices.

Data linkage relies on name-matching algorithms to connect records across large administrative databases but romanised names, especially when a non-standardised romanisation system is used, are more likely to be mismatched. They either falsely link unrelated records together or completely fail to match identical records. This study, conducted by researchers from University College London, examined 771 Hong Kong student names. It showed that Jyutping and Pinyin romanisation correctly spotted more than 95% of all matches, significantly outperforming HKG-romanisation’s 68.8%. This disparity highlights a flaw in current data processing systems: a failure to account for language-specific variations, tonal differences, and naming conventions.

To resolve these issues romanisation systems need to be standardised with enhanced phonetic encoding, and they should incorporate language-specific pre-processing strategies for data linkage. Database managers should prioritise Unicode adoption and develop tone-sensitive linkage algorithms to improve equity and accuracy in data representation.

Lead researcher Joseph Lam stated, "Our findings highlight a gap in data system design and linkage practices. The current systems fail to capture linguistic diversity accurately, which leads to systemic bias. By adopting tone-sensitive and standardised romanisation methods, we can enhance the reliability of linked records and ensure fair representation for all communities."

As digital records increasingly dictate policy decisions, data linkage must evolve to reflect the diversity of global populations. Standardising romanisation methods and incorporating phonetic encoding can significantly reduce bias, ensuring that no community is left out of critical research and decision-making. The call to action is clear: to achieve data equity, database owners and policymakers must embrace inclusive, linguistically informed reforms in data management. The future of fair and accurate data linkage depends on it.

 

Click here to read the full article

Joseph Lam, Research Assistant, Population, Policy & Practice Department, University College London

Lam, J., Cortina-Borja, M., Aldridge, R., Blackburn, R. and Harron, K. (2023) “Decolonising Data Systems: Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage”, International Journal of Population Data Science, 8(5). doi: 10.23889/ijpds.v8i5.2935.