<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "JATS-journalpublishing1.dtd" [
]>
<article xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML"
  dtd-version="1.2" article-type="abstract">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">IJPDS</journal-id>
      <journal-title-group>
        <journal-title>International Journal of Population Data Science</journal-title>
        <abbrev-journal-title>IJPDS</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="epub">2399-4908</issn>
      <publisher>
        <publisher-name>Swansea University</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.23889/ijpds.v10i4.3036</article-id>
      <article-id pub-id-type="publisher-id">10:3:25</article-id>
      <title-group>
        <article-title>Alternative Name Encodings - Using Jyutping or Pinyin as tonal
          representations of Chinese names for data linkage</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Lam</surname>
            <given-names initials="J">Joseph</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Cortina-Borja</surname>
            <given-names initials="M">Mario</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Aldridge</surname>
            <given-names initials="R">Robert</given-names>
          </name>
          <xref ref-type="aff" rid="affil-2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Blackburn</surname>
            <given-names initials="R">Ruth</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Harron</surname>
            <given-names initials="K">Katie</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
      </contrib-group>
      <aff id="affil-1"><label>1</label><institution>Great Ormond Street Institute of Child Health,
        University College London, London, United Kingdom</institution></aff>
      <aff id="affil-2"><label>2</label><institution>Institute for Health Metrics and Evaluation,
        University of Washington, Seattle, USA</institution></aff>
      <pub-date date-type="pub" publication-format="electronic">
        <day>01</day>
        <month>06</month>
        <year>2025</year>
      </pub-date>
      <pub-date date-type="collection" publication-format="electronic">
        <year>2025</year>
      </pub-date>
      <volume>8</volume>
      <issue>4</issue>
      <elocation-id>3036</elocation-id>
      <permissions>
        <license license-type="open-access"
          xlink:href="https://creativecommons.org/licences/by/4.0/">
          <license-p>This work is licenced under a Creative Commons Attribution 4.0 International
            License.</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://ijpds.org/article/view/3036">This article is available from the
        IJPDS website at: https://ijpds.org/article/view/3036</self-uri>
    </article-meta>
  </front>
  <body>
    <sec>
      <title>Objectives</title>
      <p>Accurate data linkage across large administrative databases is crucial for addressing
        complex research and policy questions, yet linkage errors—stemming from inconsistent name
        representations—can introduce biases, predominantly for names not given in English. We
        identify three primary issues in processing non-English names not sufficiently considered by
        data linkers: language-specific variations in romanisation, the loss of tonal information
        inherent to tonal languages, and discrepancies in name order conventions.</p>
    </sec>
    <sec>
      <title>Methods</title>
      <p>This work examines the impact of romanisation on linkage accuracy, focusing on Chinese
        names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised
        Hong Kong Government Cantonese Romanisation (HKG-romanisation). Using a dataset of 771 Hong
        Kong student names, we derived Jyutping and Pinyin using the pinyin_jyutping package in
        Python 3.8, which used an online open-source Cantonese dictionary CC-Canto, for each
        character of the Chinese name. We compared how closely three different systems of
        romanisation (HKG-romanisation, Jyutping, Pinyin) represented the original Chinese
        characters, in terms of uniqueness, which provided some information on the utility of these
        systems for balancing sensitivity and specificity of linkages. </p>
    </sec>
    <sec>
      <title>Results</title>
      <p>Our analysis reveals that standardised romanisation systems enhance the uniqueness and
        consistency of name representations, thereby improving linkage precision and recall compared
        to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95\% recall in blocking
        strategies, whereas HKG-romanisation only reached 68.8\%. We explored tonal distribution of
        Jyutping and Pinyin in our sample. Incorporating tonal information further improved recall,
        and hold potential for character re-encoding when developed using a larger, more
        representative Chinese name database. </p>
    </sec>
    <sec>
      <title>Conclusion</title>
      <p>These findings underscore the necessity of adopting standardised, tone-sensitive
        romanisation systems and flexible database designs to reduce linkage errors and promote data
        equity for under-represented groups. We advocate for the implementation of phonetic
        encodings in databases, alongside language-specific pre-processing protocols, to ensure more
        inclusive and accurate data linkage processes.</p>
    </sec>
  </body>
</article>