<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "JATS-journalpublishing1.dtd" [
]>
<article xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML"
  dtd-version="1.2" article-type="abstract">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">IJPDS</journal-id>
      <journal-title-group>
        <journal-title>International Journal of Population Data Science</journal-title>
        <abbrev-journal-title>IJPDS</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="epub">2399-4908</issn>
      <publisher>
        <publisher-name>Swansea University</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.23889/ijpds.v9i5.2900</article-id>
      <article-id pub-id-type="publisher-id">9:5:407</article-id>
      <title-group>
        <article-title>The Fundamental Role of Linkage Uncertainty in Epidemiological Analysis of Big Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Bor</surname>
            <given-names initials="J">Jacob</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
          <xref ref-type="aff" rid="affil-2">2</xref>
          <xref ref-type="aff" rid="affil-3">3</xref>
          <xref ref-type="aff" rid="affil-4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Lauren</surname>
            <given-names initials="E">Evelyn</given-names>
          </name>
          <xref ref-type="aff" rid="affil-5">5</xref>
          <xref ref-type="aff" rid="affil-3">3</xref>
        </contrib>
      </contrib-group>
      <aff id="affil-1"><label>1</label><institution>Department of Global Health, Boston University School of Public Health</institution></aff>
      <aff id="affil-2"><label>2</label><institution>Department of Epidemiology, Boston University School of Public Health</institution></aff>
      <aff id="affil-3"><label>3</label><institution>Health Economics and Epidemiology Research Office</institution></aff>
      <aff id="affil-4"><label>4</label><institution>Africa Health Research Institute</institution></aff>
      <aff id="affil-5"><label>5</label><institution>Department of Biostatistics, Boston University School of Public Health</institution></aff>
      <pub-date date-type="pub" publication-format="electronic">
        <day>18</day>
        <month>09</month>
        <year>2024</year>
      </pub-date>
      <pub-date date-type="collection" publication-format="electronic">
        <year>2024</year>
      </pub-date>
      <volume>9</volume>
      <issue>5</issue>
      <elocation-id>2900</elocation-id>
      <permissions>
        <license license-type="open-access" xlink:href="https://creativecommons.org/licences/by/4.0/">
          <license-p>This work is licenced under a Creative Commons Attribution 4.0 International License.</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://ijpds.org/article/view/2900">This article is available from the IJPDS website at: https://ijpds.org/article/view/2900</self-uri>
    </article-meta>
  </front>
  <body>
    <sec>
      <title>Introduction</title>
      <p>Epidemiologists increasingly work with linked “big data”. Uncertainty in record linkage may lead to biased inferences but is often overlooked. We evaluate the impact of linkage uncertainty on statistical inference in linked big data.</p>
    </sec>
    <sec>
      <title>Methods</title>
      <p>We developed a graphical framework for describing linkage uncertainty when linking multiple representations of the same entity, applied to de-identified data from South Africa’s national laboratory database. Through simulation, we systematically introduced linkage errors and measured their impact on overall accuracy (sensitivity, positive predictive value (PPV)). We evaluate how linkage errors affect bias and variance in point estimates for a hypothetical parameter of interest in clinical epidemiology: 24-month retention in care for HIV patients. We compare the roles of sampling error vs. linkage error as fundamental sources of uncertainty in datasets of varying sizes.</p>
    </sec>
    <sec>
      <title>Results</title>
      <p>We simulated a population of 14,393 HIV patients, with a “true” 24-month retention of 38.7%. There were 338,056 true links. Introducing 4,200 false links reduced PPV by 5%. Removing 21,500 existing links decreased sensitivity by 5%. From 10 simulation runs, a 95% sensitivity led, on average, to a 7.4% overestimate in entries to care and a 2.2% (range: 2.1-2.4%) underestimate in 24-month retention. A 95% PPV resulted, on average, in a 7.5% underestimate in entries to care and a 1.8% (range: 1.5-2.0%) overestimate in 24-month retention.</p>
    </sec>
    <sec>
      <title>Conclusion</title>
      <p>We observe that in a large sample, linkage uncertainty minimally impacts variance in point estimates but has a potentially large influence on the magnitude and direction, distinguishing it from typical sampling errors.</p>
    </sec>
  </body>
</article>