<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "JATS-journalpublishing1.dtd" [
]>
<article xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML"
  dtd-version="1.2" article-type="abstract">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">IJPDS</journal-id>
      <journal-title-group>
        <journal-title>International Journal of Population Data Science</journal-title>
        <abbrev-journal-title>IJPDS</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="epub">2399-4908</issn>
      <publisher>
        <publisher-name>Swansea University</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.23889/ijpds.v10i3.3271</article-id>
      <article-id pub-id-type="publisher-id">10:3:237</article-id>
      <title-group>
        <article-title>Probabilistic Linkage Pipeline Improving Linkage Quality and Explainability
          in Healthcare </article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Laidler</surname>
            <given-names initials="J">Jonny</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Blanco</surname>
            <given-names initials="A">Amaia Imaz</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Balasubramanian</surname>
            <given-names initials="D">Divya</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
      </contrib-group>
      <aff id="affil-1"><label>1</label><institution>NHS England, Leeds, United Kingdom</institution></aff>
      <pub-date date-type="pub" publication-format="electronic">
        <day>01</day>
        <month>06</month>
        <year>2025</year>
      </pub-date>
      <pub-date date-type="collection" publication-format="electronic">
        <year>2025</year>
      </pub-date>
      <volume>8</volume>
      <issue>4</issue>
      <elocation-id>3271</elocation-id>
      <permissions>
        <license license-type="open-access"
          xlink:href="https://creativecommons.org/licences/by/4.0/">
          <license-p>This work is licenced under a Creative Commons Attribution 4.0 International
            License.</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://ijpds.org/article/view/3271">This article is available from the
        IJPDS website at: https://ijpds.org/article/view/3271</self-uri>
    </article-meta>
  </front>
  <body>
    <sec>
      <title>Objectives</title>
      <p>The current methods used for data linkage or indexing in the healthcare industry follow
        deterministic algorithms that are not transparent to the end users and with often
        sub-optimal outcomes. Our work focuses on building and implementing a probabilistic
        algorithm that improves quality as well as explainability of the linkage.</p>
    </sec>
    <sec>
      <title>Methods</title>
      <p>In this project we have used Splink, a UK Ministry of Justice probabilistic linkage Python
        package, to build a pipeline that links any health data set to the Personal Demographics
        Service (PDS), a dataset containing information about all patients registered to a GP in
        England, which acts as our linkage “spine”. Our work has involved thorough investigation and
        evaluation every step of the process to ensure the quality of linkage is assured. We have
        taken considerations for data set ingestion, preprocessing, blocking rules, distance metric
        hierarchies, and explainability, including how to present the data appropriately to users of
        linked data.</p>
    </sec>
    <sec>
      <title>Results</title>
      <p>Throughout the developing process we have made comparisons to the existing deterministic
        linkage algorithm, clerically reviewing results that differed between the models, aiming to
        improve it. This allowed us to also build a ground truth dataset of records we reviewed and
        whether they were a true link or not. Furthermore, bias analysis is performed for evaluation
        purposes. Whilst we aim to continue improving the model, preliminary results show that this
        new methodology has improved linkage quality by up to 19% in comparison to the existing
        methodology. We are building an in-house capability to deliver this methodology at scale.
        Our mode, available in a public repo, provides additional metrics such as linkage
        probability made available to the end-users as part of improving transparency and
        explainability.</p>
    </sec>
    <sec>
      <title>Conclusion</title>
      <p>Our research and implementation provide evidence that probabilistic linkage algorithms are
        more sustainable indexing methods for continuous improvement of data quality, particularly
        in the healthcare industry. We believe the additional explainable measures will allow
        end-users to make informed decisions in their product creation ultimately improving patient
        health.</p>
    </sec>
  </body>
</article>