<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "JATS-journalpublishing1.dtd" [
]>
<article xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML"
  dtd-version="1.2" article-type="abstract">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">IJPDS</journal-id>
      <journal-title-group>
        <journal-title>International Journal of Population Data Science</journal-title>
        <abbrev-journal-title>IJPDS</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="epub">2399-4908</issn>
      <publisher>
        <publisher-name>Swansea University</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.23889/ijpds.v9i5.2633</article-id>
      <article-id pub-id-type="publisher-id">9:5:149</article-id>
      <title-group>
        <article-title>Enhancing Disease Detection in Electronic Medical Records: Integrating Human Expertise and Large Language Models with Application to Diabetes, Hypertension, and Acute Myocardial Infarction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Pan</surname>
            <given-names initials="J">Jie</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
          <xref ref-type="aff" rid="affil-2">2</xref>
          <xref ref-type="aff" rid="affil-3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Lee</surname>
            <given-names initials="S">Seungwon</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
          <xref ref-type="aff" rid="affil-4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Cheligeer</surname>
            <given-names initials="C">Cheligeer</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
          <xref ref-type="aff" rid="affil-4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Martin</surname>
            <given-names initials="E">Elliot</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
          <xref ref-type="aff" rid="affil-4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Riazi</surname>
            <given-names initials="K">Kiarash</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
          <xref ref-type="aff" rid="affil-2">2</xref>
          <xref ref-type="aff" rid="affil-3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Quan</surname>
            <given-names initials="H">Hude</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
          <xref ref-type="aff" rid="affil-2">2</xref>
          <xref ref-type="aff" rid="affil-3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Li</surname>
            <given-names initials="N">Na</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
          <xref ref-type="aff" rid="affil-2">2</xref>
          <xref ref-type="aff" rid="affil-3">3</xref>
        </contrib>
      </contrib-group>
      <aff id="affil-1"><label>1</label><institution>Centre for Health Informatics, Cumming School of Medicine, University of Calgary</institution></aff>
      <aff id="affil-2"><label>2</label><institution>Department of Community Health Sciences, Cumming School of Medicine, University of Calgary</institution></aff>
      <aff id="affil-3"><label>3</label><institution>Libin Cardiovascular Institute, University of Calgary</institution></aff>
      <aff id="affil-4"><label>4</label><institution>Provincial Research Data Services, Alberta Health Services</institution></aff>
      <pub-date date-type="pub" publication-format="electronic">
        <day>18</day>
        <month>09</month>
        <year>2024</year>
      </pub-date>
      <pub-date date-type="collection" publication-format="electronic">
        <year>2024</year>
      </pub-date>
      <volume>9</volume>
      <issue>5</issue>
      <elocation-id>2633</elocation-id>
      <permissions>
        <license license-type="open-access" xlink:href="https://creativecommons.org/licences/by/4.0/">
          <license-p>This work is licenced under a Creative Commons Attribution 4.0 International License.</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://ijpds.org/article/view/2633">This article is available from the IJPDS website at: https://ijpds.org/article/view/2633</self-uri>
    </article-meta>
  </front>
  <body>
    <sec>
      <title>Objective</title>
      <p>Electronic medical records (EMR) are widely available to complement administrative data-based disease surveillance and healthcare performance evaluation. Defining conditions from EMR is labour-intensive, requiring advanced medical informatics knowledge, and is challenging without effective data extraction tools. This study developed a high-throughput pipeline to detect diseases in EMRs.</p>
    </sec>
    <sec>
      <title>Methods</title>
      <p>We developed a pipeline that leverages a generative large language model (LLM) to analyze, understand, and interpret EMR notes by following clinical experts’ designed prompts. The pipeline was applied to detect diabetes, hypertension, and acute myocardial infarction (AMI) from the EMRs for a cardiac patient cohort in Calgary, Canada. The performance was compared against clinician-validated diagnoses as the reference standard.</p>
    </sec>
    <sec>
      <title>Results</title>
      <p>The cohort consisted of 3,413 patients with 551,095 clinical notes. The prevalence was 27.8%, 66.3%, and 54.3% for diabetes, hypertension, and AMI, respectively. The performance for detecting conditions varied: diabetes had 90.5% sensitivity, 83% specificity, and 67% positive predictive value (PPV); hypertension had 94.2% sensitivity, 30.2% specificity, and 73.8% PPV; and AMI had 86.4% sensitivity, 61% specificity, and 75.3% PPV. The monthly prevalence trends between the detected cases and reference standard showed similar patterns.</p>
    </sec>
    <sec>
      <title>Conclusion</title>
      <p>The proposed pipeline demonstrated reasonable accuracy and high efficiency in disease detection without manually curated labels, indicating the potential for automated real-time disease surveillance using EMRs.</p>
    </sec>
    <sec>
      <title>Implication</title>
      <p>Variations of documentation practices in clinical note can impact the detection performance of different diseases. Hence, an automated pipeline integrating LLMs with expert knowledge may improve detection accuracy with reduced labour costs while indicating documentation quality.</p>
    </sec>
  </body>
</article>