<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "JATS-journalpublishing1.dtd" [
]>
<article xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML"
  dtd-version="1.2" article-type="abstract">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">IJPDS</journal-id>
      <journal-title-group>
        <journal-title>International Journal of Population Data Science</journal-title>
        <abbrev-journal-title>IJPDS</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="epub">2399-4908</issn>
      <publisher>
        <publisher-name>Swansea University</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.23889/ijpds.v10i3.3265</article-id>
      <article-id pub-id-type="publisher-id">10:3:230</article-id>
      <title-group>
        <article-title> Handling of missing values in whole-population electronic health records: a
          simulation study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Sampri</surname>
            <given-names initials="A">Alexia</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Ip</surname>
            <given-names initials="S">Samantha</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Petitjean</surname>
            <given-names initials="C">Carmen</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Madley-Dowd</surname>
            <given-names initials="P">Paul</given-names>
          </name>
          <xref ref-type="aff" rid="affil-2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Morris</surname>
            <given-names initials="T">Tim P</given-names>
          </name>
          <xref ref-type="aff" rid="affil-3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Tilling</surname>
            <given-names initials="K">Kate</given-names>
          </name>
          <xref ref-type="aff" rid="affil-2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Sterne</surname>
            <given-names initials="J">Jonathan A C</given-names>
          </name>
          <xref ref-type="aff" rid="affil-2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Wood</surname>
            <given-names initials="A">Angela M</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
      </contrib-group>
      <aff id="affil-1"><label>1</label><institution>University of Cambridge, Department of Public
        Health and Primary Care, Cambridge, United Kingdom</institution></aff>
      <aff id="affil-2"><label>2</label><institution>University of Bristol, Bristol Medical School,
        Bristol, United Kingdom</institution></aff>
      <aff id="affil-3"><label>3</label><institution>MRC Clinical Trials Unit at UCL, University
        College London, London, United Kingdom</institution></aff>
      <pub-date date-type="pub" publication-format="electronic">
        <day>01</day>
        <month>06</month>
        <year>2025</year>
      </pub-date>
      <pub-date date-type="collection" publication-format="electronic">
        <year>2025</year>
      </pub-date>
      <volume>8</volume>
      <issue>4</issue>
      <elocation-id>3265</elocation-id>
      <permissions>
        <license license-type="open-access"
          xlink:href="https://creativecommons.org/licences/by/4.0/">
          <license-p>This work is licenced under a Creative Commons Attribution 4.0 International
            License.</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://ijpds.org/article/view/3265">This article is available from the
        IJPDS website at: https://ijpds.org/article/view/3265</self-uri>
    </article-meta>
  </front>
  <body>
    <sec>
      <title>Objectives</title>
      <p>This study evaluates the scalability of multiple imputation methods, specifically
        Multivariate Imputation by Chained Equations (MICE), for addressing missing data in
        whole-population EHRs (Electronic Health Records). We investigate the impact of the number
        of imputations, subsampling strategies, missing data mechanisms, and missingness levels on
        the accuracy of the results.</p>
    </sec>
    <sec>
      <title>Methods</title>
      <p>A simulation study was conducted using whole-population NHS England EHRs (primary/secondary
        care, and COVID-19 related records) from January 2024. We simulated missing BMI (Body Mass
        Index) scenarios under MAR (Missing At Random) and MNAR (Missing Not At Random) conditions.
        We examined the effect of MICE combined with subsampling strategies (i.e., the imputation
        and analysis models derived in subsamples and/or the full population) on the accuracy and
        precision of estimates for the association between COVID-19 vaccination status and severe
        outcomes adjusted for BMI and other confounders, using logistic regression. We evaluated the
        estimand accuracy, computational efficiency, and the environmental impact of the
        imputations.</p>
    </sec>
    <sec>
      <title>Results</title>
      <p>Preliminary results indicate that MICE effectively manages missing BMI data within large
        EHR datasets, preserving the integrity and accuracy of statistical outputs through grouped
        logistic regression. However, analysis reveals that subsampling strategies (e.g., deriving
        imputation model in 20% random sample whilst executing analysis on full population) and
        number of imputations can substantially reduce runtimes, memory usage and environmental
        impact, but compromises the accuracy of the adjusted log odds ratios and their corresponding
        standard errors, underscoring the importance of carefully chosen imputation strategies. We
        plan to extend this methodology to explore other critical missing variables within EHRs,
        such as blood pressure, ethnicity, and cholesterol levels, to further validate the
        versatility of MICE in handling diverse data types.</p>
    </sec>
    <sec>
      <title>Conclusion</title>
      <p>Scalable imputation methods like MICE are promising for robust analysis of EHR datasets,
        ensuring accuracy and data completeness while optimising computational resources and
        minimising the carbon footprint of data-intensive analyses.</p>
    </sec>
  </body>
</article>