<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "JATS-journalpublishing1.dtd" [
]>
<article xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML"
  dtd-version="1.2" article-type="abstract">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">IJPDS</journal-id>
      <journal-title-group>
        <journal-title>International Journal of Population Data Science</journal-title>
        <abbrev-journal-title>IJPDS</abbrev-journal-title>
      </journal-title-group>
      <issn pub-type="epub">2399-4908</issn>
      <publisher>
        <publisher-name>Swansea University</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.23889/ijpds.v10i3.3158</article-id>
      <article-id pub-id-type="publisher-id">10:3:124</article-id>
      <title-group>
        <article-title>Towards High Performance Data Curation Statistical Disclosure Control Tooling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Lungley</surname>
            <given-names initials="D">Deirdre</given-names>
          </name>
          <xref ref-type="aff" rid="affil-1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>George</surname>
            <given-names initials="S">Simon</given-names>
          </name>
          <xref ref-type="aff" rid="affil-2">2</xref>
        </contrib>
      </contrib-group>
      <aff id="affil-1"><label>1</label><institution>UKDS, University of Essex, Colchester, United Kingdom</institution></aff>
      <aff id="affil-2"><label>2</label><institution>Simon George Limited, Wrexham, United Kingdom</institution></aff>
      <pub-date date-type="pub" publication-format="electronic">
        <day>01</day>
        <month>06</month>
        <year>2025</year>
      </pub-date>
      <pub-date date-type="collection" publication-format="electronic">
        <year>2025</year>
      </pub-date>
      <volume>8</volume>
      <issue>4</issue>
      <elocation-id>3158</elocation-id>
      <permissions>
        <license license-type="open-access"
          xlink:href="https://creativecommons.org/licences/by/4.0/">
          <license-p>This work is licenced under a Creative Commons Attribution 4.0 International
            License.</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://ijpds.org/article/view/3158">This article is available from the
        IJPDS website at: https://ijpds.org/article/view/3158</self-uri>
    </article-meta>
  </front>
  <body>
    <sec>
      <title>Objectives</title>
      <p>Surveys are a widely used and important research resource, whose creation and curation involve skilled, labour-intensive tasks. This abstract details an initiative to improve the tooling available to the data community, to control the risk of unintended disclosure, in line with the Anonymisation Decision Making Framework.</p>
    </sec>
    <sec>
      <title>Methods</title>
      <p>An initial step in assessing dataset disclosivity is the identification of key variables (KVs) (variables which, when combined, can indicate individual units) and the subsequent computation of frequency counts for combinations of these variables. These counts are a prerequisite for achieving k-anonymity, but can also be used in further risk calculations. Their centrality to our processes prompted us to improve the performance of this algorithm. We achieved a significant improvement over the original sdcMicro R package. We do this by decomposing KV values into "bitmasks" (0s and 1s) that are then easily manipulable by native CPU instructions.</p>
    </sec>
    <sec>
      <title>Results</title>
      <p>In the sdcMicro R package these calculations use the data.table library which, while performant, can be improved upon by our algorithm especially in the common case of the dataset containing missing values.</p>
      <p>We tested using the UK Quarterly Labour Force survey, on a Dell XPS 15 9520 laptop. Our implementation makes maximum use of the number of CPU cores and runs the combinations in parallel. For a single combination of 4 KVs we achieve the following<fn><p>Average over 20 iterations</p></fn>.</p>
      <list list-type="bullet">
        <list-item>
          <p>Time to compute bitmasks 0.408s</p>
        </list-item>
        <list-item>
          <p>Time to compute frequencies 0.285s</p>
        </list-item>
        <list-item>
          <p>Total time 0.693s</p>
        </list-item>
      </list>
      <p>For all 4-way combinations from the 8 KVs (70 combinations) we achieve the following.</p>
      <list list-type="bullet">
        <list-item>
          <p>Time to compute bitmasks 1.489s</p>
        </list-item>
        <list-item>
          <p>Time to compute frequencies 28.586</p>
        </list-item>
        <list-item>
          <p>Total time 30.075s</p>
        </list-item>
      </list>
    </sec>
    <sec>
      <title>Conclusion</title>
      <p>Our chief aim is to contribute code to the community, which allows seamless integration of these performant computations into regular Python applications. Therefore, following publication, this Python wrapped C++ code will be available via GitHub. For broader applicability, the integration of our algorithm into sdcMicro itself could prove useful.</p>
    </sec>
  </body>
</article>