Designing a utility evaluation framework for synthetic health data

Main Article Content

Amir Marashi
Sadiq Dohadwalla
Patrick Browne
Ben Hachey

Abstract

Objectives
Synthetic data (SD) promises to unlock health data for training, research, and innovation. However, where utility evaluation is performed, it is applied ad-hoc for a single task of interest. We produce an initial design for a robust benchmark across a range of tasks.


Approach
We undertook several projects as a prototyping experiment to gather requirements. These projects replicate previous studies performed on the Medical Information Mart for Intensive Care — a dataset used in more than 4,000 studies. We refine definitions, identify personas, draft a user statement, and collect requirements.


Results
Definitions: We define utility as an extrinsic measure of SD on a larger system, most often through comparison to system performance on real data. This contrasts with fidelity, which measures the accuracy of SD through direct comparison to real data.


Personas: Data custodian, User of SD, SD researcher.


User statement: As a technical stakeholder, I need a reliable way to measure the utility of datasets and a benchmark to compare generation techniques.


Requirements: SD researchers can focus on generation not evaluation; Supports comparison and leaderboards; Based on relevant and applications; Comprehensive across study types and applications; Future proof for population research requiring linking.


Conclusion
We propose the following design:



  • Data pipelines follow an extract-generate-evaluate workflow.

  • Study types include cross-sectional and longitudinal.

  • Applications include predictive modelling and clinical research.


This results in a comprehensive utility benchmarking suite that complements current frameworks for fidelity and privacy of SD.

Article Details

How to Cite
Marashi, A., Dohadwalla, S., Browne, P. and Hachey, B. (2024) “Designing a utility evaluation framework for synthetic health data”, International Journal of Population Data Science, 9(5). doi: 10.23889/ijpds.v9i5.2765.

Most read articles by the same author(s)