Data that can be highly valuable for public good research can also sometimes be extremely sensitive. This means that when researchers want to use it for planning research, collaborating across research teams, or even for training, it can (rightly) be hard to get hold of. Synthetic data is data that’s made to mimic real data, without containing any information about real individuals. This can be useful as it can preserve privacy while still providing some level of utility for the researcher.

However, while synthetic data emerges as a possible option for some use cases, a number of questions arise. What standards exist for its production? What governance is in place for its provision? What examples can we draw on to understand better how it’s used? Finding answers to these fundamental questions is stymied by a lack of agreed definitions of the terminology often associated with synthetic data.

In a new article published in the International Journal of Population Data Science (IJPDS), the authors, representing a mixture of stakeholders, argue that having unsettled definitions of key synthetic data terms poses a significant challenge. It can be muddling and lead to miscommunication and misunderstandings, as well as slowing down the potential for its routine production and use for publicly beneficial purposes.  

They have approached this through analysis of the literature, homing in on the use of four key terms, particularly relevant in the context of privacy preservation: synthetic data, utility, utility measures, and fidelity. They suggest definitions for these terms, and recommendations for their future use.

Emily Oliver, one of the authors from Administrative Data Research UK (ADR UK), is exploring how synthetic data can be used to smooth the journey for researchers looking to utilise secure data, which is often slow to access and eats into their award period. She says: “Getting everyone on the same page when it comes to terminology for synthetic data is an important starting point. From there we can start to think about frameworks for more routine provision, and build trust and confidence amongst providers, users and the public that synthetic data is worthwhile and value for money.”

Defining terminology goes beyond providing clarity. It plays an important role in how technologies are developed and how policy and legal agendas evolve. It also steers the direction of public discourse – particularly around the safeguarding of personal data, a heavily debated topic. The authors conclude with a set of recommendations to advance this discussion. They welcome opinions from readers in response to their article and encourage further debate on this.

 

Click here to read the full article

Emily Oliver, Head of Research & Capacity Building, ADR UK Strategic Hub, UK

Frayling, L., Suarj Bharat, S., Pattinson, E., Stock, J., Lugg-Widger, F., Gordon, E. and Oliver, E. (2025) “A Review of Synthetic Data Terminology for Privacy Preserving Use Cases ”, International Journal of Population Data Science, 10(2). doi: 10.23889/ijpds.v10i2.2967.