Synthetic administrative data can support important research
Generating synthetic, or artificial versions of administrative data, could improve the efficiency of data analysis allowing research to be conducted more quickly. Synthetic data can be used for research and training purposes in place of the original data routinely collected during the administration of services such as the National Health Service or through education, employment, or the justice system, whilst still preserving the structure and some of the patterns in the original data, and protecting the privacy of individuals.
In the UK, administrative data has helped us gain a picture of public service users and their needs. But administrative data contain personal and sensitive information which is important to protect so that individuals can never be identified. However, the approvals and governance processes associated with accessing administrative data are extremely time consuming which can threaten the timeliness of research.
Synthetic data can help resolve the time lag issue for researchers wanting to access administrative datasets, and it can be also used for training and education purposes. By using synthetic data, researchers can familiarise themselves with the data and develop data analysis plans whilst waiting for access to the real data to conduct their final analysis. Being able to explore the datasets, understand what is available, and test code on the data will help streamline the research process, and enable researchers to make informed decisions and plan their research thoroughly, in a low-risk setting.
Lead author Dr Theodora Kokosi said, “Administrative synthetic data could go a long way in speeding up the research process and could enable researchers to gain great insights and increase their understanding of datasets containing sensitive information, without the risk of disclosing any personal information.”
Although the idea of synthetic data was introduced around 30 years ago, acceptability to the public has yet to be assessed. So, it is important for data holders to effectively communicate the benefits of synthetic data, allow data users to familiarise with the idea, and help build further acceptance and trust.
In an article published in the International Journal of Population Data Science (IJPDS) Dr Kokosi provides a comprehensive overview of the main synthetic data generation methods in the context of UK administrative data research. She discusses the benefits and challenges, and proposes simplified terms that would help data holders and data users familiarise with the concept while promoting collaboration and engagement between them.
For more detailed information on this project “Creating longitudinal datasets for linked administrative data research using synthetic data” visit https://www.adruk.org/our-work/browse-all-projects/adr-uk-research-fellows-methodological-developments-within-administrative-data-research-433/.
Dr Theodora Kokosi, Department of Population, Policy and Practice, UCL Great Ormond Street Institute of Child Health, London, UK
Kokosi, T., De Stavola, B., Mitra, R., Frayling, L., Doherty, A., Dove, I., Sonnenberg, P. and Harron, K. (2022) “An overview on synthetic administrative data for research”, International Journal of Population Data Science, 7(1). doi: 10.23889/ijpds.v7i1.1727.