Data about people of whole populations are often seen as the new oil of the Big data age. They allow companies to recommend products tailored to you and help governments gain insight into their citizens’ needs. Such population data have also been highly valuable to better understand the spread and the effects of the COVID pandemic. There are, however, various misconceptions about population data that can lead to researchers making mistakes when processing, linking, and analysing such data, potentially resulting in poor real-world decisions.

Population data are generally not collected with research in mind. Rather, their primary use is for administrative or commercial reasons, such as billing patients for their doctor’s visits. As a result, data about people might be incomplete (children will not be employed) and biased (homeless people are unlikely to pay electricity bills). There can also be multiple records for a single individual, which happens when somebody moves or changes their name when getting married. People also provide their details in different forms; Robert, for example, uses his full name when providing his details to the government, but otherwise uses Bob when shopping online. Over twenty issues originating from how data are captured have been identified.

But there are further misconceptions about how data are processed, many of them due to falsely assuming error-free data processing. Because much data processing involves humans, errors do happen for reasons such as time pressure, misunderstandings of requirements, or incorrect use of software, just to name a few. Often multiple population databases need to be linked to allow advanced data analysis to, for example, explore the effects of people’s education and employment on their health. Such data linkage often involves complex methods and processes that lead to subtle technical problems which are easily missed.

A guiding principle in science is that data needs to be collected and processed in rigorous ways, ensuring the quality of any data analysis. However, the way population data are collected and even how they are processed and linked is commonly outside the control of a researcher. Properly conducting science when using population data can therefore be challenging. Remarkably, many of the misconceptions identified are due to the social nature of data collection and are therefore missed by purely technical solutions of data processing.

The article “Thirty-three myths and misconceptions about population data: from data capture and processing to linkage” by Profs. Peter Christen and Rainer Schnell, two world-leading experts with decades of experience in working with population data, describes over thirty misconceptions about population data and provides recommendations to help researchers and practitioners recognise and overcome such misconceptions.

They conclude: “Because good data management is a key aspect of good science, it is vital for anybody who uses population data to be aware of underlying assumptions concerning this kind of data. Our aim is to help identify and prevent misleading conclusions and poor real-world decisions being made, and ensure that population data will become the new oil of the Big data era.

Click here to read the full open access article


Peter Christen, School of Computing, The Australian National University, Canberra, ACT 2600, Australia; Scottish Centre for Administrative Data Research (SCADR), University of Edinburgh. UK

Rainer Schnell, Methodology Research Group, University Duisburg-Essen, Germany

Christen, P. and Schnell, R. (2023) “Big Data is not the New Oil: Common Misconceptions about Population Data”, International Journal of Population Data Science, 8(1). doi: 10.23889/ijpds.v8i1.2115.