Researchers from the University of Manitoba, Canada have documented current research and practices about natural language processing (NLP) preprocessing methods, in an attempt to describe and potentially improve the quality of unstructured text data (UTD), including UTD found in electronic medical record (EMR) databases.

Data quality can be simply described as ‘fitness for an intended use’ and  is usually based on factors such as accuracy, completeness, consistency, reliability and whether the data are up to date. Measuring the quality of structured data is relatively straightforward because it  can be organized into rows and columns, where, for example,  rows correspond to an individual person and the columns contain information about the attributes of the person, such as their age, where they live, and their health conditions. However, a growing volume of data are presented in an unstructured, free-flowing format. For example, EMRs, in which physicians record information about their patients, often contain free-form sections. So, how can the quality of these text data be assessed?

This research, published in the International Journal of Population Data Science (IJPDS), reviewed more than 40 published studies, to summarize measures and tools to assess the quality of unstructured text data. Examples of measures identified from this review include the number of spelling errors, the repetitiveness of words, and the number of stop words with no inherent meaning (such as “is” or “the”). NLP, a form of artificial intelligence where machines try to understand aspects of language, can aid researchers to measure text data quality.

While structured data gives us a birds-eye view of individuals, unstructured text data can provide a much deeper understanding of behaviors and feelings. If the quality of text data can be described, then we can be confident about its use in research and the conclusions that stem from the research. Therefore, having tools available to assess the quality of data benefits both the producers and consumers of research. 

Lead author Marcello Nesca explained that “Data quality is multifaceted and depends on the context in which the data will be used. While there are many similarities between the characteristics of high-quality structured data and unstructured text data, assessing the quality of unstructured text data requires access to specialized dictionaries and natural language processing software tools.“

You can read the full details of Quality of Unstructured Text Data: A Scoping Review here.


Marcello Nesca, Natural Language Processing Data Analyst, Manitoba Centre for Health Policy, University of Manitoba, Winnipeg, Canada

Nesca, M., Katz, A., Leung, C. and Lix, L. (2022) “Quality of Unstructured Text Data: A Scoping Review”, International Journal of Population Data Science, 7(1). doi: 10.23889/ijpds.v7i1.1757.