International Journal of Population Data Science A scoping review of preprocessing methods for unstructured text data to assess data quality

research, including electronic medical (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. Different NLP methods are used to preprocess UTD and may affect data quality. Our objective was to systematically document current research and practices about NLP preprocessing methods to describe or improve the quality of UTD, including UTD found in EMR databases. scoping review was undertaken of peer-reviewed studies published between and January 2021. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature relevant to the study objective. Information extracted from the studies included article characteristics (i.e., year of publication, journal discipline), data characteristics, types of preprocessing methods, and data quality topics. Study data were presented using a narrative synthesis. methods included removal of extraneous text elements such as stop words, punctuation, and numbers, word tokenization, and parts of speech tagging. Data quality topics for articles about EMR data included misspelled words, security (i.e., de-identification), word variability, sources of noise, quality of annotations, and ambiguity of abbreviations. Multiple NLP techniques have been proposed to preprocess UTD, with some differences in techniques applied to EMR data. There are similarities in the data quality dimensions used to characterize structured data and UTD. While a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise.


Introduction
Unstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. Different NLP methods are used to preprocess UTD and may affect data quality.

Objective
Our objective was to systematically document current research and practices about NLP preprocessing methods to describe or improve the quality of UTD, including UTD found in EMR databases.

Methods
A scoping review was undertaken of peer-reviewed studies published between December 2002 and January 2021. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature relevant to the study objective. Information extracted from the studies included article characteristics (i.e., year of publication, journal discipline), data characteristics, types of preprocessing methods, and data quality topics. Study data were presented using a narrative synthesis.

Results
A total of 41 articles were included in the scoping review; over 50% were published between 2016 and 2021. Almost 20% of the articles were published in health science journals. Common preprocessing methods included removal of extraneous text elements such as stop words, punctuation, and numbers, word tokenization, and parts of speech tagging. Data quality topics for articles about EMR data included misspelled words, security (i.e., de-identification), word variability, sources of noise, quality of annotations, and ambiguity of abbreviations.

Conclusions
Multiple NLP techniques have been proposed to preprocess UTD, with some differences in techniques applied to EMR data. There are similarities in the data quality dimensions used to characterize structured data and UTD. While a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise.
Keywords review; data quality; natural language processing Introduction Routinely collected electronic health data are generated during the process of managing and monitoring the healthcare system [1,2]. Unstructured text data (UTD) are common in electronic medical records (EMRs), which is one type of routinely collected electronic health data. Further examples of UTD found in other types of routinely collected electronic health data, are laboratory testing results and clinical registry files.
The quality of data has been defined as their fitness for use [3][4][5], that is, that data meets the needs of the user for a specific task or purpose, such as identifying individuals with a specific health condition. Given that routinely collected electronic health data are increasingly being used for research, it is important to consider their fitness for research, including epidemiologic studies or health services utilization studies. There are several consequences of poor data quality. For example, Kiefer noted that poor data quality can "slow down innovation processes" [6]. Poor data quality may also increase the time required to prepare a dataset for use, which can impact the timeliness of research outputs. Data quality is a multidimensional construct; it encompasses such dimensions as relevance, consistency, accuracy, comparability, timeliness, accessibility and usability [4,5,7,8]. Most data quality frameworks and assessment methods have been developed for structured data. However, Kiefer [6] argued that most, if not all, data quality dimensions developed for structured data are also relevant for UTD, although she emphasized the importance of relevance, interpretability, and accuracy when assessing UTD fitness for use. Kiefer also noted that there has been little research about data quality dimensions and indicators of these dimensions for UTD [6].
The usability of UTD for research generally requires the application of natural language processing (NLP) techniques, including topic modeling, sentiment analysis, aspect mining (e.g., identifying different parts of speech), text summarization, and named entity recognition (e.g., identifying people, places, and other entities in unstructured data) [9][10][11][12][13][14]. To prepare UTD for one or more of these NLP techniques, preprocessing of the data is an essential step. Preprocessing of UTD includes such actions as removing stop words (i.e., common words in a language), removing punctuation, tagging (i.e., identifying or labelling) parts of speech, and transforming abbreviations into words or phrases so that they can be easily interpreted. Accordingly, some researchers have suggested that indicators that measure the outputs of data preprocessing steps, such as the number or percent of abbreviations and the number or percent of spelling errors, could be used to characterize UTD quality. Some types of NLP, such as named entity recognition, involve training classification models to learn to identify data entities, such as parts of speech, diseases, or geographic locations [10]. This requires the use of annotated databases as a "gold standard", which have been tagged (e.g., parts of speech have been documented or labelled) using manual or automated methods [10]. The quality of these annotated databases and the accuracy of classification models based on NLP applications involving annotated databases have also been proposed as indicators of UTD quality. Additionally, several unsupervised or supervised methods, which are used for both internal or external validation have been used for preprocessing and may be used to develop indicators for data quality. For example, in a study by Zennaki et al. [15], a recurrent neural network was used for unsupervised and semisupervised parts of speech tagging in languages for which there are no labeled training data.
The data quality paradigm places a high value on the representation of truth from the perspective of the patient [16]. In a citizens' jury study by Ford et al. [16], a representative sample of citizens listened to subject matter experts about the sharing of UTD within EMRs for research. The jury then deliberated to reach a conclusion from questions they were asked. With respect to data quality, the jurors noted that that text data may contain information about patients, judgments and offhand comments that may be misinterpreted by the researcher [16]. The concern with the representation of truth is a form of external validation (i.e., assessing data veracity). A study by Pantazos et al. [17], discussed the preservation of medical correctness, readability and consistency after EMR records were de-identified, to ensure data quality from a representativeness perspective. At the same time, NLP technologies still have difficulties with context and understanding language [18,19]. Preprocessing activities to prepare text data for research do not address the contextual concerns that the jurors raised. Thus, it should be noted that this study primarily focuses on assessing the goodness of fit of text data for analytics.
In summary, there are potentially many indicators that could be used to describe UTD quality, and these are primarily based on the use of NLP techniques to preprocess UTD. However, there is little relevant literature and few, if any, guidelines on the data quality indicators that might be recommended for inclusion in data quality frameworks for UTD, or that might be used to guide data preprocessing in studies that apply NLP methods to UTD. In addition, there have been few studies that have investigated the impact of UTD quality assessment on the performance of text analyses using NLP. The objective was to systematically document current research and practices about NLP preprocessing methods for UTD to describe or improve its quality.

Methods
To achieve the research objective, we undertook a scoping review of published literature about NLP preprocessing methods and data quality. The purpose of a scoping review is to map and describe the literature on a new topic or research area and identify key concepts, gaps in the research area, and types and sources of evidence to inform future research [20]. We adopted the Arksey and O'Malley framework [20] for scoping reviews, which has the following steps: 1) define the research question, 2) identify relevant studies, 3) select studies, 4) chart the data, and 5) collate, summarize, and report results.

Search strategy
The search strategy included the concepts of (1) data quality, (2) NLP, and (3) data preprocessing (see Figure 1). The selection of search terms was informed by a systematic review on extracting text from EMRs to improve case detection [21], a scoping review about quality of routinely-collected electronic health data [22], and keywords related to preprocessing identified from an initial search of the literature [23][24][25]. We consulted a librarian who assisted in developing and refining the list of search terms. Our initial literature review revealed few articles that included NLP, data quality, and preprocessing in the health science discipline, thus we expanded our search to include relevant literature in all disciplines.
The review included empirical research articles and review articles and was conducted over two time periods. An initial search was executed with an unrestricted minimum date criterion, with an end date of April 15, 2020. We updated the search to the end of May 15, 2021. We searched Scopus, Web of Science, EBSCOhost and ProQuest. In addition, the reference sections of the selected articles were hand searched to identify additional, relevant articles.

Inclusion and exclusion criteria
An article was selected for inclusion if it met one or both of the following criteria: (1) it described research about preprocessing methods for UTD or UTD quality measures or methods, or it was a review article that discussed preprocessing methods to restructure or reorganize UTD for analysis; (2) it was about methods or processes to create a gold standard (or reference) dataset to validate UTD.
An article was excluded if it met one or more of the following criteria: (1) it was about methods for sentiment analysis, ontologies, semantic models, geo-spatial analysis, or qualitative research (e.g., methods to analyze interview or focus group data); (2) it was about methods to construct lexicon databases, dictionaries, or language databases; (3) it focused on the creation of software programs or proprietary solutions for text analysis; (4) it was not available in English; (5) it was an article from the ProQuest database that was neither an empirical article nor a scholarly article.

Article screening
Title and abstract screening were conducted for all articles identified through the implementation of the search strategy, after duplicates were removed. Training was undertaken first for the title, and abstract screening was completed by two authors on a 10% sample of all articles identified from the application of the initial search strategy, to ensure consistency when applying the inclusion and exclusion criteria. Percent agreement and its 95% confidence interval (CI) were calculated. After training was completed, all remaining articles were screened by one author. Rayyan [26], a web application for systematic and scoping reviews, was used to manage and organize articles through the process of title and abstract screening. Differences of opinion on the title and abstract screening were resolved by consensus. Full text screening of all articles selected after title and abstract screening was then conducted, to identify the articles to retain in the scoping review.

Data extraction and analysis
The following types of information were extracted from each article: (1) characteristics of the article, (2) characteristics of the text data, and (3) characteristics of preprocessing methods to restructure or reorganize the UTD. The systematic review conducted by Hinds et al. [22] was used to inform the types of information extracted from the articles, such as the characteristics of articles and validation methods. The reviews conducted by Spasic et al. [6,28] and Kiefer [27] provided guidance on the characteristics of text data and the preprocessing methods that were included in the data extraction form [6, 27,28]. We extracted information about the specific dimensions (i.e., types) of data quality that were mentioned in each of the articles. These dimensions were identified from existing data quality frameworks, such as those developed by the Manitoba Centre for Health Policy [7]. Lastly, an inspection of the topics for data quality for UTD that used EMRs were explored. Information extracted included: (1) methods, (2) strengths and limitations, and (3) use cases.
Data extraction training was completed by two authors on a 10% sample of the articles selected for full data extraction. A data dictionary was created to ensure consistency of the data extraction methods. Percent agreement and its 95% CI were calculated. Variables with open-ended responses were excluded from this calculation. Any differences in agreement were resolved by consensus.

Results
The initial search ( Figure 1, numbers not in parentheses), which encompassed the period up to April 15, 2020, identified a total of 1134 articles. The updated search ( Figure 1, numbers in parentheses) identified an additional 154 articles. Thus, in total 1288 articles were retrieved from the search before duplicates were removed, and 1226 remained after duplicates were removed ( Figure 2). Initial training for title and abstract screening yielded 83.3% (95% CI: 75.2%, 89.2%) agreement. After full text screening, a total of 41 articles remained for data extraction ( Figure 2).

Data extraction results
Ten percent of articles from the initial search were selected for the calculation of agreement for full text extraction; the overall agreement was 94.6% (95% CI: 88.9%, 97.4%). Table 1 summarizes the characteristics of the articles selected for full data extraction.
In total, 90.2% of the articles reported the results of empirical research and another 7.3% (n = 3) were review articles. Only one article was classified as a case study. More than half (51.2%) of the articles were published between 2016 and 2021. No articles were published prior to 2002. In terms of disciplinary area, 61% of articles were deemed to be from computer science and engineering disciplines, while 39% were from the health sciences, social sciences, humanities, and business.
In terms of data size and terms used to describe data size, the UTD described in the articles were characterized in many ways ( Table 2). Documents and words were mentioned most frequently (e.g., citing how many documents were used or how many words were found in a type of UTD). More than half (53.7%) of the articles mentioned the volume of documents, and 46.3% of the articles mentioned the count of words in each document. Elements of size often used to describe the size of structured data such as "how many rows (records)" or "how many columns (features)" were among the least common ways to describe UTD size.
Almost all (85.4%) of the articles discussed using restructuring and reorganizing methods to prepare UTD for analysis (e.g., removing stop words, punctuation, removing URLs). Figure 3 describes the types of restructuring and reorganizing methods that were used for all articles and for the subset of articles from health science disciplines.
We identified the articles that explicitly mentioned a data quality dimension (i.e., we counted the words pertaining to data quality). There was no prior determination of the quality dimension criteria when capturing words in relation to data quality. The three data quality dimensions that were most frequently mentioned were: accuracy (68.3%), relevance (34.1%), and comparability (31.7%). Furthermore, "data quality" or "quality" as terms were described or referenced in several ways among the 41 articles. Several articles discussed quality either from the perspective of data quality (or information quality), or using terminology from data or information quality dimensions (e.g., accuracy, correctness, interpretability) [13,17,47,56,58]. Other articles discussed enhancing data quality by focusing on utilizing or improving preprocessing methods [31, 34-37, 40, 42, 46, 50, 52, 54-56, 63, 64]. Several of these articles only mentioned "data quality" in passing; the main focus of these articles were Figure 3: Comparison of restructuring and reorganizing methodsfor all articles and for health science articles the preprocessing methods. Other articles discussed quality from an "annotation" perspective whereby tagging documents or words was intended to enhance the quality of algorithms either through the creation of treebanks or manual annotation using expert opinion [13, 30, 32, 33, 38, 39, 47-49, 51, 53, 59, 65, 66]. Lastly, some articles discussed quality through utilizing preprocessing methods that were primarily focused on achieving a more "accurate or relevant" outcome for algorithmic model performance [29, 41, 43-45, 57, 58, 60, 62, 67, 68]. Overall, while articles mentioned terminology such as "accuracy" or "relevance", data quality as a concept (or its individual dimensions), was referenced or described in multiple ways. Articles that focused exclusively on data quality dimensions or their measurement were rare.
Overall, many of the articles described an improvement of algorithm outcome performance through a variety of preprocessing techniques to enhance UTD quality. However, there were no articles that reported on indicators of UTD quality before preprocessing. While data quality dimensions were not mapped to specific preprocessing methods, "accuracy" was used to evaluate outcomes in a confusion matrix, and it was also used as a descriptor for utilising preprocessing methods to enable unsupervised and supervised algorithms outcomes to be more accurate. That is, all preprocessing methods were closely tied to the data quality dimension of accuracy.

UTD quality topics for EMR data
Seven data quality topics were discussed specifically in reference to EMR data. The topic areas were: (1) misspelled words, (2) security/de-identification, (3) reducing word variability, (4) sources of noise, (5) quality of annotations, (6) ambiguous abbreviations, and (7) reducing manual annotations. Further details such as the methods used, strengths and limitations of the methods, and the data used (i.e., use case) are provided in Supplementary Appendix 1.
Several of these quality topics focused on the reduction of possible error or variability (e.g., correct misspelled words, reduction of word variability, addressing sources of noise, or addressing ambiguity in abbreviations). Two articles reported on the assessment of misspelled words. Some of the methods utilized were a combination of rulebased approaches or machine learning approaches such as dictionaries, regular expressions, a string-to-string edit distance known as the Levenshtein-Damerau distance, and word sense disambiguation for words with similar parts of speech (e.g., distinguishing between two similar words that are classified as verbs). One article by Assale et al. [31], was about reducing word variability in typographical corrections, where typographical errors in one word can create variability in how the word is spelled. Rule-based approaches were used in this article such as preprocessing methods to restructure and reorganize UTD (e.g., removal of stop words), counting word frequencies, and utilizing the Levenshtein-Damerau distance metric. One article addressed sources of noise where the removal of noise in text involves preprocessing methods that restructure and reorganize UTD. Some of these methods include (but are not limited to): tokenization, converting uppercase letters to lowercase letters, and removal of stop words. Lastly, ambiguity in abbreviations was addressed in one article where the authors used deep learning methods (i.e., convolutional neural networks); no feature engineering or preprocessing was required. The convolutional neural network was trained on word embeddings, which are representations of words in a list. These word embeddings were extracted from journal articles found in PubMed.
Other topics focused on the evaluation or reduction of manual efforts such as assessing the quality of annotations or solutions involving reducing manual annotation of text data. Two articles focused on the quality of annotations. One focused on manual annotation of clinical notes for fall-related injuries. In the second article, tags for parts of speech and named entities were applied by annotators with backgrounds in computational linguistics, while physicians in training solved any disagreements between annotators.

Discussion
The scoping review has documented practices for preprocessing UTD to describe or improve its quality. Few articles in the scoping review discussed the quality of UTD before analysis or preprocessing was initiated. The main topics raised in the selected studies were about the challenges of defining data quality, the choice of data quality assessments for UTD and how this is influenced by the context of the text data, and differences in the data quality challenges associated with EMR UTD when compared to other types of UTD.
The scoping review reveals the following key points: 1) The most common preprocessing methods used in health science articles were different from the most common preprocessing methods used in all disciplines combined. To elaborate, the most common preprocessing methods for health science articles included removal of stop words, removal of punctuation, removal of numbers, tokenization, parts of speech tagging, and converting characters to lowercase. 2) Few dimensions of data quality were considered in assessed UTD. Accuracy, relevance, and comparability were the most commonly-reported dimensions. 3) Quality indicator topics addressed potential challenges in preprocessing, such as the quality of annotations, presence of spelling errors, and presence of ambiguous abbreviations.
One difficulty with describing the quality of UTD is the lack of standardized terminology. Strong et al. [69] differentiated "information quality" from "data quality" in terms of its specific goals; information quality is about assessing the needs of information users while data quality refers to the fitness of data for its intended use. However, Chen and Tseng [58] did not make that distinction; their indicators of information quality are similar to those found in existing data quality frameworks.
Amongst the data quality measures identified from the scoping review, some were tailored specifically for the data being assessed. This emphasizes that quality of UTD depends on the context or type of data that is being assessed. Language usage must be contextual to the environment (i.e., vernacular used in product reviews differ from vernacular used in EMR notes). For example, several of the data quality measures for UTD that Chen and Tseng's article reference do not necessarily apply to data in EMRs such as the data quality dimension for "objectivity", and assessments of whether a product review is an opinion rather than factual [58].
Some of the challenges with quality of UTD in EMRs are different than the challenges associated with UTD found in social media data or organizational reports. One of the biggest challenges of the former is ensuring privacy, anonymity, and confidentiality of patient data. Pantazos et al. [17] stated that as UTD in EMRs increase in usage, it must achieve two goals (1) that it is de-identified and anonymized and (2) EMRs that are de-identified must contain accurate information about the un-identified patient and be coherent to the reader. The scoping review revealed that high-quality UTD within EMRs must be readable, correct, and consistent with a patient's record.
This research has some limitations. The scoping review was restricted to English language articles and grey literature was not searched. Accordingly, the scoping review may not represent all published articles about UTD quality. Furthermore, articles that discussed preprocessing methods to improve algorithmic modelling outcomes were not included. It was not feasible to address all different modelling techniques for text data (i.e., collect all articles that conducted a sentiment analysis or other methods). The articles selected were those that focused on quality of text data, or that focused on preprocessing methods and also mentioned data quality.
Choosing key words for the scoping review search strategy was challenging. This was due in large part to the lack of standardized terminology and the diverse terminology within the NLP and data quality literatures and may have resulted in some articles being missed.
Despite these limitations, the major strength of this research is that it used a systematic approach to examine preprocessing methods to describe data quality for UTD. Data quality for UTD is an important area for research in multiple fields, including in the health sciences. Written language in EMRs and other patient-related documents is different from other types of text data found in textbooks or social media due to its nature in short text and point form.
Several opportunities for future research exist. First, a scoping review could be conducted to identify operational definitions for data quality dimensions specific to text data from routinely collected electronic health data. Akin to Weiskopf et al.'s scoping review to discover methods and dimensions [70], the scoping review could be used to develop definitions for dimensions of data quality for UTD. This research has shown that there are unique characteristics of text data that are not present in numerical data (e.g., grammar rules or punctuation). Thus, operational data quality definitions that specifically address text data properties is an important step towards structuring a quality framework for text data. Second, a documentation project that maps operational definitions for data quality dimensions to the data quality indicators for routinely collected electronic health data should be explored [70]. Third, a case study could be undertaken for several text data quality indicator topics identified from this research (e.g., disambiguation of abbreviations). While there have been studies to disambiguate abbreviations [34] and correct spelling/typographical errors [31,35], it would also be beneficial to identify the impact of word variability on algorithm outputs in text data. Lastly, additional scoping or systematic reviews could be conducted. In particular, quality indicators for short text documents such as social media posts, reviews, and EMRs, could be explored [13, 54-56, 58, 71]. Since EMRs are characterized by short texts, it would be interesting to examine other text quality indicators appropriate for these types of data.

Conclusion
Data quality is a multidimensional construct that depends on the context in which the data will be used. However, there are many similarities between the dimensions of data quality for structured and unstructured text and methods to assess data quality. Assessing data quality in UTD often requires access to specialized gold standard datasets or dictionaries. However, there are a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise in the data. Counted frequency of words -words occurring above an 80% percentage were considered correct, less frequent words were checked if they were present in an Italian dictionary or a medical dictionary, whatever is left over was considered typographical.
3) Used distance metric between strings -"Levenshtein distance". Search in the 80% of most frequent words that had a "distance 1" from the typos. Distance 1 signifies that typological words differ from 1 letter insertion/deletion/substitution from the original word.

Strengths:
Proposed a method for reducing word variability Limitations: 1) Number of false positives is high.
2) Cannot guarantee to correct all errors.
Applied to anamnestic summaries of endocrinology and rheumatology

Continued
Data quality topic Source Methods Strengths/Limitations Use case 4) Also take into account "Damerau-Levenshtein distance" metric where it also takes into account inversions between letters -because its common to invert two adjacent letters (distance 2). 5) Manually inspected to verify no association errors. Ambiguous associations were discarded. 6) Multi-associated words (i.e., words with the same meaning but varied in spelling) were replaced with the most frequent one.
Sources of noise Berndt et al 1) converting text to lowercase 2) tokenization 3) removal of tokens with less than three characters or no alphabetical characters 4) normalizing terms 5) removal of stop words 6) removing tokens that only occur once

Strengths:
Authors addressed text noise by introducing preprocessing methods to reduce noise Limitations: The study only used one dataset Applied to clinical progress notes

Quality of annotations
He et al Annotation methods included: 1) word segmentation 2) part of speech tagging (with shallow and full parsing of parts of speech tags) 3) named entity tagging 4) relational tagging (i.e., finding relationships among named entities) Annotation quality was evaluated using F1 measure, precision, and recall

Strengths:
Authors have built a concept of data quality into the creation of a corpus by assessing the quality of annotations Limitations: Since there was a limit of annotation resources, the corpus created only covered two departments of a hospital, thus lacking medical terminology in other departments Liang et al Utilized KNN classifier (supervised method) to predict the type of documents (i.e., a document is either "diagnostic errors" or "device related complications"). The process was as follows: 1) Noise Removal: removal of punctuations, words were set to lowercase, white space was removed, stop words were removed 2) Document was converted to a document term matrix 3) Documents were pre-annotated as either "diagnostic errors" or "device related complications" to create a gold standard comparison 4) Evaluated using F measure and Accuracy

Strengths:
Demonstrated a process that enhances automatic annotation processes that include data quality elements.

Limitations:
The sample size in the study was small and this method was not demonstrated on other types of data such as EMR clinical notes.
Applied to publicly available patient safety documents from WebM&M