Obtaining structured clinical data from unstructured data using natural language processing software IJPDS (2017) Issue 1, Vol 1:359 Proceedings of the IPDLN Conference (August 2016)

Main Article Content

Arron S Lacey
Beata Fonferko-Shadrach
Ronan A Lyons
Mike P Kerr
David V Ford
Mark I Rees
Owen W Pickrell



Free text documents in healthcare settings contain a wealth of information not captured in electronic healthcare records (EHRs). Epilepsy clinic letters are an example of an unstructured data source containing a large amount of intricate disease information. Extracting meaningful and contextually correct clinical information from free text sources, to enhance EHRs, remains a significant challenge. SCANR (Swansea University Collaborative in the Analysis of NLP Research) was set up to use natural language processing (NLP) technology to extract structured data from unstructured sources.

IBM Watson Content Analytics software (ICA) uses NLP technology. It enables users to define annotations based on dictionaries and language characteristics to create parsing rules that highlight relevant items. These include clinical details such as symptoms and diagnoses, medication and test results, as well as personal identifiers.


To use ICA to build a pipeline to accurately extract detailed epilepsy information from clinic letters.

We used ICA to retrieve important epilepsy information from 41 pseudo-anonymized unstructured epilepsy clinic letters. The 41 letters consisted of 13 ‘new’ and 28 ‘follow-up’ letters (for 15 different patients) written by 12 different doctors in different styles. We designed dictionaries and annotators to enable ICA to extract epilepsy type (focal, generalized or unclassified), epilepsy cause, age of onset, investigation results (EEG, CT and MRI), medication, and clinic date. Epilepsy clinicians assessed the accuracy of the pipeline.

The accuracy (sensitivity, specificity) of each concept was: epilepsy diagnosis 98% (97%, 100%), focal epilepsy 100%, generalized epilepsy 98% (93%, 100%), medication 95% (93%, 100%), age of onset 100% and clinic date 95% (95%, 100%).

Precision and recall for each concept were respectively, 98% and 97% for epilepsy diagnosis, 100% each for focal epilepsy, 100% and 93% for generalized epilepsy, 100% each for age of onset, 100% and 93% for medication, 100% and 96% for EEG results, 100% and 83% for MRI scan results, and 100% and 95% for clinic date.

ICA is capable of extracting detailed, structured epilepsy information from unstructured clinic letters to a high degree of accuracy. This data can be used to populate relational databases and be linked to EHRs. Researchers can build in custom rules to identify concepts of interest from letters and produce structured information. We plan to extend our work to hundreds and then thousands of clinic letters, to provide phenotypically rich epilepsy data to link with other anonymised, routinely collected data.


Hospital length of stay (LOS) is a widely used measure for assessing cross-jurisdiction health system performance and informs resource allocation decisions. However, the accuracy of existing LOS risk adjustment models are limited, because they are mostly derived from administrative data, which mostly contain clinical/diagnostic information but lack detailed information on relevant demographic, socio-economic (SES), and self-reported health-related quality of life (HRQOL) risk factors, which have been shown to improve the accuracy of LOS risk adjustment models. The study investigates the relative contribution of demographic, socio-economic, and health status risk factors derived through data linkage in improving the accuracy of LOS risk adjustment models.


Population-based data on 8000 individuals hospitalized for coronary heart disease were obtained from Alberta Provincial Project on Outcomes Assessment in Coronary Heart Disease (APPROACH) registry and linked to Alberta Discharge Abstract Database (DAD). SES was measured using multi-domain measure of SES derived from area-level census information, while the health-related quality of life outcome was measured using the Seattle Angina Questionnaire. LOS risk adjustment model based on hierarchical logistic regression models was developed to assess relative impact of each SES measure and HRQOL measure improving the predictive accuracy of LOS adjustment models. The relative impact of each predictor was assessed by its adjusted odds ratio (OR) and improvement over the predictive accuracy of a reference model that included patients' clinical risk factors only.


More than 80% of the hospitalized individuals had prolonged LOS more than 10 days. The HRQOL and single-domain measures of SES had significant impact in accurately predicting LOS. But the inclusion of the multi-domain measure SES did not significantly improve the accuracy of LOS risk adjustment models


Using large population-based Canadian data, our study suggests that the inclusion of patients' SES and health status information through data linkage can improve the accuracy of LOS risk adjustment models. The development of more accurate risk adjustment models can aid the identification of individuals at risk of prolonged LOS and comparison of health system performance across several cross-jurisdictions.

Article Details

How to Cite
Lacey, A. S., Fonferko-Shadrach, B., Lyons, R. A., Kerr, M. P., Ford, D. V., Rees, M. I. and Pickrell, O. W. (2017) “Obtaining structured clinical data from unstructured data using natural language processing software: IJPDS (2017) Issue 1, Vol 1:359 Proceedings of the IPDLN Conference (August 2016)”, International Journal of Population Data Science, 1(1). doi: 10.23889/ijpds.v1i1.381.

Most read articles by the same author(s)

1 2 3 > >>