Using natural language processing to extract structured epilepsy data from unstructured clinic letters

Main Article Content

Beata Fonferko-Shadrach
Arron Lacey
Ashley Akbari
Simon Thompson
David Ford
Ronan Lyons
Mark Rees
Owen Pickrell


Electronic health records (EHR) are a powerful resource in enabling large-scale healthcare research. EHRs often lack detailed disease-specific information that is collected in free text within clinical settings. This challenge can be addressed by using Natural Language Processing (NLP) to derive and extract detailed clinical information from free text.

Objectives and Approach
Using a training sample of 40 letters, we used the General Architecture for Text Engineering (GATE) framework to build custom rule sets for nine categories of epilepsy information as well as clinic date and date of birth. We used a validation set of 200 clinic letters to compare the results of our algorithm to a separate manual review by a clinician, where we evaluated a “per item” and a “per letter” approach for each category.

The “per letter” approach identified 1,939 items of information with overall precision, recall and F1-score of 92.7%, 77.7% and 85.6%. Precision and recall for epilepsy specific categories were: diagnosis (85.3%,92.4%),  type (93.7%,83.2%), focal seizure (99.0%,68.3%), generalised seizure (92.5%,57.0%), seizure frequency (92.0%,52.3%), medication (96.1%,94.0%), CT (66.7%,47.1%), MRI (96.6%,51.4%) and EEG (95.8%,40.6%). By combining all items per category, per letter we were able to achieve higher precision, recall and F1-scores of 94.6%, 84.2% and 89.0% across all categories.

Our results demonstrate that NLP techniques can be used to accurately extract rich phenotypic details from clinic letters that is often missing from routinely-collected data. Capturing these new data types provides a platform for conducting novel precision neurology research, in addition to potential applicability to other disease areas.


Mechanical ventilation (MV) is an important intervention used in critically ill patients. Accurately identifying MV use in Hospital Discharge Abstracts will be extremely useful in population-based research. Although Canadian Institute for Health Information collects information on MV for all hospitalization, its validity in intensive care unit (ICU) patients is unknown.

Objectives and Approach

We validated MV use within ICU patients in Hospital Discharge Abstracts. Winnipeg Regional Health Authority (WRHA) ICU database prospectively collects use of MV by trained nurses. All patients admitted to a WRHA ICU (82 beds) between April 1, 2000 and March 31, 2012 were identified in Hospital Discharge Abstracts. MV was identified in Hospital Discharge Abstracts through International Classification of Diseases (ICD-9-CM), prior to 2004, while Canadian Classification of Health Interventions (CCI) were used 2004 onwards. Agreement between the WRHA database (gold standard) and Hospital Discharge Abstracts for invasive ventilation, non-invasive ventilation or neither was calculated at ICU encounter level.


There were 54,680 WRHA ICU admission during the study period. The linking of these two sources was highly successful with accurate identification exceeding 99%. There were 26,083 mechanical ventilations (25,387 invasive; 696 non-invasive) from the Hospital Discharge Abstracts and 30,455 (28,315 invasive; 4,554 non-invasive) from the CIC data. Hospital Discharge Abstracts had a sensitivity of 82.8%, specificity of 96.4%, Positive Predictive Value (PPV) of 96.7%, and Negative Predictive Value (NPV) of 81.7% for identifying mechanical ventilation. For invasive ventilation, Sensitivity was 85.5%, Specificity was 95.6%, PPV was 95.4% and NPV was 86.0%. Validation of non-invasive ventilation was poor in sensitivity (9.38%) and PPV (61.35%): with specificity 99.5% and NPV 92.36%.


Hospital Abstracts data are a good source to identity mechanically ventilated patients for ICU containing hospital stays especially invasive mechanical ventilations. Future research needs to explore the poor agreement with non-invasive mechanical ventilation.

Article Details

How to Cite
Fonferko-Shadrach, B., Lacey, A., Akbari, A., Thompson, S., Ford, D., Lyons, R., Rees, M. and Pickrell, O. (2018) “Using natural language processing to extract structured epilepsy data from unstructured clinic letters”, International Journal of Population Data Science, 3(4). doi: 10.23889/ijpds.v3i4.699.

Most read articles by the same author(s)

1 2 3 4 5 6 7 8 9 10 > >>