Main Article Content
A significant amount of valuable information in Electronic Health Records (EHR) such as laboratory test results or echocardiogram interpretations is embedded in lengthy free-text fields. Often patients’ personal information is also included in these narratives. Privacy legislation in different jurisdictions requires de-identification of this information prior to making it available for research. This process can be challenging and time-consuming. In particular, rule-based algorithms may lead to over-masking of essential medical terms, conditions, or devices that are named after individuals.
Objectives and Approach
We aimed to enhance ICES’ existing rule-based application to make it contextually-driven by applying Artificial Intelligence (AI). The ICES team collaborated with computer scientists at the University of Manchester who had already published work in this area and Evenset, a Toronto-based software company. Based on the Manchester University de-identification framework for name entity recognition, three machine learning-based algorithms for name entity recognition were implemented: CRF, BiLSTM recurrent neural networks with GLoVe and ELMo word embeddings. The models were trained on three different types of ICES data: Laboratory results, Electronic Medical Record (EMR) and echocardiogram data. Evenset developed the user interface and the masking modules.
Preliminary tests have generated very promising results. To improve accuracy of the models, additional data annotation to expand the training datasets is currently being undertaken at ICES. The final framework will be available as an open-source tool for public.
Conclusion / Implications
A collaborative approach for solving complex problems like de-identification of text-based medical data is highly efficient, especially where there are unique sets of expertise, resources, data and clinical knowledge among stakeholders.
This work is licensed under a Creative Commons Attribution 4.0 International License.