Clinical Text De-identification and Other Large Scale Processing Tasks in Resource Constrained Environments IJPDS (2017) Issue 1, Vol 1:157, Proceedings of the IPDLN Conference (August 2016)

Main Article Content

Richard Jackson
Richard Dobson
Robert Stewart

Abstract

ABSTRACT


Objectives
Clinical text de-identification is a common requirement of the ‘enclave’ governance model of ethical EHR research. However, there is often little consideration of the engineering task that is required to scale these approaches across the hundreds of millions of clinical documents containing personal identifiers that are resident in the data repositories of a typical NHS Trust. Similarly, natural language processing is an increasingly important field of clinical data science, yet it requires fault tolerant approaches to data processing. This work concerns the development of “turbo-laser” - a distributed document processing architecture based upon the popular ‘battle hardened’ Spring Batch framework - an industry standard for large scale processing tasks.


Approach
Using Spring Batch, we developed a highly scalable unstructured data processing framework, using the concept of remote partitioning. Remote partitioning allows us to offload processing tasks to any and all computers in a network. With this approach, it is possible to harness the entire compute available of an organisation, whether it be an office of 15 desktop PCs that go unused overnight, or a compute cluster of a thousand processors. This method is especially valuable in the NHS, where the provision of sufficient compute to make large scale analytics possible are often hindered by the lack of available hardware, or difficulties in navigating technical governance policies ill equipped for the demands of modern data science.


Results
Turbo-laser was developed in consideration of the processing challenges common in the NHS. Currently, four types of ‘job’ are available - De-identification, using the Cognition algorithm, generic GATE output, text extraction from binary files such as MS Office, PDF and scanned documents, and a document re-compiler to deal with EHR legacy issues. Examples of turbo-laser usage include processing 9 million binary documents on modest hardware, within 48 hours.


Conclusion
Turbo-laser is an enterprise grade processing tool, in keeping with the software engineering pattern of ‘batch processing’ that has been at the forefront of the informatics movement. An open source project, it is hoped that others may contribute and extend its principles, lowering the barrier of large scale data processing throughout the NHS.

Objectives

Clinical text de-identification is a common requirement of the `enclave' governance model of ethical EHR research. However, there is often little consideration of the engineering task that is required to scale these approaches across the hundreds of millions of clinical documents containing personal identifiers that are resident in the data repositories of a typical NHS Trust. Similarly, natural language processing is an increasingly important field of clinical data science, yet it requires fault tolerant approaches to data processing. This work concerns the development of ``turbo-laser'' - a distributed document processing architecture based upon the popular `battle hardened' Spring Batch framework - an industry standard for large scale processing tasks.

Approach

Using Spring Batch, we developed a highly scalable unstructured data processing framework, using the concept of remote partitioning. Remote partitioning allows us to offload processing tasks to any and all computers in a network. With this approach, it is possible to harness the entire compute available of an organisation, whether it be an office of 15 desktop PCs that go unused overnight, or a compute cluster of a thousand processors. This method is especially valuable in the NHS, where the provision of sufficient compute to make large scale analytics possible are often hindered by the lack of available hardware, or difficulties in navigating technical governance policies ill equipped for the demands of modern data science.

Results

Turbo-laser was developed in consideration of the processing challenges common in the NHS. Currently, four types of `job' are available - De-identification, using the Cognition algorithm, generic GATE output, text extraction from binary files such as MS Office, PDF and scanned documents, and a document re-compiler to deal with EHR legacy issues. Examples of turbo-laser usage include processing 9 million binary documents on modest hardware, within 48 hours.

Conclusion

Turbo-laser is an enterprise grade processing tool, in keeping with the software engineering pattern of `batch processing' that has been at the forefront of the informatics movement. An open source project, it is hoped that others may contribute and extend its principles, lowering the barrier of large scale data processing throughout the NHS.

Article Details

How to Cite
Jackson, R., Dobson, R. and Stewart, R. (2017) “Clinical Text De-identification and Other Large Scale Processing Tasks in Resource Constrained Environments: IJPDS (2017) Issue 1, Vol 1:157, Proceedings of the IPDLN Conference (August 2016)”, International Journal of Population Data Science, 1(1). doi: 10.23889/ijpds.v1i1.176.