Main Article Content
The Dementias Platform UK (DPUK) Data Portal is a secure, accessible environment facilitating provision of rich data towards the largest Dementia, cognition and ageing community of cohort studies in the world. DPUK is also providing services for cohort studies and researchers to maximise the research potential of the programme’s community.
Objectives and Approach
As part of the engagement of DPUK cohorts with the Data Portal, cohorts will upload data onto the DPUK instance of UK Secure eResearch Platform infrastructure. The Data Portal allows access to a collaborative working space that allows cohorts to enrich their own data, perform their own analysis, and enhance the research potential of their data whilst making use of expertise at various DPUK sites, such as data linking, curation and multi-modal specialism. Cohort data divided into ontologies allows researchers to access data specific to their study needs and can be requested from multiple cohorts simultaneously.
By utilising the Data Portal researchers have access to cohort data that has been prepared for dementia epidemiology using the agreed ontologies, providing more rapid access to cohort data that otherwise may be large and complex. The knowledge and experience of DPUK staff and collaborators can also help to guide nascent cohorts and feasibility studies into producing research-ready datasets, enabling them to achieve greater impact with their data. A range of analytical tools are provided on the Data Portal making analysis of a cohort’s own data or multiple independent datasets more accessible. Alongside data curation, DPUK also facilitates data linkage to routine sources, beginning with a Wales-wide use case that will expand to the UK over the course of the project.
Data from international sources accessible using a central platform permits international collaboration, with ontologies allowing previously disparate data to be combined and analysed to build knowledge and research impact. DPUK projects create policy leading results and operational research standards, enhancing cohort impact and discovery of benefits for Dementia patients.
Electronic health records (EHR) are a powerful resource in enabling large-scale healthcare research. EHRs often lack detailed disease-specific information that is collected in free text within clinical settings. This challenge can be addressed by using Natural Language Processing (NLP) to derive and extract detailed clinical information from free text.
Objectives and Approach
Using a training sample of 40 letters, we used the General Architecture for Text Engineering (GATE) framework to build custom rule sets for nine categories of epilepsy information as well as clinic date and date of birth. We used a validation set of 200 clinic letters to compare the results of our algorithm to a separate manual review by a clinician, where we evaluated a “per item” and a “per letter” approach for each category.
The “per letter” approach identified 1,939 items of information with overall precision, recall and F1-score of 92.7%, 77.7% and 85.6%. Precision and recall for epilepsy specific categories were: diagnosis (85.3%,92.4%), type (93.7%,83.2%), focal seizure (99.0%,68.3%), generalised seizure (92.5%,57.0%), seizure frequency (92.0%,52.3%), medication (96.1%,94.0%), CT (66.7%,47.1%), MRI (96.6%,51.4%) and EEG (95.8%,40.6%). By combining all items per category, per letter we were able to achieve higher precision, recall and F1-scores of 94.6%, 84.2% and 89.0% across all categories.
Our results demonstrate that NLP techniques can be used to accurately extract rich phenotypic details from clinic letters that is often missing from routinely-collected data. Capturing these new data types provides a platform for conducting novel precision neurology research, in addition to potential applicability to other disease areas.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.