Validating epilepsy diagnoses in routinely collected data
Main Article Content
Primary healthcare records are used for studies within large data repositories. One of the limitations of using these routinely collected data for epilepsy research is the possibility of including incorrectly recorded diagnoses. To our knowledge, the accuracy of UK GP diagnosis codes for epilepsy has only partially been validated.
Objectives and Approach
We aimed to validate the accuracy of case ascertainment algorithms in identifying people with epilepsy in routinely collected Welsh healthcare data.
A reference population of 150 people with definite epilepsy and 150 people without epilepsy was ascertained from hospital records and linked to records held within the Secure Anonymised Information Linkage (SAIL) databank in Wales. We used three different algorithms to identify the reference population: a) individuals with an epilepsy diagnosis code and two consecutive AED prescription codes; b) individuals with an epilepsy diagnosis code only; c) individuals with two consecutive AED prescription codes only.
We applied the algorithms to all patients and to adults and children separately. For all patients, combining diagnosis and AED prescription codes had a sensitivity of 84% (95% ci 77–90) and specificity of 98% (95–100) in identifying people with epilepsy; diagnosis codes alone had a sensitivity of 86% (80–91) and a specificity of 97% (92–99); and AED prescription codes alone achieved a sensitivity of 92% (70–83) and a specificity of 73% (65–80). Using AED codes only was more accurate in children, achieving a sensitivity of 88% (75–95) and specificity of 98% (88–100). This can be explained by the widespread use of AEDs for indications other than epilepsy in adults, which is not the case for children.
GP epilepsy diagnosis and AED prescription codes can be used to identify people with epilepsy using anonymised healthcare records in Wales. In children using AED prescription codes alone is an accurate way to identify epilepsy cases. These results are generalizable to other studies that use UK primary care records.
Ligo is an open source application that provides a framework for managing and executing administrative data linking projects. Ligo provides an easy-to-use web interface that lets analysts select among data linking methods including deterministic, probabilistic and machine learning approaches and use these in a documented, repeatable, tested, step-by-step process.
Objectives and Approach
The linking application has two primary functions: identifying common entities in datasets [de-duplication] and identifying common entities between datasets [linking]. The application is being built from the ground up in a partnership between the Province of British Columbia’s Data Innovation (DI) Program and Population Data BC, and with input from data scientists. The simple web interface allows analysts to streamline the processing of multiple datasets in a straight-forward and reproducible manner.
Built in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae. Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. A fully functional alpha, including deterministic and probabilistic methods is expected to be ready in September, with a machine learning extension expected soon after.
Ligo has been designed with enterprise users in mind. The application is intended to make the processes of data de-duplication and linking simple, fast and reproducible. By making the application open source, we encourage feedback and collaboration from across the population research and data science community.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.