Managing Emerging Data Types in the NGLMS IJPDS (2017) Issue 1, Vol 1:248 Proceedings of the IPDLN Conference (August 2016)

Main Article Content

James Farrow
Published online: Apr 18, 2017


ABSTRACT


Objectives
We describe the management system used by the Next Generation Linkage Management System (NGLMS) built for SA.NT DataLink in Adelaide, Australia. The NGLMS is a bespoke system built on freely available open source components where a graph (in the computer science sense) structure is used to store a ‘more natural’ representation of linked records explicitly in a graph database: records as vertices and relationships as edges between vertices.


Approach
The NGLMS is designed to manage linked data effectively and permit fast individual cluster extraction while retaining rich relationship information. It holds probabilistic and statically-linked data by storing all significant pair-wise relationships between records as edges in a graph, allowing clustering with different parameters to be performed dynamically. Records are heterogeneous and may contain different data types: birth records, hospital separations, census data, pharmaceutical prescriptions, educational data. The relationships between records are also heterogeneous and may represent arbitrary relationships not just a probabilistic record similarity. For example, familial (parent/child), tribal kinship structures, genomic (and other omic) information, employer/employee relationships, educational information, living arrangements, census information, and so on. Storing this information allows for richer queries than just ‘do these records represent the same entity’. For example a single rich query to the database could be ‘find all records of all siblings’, ‘create genealogies based on birth information’, ’create household groups based on census/cohabitation information’, or ‘find employees working in areas affected by recent floods with hospitalisations during that time period.’


Results
We present details of the loading of birth and perinatal data incorporating parent (mother and father) relationships for some South Australian datasets and the technical configuration of the NGLMS to support this. We discuss the queries made possible as a result. Rich non-traditional data is stored in the same manner as probabilistic record similarities and has allowed clustering queries which mix explicit deterministic statements about the data and probabilistic statements concerning record relationships.


Conclusion
Rich queries over data may be expressed by storing rich heterogeneous information about records and relationships explicitly as a graph and by determining clusters late in the extraction process. Modern graph database technologies make this effective even in the face of datasets containing 10’s to 100’s of million records and billions of edge relationships.


Objectives

``Clearly, details about an individual's mental health, for example, are generally much more `sensitive' than whether they have a broken leg.'' UK Information Commissioners Office

There is a perceived wisdom - based on issues such as social taboos, religious sensitivities, or financial implications linked to health status - that some health data is more sensitive than others. This distinction is present in many of the regulatory interpretations of privacy law (e.g. the UK Information Commissioners Office interpretation of the EU Data Directive, illustrated above), and is factored into the thinking of ethics and other regulatory decision-making committees. However, these particularly `sensitive' data are defined at a regulatory level in broad terms (e.g. mental health), yet need implementing by researchers in precise terms. In 2013 our longitudinal research study was given approval by the UK Secretary of State for Health to access identifiable patient health records with the exception of those relating to mental health, sexual health or termination of pregnancy. Our objective therefore was to develop a generalisable informatics approach which enabled us to filter out sensitive records at the point of extraction.

Approach

We developed a methodology based on the Cochrane systematic review approach: firstly using internationally recognised definitions of health concepts and reference texts (e.g. British National Formulary drug manual) we identified keywords associated with sensitive health events (including symptom and diagnostic terms, drug and appliance codes, community and secondary care references); secondly, through data-mining code terminologies - using both code terms and information embedded within the structure of the schema itself - we identified code values relating to these terms; thirdly we minimised our results through filtering out spurious results via manual review; finally, the resulting code lists were then crossed-referenced with other terminologies to ensure interoperability.

Results

We produced separate definitions of mental health and sexual health events initially using Read codes. Using NHS cross-reference tables we were able to translate Read observation and diagnostic codes to the SNOMED CT vocabulary, but were unable to translate Read drug codes into the SNOMED/DM+D vocabulary.

Conclusion

We have demonstrated a systematic and partially interoperable approach to defining `sensitive' health information. However, any such exercise is likely to include decisions which will be open to interpretation and open to change over time. As such, the application of this technique should be embedded within an appropriate governance framework which can accommodate misclassification while minimising potential patient harm.

Article Details