How can we be sure that machine learning models don’t inadvertently breach privacy?
Machine learning (ML) is a computing technique to find new insights into data. Increasingly these models are built using sensitive personal data, and so we need to be sure no individual data could be accidentally released when deploying these models. A consortium of universities, funded by DARE UK and led by the University of Dundee, developed mathematical tools to provide measures of disclosure risk and high-level guidelines for decision-makers. However, this has highlighted that there are many unresolved issues in this field, including how the results of such risk assessment should be interpreted.
ML models are growing in popularity, particularly in health where they can play an important role supporting clinical and operational practice. For example, models are trained to recognise early-stage carcinomas, or predict demand for a service to improve resource scheduling. It is important that the models learn on the same sort of data that they will encounter when deployed. Sensitive data such as health data is often made available for research in through secure computing facilities called ‘trusted research environments’ (TREs). TREs are considered best practice as they allow researchers great flexibility to analyse data in a controlled environment which prevents unauthorised use or disclosure of the data. TRE staff check outputs for residual ‘disclosure risk’ (for example, a table showing age distribution of patients might reveal the illness associated with an elderly individual) before releasing results into the public domain. Increasingly, there is a desire for ML models to be trained on data only available within TREs. However, TREs are still learning how to securely support such research.
ML models can create new risks. First, ML models generate much more complex results than traditional statistical analysis, and so it is no longer possible to eyeball the results and decide “Yes, this is/isn’t disclosive”. Second, certain types of ML models will hold within them a copy of some or all of the data they were trained on. Third, traditional models have simple rules about, for example, blocking results which show that everyone in a group has the same genetic marker. ML models produce so many parameters that it is almost certain that some combinations are unique, but does this matter? Put another way, a needle in a sofa is a problem. A needle in a haystack is theoretically the same problem, but do we need to check the haystack before we sit on it? Simple rules can severely reduce the usefulness of the model.
That presents a problem for TREs, who are keen (and well-suited) to support ML modelling, but face having to make decisions on outputs without clear, unambiguous rules. Without understanding what disclosure could even theoretically happen, or even what it means in this context, TREs either run the risk of releasing raw or disclosive data from the safe environment, or not being fit for purpose and closing doors to this kind of analysis. It is likely that this is preventing a productive partnership between TREs and ML modellers.
To allow ML models to fully exploit TRE infrastructure, urgent research is needed into both technical areas (understanding better the types of ML models and risks) and operational practices (designing controls and checks, helping TRE managers develop their decision-making processes).
Felix Ritchie, Professor of Applied Economics and Director of Bristol Centre for Economics and Finance, University of the West of England
Ritchie, F., Tilbrook, A., Cole, C., Jefferson, E., Krueger, S., Mansouri-Bensassi, E., Rogers, S. and Smith, J. (2023) “Machine learning models in trusted research environments -- understanding operational risks”, International Journal of Population Data Science, 8(1). doi: 10.23889/ijpds.v8i1.2165.