Use of Machine Learning and Linked Population Health Data to Develop Predictive Risk Algorithms for Population Health Decision-Making
Main Article Content
Abstract
Introduction
Data from population health surveys, administrative health records and environmental monitoring are increasingly being linked at the individual level. As these data become available to health researchers, there is an increasing need for methods which can make sense of large, noisy and heterogeneous data and can model complex relationships. Using these data, machine learning methods have the potential to produce population health risk algorithms with better performance than those developed with traditional statistical approaches.
Objectives and Approach
The objective of this work is to explore the use of machine learning methods for the development, validation and implementation of predictive risk algorithms designed specifically for population health planning purposes. Algorithms to predict risk of dementia and avoidable hospitalizations are in development using the Canadian Community Health Survey, geographic sociodemographic information, administrative health care utilization data and vital statistics. Methods being explored include naïve Bayes, gradient boosting, support vector machines and neural networks.
Results
Risk algorithms for population health should generally prioritize calibration over discrimination due to implications for resource allocation decisions. Approaches to minimize the risk of overfitting should be used and reweighting of unbalanced data avoided as it distorts the population-level nature of the data. It is important to be aware of propagating underlying bias in the data or exacerbating existing health inequities, which can be evaluated in part through assessment of calibration across relevant population subgroups. Approaches that consider multi-level data structures are needed to appropriately incorporate neighbourhood-level measures with individual-level information. To maximize population health impact and acceptability, model transparency and interpretability should be prioritized.
Conclusion
There is tremendous potential for machine learning approaches to leverage large volumes of linked population data to produce predictive risk algorithms that will inform population health decision-making. Future work will explore use of complex environmental remote sensing and built environment data.