Main Article Content
Introduction & Background
Social media data is increasingly recognised as an important source of behavioural data. It can provide insights into patterns of life and how individuals and groups are feeling. However, many studies into social media’s relationship to mental health and well-being have suffered from poorly developed ground-truth data, which relies on assumed ground-truth labels and data from single timepoints. This means that the accuracy of models at future timepoints cannot be assessed.
Collecting Twitter data from cohorts provides a solution to this issue, given the many years of high quality data that can be used as ground truth. Cohorts can also benefit from the higher-resolution data provided by social media that can supplement their traditional data collection methods.
Objectives & Approach
We used Twitter data that has been collected with consent from two generations of the Avon Longitudinal Study of Parents and Children (ALSPAC) (N=656). The data is linked to two surveys completed in April-May 2020 and May-July 2020 for validated outcome measures of anxiety, depression, and general well-being.
Using the LIWC and VADER sentiment algorithms, the sentiment categories most highly associated with each outcome were used to develop a multiple regression model for each of anxiety, depression and general well-being using the first survey timepoint. Error from these models in predicting the second timepoint allowed us to assess how well different outcomes are predicted by demographic group.
Relevance to Digital Footprints
Digital footprint data can complement traditional data sources to provide a more nuanced view of health inequalities. These data are typically less timely to collect than traditional data collection methods (census, survey) allowing a more reactive response to emergent issues such as the cost-of-living crisis.
This study illustrates how the collection of digital footprint data can be integrated into existing long-term studies which can be used to provide multiple points of ground-truth data.
Conclusions & Implications
This study has shown that the collection and integration of Twitter data into cohort studies is feasible, and that cohort data provides multiple ground-truth options. This time series data is important for assessing the potential feasibility of mental health inference from online behavioural data, which this study shows may vary across personal characteristics.
In future research we plan to link subsequent surveys from ALSPAC to provide more ground truth time points and explore the temporal stability of predictions, and impacts of model drift on performance.
This work is licensed under a Creative Commons Attribution 4.0 International License.