Developing a population data science approach to assess increased risk of COVID-19 associated with attending large events

Main Article Content

Mark Drakesmith
Gemma Hobson
Gareth John
Emily Stegall
Ashley Gould
John Parkinson
Daniel Rhys Thomas


In summer 2021, as rates of COVID-19 decreased and social restrictions were relaxed, live entertainment and sporting events were resumed. In order to inform policy on the safe re-introduction of spectator events, a number of test events were organised in Wales, ranging in setting, size and audience.

To design and test a method to assess whether test events were associated with an increase in risk of confirmed COVID-19, in order to inform policy.

We designed a cohort study with fixed follow-up time and measured relative risk of confirmed COVID-19 in those attending two large sporting events. First, we linked ticketing information to individual records on the Welsh Demographic Service (WDS), a register of all people living in Wales and registered with a GP, and identified NHS numbers for attendees. Where NHS numbers were not found we used combinations of other identifiers such as email, name, postcode and/or mobile number. We then linked attendees to routine SARS-CoV-2 test data to calculate positivity rates in people attending each event for the period one to fourteen days following the event. We selected a comparison cohort from WDS for each event, individually matched by age band, gender and locality of residence. As many people attended events in family groups we explored the possibility of also matching on household clusters within the comparison group. Risk ratios were then computed for the two events.

We successfully assigned NHS numbers to 91% and 84% of people attending the two events respectively. Other identifiers were available for the remainder. Only a small number of attendees (<10) had a record of confirmed COVID-19 following attendance at each event (14 day cumulative incidence: 36 and 26 per 100,000, respectively). There was no evidence of significantly increased risk of COVID-19 at either event. However, the event that didn't include pre-event testing in their mitigations, had a higher risk ratio (3.0 compared to 0.3).

We demonstrate the potential for using population data science methods to inform policy. We conclude that, at that point in the epidemic, and with the mitigations that were in place, attending large outdoor sporting events did not significantly increase risk of COVID-19. However, these analyses were carried out between epidemic waves when background incidence and testing rate was low, and need to be repeated during periods of greater transmission. Having a mechanism to identify attendees at events is necessary to calculate risk and feasibility and acceptability of data sharing should be considered.


Mass events have been associated with increased transmission of communicable disease [13]. The higher population density and movement of people associated with these events may contribute to this increased transmission. Social and environmental factors, such as temperature, humidity, venue type, crowd size and mood, age, alcohol and drug use, can also increase risk. It has therefore become accepted good practice to carry out public health surveillance during mass gathering events [4].

The COVID-19 pandemic has had a major impact on the staging of mass gatherings worldwide [5], including sporting events [6]. However, in summer 2021, as COVID-19 case rates decreased and social restrictions were reduced, policy makers needed to make decisions about the resumption of live entertainment and sporting events [7]. Some countries piloted the resumption of events by staging pilot or ‘test events’, events that fell outside extant COVID-19 regulations but were staged in order to test the safety of progressing to reduced or no restrictions on indoor or outdoor gathering [8, 9].

Policy decisions on non-pharmaceutical interventions for prevention and control of COVID-19 require good evidence [5]. In the United Kingdom, health is a responsibility devolved to the constituent nations and the work described in this article was prompted by a gap in evidence identified by a Welsh Government group set up in Summer 2021 to assess the safety of a series of ‘test events’ [10]. In order to assess the safety of re-introducing spectator events, Nine ‘test events’ were organised in Wales during May-June 2021, ranging in setting, size and type of audience [11]. The primary objective of staging these events was to assess whether safe events could be delivered in practice by event organisers with risk mitigation in place, such as physical distancing in audiences, wearing of face masks in enclosed spaces and pre-event health checks. Whilst the emphasis of the test events was on operational issues, such as event management and compliance of staff and customers with mitigation measures, questions were raised about how best to assess epidemiological risk associated with attending events. Therefore, a secondary objective was to assess whether attendance at live events was associated with increased transmission of SARS-CoV-2, and which type of events, if any, present a higher risk.

In the present study we aim demonstrate the feasibility of addressing the second objective, by linking ticketing information to routine public health surveillance data for the two largest test events to compare cases in attendees with a reference population. Linking these data sources permits estimation of relative and absolute risk associated with event attendance which would not be possible due to the absence of an appropriate denominator. In demonstrating this methodology, future analysis of outbreaks associated with large events, such as sporting events or music festivals becomes feasible.


Study design

Cohort study with fixed follow-up period, with a matched comparison (unexposed) cohort.


Both events were in Wales, and attendance was restricted to residents of Wales, which has a population of 3.13 million [12].

At the time of the test events, covid-19 incidence was decreasing following a peak in winter 2020-2021. Key indicators for all Wales in the 2-week period following event 1 (24th May–6th June 2021) were: cumulative 14-day incidence of 21.6 per 100,000, testing rate of 1,788 per 100,000 and positivity of 1.2%. In the 2-week period following event 2 (8th–20th June 2021) key indicators were: cumulative 14-day incidence of 61.4 per 100,000, testing rate of 2,266 per 100,000 and positivity of 2.7% [13].

Restrictions that had been in place over the winter were eased during this period [14]. On 11th May indoor hospitality and entertainment was re-opened, and international travel resumed on the 17th May. However, the Welsh government advised football fans against international travel. From 4th June, groups of 30 people were permitted to meet outdoors and attend larger outdoor events. A decision to re-open large sporting events had not yet been taken, pending the outcome of the test events programme.

Description of events

Of the 9 organised test events, the two largest events were deemed suitable for analysis of risk. The other events were smaller (<1,000 attendees) and not deemed feasible for the analysis. These events are referred to as ‘event 1’ and ‘event 2’. The characteristics of the two events are given in Table 1. Both events were football matches in Wales. One was a Premiership play-off match held at Swansea Liberty Stadium. The other a friendly international in advance of the, held at Cardiff City Stadium. Both had COVID-19 risk mitigations in place. Event 2 had an additional requirement for proof of a negative SARS-CoV-2 test in advance of the event.

Event Date Venue Capacity Mitigations in place Attendees recorded
1 Premiership football play-off 22 May 2021 Liberty Stadium, Swansea 3,000 Social distancing at 2m, wearing face masks, temperature checks on arrival, controlled ingress and egress and selling tickets to household bubbles. No testing requirement. Ticket buyers (not necessarily all ticket holders), plus staff and hospitality
2 International men’s football 5 June 2021 Cardiff City Stadium 6,500 As above plus PCR test up to 5 days prior to the event or home LFD test up to 24 hours prior to the event All ticket holders. No staff or hospitality.
Table 1: Description of the two test events.

Following the events, the event organisers then sent lists of all attendees, including paying spectators, guests and staff where available (see Table 1) to Public Health Wales via a secure file sharing platform.

Record linkage

Record linkage of the attendee list was performed in 3 stages (a flow diagram of record linkage for each event is provided in SI):

Linking to NHS numbers in WDS

Where name, address, and date of birth or age were available, individual attendees were matched with the Welsh Demographic Service (WDS), an administrative register of all Welsh residents registered with a general practice [15], and each individual’s NHS number as recorded in WDS was identified. The NHS number is a unique number assigned to all users of the National Health Service in the United Kingdom.

Linking to SARS-CoV-2 test data

Using NHS number, individual event attendees were linked with Public Health Wales records of all people in Wales who have received a polymerase chain reaction (PCR) test for SARS-CoV-2. In the first instance, matching was performed using the NHS number obtained from stage 1. Where not available, matching was performed iteratively on combinations of other identifiers, where available, in the following order: name, postcode and date of birth or age, where available. Matches were made where the sample collection date for the PCR test lies between 14 days before, to 14 days after the event. The matches were then manually checked for obvious mismatches, which were excluded from further analysis.

Linking to contact tracing data

Additional matches were found by linking to the Test Trace Protect (TTP), contact tracing data management system which, in addition to name, postcode and date of birth, also has contact e-mail and mobile phone number for cases. Matching was performed iteratively on combinations of identifiers in the following order: name, mobile number, e-mail address, postcode and date of birth or age, where available. Matches were made where the laboratory test result date was between 14 days before and 14 days after the event. The matches were then manually checked for obvious mismatches, which were excluded from further analysis.

Exposed cohort

The exposed cohort was identified from the linked attendee list, obtained in the previous step. Any records with duplicate NHS numbers were removed. Records deemed ‘matchable’ for identification of the unexposed cohort, must be a resident of Wales and have a minimal set of matching variables on WDS (see’ Unexposed cohorts’). All records not deemed matchable were removed from the exposed cohort.

Unexposed cohorts

We tested two different methods of sampling the unexposed cohort. In both methods, we selected comparison cohorts from the general population of Wales, that is: a group who could have attended the event but did not. In order to take into account the demographic profile of those attending each event and control for background community transmission, we selected from the WDS, individuals who were matched to the exposed cohort on a range of demographic variables. We aimed to match the unexposed to exposed cohort with a 3:1 ratio. This ratio was chosen as general guidance is that matching ratios of greater than 3:1 offer negligible gains in statistical power [16].

Unexposed cohort A (without household cluster matching)

Individuals were matched to members of the exposed cohort for age, sex, area of residence and neighbourhood deprivation. Of the event attendees who had linkable identifiers (NHS number), individual matching was carried out on the following criteria: Age (within 5 years), sex, Middle Super Output Area (MSOA) of residence, and deprivation quintile (within +/-1 quintile). Three unexposed individuals were selected for each exposed individual. A full 3:1 unexposed-to-exposed ratio was achieved for both events.

Unexposed cohort B (with household cluster matching)

As many people attended events in family groups, we attempted to simulate similar levels of clustering within the comparison/control group, in order to account for commonalities within household clusters, including risk factors for within-household infection. We used unique property reference numbers (UPRNs), which are recorded against the vast majority of individuals on the WDS, to group individuals into a family or household. We used the exact same individual-level matched control criteria as for unexposed cohort A. However if, for example, a 50 year old male attended one of the events with his two daughters, aged 12 and 17 years, our controls would be drawn from a similar household (within the same MSOA and a deprivation quintile within +/-1 quintile), in which there was also a male aged 50 (+/-5) plus two females aged 12 (+/-5) and 17 (+/-5), respectively. This limited the number of controls that it was possible to find, especially where there were 4 or 5 attendees from the same household. Therefore, unlike in cohort 1, we were unable to achieve a full 3:1 unexposed-to-exposed ratio. Actual ratios achieved were 2.5:1, and 2.6:1 for events 1 and 2, respectively.

Case definition

Cases were defined as individuals who were linked to a positive SARS-CoV-2 PCR test result in the period 1 to 14 days after the date of the event. Non-cases are defined an individuals who either were linked to a negative SARS-CoV-2 PCR test result, or not linked to a test result. We make the assumption that absence of linkage is equivalent to absence of a test result.


All analysis was carried out in R v4.0.3 [17]. All analyses were performed for the two unexposed cohort definitions, separately.

Descriptive statistics

Attendees at each event were described by age, sex and area of residence. Median age and sex ratios were calculated, and geographic distribution of attendees was plotted on a point map. 14-day cumulative incidence (number of new cases per 100,000) over the 14 days post-event period was calculated for the exposed and unexposed cohorts.

Analysis of risk

Our first approach was to quantify relative and absolute risk difference. Risk ratios with 95% confidence intervals were calculated using unconditional maximum likelihood estimation with confidence intervals calculated using the Wald normal approximation, using the riskratio function from the epitools R package. Equivalent risk differences (RD) were also calculated from the estimated risk ratios.

When computing risk ratios, where matched cohorts are treated as independent cohorts, there is a possibility of a matching bias being introduced [1820]. This may be a particular issue for unexposed cohort B, where matching is performed on household clusters with varying sizes. A common way to control for this is with conditional logistic regression (CLR) which provides an estimate of odds ratios, controlling for matching. Although CLR is usually applied in the context of case-control studies, it is also applicable to the analysis of matched cohort studies [19]. CLR was performed by fitting an equivalent time-independent Cox model [21] with strata defined by the matched groups, as implemented in the clogit function in the survival R package [22]. 95% confidence intervals were calculated using the Wald normal approximation.


Cases identified in the exposed cohort through record linkage were followed up by manual search of the TTP contact tracing management system. This system is used to record the work of contact tracers including responses of cases to the contact tracing questionnaire covering their movements in the days before and after their infection onset. Searching was undertaken based on name and address. Contact tracers’ notes were checked both for confirmation of attendance at the events and for other exposures which may have resulted in infection.


As part of the test event programme in Wales, the organisers of these events were approached prior to the event, and in partnership with Public Health Wales and Welsh Government, incorporated a statement of consent to share attendee details with Public Health Wales for purposes of evaluation of safety of these events. Providing consent was a condition of attendance which was agreed to at the point of purchase, under special arrangement with Welsh Government to assess the safety of holding large event. The following privacy statement was added to the terms and conditions when people registered for tickets to attend test events:

“Sharing your personal information

This event is one of a series of Welsh Government Test Event Series. Your personal data may be shared with Public Health Wales and linked to routine data on COVID-19 test results for the purposes of evaluating the test event. Public Health Wales is a NHS Trust and data will be held in accordance with the General Data Protection Regulations and the usual NHS information governance conditions. For further information on how we use your data please see Public Health Wales privacy notice:


Description of attendees

For event 1, a total of 2,745 attendee’s details were recorded, for event 2, a total of 3,918 attendee’s details were recorded. People attending these events were predominantly male (Figure 1a): 81% of those attending event 1 and 85% of those attending event 2 were male. People attending event 1 had a bimodal age distribution and a generally older age distribution: Median age was 43 years (range: 2, 93) as compared to 35 years (range: 1, 90) for event 2. Attendees for event 1 were predominantly resident in South West Wales, most residing close to the event venue, whereas attendees from event 2 had a wider geographic distribution (Figure 1b).

Figure 1: (a) Age and sex distributions of event attendees. (b) Geographic distribution of event attendees.

Selection of exposed cohort and linkage results

NHS numbers were ascertained for 2,499 (91%). 97 (4%) attendees were successfully linked to a SARS-CoV-2 PCR test result. For the 14-day period prior to the event, 24 tests were linked, one of which was positive. For the fourteen days post-event period, we linked 73 PCR tests (3% of attendees), of which one was positive. This represents a 14-day cumulative incidence of 36 per 100,000 attendees in the post-event period.

For event 2, a total of 3,918 attendee’s details were recorded. NHS numbers were ascertained for 3,295 (84%) (Table 2). A total of 2,658 (68%) attendees were successfully linked to a SARS-CoV-2 PCR COVID-19 test result. The majority of these were tests obtained shortly before the event to comply with requirements to have a negative test result prior to attendance (Figure 2). In the pre-event period, we identified 2,633 tests, of which 5 were positive for the 14 days post-event, we linked 25 PCR tests (0.6% of attendees), of which 2 were positive. However, only one of these were successfully validated (see ‘Validation’). This represents a 14 day cumulative incidence of 26 per 100,000 attendees in the post-event period.

Event Attendees recorded Attendees who were matchable Number of attendees with a PCR test in 14 days post event (% of linkable attendees) Number of confirmed attendees with a positive PCR test in 14 days post-event (% of post-event tests positive) Proportion with a positive SARS-CoV-2 test within 14 days of attending the event Risk ratio and risk difference (95% confidence intervals; significance) Odd ratio from CLR (95% confidence intervals; significance)
Attendees (exposed) Comparison cohort (non-exposed) Comparison cohort (non-exposed) Comparison cohort (non-exposed)
1 2,745 2,412 73 3%) 1 (1.4%) 1/ 2,412 (0.04%) 1/7236 (0.01%) 0/5964 (0.00%) RR = 3.00 (0.18–47.9) RD =0.03% (–0.22%–0.66%) p = 0.50 RR not available RD* =0.04% OR = 3.00 (0.19–48.0); p = 0.44 Not available
2 3,918 3,277 25 (0.6%) 1 (8%) 1/3,277 (0.03%) 10/9,831 (0.10%) 2/8535 (0.02%) RR =0.30 (0.04–2.34) RD = –0.07% (–0.79%–0.23%) p = 0.23 RR = 1.30 (0.12–14.35) RD = 0.007% (–0.26%–0.33%) p = 0.81 OR = 0.30 (0.04–2.34); p = 0.25 Not available
Table 2: Numbers of linkable attendees, proportions of positive SARS-CoV-2 test in exposed and non-exposed groups and associated risk ratios following each event. *Due to zero unexposed cases, only absolute risk difference can be estimated.

Figure 2: SARS-CoV-2 tests by date sample taken, in the period 14 days before to 14 days after each event.

Selection of unexposed cohorts and linkage results

For unexposed cohort A, three unexposed individuals were selected for each attendee traced to an NHS number, based on the criteria described earlier. For event 1, there were 2,412 (88%) attendees that met this criteria. There were 3,277 (84%) for event 2.

For event 1, of the 7,236 people in unexposed cohort A, there was just one individual with a SARS-CoV-2 PCR COVID-19 test result, and for event 2, of the 9,831 unexposed controls, there were 10 with a SARS-CoV-2 PCR COVID-19 test result.

For unexposed cohort B, attempts were made to find unexposed individuals for all attendees traced to an NHS number, confirmed as being Welsh residents, and with their address matched to a valid UPRN. For test events 1 and 2 there were 1,988 and 2,845 individuals who met this criteria, respectively. Due to the more stringent matching criteria for unexposed cohort B, 5,964 and 8,535 unexposed controls were found for test events 1 and 2, respectively, representing unexposed-to-exposed ratios of 2.5:1, and 2.6:1 respectively.

For event 1, none of the 5,964 individuals in unexposed cohort B had a SARS-CoV-2 PCR COVID-19 test result within 14 days of the event, and for event 2, two unexposed individuals had a SARS-CoV-2 PCR COVID-19 test result.


Using unexposed cohort A, for event 1, a risk ratio of 3.00 (95% CI: 0.18–47.9; p = 0.50) was obtained, for event 2, a risk ratio of 0.30 (95% CI: 0.04–2.34; p = 0.23) was obtained. Equivalent risk differences were 0.03% (–0.22%–0.66%; p = 0.50) and –0.07% (–0.79%–0.23%; p = 0.23). Odds ratios estimated with CLR were very similar: 3.00 (95% CI: 0.19–48.0; p = 0.44) for event 1 and 0.30 (95% CI: 0.04–2.34; p = 0.25) for event 2.

Using the unexposed cohort B was more problematic due to failure to find cases to include in all groups. A risk ratio and CLR could not be performed for event 1 due to the absence of any non-exposed cases. For event 2, a risk ratio of 1.30 (95% CI: 0.12–14.35; p = 0.81) was obtained. CLR also could not be performed for event 2, due to the absence of any matchable exposed cases. An attempt to approximate a CLR via pair-substitution was attempted (see Supplementary Material).


All cases in the exposed cohorts who were identified through record linkage were also identified in the TTP system. Contact tracing notes identified that two of these three cases had reported attending the events as identified through record linkage. One case linked to event 2 reported not attending the event, and therefore was excluded from the exposed cohort. Cases who reported having attended an event said that they had adhered to the COVID-19 protection measures at the event including social distancing and mask wearing, and also reported that they observed a high level of compliance with the measures in others attending the event. However, high-risk exposures associated with activities surrounding the event, such as in pubs or coaches, were reported. These included a lack of social distancing, singing and shouting unmasked and physical contact with others.


We have demonstrated that it is possible to design and conduct an epidemiological study of communicable disease at large spectator events using record linkage of ticketing information with routine microbiological test results. We also proved the feasibility of using routine administrative data to select an unexposed cohort, that is: people who might have attended the event but didn’t, in order to quantify risk. The quantification of risk provides advantages over approaches which rely only on contact tracing data [23, 24]. This was useful for assessing risk of COVID-19, but could be applied equally to other infectious disease risks at mass gatherings, such as gastrointestinal disease. In addition, use of qualitative information, recorded in the contact tracing system, provided useful contextual information when seeking to validate the quantitative findings.

Our results have not identified evidence of a significantly different risk of COVID-19 associated with attendance at these test events. This study took place at a time of low community incidence, with very low numbers of cases identified in both the exposed and unexposed cohorts. Testing rates were also low in this period which resulted in a low linkage rate. There was a recognition that risk would be difficult to assess given the small size of most events and the background of low community transmission at the time [25]. Nevertheless the information provided on the number of cases identified in people attending the two larger test events provided some reassurance at the time and supported the behavioural insights work being carried as part of the test-events programme [26]. Although the present study was not able to arrive at any conclusions of risk associate with these events, it does demonstrate a methodology that could be better applied to other events in higher incidence setting.

Whilst risk ratios and CLR-derived odds ratios did not indicate significantly higher or lower risk of attendance at either event, it is interesting to note that the risk ratio was less than one for the event that included pre-event PCR testing, but above one for the event that didn’t (when using unexposed cohort A). The policy to test attendees prior to one of the test events may have led to individuals who were pre-symptomatic not attending, which might have in turn reduced the number of cases detected after the event. This may lead to an apparent protective effect of event 2. This apparent effect may also be magnified by the higher background incidence during event 2, compared to event 1. We also note, while there is a large apparent difference in relative risk, looking at absolute risk, there is only an additional 30 in 100,000 risk associated with event 1, and a reduction of 70 per 100,000 in risk associated with event 2. These numbers are in the same order of magnitude as the background incidence at the time, which was negligible in the wider context of the pandemic.

With respect to the matching methodology, this approach is applicable to international public health agencies with access to large-scale registers of health care users, equivalent to WDS. We have also demonstrated two strategies for linkage, using house-hold structures as a linking variable in addition to traditional scalar demographic variables. The lack of statistically significant results in the presented data difficult to derive conclusions, but further comparisons with other datasets will provide more information of the added benefit of the household matching. The stringent matching criteria introduced by the household matching meant that not enough cases survived selection to be able to make any estimates of risk (see Supplementary Material for an attempt to overcome this issue). Comparing risk ratios with odds ratios derived from CLR, we find very little difference in the results, suggesting that there is little impact of matching bias, at least for individually matched cohorts. Supplementary analysis suggests there may be greater impact of matching bias when using household matching, but clear conclusions are difficult to make (see Supplementary Material).

Despite the numbers of cases being small, the qualitative data collected for validation indicated that behaviours surrounding an event may be pertinent to transmission. Behavioural observations at some of the other test events in Wales have demonstrated that adherence to COVID-safe behaviours appears to break down in specific situations [27], such as where there is a lack of structured support (lack of signage, lack of stewards) or at physical crunch points where flow is disrupted and bottlenecks emerge (such as entry and exit). Since the test events took place, incidence of COVID-19 has risen in the UK, and a number of large events have been implicated in large numbers of cases. In particular, the Boardmasters Festival in August 2021 is believed to have contributed to approximately 4,700 COVID-19 cases [28]. The demographic of this event was largely young adults, many of whom were unvaccinated. In addition, the size and duration of the event, the living arrangements on the site and the shared modes of transport used are likely to contribute to additional risk associated with the event. The methods demonstrated in this paper would have high utility in quantifying the risk of COVID-19 at such events, to categorise risk across different types of events and (through use of contact tracing information) different types of exposures. This has the potential to provide important information which could be used alongside behavioural insights to inform public health advice, to event organisers and potential/actual attendees.

We carried out work to explore possible study designs to inform this process, but also to test the feasibility of carrying out this type of analysis in response to outbreaks associated with future events, such as large sporting events or music festivals. In order to calculate risk it is necessary to be able to identify a denominator population and to be able to link cases to event attendance. This was possible for ‘test events’ as, with agreement from event organizers, a privacy statement was included with the terms and conditions of ticket purchase. The special arrangements of these event introduced a number of mitigations, which would not be in place under normal circumstances. In order to repeat this work, for example in the context of an alert level zero (where many restrictions would be relaxed) or during a period of higher incidence, seeking a similar mechanism for obtaining consent for data sharing would be valuable. This method could also be applied to any infectious disease risk at a large event with known attendance.


It is not possible to say whether cases associated with test event acquired or transmitted infection while attending a specific event. The events are likely to have incurred increased use of hospitality venues outside the event venue and increased social mixing due to the easing of lockdown measures during this period. The behaviour surrounding attendance at events rather than attendance itself may have contributed to Wales’ rising rate in this period and further research is needed in this area.

Post event-test results were only obtained for a very small percentage of total attendees. Our case definition included only cases who were tested with PCR tests. Testing by lateral flow devices were not considered. Hence, the majority of asymptomatic SARS-CoV-2 infection would not be detected by the present method.

Another limitation is the consistency of the attendee data. The organizers of event one collected details of ticket buyers only, whereas the organizers of event two collected details of all ticket holders. Furthermore, the event one organizers also collected information on staff and hospitality whereas no such details were recorded for event two. The practice of selling tickets on may also result in misclassification of individuals as having attended or not attended an event, if data are based on ticket sales. Furthermore, for both events, recorded ticket holders may not have attended due to a positive test result. If the presented method is to be scaled up to cover a wider range of events, harmonization of attendee data will become a significant issue.

Practical limitations are also a major consideration. To run test events personal identifiers need to be gathered routinely to allow effective epidemiological investigation. This presents significant operational investment and cost on the part of the event organizers, in addition to privacy concerns, which presents barriers to carrying out routine surveillance of this nature. Despite these limitations, the methodology presented here could be usefully applied to events in periods of rising incidence, and in events with a higher risk, for example at music festivals, so that with increased application and further improvements to the methodology, advice could be honed to specific event types.

Mass events provide an excellent opportunity for operational public health research [29]. While the present analysis cannot identify if the events directly contributed to increased risk, future work could explore this question further through mediation analysis [30]. Propensity scores will also be useful in this context, to control for the probability of certain demographic characteristics being associated with attendance at an event [31]. Identifying matched controls using propensity scores may improve identification of controls and possibly overcome the limitations experienced when using the household group matching criteria. There is also potential to combine behavioural observations with this analytical epidemiological approach to give a more precise understanding of real risk at mass events. The present analysis does not differentiate between infection risk from attendance at the event, and that of associated activities around the event.


We demonstrate the potential for using population data science methods to inform policy. We conclude that, at the point in the pandemic when this analysis was carried out, and with the mitigations that were in place, attending large sporting events did not significantly increase risk of COVID-19. However, these analyses were carried out between epidemic waves when background incidence was low, and need to be repeated during periods of higher incidence. This method could be applied to any infectious disease risk at large events with available attendee details.

Statement on conflicts of interest

Daniel Rhys Thomas and Ashley Gould were members of a Welsh Government Advisory Group on Test Events. Daniel Rhys Thomas attended a test event as a guest, although this was not one of the test events in this analysis.

Ethics statement

The work has a clear public health benefit and as such has a legal basis under Public Health Wales’ Establishment Order.

This epidemiological work was part of a wider programme of work to evaluate test events in Wales, and research ethics approval was obtained from Bangor University Research Ethics Committee. Ref. DC210610.

This work involved the secondary analysis of existing data. A Data Protection Impact Assessment (DPIA) was written, and all data was processed and analysed at Public Health Wales and in accordance with Public Health Wales’ IG policies.


We are grateful to Matthew Daniel, Katie Hughes and Luke Hughes of Swansea City Football Club, and Wayne Nash and Mona Sabbuba of Cardiff City Football Club for providing data, Rob Holt of Welsh Government for supporting DPO this work, and Lisa Partridge of Public Health Wales for advising on information governance. We thank the reviewers for providing valuable insights that contributed significantly to the development of this paper.


  1. Duizer E, Timen A, Morroy G, Husman AM de R. Norovirus outbreak at an international scout jamboree in the Netherlands, July-August 2004: international alert. Euro Surveill. [Internet] 2004 Aug. [cited 2021 19];8:2523. Available from: 10.2807/esw.08.33.02523-en
  2. Schenkel K, Williams C, Eckmanns T, Poggensee G, Benzler J, Josephsen J, et al. Enhanced surveillance of infectious diseases?: the 2006 FIFA World Cup experience, Germany. Euro Surveill. 2006;11:234–8. 10.2807/ESM.11.12.00670-EN
  3. Abubakar I, Gautret P, Brunette G, Blumberg L, Johnson D, Poumerol G, et al. Global perspectives for prevention of infectious diseases associated with mass gatherings. Lancet. Infect. Dis. [Internet] 2012 Jan. [cited 2021 19];12:66–74. Available from: 10.1016/S1473-3099(11)70246-8
  4. Kaiser R, Coulombier D. Epidemic intelligence during mass gatherings. Euro Surveill. [Internet] 2006 Dec. [cited 2021 19];11:3100. Available from: 10.2807/esw.11.51.03100-en
  5. Tam JS, Barbeschi M, Shapovalova N, Briand S, Memish ZA, Kieny M-P. Research agenda for mass gatherings: a call to action. Lancet Infect. Dis. [Internet] 2012 Mar. [cited 2021 20];12:231–9. Available from: 10.1016/S1473-3099(11)70353-X
  6. Grix J, Brannagan PM, Grimes H, Neville R. The impact of Covid-19 on sport. 10.1080/19406940.2020.1851285 [Internet] 2020 [cited 2021 20];13:1–12. Available from: 10.1080/19406940.2020.1851285
  7. Bond AJ, Cockayne D, Ludvigsen JAL, Maguire K, Parnell D, Plumley D, et al. COVID-19: the return of football fans. Manag. Sport Leis. [Internet] 2020 [cited 2021 19];25. Available from: 10.1080/23750472.2020.1841449
  8. UK Government. Events Research Programme: Phase I findings [Internet]. 2021 [cited 2021 19]; Available from:

  9. Revollo B, Blanco I, Soler P, Toro J, Izquierdo-Useros N, Puig J, et al. Same-day SARS-CoV-2 antigen test screening in an indoor mass-gathering live music event: a randomised controlled trial. Lancet Infect. Dis. [Internet] 2021 Oct. [cited 2021 19];21:1365–72. Available from: 10.1016/S1473-3099(21)00268-1
  10. Welsh Government. Wales pilot test events get underway [Internet]. 2021 [cited 2021 19]; Available from:

  11. Welsh Government. Pilot events: report on findings [Internet]. 2021 [cited 2022 17]. Available from:

  12. Welsh Government. Mid year estimates of the population: mid-2019 [Internet]. 2019 [cited 2022 19]. Available from:

  13. Public Health Wales. Rapid Covid-19 Survaillance Dashboard [Internet]. 2020; Available from:!/vizhome/RapidCOVID-19virology-Public/Headlinesummary

  14. Senedd Research. Coronavirus timeline: the response in Wales [Internet]. 2022 [cited 2022 19]. Available from:

  15. Allan Dickinson. National Architecture. Welsh Demographic Service [Internet]. 2006. Available from: Demographic Service v1-1 (13.12.06).doc

  16. Principles of Research Methodology. New York, NY: Springer New York; 2012.
  17. R Core Team. R: A language and environment for statistical computing. [Internet]. 2020; Available from:

  18. Pearce N. Analysis of matched case-control studies. BMJ [Internet] 2016 Feb. [cited 2022 25];352. Available from: 10.1136/BMJ.I969
  19. Cummings P, McKnight B. Analysis of Matched Cohort Data. Stata J. [Internet] 2004 Aug.;4:274–81. Available from: 10.1177/1536867X0400400305
  20. Kuo C-L, Duan Y, Grady J. Unconditional or Conditional Logistic Regression Model for Age-Matched Case–Control Data? Front. Public Heal. 2018 Mar.;0:57. 10.3389/FPUBH.2018.00057
  21. Gail MH, Lubin JH, Rubinstein L V. Likelihood calculations for matched case-control studies and survival studies with tied death times. Biometrika [Internet] 1981 Dec. [cited 2022 25];68:703–7. Available from: 10.1093/BIOMET/68.3.703
  22. Terry M Therneau, Thomas Lumley, Atkinson Elizabeth CC. Package “survival” [Internet]. 2021; Available from:

  23. Ferguson J, Dunn S, Best A, Mirza J, Percival B, Mayhew M, et al. Validation testing to determine the sensitivity of lateral flow testing for asymptomatic SARS-CoV-2 detection in low prevalence settings: Testing frequency and public health messaging is key. PLOS Biol. [Internet] 2021 Apr. [cited 2021 7];19:e3001216. Available from: 10.1371/journal.pbio.3001216
  24. Ryan BJ, Coppola D, Williams J, Swienton R. COVID-19 Contact Tracing Solutions for Mass Gatherings. Disaster Med. Public Health Prep. [Internet] 2021 Jun. [cited 2021 20];15:e1–7. Available from: 10.1017/dmp.2020.241
  25. Torjesen I. Covid-19: Events pilot finds “no substantial outbreaks,” but experts point to gaps in evidence. BMJ [Internet] 2021 Jun. [cited 2021 19];373. Available from: 10.1136/BMJ.N1658
  26. Gould A, Lewis L, Evans L, Greening L, Howe-Davies H, Naughton M, et al. COVID-19 personal protective behaviors during mass events: Lessons from observational measures in Wales, UK. PsyArXiv [Internet] 2021 [cited 2021 19]; Available from: 10.31234/OSF.IO/8JSR3
  27. Welsh Government. Behavioural observations of pilot events in Wales [Internet]. 2021 [cited 2021 20]; Available from:

  28. BBC News. Boardmasters: 4,700 Covid cases “may be linked” to Newquay festival [Internet]. 2021 [cited 2021 19]; Available from:

  29. Thackway S, Churches T, Fizzell J, Muscatello D, Armstrong P. Should cities hosting mass gatherings invest in public health surveillance and planning? Reflections from a decade of mass gatherings in Sydney, Australia. BMC Public Heal. 2009 91 [Internet] 2009 Sep. [cited 2021 20];9:1–10. Available from: 10.1186/1471-2458-9-324.
  30. Richiardi L, Bellocco R, Zugna D. Mediation analysis in epidemiology: Methods, interpretation and bias. Int. J. Epidemiol. [Internet] 2013 Oct. [cited 2021 20];42:1511–9. Available from: 10.1093/ije/dyt127
  31. Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behav. Res. [Internet] 2011 May [cited 2021 20];46:399. Available from: /pmc/articles/PMC3144483/. 10.1080/00273171.2011.568786

Article Details

How to Cite
Drakesmith, M., Hobson, G., John, G., Stegall, E. ., Gould, A., Parkinson, J. . and Thomas, D. R. (2022) “Developing a population data science approach to assess increased risk of COVID-19 associated with attending large events ”, International Journal of Population Data Science, 6(3). doi: 10.23889/ijpds.v6i3.1711.