Developing a population data science approach to assess increased risk of COVID-19 associated with attending large events

Abstract Introduction In summer 2021, as rates of COVID-19 decreased and social restrictions were relaxed, live entertainment and sporting events were resumed. In order to inform policy on the safe re-introduction of spectator events, a number of test events were organised in Wales, ranging in setting, size and audience. Objectives To design and test a method to assess whether test events were associated with an increase in risk of confirmed COVID-19, in order to inform policy. Methods We designed a cohort study with fixed follow-up time and measured relative risk of confirmed COVID-19 in those attending two large sporting events. First, we linked ticketing information to individual records on the Welsh Demographic Service (WDS), a register of all people living in Wales and registered with a GP, and identified NHS numbers for attendees. Where NHS numbers were not found we used combinations of other identifiers such as email, name, postcode and/or mobile number. We then linked attendees to routine SARS-CoV-2 test data to calculate positivity rates in people attending each event for the period one to fourteen days following the event. We selected a comparison cohort from WDS for each event, individually matched by age band, gender and locality of residence. As many people attended events in family groups we explored the possibility of also matching on household clusters within the comparison group. Risk ratios were then computed for the two events. Results We successfully assigned NHS numbers to 91% and 84% of people attending the two events respectively. Other identifiers were available for the remainder. Only a small number of attendees (<10) had a record of confirmed COVID-19 following attendance at each event (14 day cumulative incidence: 36 and 26 per 100,000, respectively). There was no evidence of significantly increased risk of COVID-19 at either event. However, the event that didn’t include pre-event testing in their mitigations, had a higher risk ratio (3.0 compared to 0.3). Conclusions We demonstrate the potential for using population data science methods to inform policy. We conclude that, at that point in the epidemic, and with the mitigations that were in place, attending large outdoor sporting events did not significantly increase risk of COVID-19. However, these analyses were carried out between epidemic waves when background incidence and testing rate was low, and need to be repeated during periods of greater transmission. Having a mechanism to identify attendees at events is necessary to calculate risk and feasibility and acceptability of data sharing should be considered.


with an incre
se in risk of confirmed COVID-19, in order to inform policy.MethodsWe designed a cohort study with fixed follow-up time and measured relative risk of confirmed COVID-19 in those attending two large sporting events.First, we linked ticketing information to individual records on the Welsh Demographic S

vice (WDS),
a register of all people living in Wales and registered with a GP, and identified NHS numbers for attendees.Where NHS numbers were not found we used

mbinatio
s of other identifiers such as email, name, postcode and/or mobile number.We then linked attendees to routine SARS-CoV-2 test data to calculate positivity rates in people attending each event for the period one to fourteen days following the event.We selected a comparison cohort from WDS for each event, individually matched by age band, gender and locality of residence.As many people attended events in family groups we explored the possibility of also matching on household clusters within the comparison group.Risk ratios were then computed for the two events.ResultsWe successfully assigned NHS numbers to 91% and 84% of people attending the two events respectively.Other identifiers were available for the remainder.Only a small number of attendees (<10) had a record of confirmed COVID-19 following attendance at each event (14 day cumulative incidence: 36 and 26 per 100,000, respectively).There was no evidence of significantly increased risk of COVID-19 at either event

owever,
he event that didn't include pre-event testing in their mitigations, had a higher risk ratio (3.0 compared to 0.3).ConclusionsWe demonstrate the potential for using population data science methods to inform policy.We conclude that, at that point in the epidemic, and with the mitigations that were in place, attending large outdoor sporting events did not significantly increase risk of COVID-19.However, these analyses were carried out between epidemic waves when background incidence and testing rate was low, and need to be repeated dur

g periods of
greater transmission.Having a mechanism to identify attendees at events is necessary to calculate risk and feasibility and acceptability of data sharing should be considered.

Introduction

Mass events have been associated with increased transmission of communicable disease [1][2][3].The higher population density and movement of people associated with these events may contribute to this increased transmission.Social and environmental factors, such as temperature, humidity, venue type, crowd size and mood, age, alcohol and drug use, can also increase risk.It has therefore become accepted good practic to carry out public health surveillance during mass gathering events [4].

The COVID-19 pandemic has had a major impact on the staging of mass gatherings worldwide [5], including sporting events [6].However, in summer 2021, as COVID-19 case rates decreased and social restrictions were reduced, policy makers needed to make decisions about the resumption of live entertainment and sporting events [7].Some countries piloted the resumption of events by staging pilot or 'test events', events that fell outside extant COVID-19 regulations but were staged in order to test the safety of progressing to reduced or n restrictions on indoor or outdoor gathering [8,9].

Policy decisions on non-pharmaceutical interventions for prevention and control of COVID-19 require good evidence [5].In the United Kingdom, health is a responsibility devolved to the constituent nations and the work described in this article was prompted by a gap in evidence identified by a Welsh Government group set up in Summer 2021 to assess the safety of a series of 'test events' [10].In order to assess the safety of re-introducing spectator events, Nine 'test events' were organised in Wales during May-June 2021, ranging in set ing, size and type of audience [11].The primary objective of staging these events was to assess whether safe events could be delivered in practice by event organisers with risk mitigation in place, such as physical distancing in audiences, wearing of face masks in enclosed spaces and pre-event health checks.Whilst the emphasis of the test events was on operational issues, such as event management and compliance of staff and customers with mitigation measures, questions were raised about how best to assess epidemiological risk associated with attending events.Therefore, a secondary objective was to assess whether attendance at live events was associated with increased transmission of SARS-CoV-2, and which type of events, if any, present a higher risk.

In the present study we aim demonstrate the feasibility of addressing the second objective, by linking ticketing information to routine public health surveillance data for the two largest test events to compare cases in attendees with a reference population.Linking these data sources permits estimation of relative and absolute risk associated with event attendance which would not be possible due to the absence of an appropriate denominator.In demonstrating this methodology, future analysis of outbreaks associated with large events, such s sporting events or music festivals becomes feasible.


Methods


Study design

Cohort study with fixed follow-up period, with a matched comparison (unexposed) cohort.


Setting

Both events were in Wales, and attendance was restricted to residents of Wales, which has a population of 3.13 million [12].

At the time of the test events, covid-19 incidence was decreasing following a peak in winter 2020-2021.Key indicators for all Wales in the 2-week period following event 1 (24 th May-6 th June 2021) were: cumulative 14-day incidence of 21.6 per 100,000, testing rate of 1,788 per 100,000 and posi

incidenc
of 61.4 per 100,000, testing rate of 2,266 per 100,000 and positivity of 2.7% [13].

Restrictions that had been in place ove the winter were eased during this period [14].On 11 th May indoor hospitality and entertainment was re-opened, and international travel resumed on the 17 th May.However, the Welsh government advised football fans against international travel.From 4 th June, groups of 30 people were permitted to meet outdoors and attend larger outdoor events.A decision to re-open large sporting events had not yet been taken, pending the outcome of the test events programme.


Description of events

Of the 9 organised t st events, the two largest events were deemed suitable for analysis of risk.The other events were smaller (<1,000 attendees) and not deemed feasible for the analysis.These events are referred to as 'event 1' and 'event 2'.The characteristics of the two events are given in Table 1.Both events were football matches in Wales.One was a Premiership play-off match held at Swansea Liberty Stadium.The other a friendly international in advance of the, held at Cardiff City Stadium.Both had COVID-19 risk mitigat

ns in place.Event 2 ha
an additional requirement for proof of a negative SARS-CoV-2 test in advance of the event.

Following the events, the event organisers then sent lists of all attendees, including paying spectators, guests and staff where available (see Table 1) to Public Health Wales via a secure file sharing platform.


Record linkage

Record linkage of the attendee list was performed in 3 stages (a flow diagram of record linkage for each event is provided in SI):


Linking to NHS numbers in WDS

Where name, address, and date of birth or age were available, individual attendees were matched with the Welsh Demographic Service (WDS), an administrative register f all Welsh residents registered with a general practice [15], and each individual's NHS number as recorded in WDS was identified.The NHS number is a unique number assigned to all users of the National Health Ser

ce in the Unite
Kingdom.Matches were made where the sample collection date for the PCR test lies between 14 days before, to 14 days after the even

The matches were then manually
checked for obvious mismatches, which were excluded from further analysis.


Linking to contact tracing data

Additional matches were found by linking to the Test Trace Protect (TTP), contact tracing data management system which, in addition to name, postcode and date of birth, also has contact e-mail and mobile phone number for cases.Matching was performed iteratively on combinations of identifiers in t

following order: name, mobile n
mber, e-mail address, postcode and date of birth or age, where available.Matches were made where the laboratory test result date was between 14 days before and 14 days after the event.The matches were then manually checked for obvious mismatches, which were excluded from further analysis.


Exposed cohort

The exposed cohort was identified from the linked attendee list, obtained in the previous step.Any records with duplicate NHS numbers were removed.Records deemed 'matchable' for identification of the unexposed cohort, must be a resident of Wales and have a minimal set of matching variables on WDS (see' Unexposed cohorts').All records not deemed matchable were removed from the exposed cohort.


Unexposed cohorts

We tested two different methods of sampling the unexposed cohort.In both methods, we selected comparison co orts from the g neral population of Wales, that is: a group who could have attended the event but did not.In order to take into account the demographic profile of those attending each event and control for background community transmission, we selected from the WDS, individuals who were matched to the exposed cohort on a range of demographic variables.We aimed to match the unexposed to exposed cohort with a 3:1

atio.

This ratio
as chosen as general guidance is that matching ratios of greater than 3:1 offer negligible gains in statistical power [16].

Unexposed cohort A (without household cluster matching)

Individuals were matched to members of the exposed cohort for age, sex, area of residence and neighbourhood deprivation.

Of the event attendees who had linkable identifiers (NHS number), individual matching was carried out on the following criteria: Age (within 5 years), sex, Middle Super Output Area (MSOA) of residence, and deprivation quintile (w thin +/−1 quintile).Three unexposed individuals were selected for each exposed individual.A full 3:1 unexposed-to-exposed ratio was achi

ed for both events.


Unexposed cohort B (with household
cluster matching)

As many people attended events in family groups, we attempted to simulate similar levels of clustering within the comparison/control group, in order to account for commonalities within household clusters, including risk factors for within-household infection.We used unique property reference numbers (UPRNs), which are recorded against the vast majority of individuals on the WDS, to group individuals into a family or household.We used the exact same individuallevel matched control criteria as for unexposed cohort A. However if, for example, a 50 year old male attended one of the events with his two daughters, aged 12 and 17 years, our controls would be drawn from a similar household (within the same MSOA and a deprivation quintile within +/−1 quintile), in which there was also a male aged 50 (+/−5) plus two females aged 12 (+/−5) and 17 (+/−5), respectively.This limited the number of controls that it was possible to find, especially where there were 4 or 5 attendees from the same household.Therefore, unlike in cohort 1, we were unable to achieve a full 3:1 unexposed-to-exposed ratio.Actual ratios achieved were 2.5:1, and 2.6:1 for events 1 and 2, respectively.


Case definition

Cases were defined as individuals who were linked to a positive SARS-CoV-2 PCR test result in the period 1 to 14 days after the date of the event.Non-cases are defined an individuals who either were linked to a negative SARS-CoV-2 PCR test result, or not linked to a test result.We make the assumption that absence of linkage is equivalent to absence of a test result.


Analysis

All analysis was carried out in R v4.0.3 [17].All analyses were performed for the two unexposed cohort definitions, separately.


Descriptive statistics

ttendees at each
event were described by age, sex and area of residence.Median age and sex ratios were calculated, and geographic distribution of attendees was plotted on a point map.14-day cumulative incidence (number of new cases per 100,000) over the 14 days post-event period was calculated for the exposed and unexposed cohorts.


Analysis of risk

Our first approach was to quantify

elative a
d absolute risk difference.Risk ratios with 95% confidence intervals were calculated using unconditional maximum likelihood estim

ion with confidence int
rvals calculated using the Wald normal approximation, using the riskratio function from the epitools R package.Equivalent risk differences (RD) were also calculated from the estimated risk ratios.When computing risk ratios, where matched cohorts are treated as independent cohorts, there is a possibility of a matching bias being introdu

d [18][19][20].Th
s may be a particular issue for unexposed cohort B, where matching is performed on household clusters with varying sizes.A common way to control for this is with conditional logistic regression (CLR) which provides an estimate of odds ratios, controlling for matching.Although CLR is usually applied in the context of case-control studies, it is also applicable to the analysis of matched cohort studies [19].CLR was performed by fitting an equivalent time-independent Cox model [21] with strata defined by the matched groups, as implemented in the clogit function in the survival R package [22].95% confidence intervals were calculated using the Wald normal approximation.


Validation

Cases identified in the exposed cohort through record linkage were followed up by manual search of the TTP contact tracing management system.This system is used to record the work of contact tracers including responses of cases to the contact tracing questionnaire covering their movements in the days before and after their infection onset.Searching was undertaken based on name and address.Contact tracers' notes were checked both for confirmation of attendance at the events and for other exposures which may have resulted in infection.


Consent

A

part of the
test event programme in Wales, the organisers of these events were approached prior to the event, and in partnership with Public Health Wales and Welsh Government, incorporated a statement of consent to share attendee details with Public Health Wales for purposes of evaluation of safety of these events.Providing consent was a condition of attendance which was agreed to at the point of purchase, under special arrangement with Welsh Government to assess the safety of holding large event.The following privacy statement was added to the term

and cond
tions when people registered for tickets to attend test events: "Sharing your personal information This event is one of a series of Welsh


Results


Description of attendees

For event 1, a total of 2,745 attendee's details were recorded, for event 2, a total of 3,918 attendee's details were recorded.People attending these events were predominantly male (Figure 1a): 81% of those attending event 1 and 85% of those attending event 2 were male.People attending event 1 had a bimodal age distribution and a generally older age distribution: Median age was 43 years (range: 2, 93) as compared to 35 years (range: 1, 90) for event 2. At endees for event 1 were predominantly resident in South West Wales, most residing close to the event venue, whereas attendees from event 2 had a wider geographic distribution (Figure 1b).


Selection of exposed cohort and linkage results

NHS numbers were ascertained for 2,499 (91%).97 (4%) attendees were successfully linked to a SARS-CoV-2 PCR test result.For the 14-day period prior to the event, 24 tests were linked, one of which was positive.For the fourteen days postevent period, we linked 73 PCR tests (3% of attendees), of which one was positive.This represents a

lative incidence of 36 pe
100,000 attendees in the post-event period.

For event 2, a total of 3,918 attendee's details were recorded.NHS numbers were ascertained for 3,295 (84%) (Table 2).A total of 2,658 (68%) attendees were successfully linked to a SARS-CoV-2 PCR COVID-19 test result.The majority of these were tests obtained shortly before the event to comply with requirements to have a negative test result prior to attendance (Figure 2).In the pre-event period, we identified 2,633 tests, of which 5 were positive for the 14 days post-event, we linked 25 PCR tests (0.6% of attendees), of which 2 were positive.However, only one of these were successfully validated (s

'Validation').This represents a 14


Selection
f unexposed cohorts and linkage results

For unexposed cohort A, three unexposed individuals were selected for each attendee traced to an NHS number, based on the criteria described earlier.For event 1, there were 2,412 (88%) attendees that met this criteria.There were 3,277 (84%) for event 2.

For event 1, of the 7,236 people in unexposed cohort A, there was just one individual with a SARS-CoV-2 PCR COVID-19 test resu t, and for event 2, of the 9,831 unexposed controls, there were 10 with a SARS-CoV-2 PCR COVID-19 test result.

For unexposed cohort B, attempts were made to find unexposed individuals for all attendees traced to an NHS number, confirmed as being Welsh residents, and with their address matched to a valid UPRN.For test events 1 and 2 there were 1,988 and 2,845 individuals who met this criteria, respectively.Due to the more stringent matching criteria for unexposed cohort B, 5,964 and 8,535 unexposed controls


Risks

Using unexposed cohort A, for event 1, a risk ratio of 3.00 (95% CI: 0.18-47.9;p = 0.50) was obtained, for event 2, a risk ratio

0.30 (95% CI: 0.04-2.34;p = 0.23) was obtained.Equ
valent risk differences were 0.03% (−0.22%-0.66%;p = 0.50) and −0.07%(−0.79%-0.23%;p = 0.23).Odds ratios estimated with CLR were very similar: 3.00 (95% CI: 0.19-48.0;p = 0.44) for event 1 and 0.30 (95% CI: 0.04-2.34;p = 0.25) for event 2.

Using the unexp sed cohort B was more problematic due to failure to find cases to include in all groups.A risk ratio and CLR could not be performed for event 1 due to the absence of any non-exposed cases.For event 2, a risk ratio of 1.30 (95% CI: 0.12-14 35;p = 0.81) was obtained.CLR also could not be performed for event 2, due to the absence of any matchable exposed cases.An attempt to approximate a CLR via pairsubstitution was attempted (see Supplementary Material).


Validation

All cases in the exposed cohorts who were identified through record linkage were also identified in the TTP system.Contact tracing notes identified that two of these three cases had reported attending the events as identified through record linkage.One case linked to event 2 reported not attending the event, and therefore was excluded from the exposed cohort.Cases who reported having attended an event said tha they had adhered to the COVID-1 protection measures at the even including social distancing and mask wearing, and also reported that they observed a high level of compliance with the measures in others attending the event.However, highrisk exposures associated with activities surrounding the event, such as in pubs or coaches, were reported.These included a lack of social distancing, singing and shouting unmasked and physical contact with o

ers.
Discussion

We have demonstrated that it is possible to design and conduct an epidemiological study of communicable disease at large spectator events using record linkage of ticketing information with routine microbiological test results.We also proved the feasibility of using routine administrative data to select an unexposed cohort, that is: people who might have attended the event but didn't, in order to quantify risk.The qua tification of risk provides advantages over approaches which rely only on contact tracing data [23,24].This was useful for assessing risk of COVID-19, but could be applied equally to other infectious disease risks at mass gatherings, such as gastrointestinal disease.In addition, use of qualitative information, recorded in the contact tracing system, provided useful contextual information when seeking to validate the quantitative findings.

Our results have not identified ev

ence of a s
gnificantly different risk of COVID-19 associated with attendance at these test events.This study took place at a time of low community incidence, with very low numbers of cases identified in both the exposed and unexposed cohorts.Testing rates were also low in this period which resulted in a low linkage rate.There was a recognition that risk would be difficult to assess given the small size of most events and the background of low community transmission at the time [25].Nevertheless the information provided on the number of cases identified in people attending the two larger test events provided some reassurance at the time and supported the behavioural insights work being carried as part of the test-events programme [26].Although the present study was not able to arrive at any conclusions of risk associate with these events, it does demonstrate a methodology

hat could b
better applied to other events in higher incidence setting.

Whilst risk ratios and CLR-derived odds ratios did not indicate significantly higher or lower risk of attendance at either event, it is interesting to note that the risk ratio was less than one for the event that included pre-event PCR testing, but above one for the event that didn't (when using unexposed cohort A).The policy to test attendees prior to one of the test events may have led to individuals who were presymptomatic not attending, which might have in turn reduced the number of cases detected after the event.This may lead to an apparent protective effect of event 2. This apparent effect may also be magnified by the higher background incidence during event 2, compared to event 1.We also note, while there is a large apparent difference in relative risk, looking at absolute risk, there is nly an additional 30 in 100,000 risk associated with event 1, and a reduction of 70 per 100,000 in risk associated with event 2. These numbers are in the same order of magnitude as the background incidence at the time, which was negligible in the wider context of the pandemic.

With respect to the matching methodology, this approach is applicable to international public health agencies with access to large-scale registers of health care users, equivalent to WDS.We have also demonstrated two strategies for linkage, using house-hold structures as a linking variable in addition to traditional scalar demographic variables.The lack of statistically significant results in the presented data difficult to derive conclusions, but further comparisons with other datasets will provide more information of the added benefit of the household matching.The stringent matching criteria introduced by the household matching meant that not enough cases survived selection to be able to make any estimates of isk (see Supplementary Material for an attempt to overcome this issue).Comparing risk ratios with odds ratios derived from CLR, we find very little difference in the results, suggesting that there is little impact of matching bias, at least for individually matched cohorts.Supplementary analysis suggests there may be greater impact of matching bias when using household matching, but clear conclusions are difficult to make (see Supplementary Material).

Despite the numbers of cases being small, the qualitative data collected for validation indicated that behaviours surrounding an event may be pertinent to transmission.Behavioural observations at some of the other test events in Wales have demonstrated that adherence to COVID-safe behaviours appears to break down in specific situations [27], such as where there is a lack of structured support (lack of signage, lack of stewards) or at physical crunch points where flow is disrupted and bottlenecks emerge (such as entry and exit).Since the test events took place, incidence of COVID-19 has risen in the UK, and a number of large events have been implicated in large numbers of cases.In particular, the Boardmasters Festival in August 2021 is believed to have contributed to approximately 4,700 COVID-19 cases [28].The demographic of this event was largely young adults, many of whom were unvaccinated.In addition, the size and duration of the event, the living arrangements on the site and the shared modes of transport used are likely to contribute to additional risk associated with the event.The methods demonstrated in this paper would have high utility in quantifying the risk of COVID-19 at such events, to categorise risk across different types of events and (through use of contact tracing information) different types of exposures.This has the potential to provide important information which could be used alongside behavioural insights to inform public health advice, to event organisers and potential/actual attendees.

We carried out work to explore possible study designs to inform this process, but also to test the feasibility of carrying out this type of analysis in response to outbreaks associated with future events, such as large sporting events or music festivals.In order to calculate risk it is necessary to be able to identify a denominator population and to be able to link cases to event attendance.This was possible for 'test events' as, with agreement from event organizers, a privacy statement was included with the terms and conditions of ticket purchase.The special arrangements of these event introduced a number of mitigations, which would not be in place under normal circumstances.In order to repeat this work, for example in the context of an alert level zero (where many restrictions would be relaxed) or during a period of higher incidence, seeking a similar mechanism for obtaining consent for data sharing would be valuable.This method could also be applied to any infectious disease risk at a large event with known attendance.


Limitations

It is not possible to say whether cases associated with test event acquired or transmitted infection while attending a specific event.The events are likely to have incurred increased use of hospitality venues outside the event venue and increased social mixing due to the easing of lockdown measures during this period.The behaviour surrounding attendance at events rather than attendance itself may have contributed to Wales' rising rate in this period and further research is needed in this area.

Post event-test results were only obtained for a very small percentage of total attendees.Our case definition included only cases who were tested with PCR tests.Testing by lateral flow devices were not considered.Hence, the majority of asymptomatic SARS-CoV-2 infection would not be detected by the present method.

Another limitation is the consistency of the attendee data.The organizers of event one collected details of ticket buyers only, whereas the organizers of event two collected details of all ticket holders.Furthermore, the event one organizers also collected information on staff and hospitality whereas no such details were recorded for event two.The practice of selling tickets on may also result in misclassification of individuals as having attended or not attended an event, if data are based on ticket sales.Furthermore, for both events, recorded ticket holders may not have attended due to a positive test result.If the presented method is to be scaled up to cover a wider range of events, harmonization of attendee data will become a significant issue.

Practical limitations are also a major consideration.To run test events personal identifiers need to be gathered routinely to allow effective epidemiological investigation.This presents significant operational investment and cost on the

rt of the ev
nt organizers, in addition to privacy concerns, which presents barriers to carrying out routine surveillance of this nature.Despite these limitations, the methodology presented here could be usefully applied to events in periods of rising incidence, and in events with a higher risk, for example at music festivals, so that with increased application and further improvements to the methodology, advice could be honed to specific event types.

Mass events provide an excellent opportunity for operatio al public health research [29].While the present analysis cannot identify if the events directly contributed to increased risk, future work could explore this question further through mediation analysis [30].Propensity scores will also be useful in this context, to control for the probability of certain demographic c aracteristics being associated with attendance at an event [31].Identifying matched controls using propensity scores may improve identification of controls and possibly overcome the limitations experienced when using the household group matching criteria.There is also potential to combine behavioural observations with this analytical epidemiological approach to give a more precise understanding of real risk at mass events.The present analysis does not differentiate between infection risk from attendance at the event, and that of associated activities around the event.


Conclusions

We demonstrate the potential for using population data science methods to inform policy.We conclude that, at the point in the pandemic when this analysis was carried out, and with the mitigations that were in place, attending large sporting events did not significantly increase risk of COVID-19.However, these analyses were carried out between epidemic waves when background incidence was low, and need to be repeated during periods of higher incidence.This method could be applied to any infectious disease risk at large events with available attendee details.


Statement on conflicts of interest

Daniel Rhys Thomas and Ashley Gould were members of a Welsh Government Advisory Group on Test Events.Daniel Rhys Thomas attended a test event as a guest, although this was not one of the test events in this analysis.


Comparisons between exposed an non-exposed cohorts

T e two non-exposed cohorts were tested and their distributions visually inspected to ensure they are equivalent to each other on all match variables.A χ 2 tests comparing distribution of cohorts across the five matching variables (Age, Sex, MSOA, deprivation quintile and household cluster size)

showed that both non-exposed cohorts were not significantly different to the exposed cohort (Table S1).

Visual inspection of the distributions of age, sex, deprivation quintile, MSOA and size of household cluster show they are equivalent to other exposed cohorts (Figure S3).Note that non-exposed cohort A was not matched on household cluster, hence the distribution of household cluster size noticeable deviates from that of the exposed cohort, compared to non-exposed cohort B. Figure S3: Distributions of matching variables for each cohort, for each event


Risk analysis without sufficient cases in unexposed cohort B

Using the unexposed cohort B was more

roblematic d
e to failure to find cases to include in all groups.A risk ratio could not be calculated for event 1 due to the absence of any non-exposed cases.Similarly, CLR could not be calculated for unexposed cohort B for either event.An attempt was therefore made to estimate these test statistics using pair-substitutions.

For event 1, RR and CLR were not possible due to no non-exposed cases.A paired substitution was performed on Table S2: Results of RR and CLR analysis of unexposed cohort B using paired-substitution approximations


Event

Risk ratio and risk differenc

R † = 1.50 (0.14-
6.5); p = 0.74 * No non-exposed cases were identified for unexposed cohort B. Therefore, an approximation of test statist c was obtained by finding the closest matching non-exposed, non-case in unexposed cohort B, on all possible matching variables to the 1 non-exposed case in unexposed cohort A, and re-labelling them as a case.

† No exposed cases were matchable on all matching variables in unexposed cohort B. Therefore, an approximation of the test statistic was obtained by finding the closest matching exposed non-case, on all possible matching variables, to the 1 exposed case, and re-labelling them as a case.

the one case from unexposed cohort A, for a non-case in unexposed cohort B, which was matched on all po

age, sex, MSOA and deprivation
uintile).An approximate risk ratio of 2.47 (0.15-39.5) was obtained.For event 2, CLR was not possible due to no matchable exposed cases.A paired substitution was performed for the one unmatchable exposed case for a non-case that is matchable to unexposed cohort B, which was matched on all possible matching variables (age, sex, MSOA and deprivation quintile).

The estimated ORs deviated slightly more from the RRs, compared to unexposed cohort A (see main text): 3.00 (95% CI: 0.19-48.0;p = 0.44) for event 1 and 1.50 (95% CI: 0.14-16.5;p = 0.74) for event 2. The difference in CLR results may indicate that risk ratios are more subject to matching bias in unexposed cohort B, because of the additional household clustering strategy.However, these results are confounded by the paired-substitution approximation, so direct comparisons between the two sets of results are difficult.



Drakesmith, M et al.International Jo

nal of Population Data Science (2022) 6:3:08


Figure 1 :
1
F
gure 1: (a) Age and sex distributions of event attendees.(b) Geographic distribution of event attendees


Figure 2 :
2
Figure 2: SARS-CoV-2 tests by date sample taken, in the period 14 days before to 14 days after each event


Figure S2 :
S2
Figure S2: Flow chart of cohort selection for event 2


Table 1 :
1
Description of the two test events
EventDateVenueCapacity Mitigations in placeAttendees recorded1 Premiership22 May 2021 Liberty3,000Social distancing at 2m, wearingTicket buyers (notfootballStadium,face masks, temperature checksnecessarily all ticket holders),play-offSwanseaon arrival, controlled ingress andplus staff and hospitalityegress and selling tickets tohousehold bubbles. No testingrequirement.2 International5 June 2021 Cardiff City6,500As above plus PCR test up to 5All ticket holders. No staff ormen's footballStadiumdays prior to the event or homehospitality.LFD test up to 24 hours prior tothe eventSARS-CoV-2. In the first instance, matching was performedusing the NHS number obtained from stage 1. Where notavailable, matching was performed iteratively on combinationsof other identifiers where available, in the following order:name, postcode and date of birth or age, where available.
Linking to SARS-CoV-2 test dataUsing NHS number, individual event attendees were linked with Public Health Wales records of all people in Wales who have received a polymerase chain reaction (PCR) test for




Government Test Event Series.Your personal data may be shared with Public Health Wales and linked to routine data on COVID-19 test results for the purposes of evaluating the test event.Public Health Wales is a NHS Trus and data will be held in accordance with the General Data Protection Regulations and the usual NHS information governance conditions.For further information on how we use your data please see Public Health Wales privacy notice: https://phw.nhs.wales/use-ofsite/privacy-notice"


Table 2 :
2
Numbers of linkable attendees, proportions of positive SARS-CoV-2 test in exposed and non-exposed groups and associated risk ratios following each event
Event AttendeesAttendeesNumber ofNumber ofProportion with a positive SARS-Risk r