A new phylodynamic model of Mycobacterium bovis transmission in a multi-host system uncovers the role of the unobserved reservoir

Multi-host pathogens are particularly difficult to control, especially when at least one of the hosts acts as a hidden reservoir. Deep sequencing of densely sampled pathogens has the potential to transform this understanding, but requires analytical approaches that jointly consider epidemiological and genetic data to best address this problem. While there has been considerable success in analyses of single species systems, the hidden reservoir problem is relatively under-studied. A well-known exemplar of this problem is bovine Tuberculosis, a disease found in British and Irish cattle caused by Mycobacterium bovis, where the Eurasian badger has long been believed to act as a reservoir but remains of poorly quantified importance except in very specific locations. As a result, the effort that should be directed at controlling disease in badgers is unclear. Here, we analyse densely collected epidemiological and genetic data from a cattle population but do not explicitly consider any data from badgers. We use a simulation modelling approach to show that, in our system, a model that exploits available cattle demographic and herd-to-herd movement data, but only considers the ability of a hidden reservoir to generate pathogen diversity, can be used to choose between different epidemiological scenarios. In our analysis, a model where the reservoir does not generate any diversity but contributes to new infections at a local farm scale are significantly preferred over models which generate diversity and/or spread disease at broader spatial scales. While we cannot directly attribute the role of the reservoir to badgers based on this analysis alone, the result supports the hypothesis that under current cattle control regimes, infected cattle alone cannot sustain M. bovis circulation. Given the observed close phylogenetic relationship for the bacteria taken from cattle and badgers sampled near to each other, the most parsimonious hypothesis is that the reservoir is the infected badger population. More broadly, our approach demonstrates that carefully constructed bespoke models can exploit the combination of genetic and epidemiological data to overcome issues of extreme data bias, and uncover important general characteristics of transmission in multi-host pathogen systems.


Author summary
For single host pathogens, pathogen genetic data have been transformative for understanding the transmission and control of many diseases, particuarly rapidly evolving RNA viruses. However garnering similar insights where pathogens are multi-host is more challenging, particularly when the evolution of the pathogen is slower and pathogen sampling often heavily biased. This is the case for Mycobacterium bovis, the causative agent of bovine Tuberculosis (bTB) and for which the Eurasian badger plays an as yet poorly understood role in transmission and spread. Here we have developed a computational model that incorporates M. bovis genetic data from cattle only with a highly abstracted model of an unobserved reservoir. Our research shows that a model in which the reservoir does not contribute to pathogen diversity, but is a source of infection in spatially localised areas around each farm, better describes the patterns of outbreaks observed in a population-level sample of a single M. bovis genotype in Northern Ireland over a period of 15 years, compared to models in which either the reservoir has no role, disease spread is spatially extensive, or where they generate considerable diversity on their own. While this reservoir model is not explicitly a model of badgers, its characteristics are consistent with other data that would suggest a reservoir consisting of infected badgers that contribute substantially to cattle infection, but could not maintain disease on their own.

Introduction 1
The analysis of high throughput genome sequence data for Mycobacterium bovis has 2 already generated important insights into the relative roles of direct transmission and 3 other mechanisms in the maintenance of bTB in cattle [1]. Central to this problem is 4 the well-documented involvement of the Eurasian badger (Meles meles) in the 5 persistence and spread of bTB. While it is known that badgers contribute to infection in 6 cattle, the relatively poor and biased data available regarding their contribution mean 7 that their importance to the problem remains poorly understood, a problem shared 8 with many other multi-host pathogen systems. Previous analyses have built on small 9 datasets, or used analytical tools based on evolutionary models (e.g. [2]) which, while 10 providing useful insight [3][4][5][6], have only limited ability to exploit the much richer data 11 available on the contact patterns recorded for the cattle population involved. 12 Here, we exploit these data in a agent-based simulation, using a partial-likelihood 13 fitting approach based on a measure previously developed to fit animal-level 14 transmission patterns to summary measures of herd outbreaks [7]. We compare models 15 where the diversity patterns are generated (i) by cattle only, (ii) by cattle with a passive 16 reservoir that produces minimal additional phylogenetic diversity and (iii) cattle plus an 17 active reservoir, generating diversity consistent with the observed bacterial mutation 18 rates. Considering also the spatial extent of reservoirs (locally around each farm or 19 across all farms), we fit the models to a previously described M. bovis dataset [3]. Our 20 analyses show a substantial preference for a model that includes a reservoir with only 21 short range interactions, and consistent with (ii) above, transmits the most recently 22 available genetic type back to the cattle population.   [11,12]. Isolates are stored frozen and are available for re-culture to 29 extract further DNA for sequencing. As described previously [3] The data describing the location of the herds consist of the (anonymised via rotation, 43 scaling and translation so that relative distances remained intact) x-y locations of 52 44 farms from which cultured samples were taken. The movement of cattle into or out of 45 this network extended the number of farms in our dataset to 21,012.

46
Because herd population data were not available for the entire period over which 47 bacterial samples were taken, we develop a parameterisation of the herd population 48 relevant to our study and use this to populate our simulations. The size of each herd 49 was recorded on the 1st of January and 1st of July of each year from 2003 to 2012 and 50 we draw the initial herd size for our simulations from this distribution of herd sizes. The 51 herd sizes in our available data were not found to change substantially over time (data 52 not shown) so this simplifying assumption is not thought to substantially affect the 53 results. Each herd is subjected to an annual whole herd test, and failing herds are  The phylogenetic dataset has previously been described in [3]. In brief, high-density 66 whole genome sequencing (WGS) was performed on 145 (139 cattle and 6 badger) 67 VNTR-10 isolates. Some badger isolates were available from a survey of badgers killed 68 on the roads in Northern Ireland [8] [9]. However as these few isolates are likely to be 69 only a small proportion of the total infected badgers in the area, are unlikely to be 70 representative, and the locations of where the badger carcasses were found could not be 71 verified, they were deemed unsuitable for inclusion in any further population-wide 72 analysis (though we note the close phylogenetic relationship between the badger-derived 73 samples and nearby ones from cattle [3]). Pairwise single nucleotide polymorphism 74 (SNP) differences between sequenced samples were recorded and a histogram of these 75 SNP differences generated as the basis of our further analysis, Fig. 1.

77
As in our previous study [7], we consider a simple four state model for the transmission 78 of bTB where cattle are either susceptible (S C ), exposed (E C ), test-sensitive (T C ), or 79 Fig 1. Phylogeny of VNTR-1 and -10 isolates (from [3]) showing the distribution of SNPs in sampled cattle. In this paper we have ignored the samples from badgers but show them here to indicate the close phylogenetic relatonship to the cattle isolates.
infectious (I C ). The tuberculin test, used in the UK, is based on a cow's response to the 80 invading M. bovis and it is assumed that animals in the exposed stage have not yet 81 mounted a sufficient response to be detected. Thus for the purposes of this model, any 82 test on animals in this stage will return a negative result. Infections in both the 83 test-sensitive and infectious stages are detectable. Once an animal becomes infectious, it 84 remains so until it is detected, at which point the animal would be culled. In addition 85 we create an infectious 'reservoir' population which generates an infection pressure and 86 (potentially) generates additional genetic diversity in the bacterial population. Further 87 we extend this to also consider susceptible (S r ) and infected (I r ) reservoir locations.

88
Once infected, animals progress from the exposed to test-sensitive stage at a rate σ 89 and from the test-sensitive state to infectious at a rate γ. Cattle-to-cattle transmission 90 is at a rate βS C I C (where S C ,I C , are the numbers of cattle in these states) and we 91 separate the cattle-reservoir and reservoir-cattle transmission rates as β Cr and β rC 92 respectively. 93 We can write the model excluding cattle movements between farms as a set of Ordinary Differential Equations (ODEs) in the forṁ where the sums are over all reservoirs connected to each farm or all the farms connected 94 to each reservoir. We assume that reservoir populations cannot infect each other. Also, 95 once infected, reservoirs remain infected for the duration of the simulation. This could

100
We solve the model using Gillespie's τ -leap method [13] with a fixed τ of 14 days. In 101 each 14 day step we calculate the numbers of animals whose disease status will have 102 changed and update them accordingly, perform any whole herd tests (WHT), maintain 103 a record of scheduled WHTs (including any follow-up tests from failed WHTs in the 104 current period), move animals between farms, and update the transition kernel for the 105 Gillespie algorithm in the next time step.

106
Movements To reduce computational costs, we only create agents for infected cows, 107 as the number of cattle moved between farms is typically few compared to the herd size, 108 and the susceptible fraction in a herd is high. We assume that animals born into the 109 herd, replacing animals moved out of the herd and thus keeping the herd size constant, 110 are susceptible and thus not tracked so we do not explicitly simulate birth and death 111 processes. This simplifying assumption did not give results that differed significantly 112 from less computationally efficient simulations where the size of each herd, births and 113 deaths were explicitly tracked (not shown). As we do not know the the force of infection 114 associated with each reservoir we apply the assumption that it is constant once the 115 reservoir begins to harbour infection. We keep a record of each farm (herd) and its 116 associated reservoirs in memory.

117
Every movement of cattle that passes through the VNTR-10 herds was recorded and 118 used to create a distribution of farm-to-farm movements as well as a distribution of the 119 number of animals moved for each farm. The number of animals moved in any 14 day 120 period is uniformly distributed over the VNTR 10 outbreak (as consistent with what is 121 observed in the dataset of animal movements in Northern Ireland over the period 122 investigated) creating a total of 7749685 movements. In each time step we move a 123 constant number of cattle (19108) so that the total moved in the simulation is 7749685 124 using the following algorithm:     Here we assume that the patterns do not change over timescales relevant to 139 transmission and evolution or in a way that substantially influences the metrics used (i.e. 140 the actual individual moves will be different, but not the overall pattern characteristics   Seeding the model We use the test histories of all herds observed to be part of the 163 VNTR-10 outbreak and determine those animals that had a probability of being in an 164 infected state at the start of the outbreak using the methods outlined in [1]. This  observed 60 days apart. We ignore animal births as we assume that calves are born free 178 of the disease and enter the susceptible population that is not (explicitly) tracked.

179
Hidden reservoir We incorporated three different models for the hidden reservoir; 180 none (there was no reservoir and the epidemic was driven by cattle to cattle 181 transmission and movements only), considering both a single reservoir that is connected 182 to every farm in the network, and reservoirs that had a radius of 2km (similar to the 183 expected home ranges of badgers) so farms that were less than 4km apart could share 184 one or more reservoirs (see Fig 5). Our dataset did not contain location data for those 185 uninfected farms that were connected by movements so we assigned a reservoir to each 186 of these but did not include overlaps with other reservoirs in the simulations. This 187 resulted in a network with a subset of farms that were connected via a reservoir. For 188 comparison, we also created a second network where the distribution of connected farms 189 mimicked the distribution of connections that include farms for which we did not have 190 locations. We refer to this as our "synthetic"network.   simulations. The distribution p 1 , p 2 , . . . , p n is shown in fig. 2[left]. We can write our

233
[partial] likelihood function as where n is to the total number of observed SNP differences. estimates [15][16][17][18] where they existed. The substitution or SNP creation rate is informed 246 by our previous estimate [3]. For the transmission rates, rate of the pathogen and test 247 sensitivity, we used non-informative priors, i.e. uniform priors with a large range. 248 We calculated the AIC score of each of our 5 models to determine which model best 249 describes the observed diversity (for our final set of models presented here, each model 250 contains the same number of parameters so in comparing the AIC scores of each model 251 we are indirectly comparing their log-likelihoods). According to the AIC scores the 252 preferred model that best described the observed data was one that generated no 253 additional diversity within the reservoir population (the minimal diversity model) (Table 254 3).

255
Considering the univariate parameter distributions in the posterior, the length of the 256 exposed stage (i.e. 1/σ) was estimated to be 14.3 days (with lower and upper quartiles 257 5.5-16.1 days) in line with previous estimates [1,7]. This is towards the longer duration 258 of the prior for this value. The length of the test-sensitive stage (i.e. 1/γ) was estimated 259 to be 155.7 days (with lower and upper quartiles 147.1 and 169.5 days, respectively) 260 also similar to previously published estimates [1,7,14,15]. Estimates for the sensitivities 261 of the whole herd test test, 59% (with a 95% credible interval of 52-63%), are also 262 consistent with previous observations [16][17][18] and simulations [7,14]. Neither the 263 Table 1. Summary of the priors used in the model. In the cases of β, α RC , α CR , µ we used non-informative priors whereas in the case of σ, γ, Ω the priors were chosen on the basis of existing field and experimental estimates [3,16,18] The best fit model is compared to the SNP distance histogram in illustrated in Fig. 266  2(a,b), showing considerable fidelity of the model to these data; the simulated 267 distribution displays both the bi-modal nature of the differences and matches the 268 observed distribution. Though there is evidence of a small amount of overdispersion for 269 large SNP differences while underestimating lower numbers of SNP differences, results 270 lie within the 95% CI's for most SNP differences compared. By comparison, a visual 271 inspection for the other models shows that, despite very similar univariate parameter 272 estimates (table 4), SNP distributions were clearly inferior when comparing to the data 273 (see Fig 3(a-d)).    approach [19,20]. R 0 was found to have a mean value of 1.89 with a standard deviation 278 of 1.04, as in Fig. 8 factor predicting future outbreaks [23,24], implying the existence of important farm-level 317 risk factors not captured in our data, and thus more consistent with a model where only 318 herds that have been infected at some points in our records have this localised risk.

319
Our metric is based solely on observations of genetic diversity and do not 320 incorporate epidemiological observations. While our results show that this alone is 321 sufficient to distinguish between importantly different models of transmission, our 322 estimate of transmission parameters depends on the ability to generate diversity.

323
Because in our preferred (minimum diversity) model the reservoir generates no 324 independent diversity, the ability of the model to infer transmission rates from the 325 reservoir is limited (see Fig. 8). Ongoing work involves extending the partial likelihood 326 approach to include these epidemiological observations, better exploiting the detailed 327  . All herd locations are translated to anonymise them. Herds within a 2km radius are assumed to be connected by a wildlife reservoir (left) while the other herds are joined, indirectly, through animal movements (uniformly distributed throughout the length of the simulation and between farms) or genetically (right).The determination of those herds that had a seperation of 4km was made before the location data was transformed. S4 Posterior kernel density estimates for the parameter distributions for the model. Here β, σ, γ are the transition rates in our model, Ω is the sensitivity of the routine and abattoir tests, β CR , β RC are the transmission rates from cattle to reservoir and reservoir to cattle respectively and µ is the number of new SNPs generated in the model per day. The horizontal scale on each figure corresponds to the priors used and were taken from existing field and experimental estimates [16,18]. In the case of the σ parameter our modelling suggests that the length of time an animal is in this stage is towards the longer prior estimate.