Introduction

Incidence and prevalence are standard epidemiological indicators, monitored to understand disease dynamic within society [1, 2]. In the case of infectious diseases, it is customary to measure how far an epidemic is from eradication by calculating yet another epidemiological parameter, the basic reproduction number, denoted by R0 (e.g.,, pp. 4–5). By definition, R0 represents the average number of secondary cases caused by a typical infectious individual in a fully susceptible population. If R0 is larger than 1, then an outbreak becomes an epidemic; otherwise, it goes extinct. If the population is not fully susceptible, then one calculates the effective reproduction number, denoted by R [4, 5]. However, neither R0 nor R is regularly monitored by public-health authorities (e.g.,, pp. 39–65).

R0 depends on a number of factors pertaining to the pathogen–host biology, such as pathogen transmissibility and the natural history of disease. In addition, R0 may depend on factors pertaining to host sociology, such as population density and social awareness about epidemics. All these factors may change with time. Changes in R0 may signal that the pathogen has become more transmissible, virulent or persistent in the population. They may also signal societal re-organization in response to the epidemic dynamic. Particularly important are changes in R, which, in addition to changes in R0, may also reflect changes in the susceptibility of the population. Monitoring changes in R may thus be instrumental in determining the success of public-health interventions such as mass vaccination [4, 7, 8].

An epidemiological situation that would benefit from R0 and/or R monitoring is that of a zoonotic pathogen repeatedly introduced in a population where it undergoes subcritical transmission. As the pathogen explores more and more hosts, the opportunity for mutations increases, with increasing chance for pandemic strains to occur. Possible applications of this scenario may be the cases of the pre-pandemic severe acute respiratory syndrome and Middle East respiratory syndrome.

Another application of R0 and/or R monitoring is to assessing the success of mass vaccination by surveying disease outbreaks before and after a major epidemiological event, such as implementation of mass vaccination and loss of confidence in vaccination programs. Surveillance and contact tracing provides a means of monitoring outbreaks by permanent registration of new cases. For example, monkeypox and non-zoonotic measles are potential candidates for eradication through vaccination. Monkeypox remains worrisome for the possibility that repeated introductions in the human population may yield novel strains of human poxes. Following vaccine licensing in 1963, measles incidence in the United States decreased by more than 95%. Nevertheless, recent R analysis shows the potential for measles to re-emerge and emphasizes the importance of continued surveillance. In this work, we discuss monitoring R using outbreak-size surveillance data; only minor modifications are needed to apply our methodology to monitoring R0 of emerging infectious diseases.

There exist two major approaches to estimate R from outbreak-size data. In the first approach, R is estimated using the maximum likelihood method. Blumberg et al. [8, 15] used Galton–Watson branching processes to construct the likelihood of observed data consisting of outbreak sizes. By maximizing the likelihood function, they calculated R for monkeypox in the Democratic Republic of Congo (1980–1984) and measles in the United States (2001–2011). Jansen et al. used continuous-time Markov chains to construct the likelihood of observed data. By maximizing the likelihood function, they calculated R for measles outbreaks in the United Kingdom (1995–2002), advocating for an increase in R due to loss of confidence in the vaccination program.

The second approach is based on Bayesian inference, which requires a prior distribution describing the current knowledge on the parameter of interest (e.g., R) and the likelihood of observing the data set according to a model of choice. As a result, one computes a posterior distribution, representing an upgrade of the prior, according to the data. Hence Bayesian inference follows closely the principle of surveillance and learning processes.

Farrington et al. proposed two alternative ways to constructing the posterior probability of R based on (1) outbreak size and (2) outbreak duration. They estimated R for measles (1997–1999) outbreaks in the United States using data from 41 outbreaks caused by a single introduction. Angelov et al. estimated R from a set of clusters sizes, where each cluster consisted from a known number of outbreaks. Following a Bayesian approach, they calculated R for mumps for three different regions of Bulgaria (2005–2008). Yanev et al. described different Bayesian estimators under two different families of loss functions to study multiple outbreaks. They calculated R to describe transmission of smallpox in Europe (1960–1970). Prior knowledge on R was obtained from past data covering the period 1951–1960.

Previous literature focused on extracting a single R value from a given epidemiological dataset. In this work, we describe a methodology to estimate two time-ordered R values from the same set of surveillance data, in the regime of subcritical transmission (R<1); however, the generalization to multiple values is straightforward.

General theory

We first briefly present how to compute a single value of R from an epidemiological dataset, in the Bayesian paradigm. To express lack of knowledge about R, we use a non-informative improper prior π(R) = 1, where R is uniformly distributed from zero to infinity. We construct the likelihood L(T|R) that the dataset T is observed, assuming that the effective reproduction number is R. Given that the detected trees have sizes n1,…,nN, ordered according to the date of infection of index patients, we have [15, 19]

L(Τ|R)=∏i=1Np(ni,R)(1)

where p(ni,R) is the probability that the single transmission tree i has size ni. The posterior distribution π^(R|Τ) of R for the observed dataset T is calculated according to Bayes’ rule

π^(R|Τ)∝L(Τ|R)π(R)(2)

which yields

π^(R|Τ)=L(Τ|R)/∫0∞L(Τ|R)dR.(3)

The posterior distribution π^(R|Τ) was used to calculate the average effective reproduction number 〈R〉 and its 95% credible interval (CI).

The quality of R statistics depends not only on the number of trees but also on their sizes. In short, we summarize the amount of data using the concept of information. According to the definition (e.g.,, pp. 32), information is given by the logarithm of the probability to observe the given dataset T. In our case, the information, denoted by IT, is given by the natural logarithm of the likelihood L(T|R) (c.f. Eq 1)

IΤ=−lnL(Τ|R)(4)

so that information is measured in natural units (i.e., nat). Hence, information is presented as the sum of N contributions, one for each tree in T (i.e., an extensive quantity over the number of trees). For a numerical estimate of IT corresponding to the dataset, we used the average of R (i.e., 〈R〉) obtained from the Bayesian framework.

Monitoring the pandemic risk of emerging infectious diseases, one extracts at least two time-ordered R values from the same dataset T. Hence, the dataset T is divided into two time-ordered subsets Ta and Tb = T\Ta. The Bayesian approach described above (c.f. Eq 3) is then applied independently to each subset Ta,b to estimate two effective reproduction numbers Ra,b. Obviously, the estimates Ra,b depend on the selection of Ta,b. For example, if Ta is nearly all T then Rb is badly estimated.

To justify the choice of extracting two R values from T, we proceed as follows. We denote by HT the model with a single R estimate and by Ha,b the model with two R estimates. Given Ra,b, the likelihood for the dataset Ta∪Tb is given by L(Ta|Ra)L(Tb|Rb) [c.f., Eq 1 for L(T|R)]. We evaluate whether Ha,b is more plausible than HT by calculating the Bayes factor, using the corresponding likelihoods. When our initial beliefs are a priori equally probable, pr(Ha,b) = pr(HT), the Bayes factor

B(Τa)=L(Τa|Ra)L(Τb|Rb)L(Τ|R)(5)

expresses how well the observed data were predicted by Ha,b, compared to HT; i.e., the higher the value of B(Ta), the more is justified to extract two R values rather than one from T. Kaas et al. provided an interpretation of the strength of the second model Ha,b in terms of four categories according to the gradation of 2lnB(Ta). They suggested a very strong preference for Ha,b if 2lnB(Ta)>10. In our model, we make the same decision. We extract two Ra,b values when 2lnB(Ta)>10; otherwise, we extract a single value of R from T.

Our R analysis is performed according to the following steps. For every i = 1,…,(N−1):

Finally, we denote by Τa,b* the sets Ta,b which yield the largest value of 2lnB(Ta). If 2lnB(Τa*)>10, we accept 〈R〉a,b as the best estimates of effective reproduction numbers on Τa,b* and we denote them by 〈R〉*a,b; otherwise, we calculate a single R value on T using Eq 3.

Numerical tests using synthetic data

Synthetic data consisting of arrays of sizes of transmission trees were simulated, assuming that the number of cases caused by each infected individual is a Poisson deviate with average R. Tree sizes are random integers given by a distribution p(ni,R), obtained from the probability-generation function for the Galton–Watson branching processes according to Pitman. For the Poisson offspring distribution, we obtain

p(ni,R)=(niR)ni−1exp(−niR)ni!(6)

The synthetic dataset, consisting of sizes of N transmission trees with known R, was used to validate our methodology of estimating Ra,b. The analytical approach according to Eq 3 yields

〈R〉=1−1/n¯+1/(n¯N)(7)

where n¯ is the average size of the trees in T. The corresponding standard deviation

std(R)=〈R〉n¯N(8)

becomes negligible when N≫1. Eq 7 gives R = 1 if T consists of a single tree. For large tree number in T, the third term goes to zero and Eq 7 becomes identical to the formula derived from the maximum likelihood L(T|R) method.

Using Eq 7, we estimate the change of R in real time, as new epidemiological data become available. Suppose we add a new tree of size n to T. The corresponding change in R,

δR=n(1−1/N)−n¯n¯(n¯N+n)(9)

shows that, for a sufficiently large tree number in the dataset N≫1, the sign of the change in R is determined by the difference between the size n of the newly added tree and n¯.

We now proceed with the discussion of our simulations. In the first example, we generated a homogeneous dataset T, consisting of 100 trees with R equal to 0.6. Fig 1(a) shows 2lnB(Ta) as a function of information (c.f. Eq 4) in the first subset Ta. The observed maximum is ~3, indicating weak justification for calculating two R values. Hence we concluded that this dataset should be assigned a single R estimate.

In the second example, we assumed a stepwise increase of R, modeling adaptation of the pathogen to human-to-human transmission. We generated a non-homogeneous synthetic dataset where 50 trees were generated with Ra = 0.6, while the remaining 50 trees were generated with Rb = 0.85. The parameter 2lnB(Ta) (c.f. Fig 1(b)) reaches its maximum at the 55th tree. Contrary to the previous example, the observed maximum of 2lnB(Ta) is ~14, suggesting strong justification for calculating Ra,b. The best Ra,b estimates, denoted by R*a,b, have non-overlapping error bars (95% CI) and are in satisfactory agreement with the numerical choices for the parameters when 2lnB(Ta)>10 (c.f. Fig 1(c)).

In order to clarify the minimal size of the initial dataset required to get reliable estimations of Ra,b, we performed evaluations of R*a,b for statistically independent synthetic datasets T, containing from 2 to 200 trees. For each size of T, we averaged 120 independent realizations to alleviate stochastic effects. Fig 2(a) shows the average 2lnB(Τa*) and its 95% confidence interval as a function of the total number of trees in the homogeneous datasets, when R equal to 0.6. The values are below 10 and the featured dependence is not particularly sensitive to the number of trees in T.

The situation is quite different for non-homogeneous datasets. As an example, we generated a set of trees T with R equal to 0.6 and 0.85 for the first and second halves, respectively. The results are presented in Fig 2(b), where the average 2lnB(Τa*) increases with the number of trees in T. For a low number of trees (up to ~100 trees in T), the average of 2lnB(Τa*) is less than 10, indicating no preference to estimate two R values. For a high number of trees, the average of 2lnB(Τa*) is larger than 10, demonstrating that the model with two R values is superior. For small sizes of T (c.f. Fig 2(c)), average estimates of R*a,b display strong fluctuations. Starting with ~50 trees per dataset T, R estimates are neater. With further increasing the size of T, R estimates are more grouped around the exact values and the error bars (95% CI) are smaller.

Application to epidemiological data

We applied our method of evaluating two R values from an epidemiological dataset. We aimed to reproduce recent results obtained by Jansen et al., concerning the transmission dynamics of measles in the UK during 1995–2002. Although measles-elimination programs were set worldwide, measles eradication has not yet been achieved. In the late nineties, the safety of a combined measles-mumps-rubella (MMR) vaccine became controversial, which resulted in decreased uptake of the MMR vaccine after 1998. As a consequence, measles outbreaks increased in sizes.

Jansen et al. calculated two effective reproduction numbers R for the UK, regarding the periods 1995–1998 and 1999–2002, respectively. They found that R increased significantly, from 0.47 to 0.82. The epidemiological data consisted of measles cases grouped into outbreaks of size 2 or more. We accounted for the left censoring of the outbreak-size data by renormalizing the probability of observing outbreaks p(ni,R) as p(ni,R)/(1−p(1,R)), where Eq 6 provided the probability model. The results are presented in Fig 3. The parameter 2lnB(Ta) shows a marked maximum (c.f., Fig 3(a)) at the 35th outbreak, the latest one of 1998. The magnitude of 2lnB(Ta) is ~14, justifying the split of the 1995–2002 dataset into two separate subsets for the periods 1995–1998 and 1999–2002. Jansen et al. used the same split of the dataset, based on information about measles-vaccination coverage and the MMR controversy. Our analysis yielded the same conclusion, based on the measles outbreak data alone.

Fig 3(b) shows Ra,b (95% CI) as a function of 2lnB(Ta). Both Ra,b have nearly constant values for 2lnB(Ta)>10. For the maximum value of 2lnB(Ta)>10, we evaluated R*a,b as 0.54 (90%CI 0.43–0.66) and 0.86 (90%CI 0.79–0.93), respectively. The quantitative difference with the results by Jansen et al. [7, 21] (Ra = 0.47 (90%CI 0.36–0.55) and Rb = 0.82 (90%CI 0.71–0.87)) is not significant (pa = 0.92 and pb = 0.96 for Ra,b, respectively) and is explained by the choice of the offspring distribution (i.e. the Poisson or the geometric offspring distribution) rather than by the difference in the methods of data analysis.

Discussion

We propose a method to monitor the effective reproduction number of infectious diseases with sub-threshold transmission. The method may apply to alert and surveillance systems of diseases emerging and/or re-emerging from natural reservoirs. In this case, monitoring an increase in R0 and/or R may be used to determine the implementation of public-health intervention. The method may also be used to assess the effectiveness of public-health programs designed for disease elimination. In this case, evaluating R before and after implementing intervention may confirm the performance of the public-health program.

We validated our method using synthetic data, of which we presented a few simulations. We also reanalyzed data previously studied by Jansen et al. [7, 21], concerning the transmission of measles in the UK during 1995–2002. While our R findings are similar, we also extracted the time when rumors on the MMR vaccine started to affect measles vaccination from the epidemiological data. Of note, in a previous publication, Blumberg et al. analyzed the transmissibility of measles in the US (1997–1999) and Canada (1998–2001), estimating two values of R from two distinct datasets. However, further epidemiological assumptions are required for a direct comparison of R resulting from the two analyses.

Our model has several limitations. Epidemiological data are often meant for estimating incidence or prevalence (e.g., [1, 2], pp. 39–65). However, our study requires a particular type of longitudinal data, consisting of outbreak sizes where the outbreaks are ordered according to the date of infection of index patients. In addition, our model implies the appearance of every new index case only after the termination of an outbreak caused by the previous index case (i.e., we assume a sequence of non-overlapping in time transmission trees). These data may result from close surveillance of emerging or re-emerging infections. However, outbreaks may be clustered in time and space and difficult to tell apart. If the number of index patients in each cluster is determined, the model could be amended by using the Borel–Tanner distribution for p(ni,R) in Eq 6.

Our model provides a scheme to monitor R in the regime of subcritical transmission (i.e., R<1); other methods may be used for estimating R in the regime of supercritical transmission. The quality of R estimation in our model depends on the number of outbreaks; see Fig 3c. Therefore the number of changes in R that can be estimated from a dataset depends on the number of outbreaks. We addressed the case of detecting a single change in R by solving a one-dimensional maximization problem for the Bayes factor B(Ta) (c.f., Eq 5). It is no coincidence that (global) maximization problems are divided between one- and multi-dimensional, the first class of problems being significantly easier. However, a number of numerical algorithms are readily available to address maximization in several dimensions. In our case, such algorithms would have to face additional difficulties, inherent to the stochastic nature of the data modeling.

In conclusion, our work proposes a novel method to monitor changes in the effective reproductive number from an epidemiological dataset consisting solely of outbreak sizes.