This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parameter Estimation from ICC curves

Joceline Lega Department of Mathematics, University of Arizona, 617 N. Santa Rita Avenue, Tucson, AZ 85721
Abstract.

Incidence vs Cumulative Cases (ICC) curves are introduced and shown to provide a simple framework for parameter identification in the case of the most elementary epidemiological model, consisting of susceptible, infected, and removed compartments. This novel methodology is used to estimate the basic reproduction ratio of recent outbreaks, including the ongoing COVID-19 epidemic.

Key words and phrases:
Outbreak; Parameter Identification; Compartmental Models

1. Overview

This article introduces the concept of ICC (Incidence vs Cumulative Cases) curves, which are nonlinear transformations of the traditional ‘EPI curves’ (plots of disease incidence versus time) commonly used by epidemiologists. The main message of the present work is that describing an outbreak in terms of its associated ICC curve is not only natural from a dynamical point of view, but also advantageous for parameter identification and forecasting purposes. Specifically, we provide a method for the practical identification of epidemiological parameters of the SIR (Susceptible - Infected - Removed) compartmental model, directly from its associated ICC curve. Because the result is robust to under-reporting, this approach gives a simple way to estimate the basic reproduction ratio of a disease from reported incidence data. The present analysis also provides a theoretical justification for the forecasting methodology put forward in [1], which was shown to lead to useful estimates of the characteristics (expected duration, final number of cases, and peak intensity) of ongoing outbreaks.

We develop the concept of ICC curves in the context of the classical SIR model [2, 3], which is the simplest compartmental model that captures the basic phenomenon of disease transmission: before recovering (or being removed through disease-induced death), infected individuals transmit the disease to susceptible individuals, who in turn become infected. The process repeats itself until the epidemic has followed its course and no additional infections occur. Since many outbreaks have a time scale that is much shorter than the scale at which the total population NN changes, a common simplification is to assume that NN is constant. It is then natural to define the time-dependent quantities s=S/N,s=S/N, i=I/N,i=I/N, and r=R/Nr=R/N as the expected sizes of the susceptible (SS), infected (II), and recovered (RR) compartments relative to N,N, so that s+i+r=1.s+i+r=1. The resulting SIR model is then given by

dsdt=βsi;didt=βsiγi;drdt=γi,\frac{ds}{dt}=-\beta si;\quad\frac{di}{dt}=\beta si-\gamma i;\quad\frac{dr}{dt}=\gamma i,

where β\beta is the contact rate of the disease and γ\gamma is its recovery rate. Scaling time by τ=1/γ\tau=1/\gamma turns the above equations into a one-parameter (R0R_{0}) model, where R0=β/γR_{0}=\beta/\gamma is the basic reproduction ratio [4, 5, 6] of the modeled epidemic. The disease spreads if R0>1R_{0}>1 and dies out when R0<1.R_{0}<1. It is also clear from the right-hand side of the second SIR equation that for i0,i\neq 0, the proportion of infected individuals grows as a function of time as long as s>γ/β=1/R0.s>\gamma/\beta=1/R_{0}.

In a compartmental model, an outbreak is a trajectory that starts in the vicinity of the disease-free equilibrium and ends at the endemic equilibrium, when the number of infected individuals approaches zero. We define an ICC curve, where ICC stands for Incidence vs. Cumulative Cases, as a representation of the dynamics associated with an outbreak, in the plane of coordinates CC (cumulative cases) and \mathcal{I} (incidence). The following theorem, established in Section 2, gives an exact formulation of the ICC curves of the SIR model.

Theorem 1.

For a total susceptible population of size NN and each initial condition κ=S(0)N=1C(0)N\displaystyle\kappa=\frac{S(0)}{N}=1-\frac{C(0)}{N}, the ICC curve of the SIR model is given by

(1) Gκ,N(C)=β(C+NR0ln(1CN)NR0ln(κ))(1CN),G_{\kappa,N}(C)=\beta\left(C+\frac{N}{R_{0}}\ln\left(1-\frac{C}{N}\right)-\frac{N}{R_{0}}\ln(\kappa)\right)\left(1-\frac{C}{N}\right),

where C=I+R[0,C]C=I+R\in[0,C_{\infty}] and C,C_{\infty}, the final number of cases, is the positive solution of the transcendental equation

(2) C+NR0ln(1CN)NR0ln(κ)=0.C_{\infty}+\frac{N}{R_{0}}\ln\left(1-\frac{C_{\infty}}{N}\right)-\frac{N}{R_{0}}\ln(\kappa)=0.

Details of the derivation of this theorem are given in Section 2.1. Its significance is illustrated in Figure 1, which shows a numerical simulation of the SIR model and a plot of the corresponding ICC curve for R0=2R_{0}=2. Instead of considering the time dependence of SS, II, and RR as shown in the left panel of Figure 1, the course of the outbreak is captured by a single curve: a plot of incidence versus cumulative cases. As explained in Section 2.1, knowledge of the ICC curve is sufficient to reconstruct the temporal behavior of SS, II, and RR. In other words, the ICC curve contains the entire information associated with any outbreak described by the SIR model.

Refer to caption
Figure 1. (Color online). Simulation of the SIR model showing the time-dependence of the S, I, and R compartments (left), for R0=2R_{0}=2. The corresponding ICC curve is shown on the right. The exact expression given by Equation (1) is plotted as a solid black line; the numerically estimated ICC curve is shown as a thick cyan line.

An important motivation for studying ICC curves is that they can easily be fitted to case data. Specifically, we show in Section 3.1 that, for fixed NN, there exists a single set of parameters (β,γ,κ)(\beta,\gamma,\kappa) that minimizes the root mean square error between (noisy) simulated outbreak data and the ICC curve =Gκ,N(C){\mathcal{I}}=G_{\kappa,N}(C) given by Equation (1). Additionally, a numerical investigation indicates that such an estimation is robust to noise. As a consequence, the present approach provides a method to estimate the parameters of the SIR model, thereby making this model practically identifiable.

The rest of this manuscript is organized as follows. Section 2 presents the derivation of Equations (1) and 2, and discusses the robustness of ICC curves to systemic under- (or over-) reporting. Section 3 explains how model parameters may be retrieved from ICC curves and assesses the robustness of the proposed procedure to reporting noise. Section 4 briefly discusses parameter estimation as an outbreak unfolds. Section 5 shows a few examples of application to real data, for outbreaks of gastroenteritis and COVID-19. Future extensions and applications of the methodology proposed in this article are reviewed in Section 6.

2. ICC curves of the SIR Model

2.1. Proof of Theorem 1.1

Epidemiologists use the epidemiological (EPI) curve, a plot of incidence as a function of time, to visualize the course of an epidemic. In the SIR model, incidence is defined as

=βSIN=dCdt,{\mathcal{I}}=\beta S\frac{I}{N}=\frac{dC}{dt},

where C=R+IC=R+I is the cumulative number of cases at time t.t. Note that CC includes those who have recovered (and thus were a case at some point in time), as well as those who are currently infected. To derive Equation (1), write ds/dr=βs/γ=R0s,ds/dr=-\beta s/\gamma=-R_{0}\,s, which can be integrated to give s=κexp(R0r),s=\kappa\exp(-R_{0}\,r), where κ(0,1)\kappa\in(0,1) is a constant that depends on initial conditions. Typically, κ1\kappa\simeq 1 (see e.g. [7]) since for an outbreak in a naive population we normally have r(0)=0,r(0)=0, i(0)<<1,i(0)<<1, and thus κ=s(0)1.\kappa=s(0)\simeq 1. However here we do not make this assumption and keep the parameter κ\kappa in the following discussion. With c=C/Nc=C/N and s=1(i+r)=1c,s=1-(i+r)=1-c, we obtain

r=1R0ln(s/κ)=1R0ln((1c)/κ),r=-\frac{1}{R_{0}}\ln(s/\kappa)=-\frac{1}{R_{0}}\ln((1-c)/\kappa),

and consequently

dcdt=βis=β(cr)(1c)=β(c+1R0ln((1c)/κ))(1c).\frac{dc}{dt}=\beta is=\beta(c-r)(1-c)=\beta(c+\frac{1}{R_{0}}\ln((1-c)/\kappa))(1-c).

Moving back to dimensional variables gives the expression in Equation (1), i.e. =dC/dt=Gκ,N(C){\mathcal{I}}=dC/dt=G_{\kappa,N}(C).

An approximate expression of the SIR ICC curve, assuming κ=1\kappa=1 and C/NC/N sufficiently small was given in [1]. Since Gκ,N0G_{\kappa,N}\geq 0 on [0,C][0,C_{\infty}], for a given initial condition κ=1C(0)/N\kappa=1-C(0)/N, the part of the ICC curve traced out as the outbreak unfolds will be such that C(t)C(0)>0C(t)\geq C(0)>0. However, the ICC curve is well defined for CC in the range [0,C][0,C_{\infty}], which is the definition we adopt here. Even though the time variable does not appear in the ICC curve, the value of κ\kappa is not arbitrary: C(0)C(0) is the value of CC when R=0R=0, where RR represents the removed compartment.

ICC curves are particularly useful for forecasting the course of an outbreak because they contain information on quantities that are of interest to public health practitioners, namely the final number of cases (CC_{\infty}) and peak incidence (max{\mathcal{I}}_{max}). The latter is simply the maximum of Gκ,NG_{\kappa,N}; the former may be found as follows. Since CC is the cumulative number of cases, it is a monotonic function of time and its derivative, Gκ,N(C),G_{\kappa,N}(C), must remain non-negative. For C/NC/N small,

Gκ,N(C)=Gκ,N(0)+C(βγ+γln(κ))+𝒪(C2)G_{\kappa,N}(C)=G_{\kappa,N}(0)+C(\beta-\gamma+\gamma\ln(\kappa))+{\mathcal{O}}(C^{2})

where Gκ,N(0)=Nγln(κ)>0.G_{\kappa,N}(0)=-N\gamma\ln(\kappa)>0. By continuity, Gκ,N(C)G_{\kappa,N}(C) will remain non-negative on the interval [0,C],[0,\ C_{\infty}], where CC_{\infty} is the unique positive solution of Equation (2). By writing Equation (2) as N=C+κNexp(R0C/N),N=C_{\infty}+\kappa N\exp(-R_{0}C_{\infty}/N), where the right-hand-side is the sum of two non-negative monotonic functions of CC_{\infty}, one increasing and the other decreasing, and since 0<κ<1,0<\kappa<1, one can see that such a solution exists and is unique. Asymptotically, when the disease has followed its course and all infections have occurred, I=0I=0 and C=R.C=R. Then, S=κNexp(R0R/N)=κNexp(R0C/N)S=\kappa N\exp(-R_{0}R/N)=\kappa N\exp(-R_{0}C/N) and the conservation law N=S+I+RN=S+I+R becomes N=κNexp(R0C/N)+C.N=\kappa N\exp(-R_{0}C/N)+C. In other words, CC_{\infty} defined in Equation (2) is simply the final (asymptotic) number of cases associated with the outbreak. This concludes the proof of Theorem 1.

To see that the entire dynamics of an outbreak in the SIR model is captured by the associated ICC curve, note that knowledge of C(t)C(t) leads to knowledge of

S(t)=NC(t),R(t)=NR0ln(1κ(1C(t)N)),andI(t)=C(t)R(t).S(t)=N-C(t),\quad R(t)=-\frac{N}{R_{0}}\ln\left(\frac{1}{\kappa}\left(1-\frac{C(t)}{N}\right)\right),\quad\text{and}\quad I(t)=C(t)-R(t).

In other words, the time course of SS, II, and RR may be obtained from the solution of the differential equation dC/dt=Gκ,N(C)dC/dt=G_{\kappa,N}(C) prescribed by the ICC curve, with appropriate initial conditions (which in turn define the value of κ\kappa).

2.2. Robustness to under- or over-reporting

The quantity NN appearing in Equation (1) refers to the total population taken into account in the model, which is conserved by the dynamics. In case of over- or under-reporting, the observed incidence expectation o{\mathcal{I}}_{o} is a fraction of the actual incidence \mathcal{I}, so that o=α,{\mathcal{I}}_{o}=\alpha{\mathcal{I}}, where we assume that α\alpha is constant or varying slowly enough for its derivative to be negligible. The ICC curve estimated from case report data will then be a graph of o{\mathcal{I}}_{o} as a function of Co=0to(s)𝑑s=αC.\displaystyle C_{o}=\int_{0}^{t}{\mathcal{I}}_{o}(s)ds=\alpha\,C. We thus have

dodt\displaystyle\frac{d{\mathcal{I}}_{o}}{dt} =αddt=αGκ,N(C)=αβ(C+NR0ln(1CN)NR0ln(κ))(1CN)\displaystyle=\alpha\frac{d\mathcal{I}}{dt}=\alpha\,G_{\kappa,N}(C)=\alpha\beta\left(C+\frac{N}{R_{0}}\ln\left(1-\frac{C}{N}\right)-\frac{N}{R_{0}}\ln(\kappa)\right)\left(1-\frac{C}{N}\right)
=β(Co+αNR0ln(1CN)αNR0ln(κ))(1CN)\displaystyle=\beta\left(C_{o}+\frac{\alpha N}{R_{0}}\ln\left(1-\frac{C}{N}\right)-\frac{\alpha N}{R_{0}}\ln(\kappa)\right)\left(1-\frac{C}{N}\right)
=β(Co+NoR0ln(1CoNo)NoR0ln(κ))(1CoNo)=Gκ,No(Co),\displaystyle=\beta\left(C_{o}+\frac{N_{o}}{R_{0}}\ln\left(1-\frac{C_{o}}{N_{o}}\right)-\frac{N_{o}}{R_{0}}\ln(\kappa)\right)\left(1-\frac{C_{o}}{N_{o}}\right)=G_{\kappa,N_{o}}(C_{o}),

where No=αNN_{o}=\alpha N. Therefore, the ICC curve in the presence of over-reporting (α>1\alpha>1) or under-reporting (α<1\alpha<1) has the same functional form, with the same parameter values, as the true ICC curve, provided \mathcal{I}, CC, and NN are rescaled by the factor α\alpha. For the same reason, κ=S(0)/N\kappa=S(0)/N is also independent of the value of α\alpha.

While systematic under- or over-reportding does not affect the ICC curve provided the size of each compartment is appropriately rescaled, the reproduction ratio depends on NN through the ratio C/NC_{\infty}/N, as can be seen from Equation (2), in which setting κ=1\kappa=1 leads to

(3) R0CN+ln(1CN)=0R0=ln(1C/N)C/N.R_{0}\frac{C_{\infty}}{N}+\ln\left(1-\frac{C_{\infty}}{N}\right)=0\Longrightarrow R_{0}=-\frac{\ln\left(1-C_{\infty}/N\right)}{C_{\infty}/N}.

Finally, rescaling time amounts to rescaling β\beta and γ\gamma, and consequently the ICC curve.

3. Parameter identification

Structural identifiability (see e.g. [8] for a review) relates to the ability of uniquely identifying model parameters from knowledge of the model output. In this context, NN is assumed to be known since it represents the population of the system to which the model is applied. Given C(t),C(t), one can calculate (t)=dC/dt=Gκ,N(C(t)).{\mathcal{I}}(t)=dC/dt=G_{\kappa,N}(C(t)). From Equation (1), it is clear that knowledge of C(t)C(t) and (t){\mathcal{I}}(t) uniquely determines β,\beta, γ,\gamma, and κ.\kappa. Indeed, if two sets of parameters, (β(1),γ(1),κ(1))(\beta^{(1)},\gamma^{(1)},\kappa^{(1)}) and (β(2),γ(2),κ(2))(\beta^{(2)},\gamma^{(2)},\kappa^{(2)}), led to the same output C(t)C(t) (and thus to the same values of its derivative (t){\mathcal{I}}(t)), we would have

0=(β(1)β(2))CN+(γ(1)γ(2))ln(1CN)(γ(1)ln(κ(1))γ(2)ln(κ(2))),0=\left(\beta^{(1)}-\beta^{(2)}\right)\frac{C}{N}+\left(\gamma^{(1)}-\gamma^{(2)}\right)\ln\left(1-\frac{C}{N}\right)-\left(\gamma^{(1)}\ln(\kappa^{(1)})-\gamma^{(2)}\ln(\kappa^{(2)})\right),

for all values of CN[0,CN]\frac{C}{N}\in\left[0,\frac{C_{\infty}}{N}\right], which leads to β(1)=β(2)\beta^{(1)}=\beta^{(2)}, R0(1)=R0(2)R_{0}^{(1)}=R_{0}^{(2)}, and κ(1)=κ(2).\kappa^{(1)}=\kappa^{(2)}. Therefore, all of the parameters of the SIR model, including initial conditions, are structurally identifiable from knowledge of C(t),C(t), the cumulative number of cases, and NN, the size of the susceptible population. Similar results were obtained in [9, 10, 11]. If NN is unknown, setting

0=β(1)CN(1)\displaystyle 0=\beta^{(1)}\frac{C}{N^{(1)}} β(2)CN(2)+γ(1)ln(1CN(1))γ(2)ln(1CN(2))\displaystyle-\beta^{(2)}\frac{C}{N^{(2)}}+\gamma^{(1)}\ln\left(1-\frac{C}{N^{(1)}}\right)-\gamma^{(2)}\ln\left(1-\frac{C}{N^{(2)}}\right)
(γ(1)ln(κ(1))γ(2)ln(κ(2))),\displaystyle-\left(\gamma^{(1)}\ln(\kappa^{(1)})-\gamma^{(2)}\ln(\kappa^{(2)})\right),

for all values of CN[0,CN]\frac{C}{N}\in\left[0,\frac{C_{\infty}}{N}\right], additionally leads to N(1)=N(2)N^{(1)}=N^{(2)}, making the size of the population NN a structurally identifiable parameter as well.

Whereas structural identifiability depends on the nature of the model itself, practical identifiability relates to parameter identifiability in the presence of measurement noise, as well as (hopefully small) deviations between model and reality (see e.g. [8]). This question may be addressed numerically by simulating an exact solution of the model, adding noise to its output, and estimating the model parameters that best fit the perturbed output. Two recent articles [11, 12] discuss the practical identifiability of a variety of compartmental models and reach different conclusions regarding parameter identification of the SEIR model from cumulative data. In [11], the cumulative data is multiplied by 1+ϵ,1+\epsilon, where ϵ\epsilon is normally distributed with zero mean, whereas in [12], Poisson-distributed noise of mean equal to the current incidence value [13] is used as a substitute for the incidence data. Whether or not parameters are identifiable in practice therefore depends on the type of observational noise that affects surveillance reports and thus the EPI curve.

Adding independently-distributed noise to the incidence rather than the cumulative data is more realistic since errors on CC are considered to accumulate and not be independent [14]. On the one hand, assuming Poisson-type of uncertainty is natural for data that represent counts, but the effect of replacing incidence data by a Poisson random variable of same mean will lead to significant relative deviations only if NN is small (a few hundreds). For larger values of NN and thus of ,{\mathcal{I}}, the relative change δ=|Poisson()|/\delta{\mathcal{I}}=|{\mathcal{I}}-\hbox{Poisson}({\mathcal{I}})|/{\mathcal{I}} scales like 1/.1/\sqrt{\mathcal{I}}. Indeed,

δ\displaystyle\langle\delta{\mathcal{I}}\rangle =k=0kek!k+k=+1kek!k\displaystyle=\sum_{k=0}^{\mathcal{I}}\frac{{\mathcal{I}}^{k}e^{-{\mathcal{I}}}}{k!}\frac{{\mathcal{I}}-k}{\mathcal{I}}+\sum_{k={\mathcal{I}}+1}^{\infty}\frac{{\mathcal{I}}^{k}e^{-{\mathcal{I}}}}{k!}\frac{k-{\mathcal{I}}}{\mathcal{I}}
=k=0kek!k=01kek!+k=kek!k=+1kek!=2e!.\displaystyle=\sum_{k=0}^{\mathcal{I}}\frac{{\mathcal{I}}^{k}e^{-\mathcal{I}}}{k!}-\sum_{k=0}^{{\mathcal{I}}-1}\frac{{\mathcal{I}}^{k}e^{-\mathcal{I}}}{k!}+\sum_{k={\mathcal{I}}}^{\infty}\frac{{\mathcal{I}}^{k}e^{-\mathcal{I}}}{k!}-\sum_{k={\mathcal{I}}+1}^{\infty}\frac{{\mathcal{I}}^{k}e^{-\mathcal{I}}}{k!}=2\frac{{\mathcal{I}}^{\mathcal{I}}e^{-\mathcal{I}}}{{\mathcal{I}}!}.

With Stirling’s formula, we obtain

δ2e(2πe)1=2π as .\langle\delta{\mathcal{I}}\rangle\sim 2{\mathcal{I}}^{\mathcal{I}}e^{-\mathcal{I}}\left(\sqrt{2\pi{\mathcal{I}}}{\mathcal{I}}^{\mathcal{I}}e^{-\mathcal{I}}\right)^{-1}=\sqrt{\frac{2}{\pi{\mathcal{I}}}}\qquad\hbox{ as }{\mathcal{I}}\to\infty.

If it is believed that the diminishing relative influence of the Poisson-distributed noise for large values of NN is not realistic for a specific epidemic, then multiplying \mathcal{I} by (1+ϵ)(1+\epsilon) where ϵ\epsilon is normally distributed with mean zero, circumvents this issue. We consider both types of noise below.

Refer to caption
Figure 2. (Color online). Empirical ICC curves for two hypothetical outbreaks described by the SIR model with Poisson-distributed reporting noise. Incidence information is assumed to be available every δt\delta_{t}, with δt=1\delta_{t}=1 for R0=2R_{0}=2 and δt=2\delta_{t}=2 for R0=10.5R_{0}=10.5. In each panel, stars correspond to the simulated outbreak data plotted in the (CC, \mathcal{I}) plane. Data points without added noise (circles) and the exact ICC curve (solid line) are also shown.

To test practical identifiability, we simulate the SIR model and generate a discrete time series for the cumulative number of cases of the simulated outbreak. This time series is used as a reference and represents the unperturbed (or true) data. We then add noise to the associated discrete incidence data (defined as the difference of cumulative values between consecutive time points), following the procedure described in [13], with the additional realistic constraint that the final number of cases is the same for all noisy realizations of the same outbreak. Two examples of simulated outbreaks generated in this fashion are shown in Figure 2 for different values of R0R_{0} (associated with different orders of magnitude of the incidence \mathcal{I}) and different discretization time steps δt\delta_{t}. For R0=2R_{0}=2, we use δt=1\delta_{t}=1. For R0=10.5R_{0}=10.5, we use δt=2.\delta_{t}=2. As shown in Figure 2, this latter choice leads to a discretized ICC curve with very few points in regions of large incidence, and simulates a situation where incidence data is of poorer frequency. Each time series, consisting of noisy incidence data, is used to calculate an associated time series for the corresponding cumulative data. The resulting set of points is plotted in the (CC, \mathcal{I}) plane, and defines an empirical ICC curve. In both panels of Figure 2, CC and \mathcal{I} are estimated from discrete incidence data as described in Equation (5) below.

3.1. Fitting ICC curves to data

Given a simulated time series for the number of cases of an outbreak in a population of size NN, we find the parameters of the ICC curve that best fits the data by least square minimization. This approach is different from parameter identification methods that have been proposed in the literature until now. Indeed, traditional methods aim to find the best combination of model parameters consistent with observed time series; they therefore require numerical integration of the compartmental model whose parameters need to be identified, in order to estimate and then minimize an associated cost function (see for instance [13, 11, 12]). Instead, the approach we put forward here consists in finding model parameters by directly fitting the relevant ICC curve to the data. We claim that this is a more robust approach because the functional that needs to be minimized has a unique extremizer in parameter space, which can be calculated exactly from the data points.

Given a time series of discrete incidences {k,k=0,,M},\{{\mathcal{I}}_{k},k=0,\cdots,M\}, or of discrete cumulative cases {𝒞k=j=0kj,k=0,,M}\{{\mathcal{C}}_{k}=\sum_{j=0}^{k}{\mathcal{I}}_{j},k=0,\cdots,M\} recorded every δt\delta_{t} units of time, we introduce a cost function that measures how much the data points depart from the ICC curve of the SIR model parametrized by β,\beta, γ,\gamma, and p=ln(κ).p=\ln(\kappa). This function, e{\mathcal{E}}_{e}, is defined by

(4) e(β,γ,p)=i=1M(1N(ti12)G(β,γ,p,C(ti12)))2,{\mathcal{E}}_{e}(\beta,\gamma,p)=\sum_{i=1}^{M}\left(\frac{1}{N}{\mathcal{I}}\big{(}t_{i-\frac{1}{2}}\big{)}-G\big{(}\beta,\gamma,p,C\big{(}t_{i-\frac{1}{2}}\big{)}\big{)}\right)^{2},

where

G(β,γ,p,C)=(βCN+γln(1CN)γp)(1CN),G(\beta,\gamma,p,C)=\left(\beta\frac{C}{N}+\gamma\ln\left(1-\frac{C}{N}\right)-\gamma\,p\right)\left(1-\frac{C}{N}\right),

and

(5) (ti12)=𝒞i𝒞i1δt=Ii,C(ti12)=12(𝒞i+𝒞i1)=Ci.{\mathcal{I}}\big{(}t_{i-\frac{1}{2}}\big{)}=\frac{{\mathcal{C}}_{i}-{\mathcal{C}}_{i-1}}{\delta_{t}}=I_{i},\quad C\big{(}t_{i-\frac{1}{2}}\big{)}=\frac{1}{2}\big{(}{\mathcal{C}}_{i}+{\mathcal{C}}_{i-1}\big{)}=C_{i}.

A Maple [15] calculation shows that the cost function e{\mathcal{E}}_{e} has a unique critical point, whose expression is given in Appendix A. For outbreaks that start in a naive population, the formulas are hugely simplified by setting p=0p=0. The resulting cost functional then has a unique minimum, whose value may easily be calculated. With the following notation,

𝒥i=IiN,Ui=CiN(1CiN),Vi=(1CiN)ln(1CiN),{\mathcal{J}}_{i}=\frac{I_{i}}{N},\quad U_{i}=\frac{C_{i}}{N}\left(1-\frac{C_{i}}{N}\right),\quad V_{i}=-\left(1-\frac{C_{i}}{N}\right)\ln\left(1-\frac{C_{i}}{N}\right),

we have (after setting p=0p=0),

eβ=2i=1MUi(𝒥iβUi+γVi);eγ=2i=1MVi(𝒥iβUi+γVi).\frac{\partial{\mathcal{E}}_{e}}{\partial\beta}=-2\sum_{i=1}^{M}U_{i}\left({\mathcal{J}}_{i}-\beta U_{i}+\gamma V_{i}\right);\quad\frac{\partial{\mathcal{E}}_{e}}{\partial\gamma}=2\sum_{i=1}^{M}V_{i}\left({\mathcal{J}}_{i}-\beta U_{i}+\gamma V_{i}\right).

Setting the right-hand sides to zero and solving the resulting linear system for β\beta and γ\gamma leads to the following expression for the unique global minimizer of e{\mathcal{E}}_{e} when p=0p=0:

βm\displaystyle\beta_{m} =1h((i=1MVi2)(i=1MUii)(i=1MVii)(i=1MUiVi)),\displaystyle=\displaystyle\frac{1}{h}\left(\left(\sum_{i=1}^{M}V_{i}^{2}\right)\left(\sum_{i=1}^{M}U_{i}{\mathcal{I}}_{i}\right)-\left(\sum_{i=1}^{M}V_{i}{\mathcal{I}}_{i}\right)\left(\sum_{i=1}^{M}U_{i}V_{i}\right)\right),
γm\displaystyle\gamma_{m} =1h((i=1MUiVi)(i=1MUii)(i=1MUi2)(i=1MVii)),\displaystyle=\frac{1}{h}\left(\left(\sum_{i=1}^{M}U_{i}V_{i}\right)\left(\sum_{i=1}^{M}U_{i}{\mathcal{I}}_{i}\right)-\left(\sum_{i=1}^{M}U_{i}^{2}\right)\left(\sum_{i=1}^{M}V_{i}{\mathcal{I}}_{i}\right)\right),

where

h=(i=1MUi2)(i=1MVi2)(i=1MUiVi)2=12i,j=1M(UiVjViUj)2>0.h=\left(\sum_{i=1}^{M}U_{i}^{2}\right)\left(\sum_{i=1}^{M}V_{i}^{2}\right)-\left(\sum_{i=1}^{M}U_{i}V_{i}\right)^{2}=\frac{1}{2}\,\sum_{i,j=1}^{M}\big{(}U_{i}V_{j}-V_{i}U_{j}\big{)}^{2}>0.

In what follows, we use the expressions given in Appendix A, which can easily be coded up. We checked that for time series generated from an SIR model with κ1\kappa\simeq 1, the simplified expressions above provide good estimates of the actual values of β\beta and γ\gamma that were used to generate the reference outbreak. Importantly, we now have a means of associating a single set of values for the parameters β\beta, γ\gamma, and p=ln(κ)p=\ln(\kappa) to each outbreak time series.

3.2. Parameter identification in the presence of noise

We now use the expressions found in Section 3.1 to assess the robustness of the proposed parameter identification method to noise. As discussed above, the SIR model will be declared practically identifiable provided good estimates of the parameters may be obtained from noisy data. We therefore simulate 100,000 noisy time series from a ‘reference’ numerical integration of the SIR model, use the formulas of Appendix A to calculate pcp_{c}, βc\beta_{c}, and γc\gamma_{c} for each time series, and plot histograms of the estimated values of β\beta and γ\gamma. We consider the two types of noise discussed above, both affecting the reported discrete incidence Ii=(ti12).I_{i}={\mathcal{I}}(t_{i-\frac{1}{2}}). In Case A, we assume that at each time point ti12,t_{i-\frac{1}{2}}, the reported value of \mathcal{I} has a Poisson distribution of mean ref(ti12).{\mathcal{I}}_{ref}(t_{i-\frac{1}{2}}). In Case B, it is assumed that the reported value of IiI_{i} at each time point is of the form (1+ϵsi)ref(ti12),(1+\epsilon s_{i})\,{\mathcal{I}}_{ref}(t_{i-\frac{1}{2}}), where ref{\mathcal{I}}_{ref} is the discrete incidence of the reference outbreak, ϵ\epsilon is the noise amplitude, and sis_{i} is standard normal.

3.2.1. Case A: Poisson-distributed incidence

Figure 3 shows the distributions (in the form of properly scaled histograms) of βc,\beta_{c}, γc,\gamma_{c}, and R0=βcγcR_{0}=\frac{\beta_{c}}{\gamma_{c}} obtained from 100,000 noisy realization from a reference simulation of an SIR model with parameters β=0.5\beta=0.5 and γ=0.25\gamma=0.25 (R0=2R_{0}=2). Figure 4 shows similar plots for β=0.5\beta=0.5 and γ0.0476\gamma\simeq 0.0476, so that R0=10.5R_{0}=10.5. In each case, the size of the total population is N=10000N=10000 individuals. The corresponding sample means and standard deviations are displayed in Table 1. These results indicate that the respective means of βc\beta_{c} and γc\gamma_{c} provide very good estimates of the parameters β\beta and γ\gamma. Accuracy is not as good for the larger value of R0R_{0}, whose distribution shows a fairly long tail. This is to be expected because the actual value of γ\gamma in this case is close to zero. Moreover, the number of data points away from the end points of the outbreak is much smaller for R0=10.5R_{0}=10.5 than for R0=2R_{0}=2 (see Figure 2), leading to a potential loss of information.

Refer to caption
Figure 3. (Color online). Numerically estimated distributions of the parameter values βc\beta_{c}, γc\gamma_{c}, and R0=βc/γcR_{0}=\beta_{c}/\gamma_{c} for 100,000 outbreaks corresponding to noisy (Poisson-distributed) realizations of a reference outbreak with parameters β=0.5\beta=0.5 and γ=0.25\gamma=0.25 (R0=2R_{0}=2). Corresponding means and standard deviations are displayed in Table 1.
R0=2R_{0}=2 R0=10.5R_{0}=10.5
μ\mu σ\sigma μ\mu σ\sigma
βc\beta_{c} 0.4994 0.0060 0.4936 0.0120
γc\gamma_{c} 0.2497 0.0030 0.0495 0.0070
R0R_{0} 2.0001 0.0017 10.15 1.29
Table 1. Empirical mean (μ\mu) and standard deviation (σ\sigma) of estimated parameter values in the presence of Poisson-distributed noise, for two values of R0R_{0}.
Refer to caption
Figure 4. (Color online). Same as Figure 3, but with parameters β=0.5\beta=0.5 and R0=10.5R_{0}=10.5 (γ0.0476\gamma\simeq 0.0476). Corresponding means and standard deviations are displayed in Table 1.

3.2.2. Case B: Normally-distributed noise

At each time point, we now multiply the reference incidence by 1+ϵs1+\epsilon s, where ss is standard normal and the amplitude ϵ\epsilon takes on one of 3 values: 0.05, 0.15, or 0.25. This type of noise is less realistic for incidence reports, whose variability is expected to be well described by a Poisson distribution, as in Case A above. We however perform the present tests to illustrate the robustness of the parameter estimation method introduced in this article.

R0=2R_{0}=2
ϵ=0.05\epsilon=0.05 ϵ=0.15\epsilon=0.15 ϵ=0.25\epsilon=0.25
μ\mu σ\sigma μ\mu σ\sigma μ\mu σ\sigma
βc\beta_{c} 0.4994 0.0048 0.5001 0.0152 0.5013 0.0267
γc\gamma_{c} 0.2497 0.0024 0.2500 0.0076 0.2506 0.0135
R0R_{0} 2.0001 0.0009 2.0002 0.0028 2.0002 0.0050
R0=10.5R_{0}=10.5
ϵ=0.05\epsilon=0.05 ϵ=0.15\epsilon=0.15 ϵ=0.25\epsilon=0.25
μ\mu σ\sigma μ\mu σ\sigma μ\mu σ\sigma
βc\beta_{c} 0.4939 0.0201 0.5053 0.0577 0.5396 0.0938
γc\gamma_{c} 0.04961 0.0104 0.0550 0.0293 0.0717 0.0465
R0R_{0} 10.3682 2.14 27.0950 769.77 29.9396 863.86
Table 2. Empirical mean (μ\mu) and standard deviation (σ\sigma) of estimated parameter values in the presence of normally-distributed noise of mean zero and amplitude ϵ\epsilon, for two values of R0R_{0}.

For R0=2R_{0}=2, the parameter estimation procedure is robust at all values of ϵ\epsilon considered. Means and standard deviations of βc\beta_{c}, γc\gamma_{c}, and R0=βc/γcR_{0}=\beta_{c}/\gamma_{c} are displayed in Table 2. The corresponding distributions are shown in the top row of Figure 5. As expected, estimates of β\beta and γ\gamma are not as accurate for simulations with R0=10.5R_{0}=10.5, but return mean values for β\beta and γ\gamma that are close to the exact values. The distributions of βc/γc\beta_{c}/\gamma_{c} however shows very long tails (see bottom row of Figure 5), due to values of γc\gamma_{c} close to zero. In such a case, estimating R0R_{0} as the ratio <βc>/<γc><\beta_{c}>/<\gamma_{c}> however gives acceptable values, equal to 9.96, 9.19, and 7.53, for ϵ\epsilon equal to 0.05, 0.15, and 0.25 respectively.

Refer to caption
Refer to caption
Figure 5. (Color online). Similar to Figures 3 (top row) and 4 (bottom row) but for normally-distributed noise of variable amplitude, set at 0.05, 0.15, or 0.25. Corresponding means and standard deviations are displayed in Table 2.

In summary, the above numerical simulations indicate that the ICC-based parameter estimation method presented here provides good results in the presence of reporting noise. Not surprisingly, the quality of the estimates depends on the rate at which incidence is sampled and, in the case of normally distributed noise, on the amplitude of the noise.

4. Parameter identification as the outbreak unfolds

In principle, the formulas for βc\beta_{c} and γc\gamma_{c} given in Appendix A may be used with any number of points MM, as long as the data points represent the entire ICC curve. Is is natural to ask how the method performs when applied to a developing outbreak. To test this, we simulate 50,000 possible outbreaks using the same reference simulations as before, and with the two different types of noise introduced above to represent reporting error. We then follow the evolution of the distributions of the estimated parameters as time increases. For R0=2R_{0}=2, the outbreak has run its course after 60 units of time, and we include up to Mmax=40M_{max}=40 data points, with consecutive values separated by 1 unit of time (which could correspond to a day or a week depending on the disease). For R0=10.5R_{0}=10.5, we pick Mmax=80M_{max}=80, with data points separated by 2 units of time. Figure 6 shows estimated values of β\beta, γ\gamma, and R0=2R_{0}=2 for the two types of reporting noise. In both cases, estimates of β\beta and γ\gamma are close to their actual values near t=20t=20, which is before the peak of the outbreak (see left panel of Figure 1). The value of R0R_{0} converges more slowly (circles in the bottom row of Figure 6) but the ratio of the estimated values of β\beta and γ\gamma (stars) is close to the actual value of R0R_{0} near t=12t=12, fairly early in the outbreak.

Refer to caption
Refer to caption
Figure 6. (Color online). Estimates of β\beta, γ\gamma and R0R_{0} as the outbreak unfolds for Poisson-distributed noise (left) and normally-distributed noise of size 0.15 (right), in a situation where R0=2R_{0}=2. Estimates of R0R_{0} show the evolution of <β/γ><\beta/\gamma> (circles) and <β>/<γ><\beta>/<\gamma> (stars), where <><\cdot> indicates sample mean. Error bars correspond to one standard deviation on each side of the mean.

Similar results for R0=10.5R_{0}=10.5, together with plots showing the evolution of the distributions of βc\beta_{c}, γc\gamma_{c}, and R0R_{0}, are provided as supplementary material. Parameter estimates for β\beta and γ\gamma are close to their true values near t=22t=22, which is near the peak of the outbreak, with the ratio <β>/<γ><\beta>/<\gamma> giving a reasonable estimate of R0R_{0} starting at t=18t=18, even in the presence of large uncertainty.

5. Application to outbreak data

So far, we have assumed the size of the susceptible population, NN, to be known. In practice, the value of NN that should be used to fit the ICC curve to outbreak data is often unknown since it may reflect under-reporting (as discussed in Section 2.2) or spatial effects (existence of localized clusters to which the spread is restricted) associated with the outbreak. Moreover, it could be argued that the SIR model is too simplistic to represent real-life outbreaks. The examples of [1] and those discussed below however show there is merit to the present approach.

In what follows, we fit ICC curves to surveillance reports by picking a range of values of NN for which the root mean square error (RMSE) between the ICC curve and the data is within 2% of the minimum RMSE value. The data points used in the evaluation of the RMSE are found from the reported cumulative data after smoothing and interpolation. This latter procedure estimates missing incidence values and removes negative incidence reports, if any. The resulting time series is expected to be a better approximation of the ‘true’ (i.e. as close as possible to the output of a compartmental model) evolution of the cumulative data, and is used for the parameter estimation procedure discussed in this article. Our first example is a gastroenteritis outbreak in a nursing home in Mallorca, Spain, and is therefore spatially localized. The other examples are estimations of the basic reproduction ratio for COVID-19 outbreaks in Hubei Province (China), the Republic of Korea, and France, based on data available until 4/15/2020.

5.1. Gastroenteritis outbreak in Mallorca, Spain

We use the same data set as in [1], estimated from the epidemiological curve provided in [16]. We fit the ICC curve to the interpolated data (i.e. find βc(N)\beta_{c}(N), γc(N)\gamma_{c}(N), and pc(N)p_{c}(N)) for a range of values of NN, identify the value NmN_{m} of NN that best fits the data (i.e. for which the RMSE is minimum), and select a range of values of NN near NmN_{m} that give a RMSE within 2% of its minimum. Then, for each value of NN in the selected range, we apply the parameter estimation procedure of Section 3, with 10,000 oubreak realizations, obtained from adding Poisson-distributed noise to the smoothed and interpolated data. Figure 7 shows the resulting parameter distributions (top panel; N[93,186]N\in[93,186]) and a plot of the ICC curve for the optimal parameter values (bottom panel; N=120N=120, β=1.34\beta=1.34, and γ=0.88\gamma=0.88). The value of R0R_{0} for the ICC curve displayed is 1.521.52 and corresponds to an attack rate of 59.8%. This is near the upper boundary of the 95% confidence interval (38.5 % - 61.3 %) for the overall attack rate, but well within the 95% confidence interval of the attack rate among nursing home residents (42.1 % - 72.2 %) reported in [16].

Refer to caption
Refer to caption
Figure 7. (Color online). Parameter estimates for a gastroenteritis outbreak in a nursing home in Mallorca, Spain [16]. The top panel shows histograms (scaled to represent probability distributions) of βc\beta_{c}, γc\gamma_{c}, and R0R_{0}, for N[93,186]N\in[93,186]. The bottom panel shows the ICC curve for the optimal parameter values, corresponding to a value of R0=1.52R_{0}=1.52.

5.2. COVID-19

Figure 8 shows ICC curves for COVID-19 outbreaks in a few countries. The data was obtained from the World Health Organization daily situation reports [17] as well as from the Wuhan Health Municipal Commission [18]. For Hubei Province, only laboratory confirmed cases were used. Consequently, daily increment values for 2/17-19/2020 (days 65 to 67 in the data set, when China combined laboratory and clinically confirmed cases in its reports for a brief period of time) were set to half of the reported number of confirmed cases; the number of cumulative cases was then obtained by summing daily reports of laboratory confirmed cases. For the Republic of Korea, only the first 58 points in the data set were used, since they correspond to the first wave of the outbreak. The third example is for a country (France), where the outbreak was not over as of 4/15/2020. For each of these examples, we show the ICC curve for the set of parameters that best fits the interpolated data. The optimal values of NN are N=51160N=51160 for Hubei Province, N=10282N=10282 for the Republic of Korea, and N=161142N=161142 for France. The corresponding optimal basic reproduction ratio values are R0=2.74R_{0}=2.74 for Hubei Province, R0=2.13R_{0}=2.13 for the Republic of Korea, and R0=1.96R_{0}=1.96 for France. A histogram of estimated values of R0R_{0}, scaled to represent a probability distribution function, is shown in the right panel of each row of Figure 8. These were obtained as described above in the case of the outbreak in Mallorca, except that no Poisson-distributed noise was added to the data. This is because incidence values are generally large and that, as discussed in Section 3, the relative effect of the added noise is minimal, leading to a level of variability of R0R_{0} that is insignificant compared to the effect of changing NN. Values of R0R_{0} in the 2-3 range are consistent with early estimates of the basic reproduction ratio of COVID-19 published in the literature [19, 20].

Refer to caption
Refer to caption
Refer to caption
Figure 8. (Color online). Optimal ICC curves and estimated distributions of R0R_{0} values for the COVID-19 outbreaks in Hubei Province (top panel), the Republic of Korea (middle panel), and France (bottom panel).

The time course of the COVID-19 outbreak in Hubei Province, as described by the SIR model with optimal parameters identified by the present method (β=0.401\beta=0.401, γ=0.146\gamma=0.146 (R0=2.74R_{0}=2.74), and N=51160N=51160) is shown in Figure 9, together with the reported data. The solid curves are obtained by integration of the ICC curve, with initial conditions corresponding to day 10 of the outbreak, and alignment with the data at day 45. Similar plots for the other outbreaks discussed in this section are provided as supplementary material. The good visual agreement between simulation and data indicates that the SIR model, and therefore the ICC curve approach presented in this manuscript, are able to capture the overal dynamics of real-life outbreaks.

Refer to caption
Figure 9. (Color online). Cumulative cases (left) and incidence (right) as functions of time for the COVID-19 outbreak in Hubei Province, China. The solid curves represent the predictions of the optimal ICC curve and the open circles are the data points. Time is measured in days from 12/14/2019 (day 0).

6. Conclusions

This article introduces a parameter estimation method for disease outbreaks that bypasses the numerical integration of epidemiological models. The approach relies on the use of ICC curves, also introduced here. ICC curves relate incidence to the cumulative number of cases CC and may be viewed as nonlinear transformations of the traditional epidemiological curves, in which the time variable is replaced by CC, a monotonically increasing but nonlinear function of time. For each single-wave outbreak, the ICC curve has a simple concave-down shape that crosses the horizontal axis at the origin and at the final value of CC.

The formulas presented in Section 2, which extend the parabolic approximation suggested in [1], are exact for the SIR model. The numerical experiments of Sections 3 and 4 show the method is robust to noise and may be used for parameter estimation as an outbreak unfolds, with the understanding that reasonably accurate estimates can only be reached shortly before, or after, incidence peaks. This is not a shortcoming of the present approach and was also observed in [11] when fitting synthetic prevalence time series.

Because of the simplicity of Equation 1, and the existence of a unique set of parameters that best fits any collection of data points, parameter distributions may be generated directly from epidemiological data. In the case of the SIR model, this methodology presents a fast and novel alternative to more traditional parameter estimation strategies, such as particle Makov Chain Monte Carlo methods (PMCMC; see [21] for a review). This may be beneficial in pandemic situations where epidemiological estimates need to be updated daily and in many locations simultaneously. For instance, recent work on COVID-19 data from Wuhan suggests that an SIR model can better capture the information contained in case reports than an SEIR model [22]. For more complex compartmental models, distributions generated by the present approach may be used as priors for PMCMC estimations of the contact and recovery rates of a disease.

When applying the proposed method to surveillance data, an estimate of NN may not always be available. The examples of Section 5 suggest that this parameter may be identified by optimizing the fit between the ICC curve and a smoothed and interpolated version of the reported data. However, any variability in the selected value of NN will be associated with variability in estimates of the basic reproduction number R0R_{0}. Indeed, for an outbreak that has completed its course, any good fit of the reported data by an ICC curve will produce consistent values of CC_{\infty}, the final number of cases for the outbreak (see for instance the plots of Figure 8). Because of Equation 3, if CC_{\infty} is known, the estimated value of R0R_{0} is only a function of NN. This uncertainty is inherent to the SIR model and cannot be avoided. The existence of a value NmN_{m} of NN that best fits the data is encouraging, but more robust estimates are likely to be obtained if additional information is available. In the absence thereof, a range of values of NN close to the optimal value NmN_{m} should be used to produce credible intervals for the basic reproduction ratio R0R_{0}.

As previously mentioned, empirical (see [1]) and numerical observations by the author suggest the concave-down shape of the ICC curve is ubiquitous in outbreak data and in compartmental epidemiological models. It would therefore be interesting to explore methods which, like the discussion of Section 2, lead to exact formulations of ICC curves. In particular, knowledge of how model parameters affect the shape of ICC curves could provide simple means to visualize the consequences of mitigation effects. Combined with data assimilation, ICC curves may thus present a convenient paradigm for forecasting the course of an outbreak.

Declaration of competing interests.
No potential competing interest was reported by the author.


Data availability statement.
For codes and data sets used in this study, please see https://jocelinelega.github.io/EpiGro/.


Appendix A Critical parameter values

The function e{\mathcal{E}}_{e} defined in Equation (4) has a unique extremizer (βc,γc,pc)(\beta_{c},\gamma_{c},p_{c}) given by the following expressions:

βc\displaystyle\beta_{c} =(BsOs+FsAs)pc2+(2OsDsFsNsAsQs)pc+(NsQsOsPs)(As2BsLs)pc2+2(AsNs+DsLs)pcLsPs+Ns2,\displaystyle=\frac{(-B_{s}O_{s}+F_{s}A_{s})p_{c}^{2}+(2O_{s}D_{s}-F_{s}N_{s}-A_{s}Q_{s})p_{c}+(N_{s}Q_{s}-O_{s}P_{s})}{(A_{s}^{2}-B_{s}L_{s})p_{c}^{2}+2(-A_{s}N_{s}+D_{s}L_{s})p_{c}-L_{s}P_{s}+N_{s}^{2}},
γc\displaystyle\gamma_{c} =(AsOs+FsLs)pcLsQs+NsOs(As2BsLs)pc2+2(AsNs+DsLs)pcLsPs+Ns2,\displaystyle=\frac{(-A_{s}O_{s}+F_{s}L_{s})p_{c}-L_{s}Q_{s}+N_{s}O_{s}}{(A_{s}^{2}-B_{s}L_{s})p_{c}^{2}+2(-A_{s}N_{s}+D_{s}L_{s})p_{c}-L_{s}P_{s}+N_{s}^{2}},
pc\displaystyle p_{c} =(NsQs+OsPs)As+(LsQsNsOs)Ds+(LsPs+Ns2)Fs(AsOsFsLs)Ds(As2BsLs)QsNs(BsOsFsAs),\displaystyle=\frac{(-N_{s}Q_{s}+O_{s}P_{s})A_{s}+(L_{s}Q_{s}-N_{s}O_{s})D_{s}+(-L_{s}P_{s}+N_{s}^{2})F_{s}}{(A_{s}O_{s}-F_{s}L_{s})D_{s}-(A_{s}^{2}-B_{s}L_{s})Q_{s}-N_{s}(B_{s}O_{s}-F_{s}A_{s})},

where

As\displaystyle A_{s} =i=1MCiN(1CiN)2=i=1MUiPi,Bs=i=1M(1CiN)2=i=1MPi2,\displaystyle=\sum_{i=1}^{M}\frac{C_{i}}{N}\left(1-\frac{C_{i}}{N}\right)^{2}=\sum_{i=1}^{M}U_{i}\,P_{i},\qquad B_{s}=\sum_{i=1}^{M}\left(1-\frac{C_{i}}{N}\right)^{2}=\sum_{i=1}^{M}P_{i}^{2},
Ds\displaystyle D_{s} =i=1Mln(1CiN)(1CiN)2=i=1MViPi,Fs=i=1MIiN(1CiN)=i=1M𝒥iPi,\displaystyle=\sum_{i=1}^{M}\ln\left(1-\frac{C_{i}}{N}\right)\left(1-\frac{C_{i}}{N}\right)^{2}=-\sum_{i=1}^{M}V_{i}P_{i},\qquad F_{s}=\sum_{i=1}^{M}\frac{I_{i}}{N}\left(1-\frac{C_{i}}{N}\right)=\sum_{i=1}^{M}{\mathcal{J}}_{i}\,P_{i},
Ls\displaystyle L_{s} =i=1M(CiN)2(1CiN)2=i=1MUi2,Ns=i=1M(1CiN)2ln(1CiN)CiN=i=1MUiVi,\displaystyle=\sum_{i=1}^{M}\left(\frac{C_{i}}{N}\right)^{2}\left(1-\frac{C_{i}}{N}\right)^{2}=\sum_{i=1}^{M}U_{i}^{2},\qquad N_{s}=\sum_{i=1}^{M}\left(1-\frac{C_{i}}{N}\right)^{2}\ln\left(1-\frac{C_{i}}{N}\right)\frac{C_{i}}{N}=-\sum_{i=1}^{M}U_{i}\,V_{i},
Os\displaystyle O_{s} =i=1MIiN(1CiN)CiN=i=1M𝒥iUi,Ps=i=1M(ln(1CiN))2(1CiN)2=i=1MVi2,\displaystyle=\sum_{i=1}^{M}\frac{I_{i}}{N}\left(1-\frac{C_{i}}{N}\right)\frac{C_{i}}{N}=\sum_{i=1}^{M}{\mathcal{J}}_{i}\,U_{i},\qquad P_{s}=\displaystyle\sum_{i=1}^{M}\left(\ln\left(1-\frac{C_{i}}{N}\right)\right)^{2}\left(1-\frac{C_{i}}{N}\right)^{2}=\sum_{i=1}^{M}V_{i}^{2},
Qs\displaystyle Q_{s} =i=1MIiNln(1CiN)(1CiN)=i=1M𝒥iVi,\displaystyle=\sum_{i=1}^{M}\frac{I_{i}}{N}\ln\left(1-\frac{C_{i}}{N}\right)\left(1-\frac{C_{i}}{N}\right)=-\sum_{i=1}^{M}{\mathcal{J}}_{i}\,V_{i},

and we have used the notation

𝒥i=IiN,Pi=1CiN,Ui=CiN(1CiN),Vi=ln(1CiN)(1CiN).{\mathcal{J}}_{i}=\frac{I_{i}}{N},\quad P_{i}=1-\frac{C_{i}}{N},\quad U_{i}=\frac{C_{i}}{N}\left(1-\frac{C_{i}}{N}\right),\quad V_{i}=-\ln\left(1-\frac{C_{i}}{N}\right)\,\left(1-\frac{C_{i}}{N}\right).

References