This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Statistical analysis of event classification
in experimental data

Rudolf Frühwirth111mailto:rudolf.fruehwirth@oeaw.ac.at  and Winfried Mitaroff
Institute of High Energy Physics,
Austrian Academy of Sciences, Vienna
(28 June 2023)
Abstract

The paper addresses general aspects of experimental data analysis, dealing with the separation of “signal vs. background”. It consists of two parts.

Part I is a tutorial on statistical event classification, Bayesian inference, and test optimization. Aspects of the base data sample if being created by Poisson processes are discussed, and a method for estimating the unknown numbers of signal and background events is presented. Data quality of the selected events sample is assessed by the expected purity and background contamination.

Part II contains a rigorous statistical analysis of the methods discussed in Part I. Both Bayesian and frequentist estimators of the unknown signal/background content are investigated. The estimates and their stochastic uncertainties are calculated for various conjugate priors in the Bayesian case, and for three choices of the virtual parent population in the frequentist case.

1 Introduction

The analysis of experimental data – e.g. events created at a particle collider and recorded in a detector – often requires selecting those rare events with a signature of interest (the “signal”) and rejecting the bulk of “background”. This is performed by some “test”, modelled according to the signature, in order to make a decision (positive or negative) of whether the hypothesis “event is signal” is true or false.

In practice, such a test is never perfect: some signal events may wrongly be rejected (error type I), and some background may wrongly be selected (error type II). It is therefore essential to know the fraction of “wrong events” in the selected and rejected data samples, which are characteristic values of the test being used.

These values are not known a priori, but may be obtained by applying the very same test on two samples of Monte Carlo generated data, simulating as accurate as possible pure signal and pure background, respectively. The test’s “characteristic values” (error types I and II and their complements: sensitivity and specificity) are conditional probabilities, represented by the 2×22\times 2 “contingency matrix”.

The test criteria depend on parameters which influence the selection and rejection characteristics. Optimization may be achieved by defining an appropriate “figure of merit”, to be maximized by varying the test parameters.

The conditional posterior probabilitiy of a selected event being genuine signal, provided the test result is positive, can be calculated by Bayesian inference. It depends on the prior probability (prevalence) which is in general unknown, however, can be estimated by using Monte Carlo information together with real data.

Part I presents a tutorial on the basics (Section 2), the signal and background simulations for obtaining the test charcteristics as encoded in the contingency matrix (Section 3), and their use in Bayesian inference (Section 4). The scenario of the base sample being created by two independent Poisson processes for signal and background is regarded in Section 5. Postulating these numbers to be fixed, estimates for the number of selected and rejected events are given (Subsection 5.1); test optimization and the choice of a “figure of merit” are discussed in Subsection 5.2. Section 6 presents a method for estimating the unknown numbers of signal and background events from the known numbers of selected and rejected events, using Monte Carlo information; estimates of the Bayesian prior probabilities follow straight-forward. Section 7 shows how to assess the quality of selected data by its purity or its fraction of background contamination.

Part II follows up with a comprehensive and rigorous statistical analysis. Section 8 recapitulates the basic assumptions and the notation. Bayesian estimation is the topic of Section 9. The posterior distribution of the fraction of events classified as signal is computed for various conjugate priors, both “uninformative” and informative. A simple affine transformation yields the posterior mean and the posterior variance of the unknown signal content of the sample. Frequentist estimation, on the other hand, presupposes a virtual parent population from which the observed sample is drawn (Section 10). The frequentist estimators and their covariance matrices are calculated for three choices of the parent population: fixed sample size (Subsection 10.1), Poisson-distributed sample size (Subsection 10.2), and sample size drawn from a mixture of Poisson distribution (Subsection 10.3). The pros and cons of the two approaches are summarized in the final Section 11.


Part I: The tutorial

2 The basics

Given a finite data sample 𝔇\mathfrak{D} of NN random events,222   Aspects of 𝔇\mathfrak{D} if having emerged from two Poisson processes are in short discussed in Section 5, and a thorough analysis is presented in part II of this paper. where each element 𝔢i\mathfrak{e}_{i} is theoretically classified as being either a “signal event” belonging to sub-sample 𝔖\mathfrak{S}, or a “background event” belonging to sub-sample 𝔅=𝔇𝔖\mathfrak{B}=\mathfrak{D}\setminus\mathfrak{S}, with the number of events being NSN_{S} and NBN_{B}, respectively. Then, trivially

NS+NB=N.N_{S}+N_{B}=N. (1)

For a particular element 𝔢i\mathfrak{e}_{i}, define the “null hypothesis” 𝔢i𝔖\mathcal{H}\equiv\mathfrak{e}_{i}\in\mathfrak{S} as the event being signal, and the “alternative hypothesis” ¯𝔢i𝔅\bar{\mathcal{H}}\equiv\mathfrak{e}_{i}\in\mathfrak{B} as it being background. The probabilities (in general unknown) of 𝔢i\mathfrak{e}_{i} being either “signal” (\mathcal{H} = true) or “background” (\mathcal{H} = false) are the fractions

pS𝖯()=NSN1𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒,pB𝖯(¯)=NBN=1pS,p_{S}\equiv\mathsf{P}\,(\mathcal{H})=\frac{N_{S}}{N}\leq 1\;\ldots\;\mathit{prevalence},\qquad p_{B}\equiv\mathsf{P}\,(\bar{\mathcal{H}})=\frac{N_{B}}{N}=1-p_{S}, (2)

with their ratio being

RS/B=NSNB=pSpB𝑠𝑖𝑔𝑛𝑎𝑙𝑡𝑜𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑𝑟𝑎𝑡𝑖𝑜(SBR).R_{S/B}=\frac{N_{S}}{N_{B}}=\frac{p_{S}}{p_{B}}\;\ldots\;\mathit{signal\,to\,background\,ratio}\,\mathrm{(SBR).} (3)

If the SBR is a priori known (either from theory or from previous experiments) then the prior probabilities are trivially

pS=RS/B 1+RS/BandpB=1 1+RS/B,p_{S}=\frac{R_{S/B}}{\,1+R_{S/B}\,}\quad\mathrm{and}\quad p_{B}=\frac{1}{\,1+R_{S/B}\,}\,, (4)

but often the SBR may only be guessed by order of magnitude. In case of the signal consisting of rare events, RS/B1R_{S/B}\ll 1.

It is assumed that some test is available which aims at determining, for each element 𝔢i𝔇\mathfrak{e}_{i}\in\mathfrak{D}, whether it does comply with hypothesis \mathcal{H} and therefore will be selected and put into sub-sample 𝔖\mathfrak{S}^{\ast}; otherwise it will be rejected and put into sub-sample 𝔅=𝔇𝔖\mathfrak{B}^{\ast}=\mathfrak{D}\setminus\mathfrak{S}^{\ast}. The “positive test result” 𝒯𝔢i𝔖\mathcal{T}\equiv\mathfrak{e}_{i}\in\mathfrak{S}^{\ast} and the “negative test result” 𝒯¯𝔢i𝔅\bar{\mathcal{T}}\equiv\mathfrak{e}_{i}\in\mathfrak{B}^{\ast} are defined accordingly.

However, any test is inherently not perfect but prone to errors of type I and II, the rates of which must be evaluated.

3 MC simulation

In order to achieve this, two separate samples are created by Monte Carlo simulation: 𝔖\mathfrak{S}^{\prime} containing NSN^{\prime}_{S} pure signal events, and 𝔅\mathfrak{B}^{\prime} containing NBN^{\prime}_{B} pure background events. Their ratio may be chosen arbitrarily; often NS/NBRS/BN^{\prime}_{S}/N^{\prime}_{B}\gg R_{S/B}.

The simulated to real background ratio is

rB=NBNB=NBN(1+RS/B)NBN(approximationifRS/B1),r_{B}=\frac{N^{\prime}_{B}}{N_{B}}=\frac{N^{\prime}_{B}}{N}\cdot\left(1+R_{S/B}\right)\approx\frac{N^{\prime}_{B}}{N}\quad(\mathrm{approximation\ if}\>R_{S/B}\ll 1), (5)

and its inverse 1/rB1/r_{B} may serve as the normalization factor for scaling simulated to real background, as used in Section 7.

The test criteria are described by a set of parameters θ\vec{\theta}. Applying the test on the 𝔖\mathfrak{S}^{\prime} signal and 𝔅\mathfrak{B}^{\prime} background MC samples yield 333   Deliberately using the terms “right” and “wrong” instead of “true” and “false”.

  • nS+n_{S+} = no. of signal MC events selected = right positives,

  • nSn_{S-} = no. of signal MC events rejected = wrong negatives,

  • nB+n_{B+} = no. of background MC events selected = wrong positives,

  • nBn_{B-} = no. of background MC events rejected = right negatives,

nS++nS=NS,nB++nB=NB,n_{S+}+n_{S-}=N^{\prime}_{S},\qquad\qquad n_{B+}+n_{B-}=N^{\prime}_{B}, (6)
nS++nB+=N+,nS+nB=N,n_{S+}+n_{B+}=N^{\prime}_{+},\qquad\qquad n_{S-}+n_{B-}=N^{\prime}_{-}, (7)

defining the conditional probabilities 444   The term purity, which is often defined ambigously, will be used in Section 7.

ε\displaystyle\varepsilon =\displaystyle= nS+NS=𝖯(𝒯|)𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦,𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦,𝑠𝑒𝑙𝑒𝑐𝑡|ℎ𝑖𝑡𝑟𝑎𝑡𝑒,\displaystyle\frac{n_{S+}}{N^{\prime}_{S}}\,=\,\mathsf{P}\,(\mathcal{T}\,|\,\mathcal{H})\;\ldots\;\mathit{efficiency,sensitivity,select|hit\,rate}, (8)
α\displaystyle\alpha =\displaystyle= nSNS=𝖯(𝒯¯|)𝑒𝑟𝑟𝑜𝑟𝑡𝑦𝑝𝑒I,𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒,𝑙𝑜𝑠𝑠|𝑚𝑖𝑠𝑠𝑟𝑎𝑡𝑒,\displaystyle\frac{n_{S-}}{N^{\prime}_{S}}\,=\,\mathsf{P}\,(\bar{\mathcal{T}}\,|\,\mathcal{H})\;\ldots\;\mathit{error\,type\,I,significance,loss|miss\,rate}, (9)
β\displaystyle\beta =\displaystyle= nB+NB=𝖯(𝒯|¯)𝑒𝑟𝑟𝑜𝑟𝑡𝑦𝑝𝑒𝐼𝐼,𝑐𝑜𝑛𝑡𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛|𝑓𝑎𝑘𝑒𝑟𝑎𝑡𝑒,\displaystyle\frac{n_{B+}}{N^{\prime}_{B}}\,=\,\mathsf{P}\,(\mathcal{T}\,|\,\bar{\mathcal{H}})\;\ldots\;\mathit{error\,type\,II,contamination|fake\,rate}, (10)
η\displaystyle\eta =\displaystyle= nBNB=𝖯(𝒯¯|¯)𝑡𝑒𝑠𝑡𝑝𝑜𝑤𝑒𝑟,𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦,𝑟𝑒𝑗𝑒𝑐𝑡𝑟𝑎𝑡𝑒,and\displaystyle\frac{n_{B-}}{N^{\prime}_{B}}\,=\,\mathsf{P}\,(\bar{\mathcal{T}}\,|\,\bar{\mathcal{H}})\;\ldots\;\mathit{test\,power,specificity,reject\,rate},\quad\mathrm{and} (11)
+\displaystyle\ell_{+} =\displaystyle= εβ𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑𝑟𝑎𝑡𝑖𝑜,\displaystyle\frac{\varepsilon}{\beta}\;\ldots\;\mathit{positive\,likelihood\,ratio}, (12)
\displaystyle\ell_{-} =\displaystyle= αη𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑𝑟𝑎𝑡𝑖𝑜,\displaystyle\frac{\alpha}{\eta}\;\ldots\;\mathit{negative\,likelihood\,ratio}, (13)
triviallyobeyingε+α=1,β+η=1εηαβ=εβ.\mathrm{trivially\;obeying}\qquad\varepsilon+\alpha=1,\quad\beta+\eta=1\quad\Longrightarrow\quad\varepsilon\,\eta-\alpha\,\beta=\varepsilon-\beta. (14)

The Monte Carlo samples 𝔖\mathfrak{S}^{\prime} and 𝔅\mathfrak{B}^{\prime} contain randomly simulated events, hence the test’s characteristic values [ε,α,β,η]\left[\,\varepsilon,\alpha,\beta,\eta\,\right] in Eqs. (811) are stochastic and represent empirical means. Their statistical errors σ(ε),σ(α)1/NS\sigma(\varepsilon),\,\sigma(\alpha)\propto 1/\sqrt{N^{\prime}_{S}} and σ(β),σ(η)1/NB\sigma(\beta),\,\sigma(\eta)\propto 1/\sqrt{N^{\prime}_{B}} scale down with the sample sizes NSN^{\prime}_{S} and NBN^{\prime}_{B} increasing, thus may be neglected for big enough samples. However, improper simulation of reality can introduce systematic errors which are difficult to detect and may bias subsequent results.

The characteristic values (811) resulting from above test may be arranged in the 2×22\times 2 contingency/confusion/error matrix 555   Note: in the literature these names are sometimes used for the transposed matrix 𝑨T\bm{A}^{T} instead.

𝑨=(ε=𝖯(𝒯|)β=𝖯(𝒯|¯)α=𝖯(𝒯¯|)η=𝖯(𝒯¯|¯)),with(N+N)=𝑨(NSNB)anddet(𝑨)=δ=εβ.\begin{array}[]{l}\bm{A}\;=\;\left(\begin{array}[]{cc}\varepsilon=\,\mathsf{P}\,(\mathcal{T}\,|\,\mathcal{H})&\qquad\beta=\,\mathsf{P}\,(\mathcal{T}\,|\,\bar{\mathcal{H}})\\[5.69054pt] \alpha=\,\mathsf{P}\,(\bar{\mathcal{T}}\,|\,\mathcal{H})&\qquad\eta=\,\mathsf{P}\,(\bar{\mathcal{T}}\,|\,\bar{\mathcal{H}})\end{array}\right),\qquad\quad\mbox{with}\\[22.76219pt] \left(\begin{array}[]{c}N^{\prime}_{+}\\[5.69054pt] N^{\prime}_{-}\end{array}\right)\;=\;\bm{A}\cdot\left(\begin{array}[]{c}N^{\prime}_{S}\\[5.69054pt] N^{\prime}_{B}\end{array}\right)\qquad\quad\mbox{and}\qquad\quad\det\,(\bm{A})=\delta=\varepsilon-\beta.\end{array} (15)

A meaningful test should be designed such that

α+β<1δ=εβ>0+=ε/β>1,=α/η<1.\alpha+\beta<1\quad\Longrightarrow\quad\delta=\varepsilon-\beta>0\quad\Longrightarrow\quad\ell_{+}=\varepsilon/\beta>1,\quad\ell_{-}=\alpha/\eta<1. (16)

4 Bayesian inference

The values [ε,α,β,η]\left[\,\varepsilon,\alpha,\beta,\eta\,\right], defined in (811), represent the conditional probabilities of the test being positive or negative, provided the hypothesis \mathcal{H} is true or false. Asking inversely: which are the conditional posterior probabilities of hypothesis \mathcal{H} being true or false, provided this test has been positive or negative?

pS+\displaystyle p_{S+} =\displaystyle= 𝖯(|𝒯)\displaystyle\mathsf{P}\,(\mathcal{H}\,|\,\mathcal{T}) (17)
pS\displaystyle p_{S-} =\displaystyle= 𝖯(|𝒯¯)\displaystyle\mathsf{P}\,(\mathcal{H}\,|\,\bar{\mathcal{T}}) (18)
pB+\displaystyle p_{B+} =\displaystyle= 𝖯(¯|𝒯)= 1pS+\displaystyle\mathsf{P}\,(\bar{\mathcal{H}}\,|\,\mathcal{T})\,=\,1-p_{S+} (19)
pB\displaystyle p_{B-} =\displaystyle= 𝖯(¯|𝒯¯)= 1pS\displaystyle\mathsf{P}\,(\bar{\mathcal{H}}\,|\,\bar{\mathcal{T}})\,=\,1-p_{S-} (20)

Focusing on Eq. (17), above question may be formulated as: when applying the test on a particular event 𝔢i𝔇\mathfrak{e}_{i}\in\mathfrak{D} of the real data sample, which is the conditional posterior probability pS+p_{S+} of 𝔢i\mathfrak{e}_{i} being signal (\mathcal{H} = true) if the test result is positive (𝒯\mathcal{T} = true)? The answer is provided by Bayes’ theorem [3, 4]

𝖯(|𝒯)=𝖯(𝒯|)𝖯(𝒯)𝖯(),\mathsf{P}\,(\mathcal{H}\,|\,\mathcal{T})=\frac{\mathsf{P}\,(\mathcal{T}\,|\,\mathcal{H})}{\mathsf{P}\,(\mathcal{T})}\cdot\mathsf{P}\,(\mathcal{H}), (21)

with the positive evidence being calculated by the law of total probability:

p+𝖯(𝒯)=𝖯(𝒯|)𝖯()+𝖯(𝒯|¯)𝖯(¯)=εpS+βpB,p_{+}\equiv\mathsf{P}\,(\mathcal{T})=\mathsf{P}\,(\mathcal{T}\,|\,\mathcal{H})\cdot\mathsf{P}\,(\mathcal{H})+\mathsf{P}\,(\mathcal{T}\,|\,\bar{\mathcal{H}})\cdot\mathsf{P}\,(\bar{\mathcal{H}})\,=\,\varepsilon\cdot p_{S}+\beta\cdot p_{B}, (22)

using the variables of Eqs. (2, 8, 10). Hence, from Eqs. (12, 21)

pS+=εεpS+βpBpS=pSpS+(1pS)/+.p_{S+}=\frac{\varepsilon}{\varepsilon\cdot p_{S}+\beta\cdot p_{B}}\cdot p_{S}\,=\,\frac{p_{S}}{p_{S}+(1-p_{S})\,/\,\ell_{+}}. (23)

Clearly the posterior probability pS+p_{S+} depends only on the prior probability pSp_{S} and on the positive likelihood ratio +=ε/β\ell_{+}=\varepsilon/\beta. Fig. 1 plots pS+(pS)p_{S+}(\,p_{S}) for 0β/ε10\leq\beta/\varepsilon\leq 1.

Refer to caption
Figure 1: Bayesian posterior pS+p_{S+} as a function of pSp_{S} for different 1/+=β/ε1/\ell_{+}=\beta/\varepsilon.

Similarly, Eq. (18) asks: when applying the test, which is the conditional posterior probability pSp_{S-} of 𝔢i\mathfrak{e}_{i} being signal (\mathcal{H} = true) if the test result is negative (𝒯\mathcal{T} = false)? Applying Bayes’ theorem to this case yields:

𝖯(|𝒯¯)=𝖯(𝒯¯|)𝖯(𝒯¯)𝖯(),andthenegative evidence\mathsf{P}\,(\mathcal{H}\,|\,\bar{\mathcal{T}})=\frac{\mathsf{P}\,(\bar{\mathcal{T}}\,|\,\mathcal{H})}{\mathsf{P}\,(\bar{\mathcal{T}})}\cdot\mathsf{P}\,(\mathcal{H}),\qquad\mathrm{and\;the}\;\textit{negative evidence} (24)
p𝖯(𝒯¯)=𝖯(𝒯¯|)𝖯()+𝖯(𝒯¯|¯)𝖯(¯)=αpS+ηpB=1p+.p_{-}\equiv\mathsf{P}\,(\bar{\mathcal{T}})=\mathsf{P}\,(\bar{\mathcal{T}}\,|\,\mathcal{H})\cdot\mathsf{P}\,(\mathcal{H})+\mathsf{P}\,(\bar{\mathcal{T}}\,|\,\bar{\mathcal{H}})\cdot\mathsf{P}\,(\bar{\mathcal{H}})=\alpha\cdot p_{S}+\eta\cdot p_{B}=1-p_{+}. (25)

Reverting to the variables of Eqs. (2, 9, 11, 13), Eq. (24) becomes

pS=ααpS+ηpBpS=pSpS+(1pS)/.p_{S-}=\frac{\alpha}{\alpha\cdot p_{S}+\eta\cdot p_{B}}\cdot p_{S}\,=\,\frac{p_{S}}{p_{S}+(1-p_{S})\,/\,\ell_{-}}. (26)

Similarly to above, the posterior probability pSp_{S-} depends only on the prior probability pSp_{S} and on the negative likelihood ratio =α/η\ell_{-}=\alpha/\eta.

Calculation of the remaining conditional posterior probabilities pB+p_{B+} and pBp_{B-} of 𝔢i\mathfrak{e}_{i} being background (\mathcal{H} = false) if the test result is positive (𝒯\mathcal{T} = true) or negative (𝒯\mathcal{T} = false), respectively, is trivial by using Eqs. (1920).

A priori information are the likelihood ratios +\ell_{+} and \ell_{-} which are obtained from the Monte Carlo simulations, see Section 3. The prior probabilities pSp_{S} and pBp_{B} are the signal and background fractions, Eq. (2). They are determined by Eq. (4) if the signal to background ratio RS/BR_{S/B} is known; if not, estimates of them can be calculated from the real data sample by a method presented in Section 6.

5 Event selection

In many experiments, notably those performed by particle collisions, the real data sample 𝔇\mathfrak{D} has been stochastically accumulated by two independent Poisson processes for signal and background events, respectively.666   In real experiments, the raw data are further affected by triggers of data acquisition, the detector acceptance, reconstruction efficiencies, pre-selection for the analysis proper, etc. These complex aspects, which are beyond the scope of this paper, are discussed e.g. in [5, 6]. Thus, their numbers NSN_{S} and NBN_{B} behave as random variables with expectation values

NS=pSN,NB=pBN,\left<N_{S}\right>=p_{S}\cdot N,\qquad\left<N_{B}\right>=p_{B}\cdot N, (27)

where pSp_{S} and pBp_{B} are fixed values, determined only by the physics laws underlying the production of signal and background events; they are a-priori unknown, but their values can be estimated as shown in section 6.

The base sample 𝔇=𝔖𝔅\mathfrak{D}=\mathfrak{S}\cup\mathfrak{B} thus obtained may be regarded, from a frequentist’s point-of-view, as one stochastic draw from a virtual population, consisting of many repetitions of the experiment at same conditions.

The marginal standard deviations and the correlation 777   The correlation ρ(NS,NB)\rho\,(N_{S},N_{B}) is a characteristic of these distributions, hence it is a constant. It must not be confused with the empirical correlation r(NS,NB)r\,(N_{S},N_{B}) which is a random variable, dependent on the actual draw, however with expectation value r=ρ(NS,NB)=0\left<\,r\,\right>=\rho\,(N_{S},N_{B})=0. As can be shown by simulation with an increasing number of draws, plots of the rr-distribution approach the Dirac function δ(r)\delta\,(r), thus illustrating the fact that r(NS,NB)r\,(N_{S},N_{B}) is a consistent estimator of ρ(NS,NB)\rho\,(N_{S},N_{B}). of the bivariate distribution of NSN_{S} and NBN_{B} are as follows:

σ(NS)=pSN,σ(NB)=pBN,ρ(NS,NB)=0.\sigma(N_{S})=\sqrt{p_{S}\cdot N},\qquad\sigma(N_{B})=\sqrt{p_{B}\cdot N},\qquad\rho\,(N_{S},N_{B})=0. (28)

Hence, for the total number N=NS+NBN=N_{S}+N_{B}

NS+NB=N,σ(NS+NB)=N.\left<N_{S}+N_{B}\right>=N,\qquad\sigma(N_{S}+N_{B})=\sqrt{N}. (29)

However, for dealing with the following selection, 𝔇\mathfrak{D} is postulated to be a base sample of fixed albeit unknown sizes NSN_{S} of 𝔖\mathfrak{S} (signal) and NBN_{B} of 𝔅\mathfrak{B} (background), and their sum N=NS+NBN=N_{S}+N_{B} being the fixed and known size of 𝔇=𝔖𝔅\mathfrak{D}=\mathfrak{S}\cup\mathfrak{B}.

A rigorous analysis is deferred to part II of this study.

5.1 Selection procedure

After having determined, with sufficient precision, the characteristic values [ε,α,β,η]\left[\,\varepsilon,\alpha,\beta,\eta\,\right] (811) from the Monte Carlo samples 𝔖\mathfrak{S}^{\prime} and 𝔅\mathfrak{B}^{\prime}, the same test criteria θ\vec{\theta} are now applied on the full real data sample 𝔇\mathfrak{D}, yielding ESE_{S} selected events to enter sub-sample 𝔖\mathfrak{S}^{\ast} and EBE_{B} rejected events to enter sub-sample 𝔅\mathfrak{B}^{\ast}.

Since ε,β\varepsilon,\beta and α,η\alpha,\eta are probabilities of selection and rejection, respectively, ESE_{S} and EBE_{B} are random variables resulting from two independent Bernoulli processes, with expectation values

ES=εNS+βNB=expectedno.ofeventstobeselected,\left<E_{S}\right>=\varepsilon\cdot N_{S}+\beta\cdot N_{B}\;=\;\mathrm{expected\;no.\;of\;events\;to\;be\;selected}, (30)
EB=αNS+ηNB=expectedno.ofeventstoberejected.\left<E_{B}\right>=\alpha\cdot N_{S}+\eta\cdot N_{B}\;=\;\mathrm{expected\;no.\;of\;events\;to\;be\;rejected}. (31)

Hence with Eq. (14) :

σ(ES)=σ(EB)=NSεα+NBβη.\sigma(E_{S})=\sigma(E_{B})=\sqrt{N_{S}\cdot\varepsilon\cdot\alpha\,+\,N_{B}\cdot\beta\cdot\eta}. (32)

The postulated fixed event numbers “per draw”, NSN_{S} and NBN_{B}, imply maximal anti-correlation of selected and rejected events,888   A common mistake is tacitly assuming ρ(ES,EB)=0\rho\,(E_{S},E_{B})=0.

ES+EB=NS+NBρ(ES,EB)=1,\left<E_{S}\right>+\left<E_{B}\right>=N_{S}+N_{B}\qquad\Longrightarrow\qquad\rho\,(E_{S},E_{B})=-1, (33)

and error propagation yields σ(ES+EB)=|σ(ES)σ(EB)|=0\sigma(E_{S}+E_{B})=\left|\,\sigma(E_{S})-\sigma(E_{B})\,\right|=0.

5.2 Test optimization

Optimizing the test parameters θ\vec{\theta} requires a compromise between minimizing both α\alpha (loss of genuine signals) and β\beta (contamination by genuine background) with regard to the sample 𝔖\mathfrak{S}^{\ast} of selected events. This can be formalized by defining an appropriate figure of merit (FOM) as a function F(θ)F(\vec{\theta}), thus implicitely affecting the test criteria α(θ)\alpha(\vec{\theta}) and β(θ)\beta(\vec{\theta}), and tuning θ\vec{\theta} with the goal of maximizing the FOM fuction FF.

One might be tempted to base the FOM on real data, e.g. by using the purity xSx_{S}, Eq. (51) of Section 7. However, this poses the danger of artificially enhancing a possible statistical fluctuation, hence it must strictly be avoided [7].

Therefore, the FOM should be based on simulated data. If the signal to background ratio of real data, RS/BR_{S/B} (Eq. (3)), is approximately known and is not too small, and the simulated signal and background sample sizes are scaled such that NS/NBRS/BN^{\prime}_{S}/N^{\prime}_{B}\approx R_{S/B}, then a simple heuristic FOM ansatz may be (see Eq. (7)):

F(θ)=nS+N+=nS+nS++nB+max.F(\vec{\theta})=\frac{n_{S+}}{\,\sqrt{N^{\prime}_{+}}\,}=\frac{n_{S+}}{\,\sqrt{n_{S+}+n_{B+}}\,}\,\rightarrow\,\mathrm{max}. (34)

For more sophisticated figures of merit see e.g. [7, 8].

6 Estimation of pSp_{S} and pBp_{B}

The fractions pSp_{S} and pB=1pSp_{B}=1-p_{S} must be obtained from independent prior knowledge. If the signal to background ratio (SBR) is known, i.e. RS/B=RS/B±σ(RS/B)R_{S/B}=\left<R_{S/B}\right>\pm\sigma(R_{S/B}), their values are determined by Eq. (4), and error propagation yields

σ(p~S)=σ(p~B)=σ(RS/B)(1+RS/B)2andρ(p~S,p~B)=1.\displaystyle\sigma(\tilde{p}_{S})=\sigma(\tilde{p}_{B})=\frac{\sigma(R_{S/B})}{\,(1+R_{S/B})^{2}\,}\qquad\mathrm{and}\qquad\rho\,(\tilde{p}_{S},\tilde{p}_{B})=-1. (35)

Estimates of NSN_{S} and NBN_{B} are derived straight-forward from Eq. (2).

In the following we assume that SBR is unknown. In this non-trivial case the fractions pSp_{S} and pBp_{B} are unknown parameters, however, their estimation is possible by using as prior knowledge the test’s characteristic values [ε,α,β,η]\left[\,\varepsilon,\alpha,\beta,\eta\,\right], applied to the real data sample 𝔇\mathfrak{D} at hand. The method is described in detail below.

The number of real data events selected and rejected are given by ES=ES±σ(ES)E_{S}=\left<E_{S}\right>\pm\sigma(E_{S}) and EB=EB±σ(EB)E_{B}=\left<E_{B}\right>\pm\sigma(E_{B}), respectively, with ES+EB=NE_{S}+E_{B}=N. These numbers can be used for calculating estimates of the true number of signal and background events, N~S\tilde{N}_{S} and N~B\tilde{N}_{B}, respectively, by re-writing Eqs. (3031) as

εN~S+βN~B=ES,\displaystyle\varepsilon\cdot\tilde{N}_{S}+\beta\cdot\tilde{N}_{B}=E_{S}, (36)
αN~S+ηN~B=EB,\displaystyle\alpha\cdot\tilde{N}_{S}+\eta\cdot\tilde{N}_{B}=E_{B}, (37)

or in matrix notation (see Eq. (15)) as

𝑨(N~SN~B)=(ESEB),𝑨=(εβαη).\begin{array}[]{l}\quad\;\bm{A}\cdot\left(\begin{array}[]{c}\tilde{N}_{S}\\[5.69054pt] \tilde{N}_{B}\end{array}\right)\;=\;\left(\begin{array}[]{c}E_{S}\\[5.69054pt] E_{B}\end{array}\right),\qquad\qquad\bm{A}\;=\;\left(\begin{array}[]{cc}\varepsilon&\quad\beta\\[5.69054pt] \alpha&\quad\eta\end{array}\right).\end{array} (38)

Using Eqs. (14, 16) yields the solutions 999det(𝑨)=δ=εβ>0\det\,(\bm{A})=\delta=\varepsilon-\beta>0 (see Eq. (16)) ensures non-singularity. The simpler formulae N~SES/ε\tilde{N}_{S}\approx E_{S}/\varepsilon, N~BNES/ε\tilde{N}_{B}\approx N-E_{S}/\varepsilon, as often used instead, are valid only for a negligible contamination rate β1\beta\ll 1.

N~S=(1β)ESβEBδ=ESβNδ0,\displaystyle\tilde{N}_{S}=\frac{(1-\beta)\cdot E_{S}-\beta\cdot E_{B}}{\delta}\,=\,\frac{E_{S}-\beta\cdot N}{\delta}\;\geq 0, (39)
N~B=εEB(1ε)ESδ=εNESδ0,\displaystyle\tilde{N}_{B}=\frac{\varepsilon\cdot E_{B}-(1-\varepsilon)\cdot E_{S}}{\delta}\,\,=\,\frac{\varepsilon\cdot N-E_{S}}{\delta}\;\geq 0, (40)

obeying N~S+N~B=N\tilde{N}_{S}+\tilde{N}_{B}=N. Error propagation, observing ρ(ES,EB)=1\rho\,(E_{S},E_{B})=-1 (see Eq. (33)), yields the corresponding standard deviations as

σ(N~S)=(1β)σ(ES)+βσ(EB)δ=σ(ES)δ,\displaystyle\sigma(\tilde{N}_{S})=\frac{(1-\beta)\cdot\sigma(E_{S})+\beta\cdot\sigma(E_{B})}{\delta}\,=\,\frac{\sigma(E_{S})}{\delta}, (41)
σ(N~B)=εσ(EB)+(1ε)σ(ES)δ=σ(EB)δ=σ(N~S),\displaystyle\sigma(\tilde{N}_{B})=\frac{\varepsilon\cdot\sigma(E_{B})+(1-\varepsilon)\cdot\sigma(E_{S})}{\delta}\,=\,\frac{\sigma(E_{B})}{\delta}\,=\,\sigma(\tilde{N}_{S}), (42)

with σ(ES)=σ(EB)\sigma(E_{S})=\sigma(E_{B}) given by Eq. (32) and approximately setting NSN~SN_{S}\approx\tilde{N}_{S} and NBN~BN_{B}\approx\tilde{N}_{B}. N~S+N~B=N\tilde{N}_{S}+\tilde{N}_{B}=N implies maximal anti-correlation:

ρ(N~S,N~B)=1,\displaystyle\rho\,(\tilde{N}_{S},\tilde{N}_{B})=-1, (43)

and error propagation yields σ(N~S+N~B)=|σ(N~S)σ(N~B)|=0\sigma(\tilde{N}_{S}+\tilde{N}_{B})=|\,\sigma(\tilde{N}_{S})-\sigma(\tilde{N}_{B})\,|=0.

Note that the values of Eqs. (4143) are derived empirically from the observed numbers ESE_{S} and EBE_{B}. They must not be confused with Eq. (28) which represents characteristics of the distributions determined by the data acquisition.

Estimates of the signal and background fractions, pSp_{S} and pB=1pSp_{B}=1-p_{S}, are derived straight-forward from Eqs. (3940) with Eq. (2) as

p~S=N~SN=ES/Nβδ,p~B=N~BN=εES/Nδ,\displaystyle\tilde{p}_{S}=\frac{\tilde{N}_{S}}{N}\,=\,\frac{E_{S}/N-\beta}{\delta}\,,\qquad\qquad\tilde{p}_{B}=\frac{\tilde{N}_{B}}{N}\,=\,\frac{\varepsilon-E_{S}/N}{\delta}, (44)

with p~S+p~B=1\tilde{p}_{S}+\tilde{p}_{B}=1, implying maximal anti-correlation:

ρ(p~S,p~B)=1.\displaystyle\rho\,(\tilde{p}_{S},\tilde{p}_{B})=-1. (45)

The estimated values of Eq. (44) can be used as prior probabilities in the Bayesian inference formulae, Eqs. (23, 26) of Section 4. Standard deviations are

σ(p~S)=σ(N~S)N,σ(p~B)=σ(N~B)N=σ(p~S),\displaystyle\sigma(\tilde{p}_{S})=\frac{\sigma(\tilde{N}_{S})}{N}\,,\qquad\qquad\sigma(\tilde{p}_{B})=\frac{\sigma(\tilde{N}_{B})}{N}\,=\,\sigma(\tilde{p}_{S}), (46)

with σ(N~S)=σ(N~B)\sigma(\tilde{N}_{S})=\sigma(\tilde{N}_{B}) given by Eqs. (4142).

Accurate estimates of the true number of signal and background events in the real data sample 𝔇\mathfrak{D}, see Eqs. (3942), are essential for data analyses based on event counting, e.g. measurement of particle decay branching fractions.

7 Data quality

Investigation of the sample 𝔖\mathfrak{S}^{\ast}, containing ESE_{S} selected real events, requires an accurate knowledge of the expected numbers ES/S\left<E_{S/S}\right> of pure signal events and EB/S\left<E_{B/S}\right> of contaminating background events within this sample.101010   Such an investigation would typically be summarized like “analysis performed on ESE_{S} events, EB/S\left<E_{B/S}\right> of which are expected to be background”.

Their fractions w.r.t. all selected events, purity 111111   The term “purity” is defined ambigously in the literature. and fraction of contamination, are relevant quantities for judging the data quality:

xS=ES/SES,xB=EB/SESobeyingxS+xB=1.\displaystyle x_{S}=\frac{\,\left<E_{S/S}\right>\,}{E_{S}},\qquad x_{B}=\frac{\,\left<E_{B/S}\right>\,}{E_{S}}\qquad\mathrm{obeying}\qquad x_{S}+x_{B}=1. (47)

These fractions xSx_{S} and xB=1xSx_{B}=1-x_{S} are the probabilities of a selected event to be signal or background, respectively. The number nSn_{S} (nBn_{B}) of signal (background) events within the sample 𝔖\mathfrak{S}^{\ast} of size ESE_{S} is distributed according to the binomial pdf:

𝖯(nS|xS)=(ESnS)xSnS(1xS)ESnS,0nSES,\displaystyle\mathsf{P}\,(n_{S}|x_{S})={E_{S}\choose n_{S}}\cdot x_{S}^{n_{S}}\cdot(1-x_{S})^{E_{S}-n_{S}},\quad 0\leq n_{S}\leq E_{S}, (48)
𝖯(nB|xB)=(ESnB)xBnB(1xB)ESnB,0nBES.\displaystyle\mathsf{P}\,(n_{B}|x_{B})={E_{S}\choose n_{B}}\cdot x_{B}^{n_{B}}\cdot(1-x_{B})^{E_{S}-n_{B}},\quad 0\leq n_{B}\leq E_{S}. (49)

7.1 Purity

An estimate of the expected number of pure signal events within the selected sample 𝔖\mathfrak{S}^{\ast} can be derived from Eqs. (30, 39), yielding

ES/S=εNSεN~S=εδ(ESβN).\quad\>\left<E_{S/S}\right>=\varepsilon\cdot N_{S}\approx\varepsilon\cdot\tilde{N}_{S}=\frac{\varepsilon}{\delta}\cdot\left(E_{S}-\beta\cdot N\right). (50)

Its fraction w.r.t. all selected events, the purity, is

xS=ES/SESεδ(1βNES).\quad\>x_{S}=\frac{\,\left<E_{S/S}\right>\,}{E_{S}}\approx\frac{\varepsilon}{\delta}\cdot\left(1-\beta\cdot\frac{N}{E_{S}}\right). (51)

7.2 Fraction of contamination

An estimate of the expected number of background contamination within the selected sample 𝔖\mathfrak{S}^{\ast}, under the assumption of a non-zero contamination rate, i.e. β>0\beta>0, can be derived from Eqs. (30, 40), which yields

EB/S=βNBβN~B=βδ(εNES).\quad\>\left<E_{B/S}\right>=\beta\cdot N_{B}\approx\beta\cdot\tilde{N}_{B}=\frac{\beta}{\delta}\cdot\left(\varepsilon\cdot N-E_{S}\right). (52)

If knowing the normalization factor 1/rB1/r_{B}, Eq. (5), together with the expected number nB+nB+\left<n_{B+}\right>\approx n_{B+} of wrong positives (simulated background passing the test, see Eq. (10)), above Eq. (52) is equivalent to

EB/S=nB+rBnB+rB=βNBrB=βN 1+RS/B,\quad\>\left<E_{B/S}\right>=\frac{\left<n_{B+}\right>}{r_{B}}\approx\frac{n_{B+}}{r_{B}}=\beta\cdot\frac{N^{\prime}_{B}}{r_{B}}=\frac{\beta\cdot N}{\,1+R_{S/B}\,}\,, (53)

with signal to background ratio of real data, Eqs. (3, 3940)

RS/B=NSNBN~SN~B=ESβNεNES.\quad\>R_{S/B}=\frac{N_{S}}{N_{B}}\approx\frac{\tilde{N}_{S}}{\tilde{N}_{B}}=\frac{\,E_{S}-\beta\cdot N\,}{\varepsilon\cdot N-E_{S}}. (54)

The fraction of contamination w.r.t. all selected events is

xB=EB/SES=nB+rBES=βNBESβδ(εNES1)=1xS.\quad\>x_{B}=\frac{\,\left<E_{B/S}\right>\,}{E_{S}}=\frac{\left<n_{B+}\right>}{\,r_{B}\cdot E_{S}\,}=\beta\cdot\frac{N_{B}}{E_{S}}\approx\frac{\beta}{\delta}\cdot\left(\varepsilon\cdot\frac{N}{E_{S}}-1\right)=1-x_{S}. (55)

7.3 Zero “wrong positives”

If the test criteria are so tight that no simulated background event survives the test, i.e. nB+=0n_{B+}=0 implying β=0\beta=0, then the expectation value nB+nB+\left<n_{B+}\right>\neq n_{B+}, and Eq. (52) and the approximate part of Eq. (53) are no longer valid.

In this case, an estimate of the expectation value EB/S\left<E_{B/S}\right> follows Bayesian reasoning. Given the sample 𝔅\mathfrak{B}^{\prime} of NBN^{\prime}_{B} simulated background events, and assuming a probability β>0\beta^{\prime}>0 of such an event to pass the test, then the probability 𝖯(nB+|β)\mathsf{P}\,(n_{B+}|\,\beta^{\prime}) of observing nB+n_{B+} events passing this test is given by the binomial pdf

𝖯(nB+|β)=(NBnB+)(β)nB+(1β)NBnB+.\quad\>\mathsf{P}\,(n_{B+}|\,\beta^{\prime})={N^{\prime}_{B}\choose n_{B+}}\cdot{\left(\beta^{\prime}\right)}^{n_{B+}}\cdot\left(1-\beta^{\prime}\right)^{N^{\prime}_{B}-n_{B+}}. (56)

Now we observe nB+=0n_{B+}=0 events to have passed the test, i.e. 𝖯(0|β)=(1β)NB\mathsf{P}\,(0\,|\,\beta^{\prime})=\left(1-\beta^{\prime}\right)^{N^{\prime}_{B}}. This gives the posterior pdf f(β)𝖯(0|β)f(\beta^{\prime})\propto\mathsf{P}\,(0\,|\,\beta^{\prime}). Normalization to 1 yields:

f(β)=(1β)NB01(1b)NBdb=(NB+1)(1β)NB,\quad\>f(\beta^{\prime})=\frac{\left(1-\beta^{\prime}\right)^{N^{\prime}_{B}}}{\,\int_{0}^{1}\left(1-b\right)^{N^{\prime}_{B}}\mathrm{d}b\,}=(N^{\prime}_{B}+1)\cdot(1-\beta^{\prime})^{N^{\prime}_{B}}, (57)

with the expectation value

β=01βf(β)dβ=1NB+2NB1>0\quad\>\left<\beta^{\prime}\right>=\int_{0}^{1}\beta^{\prime}\cdot f(\beta^{\prime})\,\mathrm{d}\beta^{\prime}=\frac{1}{\,N^{\prime}_{B}+2\,}\approx{N^{\prime}_{B}}^{-1}>0 (58)

being the probability of a background event to pass the test.

Regarding sample 𝔅\mathfrak{B} of NBN_{B} real background events, Eq. (52) suggests

EB/S=βNB=NBNB=1rB,\quad\>\left<E_{B/S}\right>=\left<\beta^{\prime}\right>\cdot N_{B}=\frac{N_{B}}{N^{\prime}_{B}}=\frac{1}{r_{B}}\,, (59)

hence in the sample 𝔖\mathfrak{S}^{\ast} of real selected events, the expected number of background events is equal to the normalization factor. Inserting into Eq. (53) yields

nB+=EB/SrB=1,\quad\>\left<n_{B+}\right>=\left<E_{B/S}\right>\cdot r_{B}=1\,, (60)

hence the posterior expectation value of the number of simulated background events wrongly passing the test is one, while their observed number nB+=0n_{B+}=0.


Part II: Statistical analysis

8 Assumptions and notation

Let 𝔇\mathfrak{D} be a data set consisting of NN recorded events 𝔢i,i=1,,N\mathfrak{e}_{i},\;{}i=1,\ldots,N. 𝔇\mathfrak{D} can be partitioned into the subsets 𝔖\mathfrak{S} containing the “signal” events and 𝔅\mathfrak{B} containing the “background” events:

𝔖𝔅=𝔇,𝔖𝔅=,|𝔖|=NS,|𝔅|=NB,NS+NB=N.\displaystyle\mathfrak{S}\cup\mathfrak{B}=\mathfrak{D},\;{}\mathfrak{S}\cap\mathfrak{B}=\emptyset,\;{}|\mathfrak{S}|=N_{S},\;{}|\mathfrak{B}|=N_{B},\;{}N_{S}+N_{B}=N. (61)

The respective fractions of “signal”and “background” events are denoted by:

pS=NS/N,pB=NB/N=1pS.\displaystyle p_{S}=N_{S}/N,\;{}p_{B}=N_{B}/N=1-p_{S}. (62)

A binary indicator Ii{0,1}I_{i}\in\{0,1\} is attached to each event i such that

Ii=1𝔢i𝔖.\displaystyle I_{i}=1\iff\mathfrak{e}_{i}\in\mathfrak{S}. (63)

The actual values of the indicators are unknown a priori.

A binary classifier is a function CC that computes a function value ci=C(𝔢i){0,1}c_{i}=C(\mathfrak{e}_{i})\in\{0,1\} . If ci=1c_{i}=1, event 𝔢i\mathfrak{e}_{i} is classified as “signal”; if ci=0c_{i}=0, event 𝔢i\mathfrak{e}_{i} is classified as “background”. Such a classifier cannot be expected to perform perfectly. The probabilities of correct and wrong decisions can be summarized as follows:

𝖯(ci=1|Ii=1)\displaystyle\mathsf{P}(c_{i}=1|I_{i}=1) =ε=1α,\displaystyle=\varepsilon=1-\alpha,
𝖯(ci=0|Ii=1)\displaystyle\mathsf{P}(c_{i}=0|I_{i}=1) =α,\displaystyle=\alpha,
𝖯(ci=1|Ii=0)\displaystyle\mathsf{P}(c_{i}=1|I_{i}=0) =β,\displaystyle=\beta,
𝖯(ci=0|Ii=0)\displaystyle\mathsf{P}(c_{i}=0|I_{i}=0) =η=1β,\displaystyle=\eta=1-\beta, (64)

where α0\alpha\geq 0 is the probability of classifying a “signal” event as being “background” (type I error), and β0\beta\geq 0 is the probability of classifying a “background” event as being “signal” (type II error). It is required that α+β<1\alpha+\beta<1, see Eq. (16). The difference εβ=1αβ>0\varepsilon-\beta=1-\alpha-\beta>0 will be denoted by δ\delta.

The probabilities α\alpha and β\beta are unknown a priori, but can be estimated to arbitrary precision from an independent, sufficiently large simulated sample 𝔇\mathfrak{D}^{\prime}, for which the classification of events into “signal”and “background” is known, see Section 3. In the following, α\alpha and β\beta will be considered as known constants.

The respective probabilities qS,qBq_{S},q_{B} of classifying an arbitrary event as “signal” or “background” are given by:

(qSqB)=𝑨(pSpB),with 𝑨=(εβαη)and qS+qB=1.\displaystyle\begin{pmatrix}q_{S}\\ q_{B}\end{pmatrix}=\bm{A}\begin{pmatrix}p_{S}\\ p_{B}\end{pmatrix}\!,\ \text{with }\bm{A}=\begin{pmatrix}\varepsilon&\beta\\ \alpha&\eta\end{pmatrix}\text{and }q_{S}+q_{B}=1. (65)

The respective numbers ESE_{S} and EBE_{B} of events classified by CC as “signal” and “background” are given by:

ES=i=1Nci,EB=NES.\displaystyle E_{S}=\sum_{i=1}^{N}c_{i},\;{}E_{B}=N-E_{S}. (66)

9 Bayesian estimation

Conditional on qSq_{S}, ESE_{S} is distributed according to the binomial distribution Bino(N,qSN,q_{S}) with the probability function:

𝖯(ES=k|qS)=(Nk)qSk(1qS)Nk,0kN.\displaystyle\mathsf{P}(E_{S}=k|q_{S})=\binom{N}{k}{q_{S}}^{k}(1-q_{S})^{N-k},\quad 0\leq k\leq N. (67)

The posterior pdf of qSq_{S}, given an observation ESE_{S}, can be written down in closed form if the prior pdf π(qS)\pi(q_{S}) is taken from the conjugate family of priors, which is the Beta family in this case [9].

If the prior distribution of qSq_{S} is Beta(a,ba,b), the posterior pdf is given by:

p(qS|ES=k)qSk+a1(1qS)Nk+b1Beta(ES+a,EB+b).\displaystyle p(q_{S}|E_{S}=k)\propto q_{S}^{k+a-1}(1-q_{S})^{N-k+b-1}\propto\textsf{Beta}(E_{S}+a,E_{B}+b). (68)

The Bayes estimator, i.e., the posterior mean and its variance are equal to:

qS|ES=ES+aN+a+b,σ2(qS|ES)=(ES+a)(EB+b)(N+a+b)2(N+a+b+1),\displaystyle\langle q_{S}|E_{S}\rangle=\frac{E_{S}+a}{N+a+b},\;{}\sigma^{2}(q_{S}|E_{S})=\frac{(E_{S}+a)(E_{B}+b)}{(N+a+b)^{2}(N+a+b+1)}, (69)

where X\langle X\rangle is the expectation of the random variable XX and σ2(X)\sigma^{2}(X) is its variance.

If no prior information on qSq_{S} is available, there are three popular “uninformative” priors 121212   The maximum-entropy prior and Jeffrey’s prior are not really uninformative, as they assign a specific probability to any interval q1qSq2q_{1}\leq q_{S}\leq q_{2} that is contained in (0,1)(0,1), see [10]. Neither is Haldane’s improper prior, as it can be considered as the limit of Beta(ε,ε\varepsilon,\varepsilon) for ε0\varepsilon\longrightarrow 0, thus in the limit assigning the probability of 0.5 to both qS=0q_{S}=0 and qS=1q_{S}=1. in the unit interval:

  1. 1.

    the maximum-entropy prior Beta(1,11,1)=Unif(0,10,1): π(qS)=1, 0qS1\pi(q_{S})=1,\ 0\leq q_{S}\leq 1;

  2. 2.

    Jeffrey’s prior Beta(12,12\textstyle\frac{1}{2},\textstyle\frac{1}{2}): π(qS)qS1/2(1qS)1/2, 0<qS<1\pi(q_{S})\propto q_{S}^{-1/2}(1-q_{S})^{-1/2},\ 0<q_{S}<1;

  3. 3.

    Haldane’s improper prior Beta(0,00,0): π(qS)qS1(1qS)1, 0<qS<1\pi(q_{S})\propto q_{S}^{-1}(1-q_{S})^{-1},\ 0<q_{S}<1.

Prior knowledge about qSq_{S} can be incorporated via an informative prior. For example, assume that it is known that qSq_{S} is about 0.4, with a standard uncertainty of about 0.1. This information can be encoded by a Beta prior Beta(a,ba,b) with mean 0.4 and variance 0.01:

aa+b=0.4,ab(a+b)2(a+b+1)=0.01a=9.2,b=13.8.\displaystyle\frac{a}{a+b}=0.4,\;{}\frac{ab}{(a+b)^{2}(a+b+1)}=0.01\;{}\Longrightarrow\;{}a=9.2,\;{}b=13.8. (70)

Credible intervals for qSq_{S} can be computed via quantiles of the posterior distribution. A symmetric credible interval CC with probability content of 1α1-\alpha is given by C=[c1,c2]C=\left[c_{1},c_{2}\right], where c1c_{1} is the α/2\alpha/2-quantile of the posterior pdf in Eq. 68 and c2c_{2} is the (1α/2)(1-\alpha/2)-quantile.

An HPD interval [h1,h2][h_{1},h_{2}] can be computed by solving the following system of non-linear equations:

f(h2)f(h1)\displaystyle f(h_{2})-f(h_{1}) =0,\displaystyle=0, (71)
F(h2)F(h1)\displaystyle F(h_{2})-F(h_{1}) =1α,\displaystyle=1-\alpha, (72)

where f(h)=p(h|ES=k)f(h)=p(h|E_{S}=k) is the posterior pdf in Eq. 68 and F(h)F(h) is its cumulative distribution function. A MATLAB code snippet that solves the system is shown in Fig. 2.

% function that computes the left-hand side of the system
betahpd=@(h,a,b,alpha) [betapdf(h(2),a,b)-betapdf(h(1),a,b);...
         betacdf(h(2),a,b)-betacdf(h(1),a,b)-1+alpha];
N=100; % sample size
Es=30; % observed signal events
alpha=0.05; % probability outside the interval
a=1.5; b=2.3F3; % parameters of Beta prior
A=Es+a; B=N-Es+b; % parameters of Beta posterior
% credible interval, starting point for fsolve
hcred=[betainv(alpha/2,A,B);betainv(1-alpha/2,A,B)];
% solve the system
h=fsolve(@(par) betahpd(par,A,B,alpha),hcred);
% h(1)=lower bound, h(2)=upper bound
Figure 2: MATLAB code snippet that computes the bounds of an HPD interval.

9.1 Transformation to pSp_{S} and NSN_{S}

Solving Eq. (65) for pSp_{S} yields:

pS=qSβδ.\displaystyle p_{S}=\frac{q_{S}-\beta}{\delta}. (73)

As this is an affine transformation, the posterior mean and variance of pSp_{S} and NS=NpSN_{S}=Np_{S} are given by:

pS|ES=qS|ESβδ,σ2(pS|ES)=σ2(qS|ES)δ2,\displaystyle\langle p_{S}|E_{S}\rangle=\frac{\langle q_{S}|E_{S}\rangle-\beta}{\delta},\;{}\sigma^{2}(p_{S}|E_{S})=\frac{\sigma^{2}(q_{S}|E_{S})}{\delta^{2}}, (74)
NS|ES=NpS|ES,σ2(NS|ES)=N2σ2(pS|ES).\displaystyle\langle N_{S}|E_{S}\rangle=N\langle p_{S}|E_{S}\rangle,\;{}\sigma^{2}(N_{S}|E_{S})=N^{2}\sigma^{2}(p_{S}|E_{S}). (75)

The bounds of credible and HPD intervals are transformed in the same way.

Example 1

Assume N=1100,pS=1/11,pB=10/11,α=0.1,β=0.05N=1100,p_{S}=1/11,p_{B}=10/11,\alpha=0.1,\beta=0.05. Then ε=0.9,η=0.95,qS=7/550.1273,qB=48/550.8727.\varepsilon=0.9,\eta=0.95,q_{S}=7/55\approx 0.1273,q_{B}=48/55\approx 0.8727. Figure 3 shows histograms of the difference NS|ESNS,sim\langle N_{S}|E_{S}\rangle-N_{S,\mathrm{sim}} in M=105M=10^{5} simulated samples, with three non-informative priors and an informative prior. The informative prior Beta(7.5,42.57.5,42.5) in subplot (d) encodes the prior information qS=0.15q_{S}=0.15 with a standard uncertainty of 0.05 (a=7.5,b=42.5a=7.5,b=42.5). The estimates in (d) have the largest bias, but the smallest rms error.

10 Frequentist estimation

From a frequentist point of view, the observed data set 𝔇\mathfrak{D} is drawn at random from a (virtual or fictitious) parent population that has to be specified by the person(s) performing the data analysis. From now on, NN is a random variable that denotes the size of the elements of the population; the quantities NS,NB,ES,EBN_{S},N_{B},E_{S},E_{B} are random variables, too. In contrast to the Bayesian viewpoint, the fraction pSp_{S} of “signal” events is an unknown fixed parameters to be estimated. Of course it is assumed that all data sets in the parent population have the same average fraction pSp_{S} of “signal” events. The size of the observed data set 𝔇\mathfrak{D} is denoted by the constant N𝔇N_{\mathfrak{D}}.

The specification of the parent population is neither unique nor always obvious, and each choice will lead to a different assessment of the unconditional moments of the estimates N~S\tilde{N}_{S} and N~B\tilde{N}_{B}. However, conditional on the observed size NN of the data set, the moments are always the same. This point is illustrated by three choices of parent populations, the first one with fixed NN (Subsection 10.1), the second one with Poisson-distributed NN (Subsection 10.2), the third one with NN drawn from a mixture of Poisson distributions (Subsection 10.3).

10.1 Data set size NN is fixed

If NN is fixed, the unconditional distribution of NSN_{S} is the same as the conditional distribution, i.e., Bino(N,pSN,p_{S}), and the distribution of NBN_{B} is Bino(N,pBN,p_{B}), with pB=1pSp_{B}=1-p_{S}. It follows that:

NS=NpS,NB=NpB,\displaystyle\langle N_{S}\rangle=Np_{S},\langle N_{B}\rangle=Np_{B}, (76)
σ2(NS)=σ2(NB)=cov(NS,NB)=NpSpB,ρ(NS,NB)=1,\displaystyle\sigma^{2}(N_{S})=\sigma^{2}(N_{B})=-\textsf{cov}(N_{S},N_{B})=Np_{S}p_{B},\;{}\rho(N_{S},N_{B})=-1, (77)

where cov(X,Y) is the covariance of XX and YY and ρ(X,Y)\rho(X,Y) is their correlation. It follows that:

ES=NqS,EB=NqB,\displaystyle\langle E_{S}\rangle=Nq_{S},\;{}\langle E_{B}\rangle=Nq_{B}, (78)
σ2(ES)=σ2(EB)=cov(ES,EB)=NqSqB,ρ(ES,EB)=1.\displaystyle\sigma^{2}(E_{S})=\sigma^{2}(E_{B})=-\textsf{cov}(E_{S},E_{B})=Nq_{S}q_{B},\;{}\rho(E_{S},E_{B})=-1. (79)

The relation between NS,NB\langle N_{S}\rangle,\langle N_{B}\rangle and ES,EB\langle E_{S}\rangle,\langle E_{B}\rangle can be written in the following way (see Eq. (65)):

(ESEB)=𝑨(NSNB),with 𝑨=(εβαη).\displaystyle\begin{pmatrix}\langle E_{S}\rangle\\ \langle E_{B}\rangle\end{pmatrix}=\bm{A}\begin{pmatrix}\langle N_{S}\rangle\\ \langle N_{B}\rangle\end{pmatrix},\text{with }\bm{A}=\begin{pmatrix}\varepsilon&\beta\\ \alpha&\eta\end{pmatrix}. (80)

Using Eq. (80), it is easy to show that

(N~SN~B)=𝑯(ESEB),with 𝑯=𝑨1=1δ(ηβαε),\displaystyle\begin{pmatrix}\tilde{N}_{S}\\ \tilde{N}_{B}\end{pmatrix}=\bm{H}\begin{pmatrix}E_{S}\\ E_{B}\end{pmatrix},\text{with }\bm{H}=\bm{A}^{-1}=\frac{1}{\delta}\begin{pmatrix}\eta&-\beta\\ -\alpha&\varepsilon\end{pmatrix}, (81)

is an unbiased estimator of NS,NBN_{S},N_{B}:

N~S=NpS,N~B=NpB.\displaystyle\langle\tilde{N}_{S}\rangle=Np_{S},\;{}\langle\tilde{N}_{B}\rangle=Np_{B}. (82)

The joint variance-covariance matrix of N~S,N~B\tilde{N}_{S},\tilde{N}_{B} can be computed from Eq. (81) by linear error propagation:

σ2(N~S)=σ2(N~B)=cov(N~S,N~B)=NqSqB/δ2,ρ(N~S,N~B)=1.\displaystyle\sigma^{2}(\tilde{N}_{S})=\sigma^{2}(\tilde{N}_{B})=-\textsf{cov}(\tilde{N}_{S},\tilde{N}_{B})=Nq_{S}q_{B}/\delta^{2},\rho(\tilde{N}_{S},\tilde{N}_{B})=-1. (83)

The relative frequencies pSp_{S} and pBp_{B} are estimated by:

p~S=N~S/N,p~B=N~B/N.\displaystyle\tilde{p}_{S}=\tilde{N}_{S}/N,\;{}\tilde{p}_{B}=\tilde{N}_{B}/N. (84)

These estimators are unbiased, and their expectation and variance-covariance matrix is equal to:

p~S=pS,p~B=pB,\displaystyle\langle\tilde{p}_{S}\rangle=p_{S},\;{}\langle\tilde{p}_{B}\rangle=p_{B}, (85)
σ2(p~S)=σ2(p~B)=qSqB/(Nδ2),\displaystyle\sigma^{2}(\tilde{p}_{S})=\sigma^{2}(\tilde{p}_{B})=q_{S}q_{B}/(N\delta^{2}), (86)
cov(p~S,p~B)=qSqB/(Nδ2)ρ(p~S,p~B)=1.\displaystyle\textsf{cov}(\tilde{p}_{S},\tilde{p}_{B})=-q_{S}q_{B}/(N\delta^{2})\;{}\Longrightarrow\;{}\rho(\tilde{p}_{S},\tilde{p}_{B})=-1. (87)

N~S\tilde{N}_{S} can also be written as:

N~S=ESNβδ\displaystyle\tilde{N}_{S}=\frac{E_{S}-N\beta}{\delta} (88)

Conditional on NSN_{S}, the observable ESE_{S} is the sum of two random variables XX and YY, where XX (YY) is the number of signal (background) events classified as “signal”. The distribution of XX is therefore Bino(NS,1αN_{S},1-\alpha), and the distribution of YY is Bino(NB,βN_{B},\beta). It follows that:

ES|NS=NS(1α)+NBβ,\displaystyle\langle E_{S}|N_{S}\rangle=N_{S}(1-\alpha)+N_{B}\beta, (89)
ESNS|NS=NSα+NBβ,\displaystyle\langle E_{S}-N_{S}|N_{S}\rangle=-N_{S}\alpha+N_{B}\beta, (90)
σ2(ES|NS)=NSα(1α)+NBβ(1β).\displaystyle\sigma^{2}(E_{S}|N_{S})=N_{S}\alpha(1-\alpha)+N_{B}\beta(1-\beta). (91)

Still conditional on NSN_{S}, we have:

N~S|NS=ES/δNβ/δ|NS\displaystyle\langle\tilde{N}_{S}|N_{S}\rangle=\langle E_{S}/\delta-N\beta/\delta|N_{S}\rangle
=(1/δ)(NS(1α)+(NNS)β)Nβ/δ\displaystyle\phantom{\langle\tilde{N}_{S}|N_{S}\rangle}=(1/\delta)\cdot\left(N_{S}(1-\alpha)+(N-N_{S})\beta\right)-N\beta/\delta
=(1/δ)NSδ+Nβ/δNβ/δ=NS.\displaystyle\phantom{\langle\tilde{N}_{S}|N_{S}\rangle}=(1/\delta)\cdot N_{S}\delta+N\beta/\delta-N\beta/\delta=N_{S}. (92)
N~SNS|NS=0\displaystyle\langle\tilde{N}_{S}-N_{S}|N_{S}\rangle=0 (93)
σ2(N~S|NS)=(1/δ2)σ2(ES|NS)=(1/δ2)(NSα(1α)+NBβ(1β)).\displaystyle\sigma^{2}(\tilde{N}_{S}|N_{S})=(1/\delta^{2})\cdot\sigma^{2}(E_{S}|N_{S})=(1/\delta^{2})\cdot\left(N_{S}\alpha(1-\alpha)+N_{B}\beta(1-\beta)\right). (94)

N~S\tilde{N}_{S} and NSN_{S} are correlated. From Eq. (89) follows:

ESNS|NS\displaystyle\langle E_{S}N_{S}|N_{S}\rangle =NS2(1α)+NSNBβ\displaystyle=N_{S}^{2}(1-\alpha)+N_{S}N_{B}\beta\Longrightarrow
ESNS\displaystyle\langle E_{S}N_{S}\rangle =NS2(1α)+NS(NNS)β\displaystyle=\langle N_{S}^{2}\rangle(1-\alpha)+\langle N_{S}(N-N_{S})\rangle\beta
=δNS2+NβNS\displaystyle=\delta\langle N_{S}^{2}\rangle+N\beta\langle N_{S}\rangle
=δ(NpS(1pS)+N2pS2)+N2βpS\displaystyle=\delta\left(Np_{S}(1-p_{S})+N^{2}p_{S}^{2}\right)+N^{2}\beta p_{S}\Longrightarrow
cov(ES,NS,\displaystyle\textsf{cov}(E_{S},N_{S}, )=δNpS(1pS).\displaystyle)=\delta Np_{S}(1-p_{S}). (95)

As N~S=1/δ(ESNβ)\tilde{N}_{S}=1/\delta\cdot(E_{S}-N\beta), we finally obtain:

cov(N~S,NS)\displaystyle\textsf{cov}(\tilde{N}_{S},N_{S}) =NpS(1pS),\displaystyle=Np_{S}(1-pS), (96)
ρ(N~S,NS)\displaystyle\rho(\tilde{N}_{S},N_{S}) =δpS(1pS)qS(1qS).\displaystyle=\delta\frac{\sqrt{p_{S}(1-p_{S})}}{\sqrt{q_{S}(1-q_{S})}}. (97)

It follows from Eq. (96) that N~SNS\tilde{N}_{S}-N_{S} and NSN_{S} are uncorrelated:

cov(N~SNS,NS)=cov(N~S,NS)σ2(NS)=0.\displaystyle\textsf{cov}(\tilde{N}_{S}-N_{S},N_{S})=\textsf{cov}(\tilde{N}_{S},N_{S})-\sigma^{2}(N_{S})=0. (98)

Example 2

Using the numerical values of Example 1, Table 1 shows the expectations, the standard deviations and the correlation of N~S,N~B\tilde{N}_{S},\tilde{N}_{B} averaged over a simulated population with 10510^{5} data sets. The exact values are shown as well. Figure Fig. 4 shows histograms and scatter plots of the estimates N~SNS\tilde{N}_{S}-N_{S} and N~S\tilde{N}_{S}.

Table 1: Empirical and exact expectations, standard deviations and correlation of N~S,N~B\tilde{N}_{S},\tilde{N}_{B} with fixed data size N=1100N=1100.
N=1100EmpiricalExactN~SNS0.0190σ(N~SNS)()8.878.84N~S99.94100.00N~B1000.061000.00σ(N~S)13.0213.00σ(N~B)13.0213.00ρ(N~S,N~B)1.001.00ρ(N~S,NS)()0.7320.733ρ(N~SNS,NS)0.0010\begin{array}[]{|l||c|c|}\hline\cr N=1100&\makebox[85.35826pt]{{Empirical}}&\makebox[85.35826pt]{{Exact}}\\ \hline\cr\langle\tilde{N}_{S}-N_{S}\rangle\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$-0.019$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$0$}\makebox[28.45274pt]{}\\ \hline\cr\sigma(\tilde{N}_{S}-N_{S})^{\ (\ast)}\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$8.87$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$8.84$}\makebox[28.45274pt]{}\\ \hline\cr\langle\tilde{N}_{S}\rangle\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$99.94$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$100.00$}\makebox[28.45274pt]{}\\ \hline\cr\langle\tilde{N}_{B}\rangle\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$1000.06$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$1000.00$}\makebox[28.45274pt]{}\\ \hline\cr\sigma(\tilde{N}_{S})\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$13.02$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$13.00$}\makebox[28.45274pt]{}\\ \hline\cr\sigma(\tilde{N}_{B})\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$13.02$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$13.00$}\makebox[28.45274pt]{}\\ \hline\cr\rho(\tilde{N}_{S},\tilde{N}_{B})\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$-1.00$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$-1.00$}\makebox[28.45274pt]{}\\ \hline\cr\rho(\tilde{N}_{S},N_{S})^{\ (\dagger)}\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$0.732$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$0.733$}\makebox[28.45274pt]{}\\ \hline\cr\rho(\tilde{N}_{S}-N_{S},N_{S})\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$0.001$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$0$}\makebox[28.45274pt]{}\\ \hline\cr\end{array}

(){}^{\ (\ast)} The exact value is the average of the exact standard deviation as given by Eq. (94) over the 10510^{5} simulated data sets.
(){}^{\ (\dagger)} The exact value is the average of the exact correlation as given by Eq. (97) over the 10510^{5} simulated data sets.

The variances and covariances derived in this subsection ultimately depend on the unknown parameter pSp_{S}. In practice, pSp_{S} has to be replaced by its estimated value, leading to a deviation from the exact theoretical value. The size of the deviations is illustrated in Fig. 5, which shows histograms of σ(N~S)\sigma(\tilde{N}_{S}), σ(N~SNS)\sigma(\tilde{N}_{S}-N_{S}) and ρ(N~S,NS)\rho(\tilde{N}_{S},N_{S}). The entries are the approximative standard deviations that use p~S\tilde{p}_{S} and q~S\tilde{q}_{S} instead of pSp_{S} and qSq_{S}.

10.2 Data set size NN is Poisson-distributed

Next it is assumed that

  • each data set in the population is the union of a set 𝔖\mathfrak{S} of size NSN_{S} and a set 𝔅\mathfrak{B} of size NBN_{B};

  • NSN_{S} and NBN_{B} are independent Poisson-distributed random variables with mean values λpS\lambda p_{S} and λpB\lambda p_{B}, respectively.

The data set size N=NS+NBN=N_{S}+N_{B} ist then Poisson distributed according to Pois(λ\lambda). In the absence of further external information, the obvious choice of the Poisson parameter λ\lambda is λ=N𝔇\lambda=N_{\mathfrak{D}}.

Conditional on NN, we have:

ES|N=NqS,EB|N=NqB,\displaystyle\langle E_{S}|N\rangle=Nq_{S},\;{}\langle E_{B}|N\rangle=Nq_{B}, (99)
σ2(ES|N)=σ2(EB|N)=NqSqB,\displaystyle\sigma^{2}(E_{S}|N)=\sigma^{2}(E_{B}|N)=Nq_{S}q_{B}, (100)
cov(ES,EB|N)=NqSqB,ρ(ES,EB|N)=1.\displaystyle\textsf{cov}(E_{S},E_{B}|N)=-Nq_{S}q_{B},\;{}\rho(E_{S},E_{B}|N)=-1. (101)

If NS,NBN_{S},N_{B} are estimated via Eq. (81), their conditional expectation and variance-covariance matrix is given by:

N~S|N=NpS,N~B|N=NpB,\displaystyle\langle\tilde{N}_{S}|N\rangle=Np_{S},\;{}\langle\tilde{N}_{B}|N\rangle=Np_{B}, (102)
σ2(N~S|N)=σ2(N~B|N)=NqSqB/δ2,\displaystyle\sigma^{2}(\tilde{N}_{S}|N)=\sigma^{2}(\tilde{N}_{B}|N)=Nq_{S}q_{B}/\delta^{2}, (103)
cov(N~S,N~B|N)=NqSqB/δ2ρ(N~S,N~B|N)=1.\displaystyle\textsf{cov}(\tilde{N}_{S},\tilde{N}_{B}|N)=-Nq_{S}q_{B}/\delta^{2}\;{}\Longrightarrow\;{}\rho(\tilde{N}_{S},\tilde{N}_{B}|N)=-1. (104)

The unknown fractions pB,pSp_{B},p_{S} are estimated by:

p~S=N~S/N,p~B=N~B/N.\displaystyle\tilde{p}_{S}=\tilde{N}_{S}/N,\;{}\tilde{p}_{B}=\tilde{N}_{B}/N. (105)

Conditional on NN, these estimators are unbiased. Their expectation and conditional variance-covariance matrix is equal to:

p~S|N=pS,p~B|N=pB,\displaystyle\langle\tilde{p}_{S}|N\rangle=p_{S},\;{}\langle\tilde{p}_{B}|N\rangle=p_{B}, (106)
σ2(p~S|N)=σ2(p~B|N)=qSqB/(Nδ2),\displaystyle\sigma^{2}(\tilde{p}_{S}|N)=\sigma^{2}(\tilde{p}_{B}|N)=q_{S}q_{B}/(N\delta^{2}), (107)
cov(p~S,p~B|N)=qSqB/(Nδ2)ρ(p~S,p~B|N)=1.\displaystyle\textsf{cov}(\tilde{p}_{S},\tilde{p}_{B}|N)=-q_{S}q_{B}/(N\delta^{2})\;{}\Longrightarrow\;{}\rho(\tilde{p}_{S},\tilde{p}_{B}|N)=-1. (108)

The unconditional expectations and elements of the variance-covariance matrix can be calculated by taking the expectation with respect to the distribution of NN, which is Pois(λ\lambda). First, the conditional second moments around 0 (i.e., raw moments) are computed:

N~S2|N=NqSqB/δ2+N2pS2,\displaystyle\langle\tilde{N}_{S}^{2}|N\rangle=Nq_{S}q_{B}/\delta^{2}+N^{2}p_{S}^{2}, (109)
N~B2|N=NqSqB/δ2+N2pB2,\displaystyle\langle\tilde{N}_{B}^{2}|N\rangle=Nq_{S}q_{B}/\delta^{2}+N^{2}p_{B}^{2}, (110)
N~SN~B|N=NqSqB/δ2+N2pSpB.\displaystyle\langle\tilde{N}_{S}\tilde{N}_{B}|N\rangle=Nq_{S}q_{B}/\delta^{2}+N^{2}p_{S}p_{B}. (111)

Taking the expectation of NN over Pois(λ\lambda) yields:

N~S=λpS,N~B=λpB,\displaystyle\langle\tilde{N}_{S}\rangle=\lambda p_{S},\;{}\langle\tilde{N}_{B}\rangle=\lambda p_{B}, (112)
N~S2=λqSqB/δ2+(λ2+λ)pS2,\displaystyle\langle\tilde{N}_{S}^{2}\rangle=\lambda q_{S}q_{B}/\delta^{2}+(\lambda^{2}+\lambda)p_{S}^{2}, (113)
N~B2=λqSqB/δ2+(λ2+λ)pB2,\displaystyle\langle\tilde{N}_{B}^{2}\rangle=\lambda q_{S}q_{B}/\delta^{2}+(\lambda^{2}+\lambda)p_{B}^{2}, (114)
N~SN~B=λqSqB/δ2+(λ2+λ)pSpB.\displaystyle\langle\tilde{N}_{S}\tilde{N}_{B}\rangle=\lambda q_{S}q_{B}/\delta^{2}+(\lambda^{2}+\lambda)p_{S}p_{B}. (115)

The unconditional variance-covariance matrix is then given by:

σ2(N~S)=λ(η2qS+β2qB)/δ2,\displaystyle\sigma^{2}(\tilde{N}_{S})=\lambda(\eta^{2}q_{S}+\beta^{2}q_{B})/\delta^{2}, (116)
σ2(N~B)=λ(α2qS+ε2qB)/δ2,\displaystyle\sigma^{2}(\tilde{N}_{B})=\lambda(\alpha^{2}q_{S}+\varepsilon^{2}q_{B})/\delta^{2}, (117)
cov(N~S,N~B)=λ(αηqS+βεqB)/δ2,\displaystyle\textsf{cov}(\tilde{N}_{S},\tilde{N}_{B})=-\lambda(\alpha\eta q_{S}+\beta\varepsilon q_{B})/\delta^{2}, (118)
ρ(N~S,N~B)=αηqS+βεqB(η2qS+β2qB)(α2qS+ε2qB).\displaystyle\rho(\tilde{N}_{S},\tilde{N}_{B})=-\frac{\alpha\eta q_{S}+\beta\varepsilon q_{B}}{\sqrt{(\eta^{2}q_{S}+\beta^{2}q_{B})(\alpha^{2}q_{S}+\varepsilon^{2}q_{B})}}. (119)
Table 2: Unconditional empirical and exact expectations, standard deviations and correlation of N~S,N~B\tilde{N}_{S},\tilde{N}_{B} with data size NN distributed according to Pois(λ\lambda) with λ=1100\lambda=1100.
λ=1100EmpiricalExactN~S99.94100.00N~B1000.081000.00σ(N~S)13.3113.35σ(N~B)32.9232.84ρ(N~S,N~B)0.180.18\begin{array}[]{|l||c|c|}\hline\cr\lambda=1100&\makebox[85.35826pt]{{Empirical}}&\makebox[85.35826pt]{{Exact}}\\ \hline\cr\langle\tilde{N}_{S}\rangle\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$99.94$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$100.00$}\makebox[28.45274pt]{}\\ \hline\cr\langle\tilde{N}_{B}\rangle\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$1000.08$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$1000.00$}\makebox[28.45274pt]{}\\ \hline\cr\sigma(\tilde{N}_{S})\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$13.31$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$13.35$}\makebox[28.45274pt]{}\\ \hline\cr\sigma(\tilde{N}_{B})\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$32.92$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$32.84$}\makebox[28.45274pt]{}\\ \hline\cr\rho(\tilde{N}_{S},\tilde{N}_{B})\rule[-6.0pt]{0.0pt}{20.0pt}&\makebox[56.9055pt][r]{$-0.18$}\makebox[28.45274pt]{}&\makebox[56.9055pt][r]{$-0.18$}\makebox[28.45274pt]{}\\ \hline\cr\end{array}

Example 3

Using the numerical values of Example 1, the exact conditional expectations, standard deviations and correlation of N~S,N~B\tilde{N}_{S},\tilde{N}_{B} are identical to the ones shown in Table 1. Table 2 shows the corresponding unconditional values with λ=N𝔇\lambda=N_{\mathfrak{D}}. Fig. 6 shows scatter plots of NSN_{S} versus NBN_{B}, ESE_{S} versus EBE_{B}, NSNpSN_{S}-Np_{S} versus NBNpBN_{B}-Np_{B}, and N~S\tilde{N}_{S} versus N~B\tilde{N}_{B}.

10.3 Data set size NN is drawn from a mixture
of Poisson distributions

An even more general parent distribution is obtained by assuming that the size of the data sets in the population is distributed according to a mixture of Poisson distributions. This means that the data set size NN is drawn from a Poisson distribution with mean λ\lambda, and λ\lambda is drawn from a mixing distribution GG with mean μ\mu and variance τ2\tau^{2}. The results of the previous subsection show that conditional on λ\lambda:

ES|λ=λqS,EB|λ=λqB,\displaystyle\langle E_{S}|\lambda\rangle=\lambda q_{S},\;{}\langle E_{B}|\lambda\rangle=\lambda q_{B},
σ2(ES|λ)=λqS,σ2(EB|λ)=λqB,cov(ES,EB|λ)=0.\displaystyle\sigma^{2}(E_{S}|\lambda)=\lambda q_{S},\;{}\sigma^{2}(E_{B}|\lambda)=\lambda q_{B},\;{}\textsf{cov}(E_{S},E_{B}|\lambda)=0. (120)

This yields the following second raw moments:

ES2|λ=λ2qS2+λqS,\displaystyle\langle E_{S}^{2}|\lambda\rangle=\lambda^{2}q_{S}^{2}+\lambda q_{S},
EB2|λ=λ2qB2+λqB,\displaystyle\langle E_{B}^{2}|\lambda\rangle=\lambda^{2}q_{B}^{2}+\lambda q_{B},
ESEB|λ=ES|λEB|λ=λ2qSqB.\displaystyle\langle E_{S}E_{B}|\lambda\rangle=\langle E_{S}|\lambda\rangle\langle E_{B}|\lambda\rangle=\lambda^{2}q_{S}q_{B}. (121)

The computation of the unconditional moments relies on the fact that the raw moments of a mixture are a mixture of the raw moments of the mixture components. It follows that:

ES=λqSG=μqS,\displaystyle\langle E_{S}\rangle=\langle\lambda q_{S}\rangle_{G}=\mu q_{S},
EB=λqBG=μqB,\displaystyle\langle E_{B}\rangle=\langle\lambda q_{B}\rangle_{G}=\mu q_{B},
ES2=λ2qS2+λqSG=(μ2+τ2)qS2+μqS,\displaystyle\langle E_{S}^{2}\rangle=\langle\lambda^{2}q_{S}^{2}+\lambda q_{S}\rangle_{G}=(\mu^{2}+\tau^{2})q_{S}^{2}+\mu q_{S}, (122)
EB2=λ2qB2+λqBG=(μ2+τ2)qB2+μqB,\displaystyle\langle E_{B}^{2}\rangle=\langle\lambda^{2}q_{B}^{2}+\lambda q_{B}\rangle_{G}=(\mu^{2}+\tau^{2})q_{B}^{2}+\mu q_{B},
ESEB=λ2qSqBG=(μ2+τ2)qSqB,\displaystyle\langle E_{S}E_{B}\rangle=\langle\lambda^{2}q_{S}q_{B}\rangle_{G}=(\mu^{2}+\tau^{2})q_{S}q_{B},

where G\langle\cdot\rangle_{G} denotes the expectation with respect to the mixing distribution GG. The final step is the computation of the full unconditional variance-covariance matrix of ES,EBE_{S},E_{B}:

σ2(ES)=ES2ES2=τ2qS2+μqS,\displaystyle\sigma^{2}(E_{S})=\langle E_{S}^{2}\rangle-\langle E_{S}\rangle^{2}=\tau^{2}q_{S}^{2}+\mu q_{S},
σ2(EB)=EB2EB2=τ2qB2+μqB,\displaystyle\sigma^{2}(E_{B})=\langle E_{B}^{2}\rangle-\langle E_{B}\rangle^{2}=\tau^{2}q_{B}^{2}+\mu q_{B}, (123)
cov(ES,EB)=ESEBESEB=τ2qSqB.\displaystyle\textsf{cov}(E_{S},E_{B})=\langle E_{S}E_{B}\rangle-\langle E_{S}\rangle\langle E_{B}\rangle=\tau^{2}q_{S}q_{B}.

Using Eq. (81) and linear error propagation, N~S,N~B\tilde{N}_{S},\tilde{N}_{B}, their expectations and their joint variance-covariance matrix can be computed, see Subsection 3.2. The explicit formulas are too complicated to be spelled out here. Note that two mixing distributions with the same mean and variance give the same estimates.

The special case where GG is a Poisson distribution Pois(μ)\textsf{Pois}(\mu) is of particular interest. As τ2=μ\tau^{2}=\mu in this case, the following variance-covariance matrix of ES,EBE_{S},E_{B} is obtained:

σ2(ES)=μ(qS2+qS),σ2(EB)=μ(qB2+qB),cov(ES,EB)=μqSqB.\displaystyle\sigma^{2}(E_{S})=\mu(q_{S}^{2}+q_{S}),\;{}\sigma^{2}(E_{B})=\mu(q_{B}^{2}+q_{B}),\;{}\textsf{cov}(E_{S},E_{B})=\mu q_{S}q_{B}. (124)

Linear error propagation gives the following joint variance-covariance matrix of N~S,N~B\tilde{N}_{S},\tilde{N}_{B}:

σ2(N~S)=μ(2β24βqS+qS2+qS)δ2\displaystyle\sigma^{2}(\tilde{N}_{S})=\frac{\mu(2\beta^{2}-4\beta q_{S}+q_{S}^{2}+q_{S})}{\delta^{2}}
σ2(N~B)=μ(2α24αqB+qB2+qB)δ2\displaystyle\sigma^{2}(\tilde{N}_{B})=\frac{\mu(2\alpha^{2}-4\alpha q_{B}+q_{B}^{2}+q_{B})}{\delta^{2}} (125)
cov(N~S,N~B)=μ(2αqB2αβ2αqB+2βqB+qB2)δ2.\displaystyle\textsf{cov}(\tilde{N}_{S},\tilde{N}_{B})=-\frac{\mu(2\alpha-q_{B}-2\alpha\beta-2\alpha q_{B}+2\beta q_{B}+q_{B}^{2})}{\delta^{2}}. (126)

In the absence of further external information, the obvious choice of the Poisson parameter μ\mu is μ=N𝔇\mu=N_{\mathfrak{D}}.

11 Discussion

Conditional on the data set size NN, the distribution of the estimates N~S,N~B\tilde{N}_{S},\tilde{N}_{B} is always the same. The conditional estimates are identical to the Bayes estimate with Haldane’s prior, their conditional variance is, however, slightly different from the posterior variance of the Bayes estimator. The unconditional moments of N~S,N~B\tilde{N}_{S},\tilde{N}_{B} depend of course on the assumptions about the parent population, so different data analysts may come to different conclusions. In addition, if only a single data set has been observed, the conditional moments are much more meaningful.

An advantage of the Bayesian approach is that prior information about the “signal” content of the data set can be easily incorporated into the Bayes estimator by a conjugate Beta prior. On the other hand, with the exception of Haldane’s prior, the Bayes estimators have a small bias when compared to the “truth” value used in the simulation, and different data analysts may prefer different priors, whether uninformative or informative.

Refer to caption
Figure 3: Histograms of the difference NS|ESNS,sim\langle N_{S}|E_{S}\rangle-N_{\mathrm{S,sim}} in M=105M=10^{5} simulated samples with four priors: (a) Uniform prior, (b) Jeffrey’s prior, (c) Haldane’s prior, (d) Beta(7.5,42.57.5,42.5) prior.
Refer to caption
Figure 4: Fixed size NN of the data sets. (a) Histogram of N~SNS\tilde{N}_{S}-N_{S}, (b) scatter plot NSN_{S} versus N~SNS\tilde{N}_{S}-N_{S}, (c) histogram of NSN_{S}, (d) scatter plot NSN_{S} versus N~S\tilde{N}_{S}. Note that N~S\tilde{N}_{S} takes only 91 discrete values.
Refer to caption
Figure 5: Fixed size NN of the data sets. Histogram of (a) σ(N~SNS)\sigma(\tilde{N}_{S}-N_{S}), (b) σ(N~SNS\sigma(\tilde{N}_{S}-N_{S}), (c) ρ(N~S,NS)\rho(\tilde{N}_{S},N_{S}). The entries are calculated with the estimated values of p~S\tilde{p}_{S} and q~S\tilde{q}_{S}.
Refer to caption
Figure 6: Poisson-distributed size NN of the data sets. Scatter plots of (a) NSN_{S} versus NBN_{B}, (b) ESE_{S} versus EBE_{B}, (c) NSNpSN_{S}-Np_{S} versus NBNpBN_{B}-Np_{B}, (d) N~S\tilde{N}_{S} versus N~B\tilde{N}_{B}. NN is distributed according to Pois(λ\lambda) with λ=1100\lambda=1100.

References

  • [1] F. James: Statistical Methods in Experimental Physics, 2nd2^{nd} ed.
    World Scientific, Singapore 2006 (ISBN 981-270-527-9).
  • [2] L. Lyons: Statistics for Nuclear and Particle Physicists,
    Cambridge University Press, 1992 (ISBN 0-521-37934-2).
  • [3] D.S. Sivia: Data Analysis – A Bayesian Tutorial, 2nd2^{nd} ed.
    Oxford University Press, 2008 (ISBN 978-0-19-856832-2).
  • [4] G. D’Agostini: Bayesian Reasoning in High-Energy Physics,
    Yellow Report CERN 99-03, Geneva 1999.
  • [5] R. Frühwirth, M. Regler, R.K. Bock, H. Grote, D. Notz:
    Data Analysis Techniques for High-Energy Physics, 2nd2^{nd} ed.
    Cambridge University Press, 2000 (ISBN 0-521-63548-9).
  • [6] C.W. Fabjan and H. Schopper (editors):
    Particle Physics Reference Library, vol. 2.
    Springer Open, Berlin 2020 (DOI: 10.1007/978-3-030-35318-6).
  • [7] F. Porter (editor): The Physics of the B Factories, chapter 4,
    Eur.Phys.J. C 74 (2014) 3026, pp. 59–66, and
    Springer Open, Berlin 2014 (arXiv:1406.6311 [hep-ex]).
  • [8] P. Feichtinger et al.: Punzi-loss,
    Eur.Phys.J. C 82 (2022) 121 (arXiv:2110.00810 [hep-ex]).
  • [9] J.M. Bernardo and A.F.M. Smith: Bayesian Theory.
    John Wiley&Sons, Hoboken NJ, 1994 (ISBN 0-471-92416-4).
  • [10] P.B. Stark: Constraints versus Priors,
    SIAM/ASA J. Uncertainty Quantification 3 (2015) 586
    (DOI: 10.1137/130920721).