This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimising quantifier variance under prior probability shift

Dirk Tasche Independent researcher, e-mail: dirk.tasche@gmx.net
Abstract

For the binary prevalence quantification problem under prior probability shift, we determine the asymptotic variance of the maximum likelihood estimator. We find that it is a function of the Brier score for the regression of the class label on the features under the test data set distribution. This observation suggests that optimising the accuracy of a base classifier, as measured by the Brier score, on the training data set helps to reduce the variance of the related quantifier on the test data set. Therefore, we also point out training criteria for the base classifier that imply optimisation of both of the Brier scores on the training and the test data sets.
Keywords: Prior probability shift, quantifier, class distribution estimation, Cramér-Rao bound, maximum likelihood estimator, Brier score.

1 Introduction

The survey paper [8] described the problem to estimate prior class probabilities (also called prevalences) on a test set with a different distribution than the training set (the quantification problem) as “Given a labelled training set, induce a quantifier that takes an unlabelled test set as input and returns its best estimate of the class distribution.” As becomes clear from [8] and also more recent work on the problem, it has been widely investigated in the past twenty years.

A lot of different approaches to quantification of prior class probabilities has been proposed and analysed (see, e.g. [8, 10, 16]), but it appears that the following question has not yet received very much attention:

Is it worth the effort to try to train a good (accurate) hard (or soft or probabilistic) classifier as the ‘base classifier’ for the task of quantification if the class labels of individual instances are unimportant and only the aggregate prior class probabilities are of interest?

In principle, there is a clear answer to this question. The accuracy of the classifier matters at least in the extreme cases:

  • If a classifier is least accurate because its predictions and the true class labels are stochastically independent, then quantification is not feasible.

  • If a classifier is most accurate in the sense of making perfect predictions then perfect quantification is easy by applying Classify & Count [4].

But if no perfect classifier is around, can we be happy to deploy a moderately accurate classifier for quantification or should we rather strive to develop an optimal classifier, possibly based on an comprehensive feature selection process?

Some researchers indeed suggest that the accuracy of the base classifiers is less important for quantification than for classification. [5] made the following statements:

  • From the abstract of [5]: “These strengths can make quantification practical for business use, even where classification accuracy is poor.”

  • P. 165 of [5]: “The effort to develop special purpose features or classifiers could increase the cost significantly, with no guarantee of an accurate classifier. Thus, an imperfect classifier is often all that is available.”

  • P. 166 of [5]: “It is sufficient but not necessary to have a perfect classifier in order to estimate the class distribution well. If the number of false positives balances against false negatives, then the overall count of predicted positives is nonetheless correct. Intuitively, the estimation task is easier for not having to deliver accurate predictions on individual cases.”

The point on the mutual cancellation of false positives and false negatives is mentioned also by a number of other researchers like for instance [3]. On p. 74, [3] wrote: “Equation 1 [with the definition of the FF-measure] shows that F1F_{1} deteriorates with (FP+FN)(FP+FN) and not with |FPFN||FP-FN|, as would instead be required of a function that truly optimizes quantification.”

There are also researchers that hold the contrary position, at least as quantification under an assumption of prior probability shift (see (3.1) below for the formal definition) is concerned:

  • [25] noted for the class of ‘ratio estimators’ they introducted that it was both desirable and feasible to construct estimators with small asymptotic variances.

  • [22] demonstrated by a simulation study that estimating the prior class probabilities by means of a more accurate base classifier may entail much shorter confidence intervals for the estimates.

  • [1, 7] pointed out the efficiency of the maximum likelihood estimator (MLE) for the quantification task and made cases for its application.

In the following, we revisit the question of the usefulness of accurate classifiers for quantification:

After giving an overview of related research in Section 2 and specifying the setting and assumptions for the binary quantification problem in Section 3, we recall the technical details of the definition of the MLE for the positive class prevalence in Section 4. In particular, we show that the MLE is well-defined under the mild condition that the test sample consists of at least two different points, see (4.4) below.

In Section 5, we describe, based on its representation in terms of the Fisher information, the Cramér-Rao lower bound for the variances of unbiased estimators of the positive class prevalence, see (5.1) below. This lower bound, at the same time, is the large sample variance of the MLE defined in Section 4. Thus, the Cramér-Rao lower bound is achievable in theory by MLEs – which might explain to some extent the superiority of MLEs as observed in [1, 7].

In Section 6, we show for the test distribution of the feature vector that its associated Fisher information with respect to the positive class prevalence is – up to a factor depending only on the true prevalence – just the variance of the posterior positive class probability (Proposition 6.1 below). The variance of the posterior probability is closely related to the Brier score for the regression of the class label on the feature vector. While the Brier score decreases when the information content of the feature vector increases, the variance of the posterior probability decreases with shrinking information content of the features. In any case, these observations imply that the large sample variance of the MLE (or the Cramér-Rao lower bound for the variances of unbiased estimators) may be reduced when a feature vector with larger information content is selected. Section 6 concludes with two suggestions of how this can be achieved in practice (Brier curves, ROC analysis).

In Section 7, we illustrate the observations of Sections 5 and 6 with a numerical example. Table 1 below demonstrates variance reduction through more powerful features both for the ML quantifier and a non-ML quantifier. Figure 1 below suggests that differences in efficiency between the ML quantifier and non-ML quantifiers may depend both on the information content (power) of the feature vector and on the true value of the positive class prevalence.

2 Related work

Prior probability shift is a special type of data set shift, see [15] for background information and a taxonomy of data set shift. In the literature, also other terms are used for prior probability shift, for instance ‘global drift’ [12] or ‘label shift’ [13].

The problem of estimating the test set prior class probabilities can also be interpreted as a problem to estimate the parameters of a ‘mixture model’ [6] where the component distributions are learnt on a training set. See [17] for an early work on the properties of the maximum likelihood (ML) estimator in this case. [20] revived the interest in the ML estimator for the unknown prior class probabilities in the test set by specifying the associated ‘expectation maximisation’ (EM) algorithm.

[21] proposed to take recourse to the notion of Fisher consistency as a criterion to identify completely unsuitable approaches to the quantification problem that do not have this property. [21] then proved Fisher consistency of the ML estimator under prior probability shift.

The ML approach has been criticised for its sometimes moderate performance and the effort and amount of training data needed to implement it. However, recently some researchers [1, 7] began to vindicate the ML approach. They focussed on the need to properly calibrate the posterior class probability estimate in order to improve the efficiency of the MLE for the positive class prevalence on the test set. Complementing the work of [1, 7], in this paper we study the role that constructing more powerful (or accurate) classifiers by selection of more appropriate features may play to reduce the variances of the related quantifiers.

In the following, we revisit the well-known asymptotic efficiency property of ML estimators in the special case of the MLE for prior class probabilities in the binary setting and investigate how it is impacted by the power (accuracy) of the classifier on the MLE is based.

3 Setting

We consider the binary prevalence quantification problem in the following setting:

  • There is a training (or source) data set (x1,y1),,(xm,ym)𝔛×{0,1}(x_{1},y_{1}),\ldots,(x_{m},y_{m})\in\mathfrak{X}\times\{0,1\}. It is assumed to be an i.i.d. sample of a random vector (X,Y)(X,Y) with values in 𝔛×{0,1}\mathfrak{X}\times\{0,1\}. The vector (X,Y)(X,Y) is defined on a probability space (Ω,P)(\Omega,P), the training (or source) domain. The elements ω\omega of Ω\Omega are the instances (or objects). Each instance ω\omega belongs to one of the classes 0 and 11, and its class label is Y(ω){0,1}Y(\omega)\in\{0,1\}. In addition, each instance ω\omega has features X(ω)𝔛X(\omega)\in\mathfrak{X}. Often, 𝔛\mathfrak{X} is the dd-dimensional Euclidian space such that accordingly XX is a real-valued random vector. See Appendix B.1 of [27] for more detailed comments of how this setting avoids the logical problems that arise when feature vectors and instances are considered to be the same thing.

  • Under the training distribution PP, both the features X(ω)X(\omega) and the class labels Y(ω)Y(\omega) of the instances are observed in a series of mm independent experiments resulting in the sample (x1,y1),,(xm,ym)(x_{1},y_{1}),\ldots,(x_{m},y_{m}). The sample can be used to infer the joint distribution of XX and YY under PP, and hence, in particular, also the distribution of YY (the class distribution) under PP.

  • There is a test (or target) data set z1,,zn𝔛z_{1},\ldots,z_{n}\in\mathfrak{X}. It is assumed to be an i.i.d. sample of the random vector XX with values in 𝔛\mathfrak{X}, under a probability measure QQ on Ω\Omega that may be different to the training distribution PP.

  • Under the test distribution QQ, only the features X(ω)X(\omega) of the instances are observed in a series of nn independent experiments resulting in the sample z1,,znz_{1},\ldots,z_{n}. The sample can be used to infer the distribution of XX under QQ.

  • The goal of quantification is to infer the distribution of YY under QQ, based on the sample of features z1,,znz_{1},\ldots,z_{n} generated under QQ and on the joint sample of features and class labels (x1,y1),,(xm,ym)(x_{1},y_{1}),\ldots,(x_{m},y_{m}) generated under PP. It is not possible to design a method for this inference without any assumption on the relation of PP and QQ.

  • In this paper, we assume that PP and QQ are related by prior probability shift, in the sense that the class-conditional feature distributions are the same under PP and QQ, i.e. it holds that

    P[XM|Y=i]=Q[XM|Y=i]P[X\in M\,|\,Y=i]\ =\ Q[X\in M\,|\,Y=i] (3.1)

    for i{0,1}i\in\{0,1\} and all measurable subsets MM of 𝔛\mathfrak{X}.

Denoting P[Y=1]=pP[Y=1]=p and Q[Y=1]=qQ[Y=1]=q, (3.1) implies that the distribution of the features XX under PP and QQ respectively can be represented as

P[XM]\displaystyle P[X\in M] =pP[XM|Y=1]+(1p)P[XM|Y=0],\displaystyle=p\,P[X\in M\,|\,Y=1]+(1-p)\,P[X\in M\,|\,Y=0], (3.2a)
Q[XM]\displaystyle Q[X\in M] =qP[XM|Y=1]+(1q)P[XM|Y=0],\displaystyle=q\,P[X\in M\,|\,Y=1]+(1-q)\,P[X\in M\,|\,Y=0], (3.2b)

for M𝔛M\subset\mathfrak{X}. In the following, we assume that the components pp, P[XM|Y=1]P[X\in M\,|\,Y=1] and P[XM|Y=0]P[X\in M\,|\,Y=0] can be perfectly estimated from the training sample (x1,y1),,(xm,ym)(x_{1},y_{1}),\ldots,(x_{m},y_{m}).

Basically, this means letting m=m=\infty which obviously is infeasible. The assumption helps, however, to shed light on the importance of both maximum likelihood estimation and accurate classifiers for the efficient estimation of the unknown positive class prevalence qq in the test data set.

4 The ML estimator for the positive class prevalence

Assume that the conditional distributions in (3.1) have positive densities fif_{i}, i=0,1i=0,1. Then the unconditional density of the features vector XX under QQ is

f(q)(x)=(1q)f0(x)+qf1(x),x𝔛.f^{(q)}(x)\ =\ (1-q)\,f_{0}(x)+q\,f_{1}(x),\quad x\in\mathfrak{X}. (4.1)

Hence the likelihood function

Ln(q)=Ln(q;z1,,zn)L_{n}(q)=L_{n}(q;z_{1},\ldots,z_{n})

for the sample z1,,znz_{1},\ldots,z_{n} is given by

Ln(q)=i=1n(q(f1(zi)f0(zi))+f0(zi)).L_{n}(q)\ =\ \prod_{i=1}^{n}\bigl{(}q\,(f_{1}(z_{i})-f_{0}(z_{i}))+f_{0}(z_{i})\bigr{)}. (4.2)

This implies for the first two derivatives of the log-likelihood with respect to qq:

logLnq(q)\displaystyle\frac{\partial\log L_{n}}{\partial\,q}(q) =i=1nf1(zi)f0(zi)q(f1(zi)f0(zi))+f0(zi),\displaystyle=\sum_{i=1}^{n}\frac{f_{1}(z_{i})-f_{0}(z_{i})}{q\,(f_{1}(z_{i})-f_{0}(z_{i}))+f_{0}(z_{i})}, (4.3a)
2logLnq2(q)\displaystyle\frac{\partial^{2}\log L_{n}}{\partial\,q^{2}}(q) =i=1n(f1(zi)f0(zi)q(f1(zi)f0(zi))+f0(zi))20.\displaystyle=-\sum_{i=1}^{n}\left(\frac{f_{1}(z_{i})-f_{0}(z_{i})}{q\,(f_{1}(z_{i})-f_{0}(z_{i}))+f_{0}(z_{i})}\right)^{2}\leq 0. (4.3b)

We assume that there is at least one j{1,,n}j\in\{1,\ldots,n\} such that

f1(zj)f0(zj).f_{1}(z_{j})\ \neq\ f_{0}(z_{j}). (4.4)

Under (4.4), qlogLn(q)q\mapsto\log L_{n}(q) is strictly concave in [0,1][0,1]. Hence (see Example 4.3.1 of [24]) the equation

logLnq(q)= 0\frac{\partial\log L_{n}}{\partial\,q}(q)\ =\ 0 (4.5)

has a solution 0<q<10<q<1 if and only if

1ni=1nf1(zi)f0(zi)>1and1ni=1nf0(zi)f1(zi)>1.\begin{split}\frac{1}{n}\sum_{i=1}^{n}\frac{f_{1}(z_{i})}{f_{0}(z_{i})}>1\quad\text{and}\quad\frac{1}{n}\sum_{i=1}^{n}\frac{f_{0}(z_{i})}{f_{1}(z_{i})}>1.\end{split} (4.6a)
This solution is then the unique point in [0,1][0,1] where Ln(q)L_{n}(q) takes its absolute maximum value. By strict concavity of logLn\log L_{n}, if (4.6a) is not true then either
1ni=1nf1(zi)f0(zi)1and1ni=1nf0(zi)f1(zi)>1,\begin{split}\frac{1}{n}\sum_{i=1}^{n}\frac{f_{1}(z_{i})}{f_{0}(z_{i})}\leq 1\quad\text{and}\quad\frac{1}{n}\sum_{i=1}^{n}\frac{f_{0}(z_{i})}{f_{1}(z_{i})}>1,\end{split} (4.6b)
applies or
1ni=1nf1(zi)f0(zi)>1and1ni=1nf0(zi)f1(zi)1,\begin{split}\frac{1}{n}\sum_{i=1}^{n}\frac{f_{1}(z_{i})}{f_{0}(z_{i})}>1\quad\text{and}\quad\frac{1}{n}\sum_{i=1}^{n}\frac{f_{0}(z_{i})}{f_{1}(z_{i})}\leq 1,\end{split} (4.6c)
holds. Under (4.6b), the unique maximum of logLn\log L_{n} in [0,1][0,1] lies at q=0q=0 while under (4.6c), the unique maximum of logLn\log L_{n} in [0,1][0,1] is taken at q=1q=1.

In summary, under the natural assumption (4.4), the likelihood function LnL_{n} of (4.2) has an absolute maximum in [0,1][0,1] at a unique point q[0,1]q^{\ast}\in[0,1]. As a consequence, the maximum likelihood (ML) estimate q^n\hat{q}_{n} of the test set positive class prevalence qq is well-defined by setting q^n=q\hat{q}_{n}=q^{\ast}.

5 The Cramér-Rao bound for unbiased estimators

In the setting of Section 3, let q~n\widetilde{q}_{n} be any unbiased estimator of the positive class prevalence qq under the test distribution QQ, i.e. q~n=Wn(X1,,Xn)\widetilde{q}_{n}=W_{n}(X_{1},\ldots,X_{n}) for some function Wn:𝔛nW_{n}:\mathfrak{X}^{n}\to\mathbb{R} such that

EQ[q~n]=Q[Y=1]=q,E_{Q}[\widetilde{q}_{n}]=Q[Y=1]=q,

where EQE_{Q} denotes the expected value under the mixture probability measure QQ of (3.2b). Assume additionally that there are positive densities f0f_{0} and f1f_{1} of the class-conditional distributions of the feature vector XX, such that the density f(q)f^{(q)} of XX under QQ is given by (4.1).

If (X1,,Xn)(X_{1},\ldots,X_{n}) is an i.i.d. sample from the distribution of the feature vector under the test distribution QQ, then the variance of q~n\widetilde{q}_{n} is bounded from below by the inverse of the product of the Fisher information of the test distribution with respect to qq and the size of the sample (Cramér-Rao bound, see, e.g., Corollary 7.3.10 of [2]):

var[q~n]\displaystyle\mathrm{var}[\widetilde{q}_{n}] 1nEQ[(logf(q)q(X))2]\displaystyle\ \geq\ \frac{1}{n\,E_{Q}\left[\left(\frac{\partial\log f^{(q)}}{\partial\,q}(X)\right)^{2}\right]}
=1nEQ[(f1(X)f0(X)f(q)(X))2].\displaystyle\ =\ \frac{1}{n\,E_{Q}\left[\left(\frac{f_{1}(X)-f_{0}(X)}{f^{(q)}(X)}\right)^{2}\right]}. (5.1)

Actually, since q^n\hat{q}_{n} from Section 4 is the ML estimator of qq, (3.2b) presents not only a lower bound for the variances of unbiased estimators of qq but also the large sample variance of q^n\hat{q}_{n} in the sense that n(q^nq)\sqrt{n}\,(\hat{q}_{n}-q) converges in distribution toward 𝒩(0,EQ[(f1(X)f0(X)f(q)(X))2]1)\mathcal{N}\left(0,E_{Q}\left[\Big{(}\frac{f_{1}(X)-f_{0}(X)}{f^{(q)}(X)}\Big{)}^{2}\right]^{-1}\right), the normal distribution with mean 0 and variance EQ[(f1(X)f0(X)f(q)(X))2]1E_{Q}\left[\Big{(}\frac{f_{1}(X)-f_{0}(X)}{f^{(q)}(X)}\Big{)}^{2}\right]^{-1}, the asymptotic variance of q^n\hat{q}_{n} (see, e.g., Theorem 10.1.12 of [2]). Note that the lower bound of (5.1) may not hold for q^n\hat{q}_{n} because it need not be unbiased.

In summary, if the densities of the class-conditional feature distributions are known, the ML estimator of qq has asymptotically the smallest variance of all unbiased estimators of qq on i.i.d. samples from the test distribution of the features. This is demonstrated in the two upper panels of Table 4 of [22] which shows on simulated data that confidence intervals for qq – which are primarily driven by the standard deviations of the estimators – based on the ML estimator are the shortest if the training sample is infinite and the test sample is large.

6 The asymptotic variance of the ML estimator

Denote by ηQ(x)\eta_{Q}(x) the posterior positive class probability given X=xX=x under QQ. Assume that the feature vector XX under QQ has a density that is given by (4.1). Then it holds that

ηQ(x)=qf1(x)f(q)(x).\eta_{Q}(x)\ =\ \frac{q\,f_{1}(x)}{f^{(q)}(x)}. (6.1)

This representation immediately implies the following result on the representation of the Fisher information mentioned above in the context of the Cramér-Rao bound in terms of the variance of ηQ\eta_{Q}.

Proposition 6.1

If the feature vector XX under QQ has a density as specified by (4.1) then the Fisher information EQ[(logf(q)q(X))2]E_{Q}\left[\left(\frac{\partial\log f^{(q)}}{\partial\,q}(X)\right)^{2}\right] of the distribution of XX under QQ with respect to qq can be represented as follows:

EQ[(logf(q)q(X))2]=varQ[ηQ(X)]q2(1q)2.E_{Q}\left[\left(\frac{\partial\log f^{(q)}}{\partial\,q}(X)\right)^{2}\right]=\frac{\mathrm{var}_{Q}[\eta_{Q}(X)]}{q^{2}\,(1-q)^{2}}.

From Proposition 6.1 we obtain the following representation of the asymptotic variance of the ML estimator q^n\hat{q}_{n}:

EQ[(f1(X)f0(X)f(q)(X))2]1=q2(1q)2varQ[ηQ(X)].E_{Q}\left[\Big{(}\frac{f_{1}(X)-f_{0}(X)}{f^{(q)}(X)}\Big{)}^{2}\right]^{-1}=\frac{q^{2}\,(1-q)^{2}}{\mathrm{var}_{Q}[\eta_{Q}(X)]}. (6.2)

Recall the following decomposition of the optimal Brier Score BSQ(X)BS_{Q}(X) for the problem to predict the class variable YY from the features XX (under the test distribution QQ):

BSQ(X)\displaystyle BS_{Q}(X) =EQ[(YηQ(x))2]\displaystyle\ =\ E_{Q}\left[(Y-\eta_{Q}(x))^{2}\right]
=varQ[Y]varQ[ηQ(X)]\displaystyle\ =\ \mathrm{var}_{Q}[Y]-\mathrm{var}_{Q}[\eta_{Q}(X)]
=q(1q)varQ[ηQ(X)].\displaystyle\ =\ q\,(1-q)-\mathrm{var}_{Q}[\eta_{Q}(X)]. (6.3)

In (6.3), the optimal Brier Score BSQ(X)BS_{Q}(X) is also called refinement loss, while varQ[Y]\mathrm{var}_{Q}[Y] and varQ[ηQ(X)]\mathrm{var}_{Q}[\eta_{Q}(X)] are known as uncertainty and resolution respectively [9].

By (6.2), the asymptotic variance of the ML estimator q^n\hat{q}_{n} is reduced if a feature vector XX with greater variance of ηQ(X)\eta_{Q}(X) is found (or if the Brier Score with respect to XX decreases). Consider, for instance, a feature vector XX^{\prime} that is a function of the feature vector XX, i.e. it holds that X=F(X)X^{\prime}=F(X) for some function F:𝔛𝔛F:\mathfrak{X}\to\mathfrak{X}^{\prime}. Since FF potentially maps different values x𝔛x\in\mathfrak{X} onto the same value x𝔛x^{\prime}\in\mathfrak{X}^{\prime} the amount of information carried by XX^{\prime} is reduced compared to the amount of information carried by XX. Therefore, the approximation of the class label YY by regression on XX^{\prime} is less close than the approximation of YY by regression on XX.

From this observation, it follows that BSQ(X)BSQ(X)BS_{Q}(X^{\prime})\geq BS_{Q}(X). This in turn implies by (6.3) and (6.2) for the asymptotic variances of the ML estimators q^n(X)\hat{q}_{n}(X^{\prime}) and q^n(X)\hat{q}_{n}(X) that

varQ[q^n(X)]varQ[q^n(X)].\mathrm{var}_{Q}[\hat{q}_{n}(X^{\prime})]\ \leq\ \mathrm{var}_{Q}[\hat{q}_{n}(X)]. (6.4)

Observe that X=F(X)X^{\prime}=F(X) also implies BSP(X)BSP(X)BS_{P}(X^{\prime})\geq BS_{P}(X), i.e. also under the training distribution PP, the posterior positive class probability ηP(X)\eta_{P}(X) based on XX is a better predictor of YY than ηP(X)\eta_{P}(X^{\prime}) which is based on XX^{\prime}. By the assumption underlying this paper, BSP(X)BS_{P}(X^{\prime}) and BSP(X)BS_{P}(X) are observable while BSQ(X)BS_{Q}(X^{\prime}) and BSQ(X)BS_{Q}(X) are not, because the class label YY is not observed in the test data set.

But does BSP(X)BSP(X)BS_{P}(X^{\prime})\geq BS_{P}(X) always imply BSQ(X)BSQ(X)BS_{Q}(X^{\prime})\geq BS_{Q}(X) and therefore also (6.4), thus generalising the implication “X=F(X)X^{\prime}=F(X) \Rightarrow (6.4)”?

This paper has no fully general answer to this question. Instead we can only point to alternative conditions on XX^{\prime} and XX that imply both BSP(X)BSP(X)BS_{P}(X^{\prime})\geq BS_{P}(X) and BSQ(X)BSQ(X)BS_{Q}(X^{\prime})\geq BS_{Q}(X), but are weaker than X=F(X)X^{\prime}=F(X).

Brier curves:

Recall the notion of Brier curve from [11] (with the slightly modified definition of [23]). If the Brier curve for ηP(X)\eta_{P}(X^{\prime}) dominates the Brier curve for ηP(X)\eta_{P}(X), then by item 6) of Proposition 5.2 and Proposition 4.1 of [23], it follows that BSP(X)BSP(X)BS_{P}(X^{\prime})\geq BS_{P}(X) and BSQ(X)BSQ(X)BS_{Q}(X^{\prime})\geq BS_{Q}(X) hold.

ROC analysis:

Recall the notion of Receiver Operating Characteristic (ROC) as defined, for instance, in [19]. If the ROC for the density ratio associated with XX dominates the ROC for the density ratio associated with XX^{\prime}, then by Remark 5.4 and Proposition 4.1 of [23], it follows that BSP(X)BSP(X)BS_{P}(X^{\prime})\geq BS_{P}(X) and BSQ(X)BSQ(X)BS_{Q}(X^{\prime})\geq BS_{Q}(X) hold.

7 Example: Binormal model

In this section, we numerically compare the variance of the Sample Mean Matching (SMM) estimator q^n,SMM\hat{q}_{n,SMM} of class prevalences [10] and the Cramér-Rao bound of (5.1) (which is also the large sample variance of the ML estimator q^n,ML\hat{q}_{n,ML} as specified in Section 4 above). In order to be able to do this, we take recourse to the univariate binormal model with equal variances of the class-conditional distributions: The two normal class-conditional distributions of the feature variable XX are given by

X|Y=i𝒩(μi,σ2),i=0,1,X\,|\,Y=i\ \sim\ \mathcal{N}(\mu_{i},\sigma^{2}),\qquad i=0,1, (7.1a)
for conditional means μ0<μ1\mu_{0}<\mu_{1} and some σ>0\sigma>0. For the sake of simplicity, we choose
μ0=0,μ1>0,σ=1.\mu_{0}=0,\quad\mu_{1}>0,\quad\sigma=1. (7.1b)

Greater values of μ1\mu_{1} imply less overlap of the class-conditional feature distributions, corresponding to more powerful (or accurate) models. Or in other words, for greater values of μ1\mu_{1}, the feature variable XX carries more information on the class label YY.

As stated in Section 3, we assume we are dealing with an infinitely large training sample and a test sample of size nn. By Section 6, then for large nn the variance of q^n,ML\hat{q}_{n,ML} is approximately

1nEQ[(f1(X)f0(X)f(q)(X))2]1,\frac{1}{n}\,E_{Q}\left[\left(\frac{f_{1}(X)-f_{0}(X)}{f^{(q)}(X)}\right)^{2}\right]^{-1}, (7.2)

where QQ denotes the distribution underlying the test sample and f0f_{0} and f1f_{1} are the class-conditional feature densities – which are given for the purpose of this section by (7.1a). We evaluate the term given by (7.2) by means of one-dimensional numerical integration, making use of the R-function ‘integrate’ [18].

By Eq. (2) of [10], in the setting of this paper as specified in Section 3 above, the estimator q^n,SMM\hat{q}_{n,SMM} is given by the following explicit formula:

q^n,SMM=1ni=1nXiμ0μ1μ0=1nμ1i=1nXi.\begin{split}\hat{q}_{n,SMM}&\ =\ \frac{\frac{1}{n}\sum_{i=1}^{n}X_{i}-\mu_{0}}{\mu_{1}-\mu_{0}}\\ &\ =\ \frac{1}{n\,\mu_{1}}\sum_{i=1}^{n}X_{i}.\end{split} (7.3)

By (7.3), q^n,SMM\hat{q}_{n,SMM} is an unbiased estimator of the positive class prevalence qq. For the variance of q^n,SMM\hat{q}_{n,SMM}, we can refer to Theorem 3 of [25], case ntr=n_{tr}=\infty in the notation of [25]. Observe that in the case of SMM and ntr=n_{tr}=\infty, it holds that the representation of the variance is exact, not only approximate. Hence we obtain

varQ[q^n,SMM]=σ2+q(1q)(μ02+μ12)n(μ1μ0)2=1n(1μ12+q(1q)).\begin{split}\mathrm{var}_{Q}[\hat{q}_{n,SMM}]&\ =\ \frac{\sigma^{2}+q\,(1-q)\,(\mu_{0}^{2}+\mu_{1}^{2})}{n\,(\mu_{1}-\mu_{0})^{2}}\\ &\ =\ \frac{1}{n}\left(\frac{1}{\mu_{1}^{2}}+q\,(1-q)\right).\end{split} (7.4)

In the model specified by (7.1a) and (7.1b), the classification power (or accuracy) is driven by the difference of the conditional means, i.e. by the mean conditional on the positive class μ1\mu_{1}. If we measure the power by the Area under the Curve (AUC, see for instance Section 6.1 of [19]) in order to obtain a measure which is independent of the class prevalences, AUC is a simple function of μ1=μ1μ0\mu_{1}=\mu_{1}-\mu_{0}:

AUC(μ1)=Φ(μ12),AUC(\mu_{1})\ =\ \Phi\left(\frac{\mu_{1}}{\sqrt{2}}\right), (7.5)

with Φ\Phi denoting the standard normal distribution function.

Table 1: Illustration of the relations of model power, standard deviation of the SMM quantifier and large sample standard deviation of the ML quantifier. Sample size is 100, positive class prevalence in test set is 0.2. The parameter μ1\mu_{1} is the expected sample mean conditional on the positive class, see (7.1a) and (7.1b).
μ1\mu_{1} AUC(μ1)AUC(\mu_{1}) σSMM\sigma_{SMM} σML\sigma_{ML}
0.01 0.5028 10.0001 10.0000
0.05 0.5141 2.0004 2.0000
0.10 0.5282 1.0008 0.9999
0.25 0.5702 0.4020 0.3998
0.50 0.6382 0.2040 0.2000
1.00 0.7602 0.1077 0.1017
1.50 0.8556 0.0777 0.0710
2.00 0.9214 0.0640 0.0571
2.50 0.9615 0.0566 0.0498
3.00 0.9831 0.0521 0.0456
3.50 0.9933 0.0492 0.0432
4.00 0.9977 0.0472 0.0418
5.00 0.9998 0.0447 0.0405

From (7.5) und (7.4), it is clear that for fixed test sample size nn the variance of q^n,SMM\hat{q}_{n,SMM} decreases when the power of the model increases. This is less obvious from (7.2) for the asymptotic variance of q^n,ML\hat{q}_{n,ML} but if follows from (6.2) in that case.

Table 1 above illustrates these observations. In the table, we use the notation

σSMM=varQ[q^n,SMM]\sigma_{SMM}=\sqrt{\mathrm{var}_{Q}[\hat{q}_{n,SMM}]}

and

σML=varQ[q^n,ML].\sigma_{ML}=\sqrt{\mathrm{var}_{Q}[\hat{q}_{n,ML}]}.

Table 1 suggests the following observations:

  • At low levels of model power, small increases of power entail huge reductions of both the SMM variance and the ML large sample variance.

  • The variance reductions become moderate or even low for moderate and high levels of model power (higher than 75% AUC).

  • The efficiency deficiency of the SMM estimator (as expressed by its variance) compared to the large sample ML variance varies and is much larger for higher levels of model power.

Figure 1 below shows that the efficiency gain of the ML estimator compared to the SMM estimator – assuming an infinitely large training sample – does not only depend on the model power but also on the positive class prevalence.

Indeed, Figure 1 suggests that a large efficiency gain is possible in the presence of a large difference in the prevalences of the two classes while the gain is rather moderate in the case of almost equal class prevalences.

Refer to caption
Figure 1: Ratios of standard deviations of SMM and ML estimates as function of AUC and positive class prevalence.

8 Conclusions

In this paper, we have revisited the binary quantification problem, i.e. the problem of estimating a binary prior class distribution on the test data set when training and test distributions are different.

  • Specifically, under the assumption of prior probability shift we have looked at the asymptotic variance of the maximum likelihood estimator (MLE).

  • We have found that this asymptotic variance is closely related to the Brier score for the regression of the class label variable YY against the features vector XX under the test set distribution. In particular, the asymptotic variance can be reduced by selection of a more powerful feature vector.

  • At the end of Section 6, we have pointed out sufficient conditions and associated training criteria (Brier curves and ROC analysis) for minimising both the Brier score on the training data set and the Brier score on the test data set.

  • These findings suggest methods to reduce the variance of the ML estimator of the prior class probabilities (or prevalences) on the test data set. Due to the statistical consistency of ML estimators, by reducing the variance of the estimator also its mean squared error is minimised.

  • The large sample variance of the MLE associated with its asymptotic variance is identical to the Cramér-Rao lower bound for the variances of unbiased estimators of the prior positive class probability. Therefore, it seems likely that also other estimators benefit from improved performance of the underlying classifiers or feature vectors.

Indeed, results of a simulation study in [22] and theoretical findings in [25, 26] suggest that improving the accuracy of the base classifiers used for quantification helps to reduce not only the variances of ML estimators but also of other estimators. The example of the Sample Mean Match (SMM) estimator we have discussed in Section 7 supports this conclusion.

However, these findings must be qualified in so far as the observations made in this paper apply only to the case where both training data set and test data set are large. This is a severe restriction indeed as [5] pointed out the importance of quantification methods specifically in the case of small training data sets, for cost efficiency reasons.

Further research on developing efficient quantifiers for small or moderate training and test data set sizes therefore is highly desirable. A promising step in this direction has already been done by [14], with a proposal for the selection of the most suitable quantifiers for problems on data sets with widely varying sizes.

References

  • [1] A. Alexandari, A. Kundaje, and A. Shrikumar. Maximum Likelihood with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation. In International Conference on Machine Learning, pages 222–232. PMLR, 2020.
  • [2] G. Casella and R.L. Berger. Statistical Inference. Duxbury Press, second edition, 2002.
  • [3] A. Esuli, F. Sebastiani, and A. Abbasi. Sentiment quantification. IEEE intelligent systems, 25(4):72–79, 2010.
  • [4] G. Forman. Counting Positives Accurately Despite Inaccurate Classification. In European Conference on Machine Learning (ECML 2005), pages 564–575. Springer, 2005.
  • [5] G. Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, 2008.
  • [6] S. Frühwirth-Schnatter. Finite Mixture and Markov Switching Models: Modeling and Applications to Random Processes. Springer, 2006.
  • [7] S. Garg, Y. Wu, S. Balakrishnan, and Z. Lipton. A Unified View of Label Shift Estimation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3290–3300. Curran Associates, Inc., 2020.
  • [8] P. González, A. Castaño, N.V. Chawla, and J.J. Del Coz. A Review on Quantification Learning. ACM Comput. Surv., 50(5):74:1–74:40, 2017.
  • [9] D.J. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons, Chichester, 1997.
  • [10] W. Hassan, A. Maletzke, and G. Batista. Accurately Quantifying a Billion Instances per Second. In 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10, 2020.
  • [11] J. Hernández-Orallo, P. Flach, and C. Ferri. Brier Curves: A New Cost-Based Visualisation of Classifier Performance. In Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pages 585–592. International Machine Learning Society, 2011.
  • [12] V. Hofer and G. Krempl. Drift mining in data: A framework for addressing drift in classification. Computational Statistics & Data Analysis, 57(1):377–391, 2013.
  • [13] Z. Lipton, Y.-X. Wang, and A. Smola. Detecting and Correcting for Label Shift with Black Box Predictors. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3122–3130. PMLR, 10–15 Jul 2018.
  • [14] A.G. Maletzke, W. Hassan, D.M. dos Reis, and G. Batista. The Importance of the Test Set Size in Quantification Assessment. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), pages 2640–2646, 2020.
  • [15] J.G. Moreno-Torres, T. Raeder, R. Alaiz-Rodriguez, N.V. Chawla, and F. Herrera. A unifying view on dataset shift in classification. Pattern Recognition, 45(1):521–530, 2012.
  • [16] A. Moreo, A. Esuli, and F. Sebastiani. QuaPy: A Python-Based Framework for Quantification. arXiv preprint arXiv:2106.11057, 2021.
  • [17] C. Peters and W.A. Coberly. The numerical evaluation of the maximum-likelihood estimate of mixture proportions. Communications in Statistics – Theory and Methods, 5(12):1127–1135, 1976.
  • [18] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019.
  • [19] M. Reid and R.C. Williamson. Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research, 12:731–817, 2011.
  • [20] M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation, 14(1):21–41, 2001.
  • [21] D. Tasche. Fisher Consistency for Prior Probability Shift. Journal of Machine Learning Research, 18(95):1–32, 2017.
  • [22] D. Tasche. Confidence intervals for class prevalences under prior probability shift. Machine Learning and Knowledge Extraction, 1(3):805–831, 2019.
  • [23] D. Tasche. Calibrating sufficiently. arXiv preprint arXiv:2105.07283, 2021.
  • [24] D.M Titterington, A.F.M. Smith, and U.E. Makov. Statistical analysis of finite mixture distributions. Wiley New York, 1985.
  • [25] A. Vaz, R. Izbicki, and R.B. Stern. Prior Shift Using the Ratio Estimator. In Adriano Polpo, Julio Stern, Francisco Louzada, Rafael Izbicki, and Hellinton Takada, editors, International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, pages 25–35. Springer, 2017.
  • [26] A. Vaz, R. Izbicki, and R.B. Stern. Quantification Under Prior Probability Shift: the Ratio Estimator and its Extensions. Journal of Machine Learning Research, 20(79):1–33, 2019.
  • [27] M.-J. Zhao, N. Edakunni, A. Pocock, and G. Brown. Beyond Fano’s Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications. The Journal of Machine Learning Research, 14(1):1033–1090, 2013.