This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How many classifiers do we need?

Hyunsuk Kim
Department of Statistics
University of California, Berkeley
hyskim7@berkeley.edu
&Liam Hodgkinson
School of Mathematics and Statistics
University of Melbourne, Australia
lhodgkinson@unimelb.edu.au Ryan Theisen
Harmonic Discovery
ryan@harmonicdiscovery.com
&Michael W. Mahoney
ICSI, LBNL, and Dept. of Statistics
University of California, Berkeley
mmahoney@stat.berkeley.edu
Abstract

As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined to improve accuracy. In this paper, we provide a detailed analysis of how the disagreement and the polarization (a notion we introduce and define in this paper) among classifiers relate to the performance gain achieved by aggregating individual classifiers, for majority vote strategies in classification tasks. We address these questions in the following ways. (1) An upper bound for polarization is derived, and we propose what we call a neural polarization law: most interpolating neural network models are 4/3-polarized. Our empirical results not only support this conjecture but also show that polarization is nearly constant for a dataset, regardless of hyperparameters or architectures of classifiers. (2) The error of the majority vote classifier is considered under restricted entropy conditions, and we present a tight upper bound that indicates that the disagreement is linearly correlated with the target, and that the slope is linear in the polarization. (3) We prove results for the asymptotic behavior of the disagreement in terms of the number of classifiers, which we show can help in predicting the performance for a larger number of classifiers from that of a smaller number. Our theories and claims are supported by empirical results on several image classification tasks with various types of neural networks.

1 Introduction

As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined, both to improve accuracy and to form more robust conclusions than any individual model alone can provide. In some cases, ensembling can produce substantial benefits, particularly when increasing model size becomes prohibitive. In particular, for large neural network models, deep ensembles LPB (17) are especially popular. These ensembles consist of independently trained models on the same dataset, often using the same hyperparameters, but starting from different initializations.

The cost of producing new classifiers can be steep, and it is often unclear whether the additional performance gains are worth the cost. Assuming that constructing two or three classifiers is relatively cheap, procedures capable of deciding whether to continue producing more classifiers are needed. To do so requires a precise understanding of how to predict ensemble performance. Of particular interest are majority vote strategies in classification tasks, noting that regression tasks can also be formulated in this way by clustering outputs. In this case, one of the most effective avenues for predicting performance is the disagreement JNBK (22); BJRK (22): measuring the degree to which classifiers provide different conclusions over a given dataset. Disagreement is concrete, easy to compute, and strongly linearly correlated with majority vote prediction accuracy, leading to its use in many applications. However, a priori, the precise linear relationship between disagreement and accuracy is unclear, preventing the use of disagreement for predicting ensemble performance.

Our goal in this paper is to go beyond disagreement-based analysis to provide a more quantitative understanding of the number of classifiers one should use to achieve a desired level of performance in modern practical applications, in particular for neural network models. In more detail, our contributions are as follows.

  1. (i)

    We introduce and define the concept of polarization, a notion that measures the higher-order dispersity of the error rates at each data point, and which indicates how polarized the ensemble is from the ground truth. We state and prove an upper bound for polarization (Theorem 1). Inspired by the theorem, we propose what we call a neural polarization law (Conjecture 1): most interpolating (Definition 2) neural network models are 4/3-polarized. We provide empirical results supporting the conjecture (Figures 1 and 2).

  2. (ii)

    Using the notion of polarization, we develop a refined set of bounds on the majority vote test error rate. For one, we provide a sharpened bound for any ensembles with a finite number of classifiers (Corollary 1). For the other, we offer a tighter-than-ever bound under an additional condition on the entropy of the ensemble (Theorem 4). We provide empirical results that demonstrate our new bounds perform significantly better than the existing bounds on the majority vote test error (Figure 3).

  3. (iii)

    The asymptotic behavior of the majority vote error rate is determined as the number of classifiers increases (Theorem 5). Consequently, we show that we can predict the performance for a larger number of classifiers from that of a smaller number. We provide empirical results that show such predictions are considerably accurate across various pairs of model architecture and dataset (Figure 4).

In Section 2, we define the notations that will be used throughout the paper, and we introduce upper bounds for the error rate of the majority vote from previous work. The next three sections are the main part of the paper. In Section 3, we introduce the notion of polarization, ηρ\eta_{\rho}, which plays a fundamental role in relating the majority vote error rate to average error rate and disagreement. We explore the properties of the polarization and present empirical results that corroborate our claims. In Section 4, we present tight upper bounds for the error rate of the majority vote for ensembles that satisfy certain conditions; and in Section 5, we prove how disagreement behaves in terms of the number of classifiers. All of these ingredients are put together to estimate the error rate of the majority vote for a large number of classifiers using information from only three sampled classifiers. In Section 6, we provide a brief discussion and conclusion. Additional material is presented in the appendices.

2 Preliminaries

In this section, we introduce notation that we use throughout the paper, and we summarise previous work on the performance of the majority vote error rate.

2.1 Notations

We focus on KK-class classification problems, with features X𝒳X\in\mathcal{X}, labels Y[K]={1,2,,K}Y\in[K]=\{1,2,...,K\} and feature-label pairs (X,Y)𝒟(X,Y)\sim\mathcal{D}. A classifier h:𝒳[K]h:\mathcal{X}\to[K] is a function that maps a feature to a label. We define the error rate of a single classifier hh, and the disagreement and the tandem loss MLIS (20) between two classifiers, hh and hh^{\prime}, as the following:

Error rate : L(h)=𝔼𝒟[𝟙(h(X)Y)]\displaystyle L(h)=\mathbb{E}_{\mathcal{D}}[\mathds{1}(h(X)\neq Y)]
Disagreement : D(h,h)=𝔼𝒟[𝟙(h(X)h(X))]\displaystyle D(h,h^{\prime})=\mathbb{E}_{\mathcal{D}}[\mathds{1}(h(X)\neq h^{\prime}(X))]
Tandem loss : L(h,h)=𝔼𝒟[𝟙(h(X)Y)𝟙(h(X)Y)],\displaystyle L(h,h^{\prime})=\mathbb{E}_{\mathcal{D}}[\mathds{1}(h(X)\neq Y)\mathds{1}(h^{\prime}(X)\neq Y)],

where the expectation 𝔼𝒟\mathbb{E}_{\mathcal{D}} is used to denote 𝔼(X,Y)𝒟\mathbb{E}_{(X,Y)\sim\mathcal{D}}. Next, we consider a distribution of classifiers, ρ\rho, which may be viewed as an ensemble of classifiers. This distribution can represent a variety of different cases. Examples include: (1) a discrete distribution over finite number of hih_{i}, e.g., a weighted sum of hih_{i}; and (2) a distribution over a parametric family hθh_{\theta}, e.g., a distribution of classifiers resulting from one or multiple trained neural networks. Given the ensemble ρ\rho, the (weighted) majority vote hρMV:𝒳[K]h_{\rho}^{\mathrm{MV}}:\mathcal{X}\to[K] is defined as

hρMV(x)=argmaxy[K]𝔼ρ[𝟙(h(x)=y)].\displaystyle h_{\rho}^{\mathrm{MV}}(x)=\operatorname*{arg\,max}_{y\in[K]}\,\mathbb{E}_{\rho}[\mathds{1}(h(x)=y)].

Again, 𝔼ρ\mathbb{E}_{\rho} denotes 𝔼hρ\mathbb{E}_{h\sim\rho}, and we use 𝔼ρ,𝔼ρ2,ρ\mathbb{E}_{\rho},\mathbb{E}_{\rho^{2}},\mathbb{P}_{\rho} for 𝔼hρ,𝔼(h,h)ρ2,hρ\mathbb{E}_{h\sim\rho},\mathbb{E}_{(h,h^{\prime})\sim\rho^{2}},\mathbb{P}_{h\sim\rho}, respectively, throughout the paper. In this sense, 𝔼ρ[L(h)]\mathbb{E}_{\rho}[L(h)] represents the average error rate under a distribution of classifiers ρ\rho and 𝔼ρ2[D(h,h)]\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})] represents the average disagreement between classifiers under ρ\rho. Hereafter, we refer to 𝔼ρ[L(h)]\mathbb{E}_{\rho}[L(h)], 𝔼ρ2[D(h,h)]\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})], and L(hρMV)L(h_{\rho}^{\mathrm{MV}}) as the average error rate, the disagreement, and the majority vote error rate, respectively, with

L(hρMV)=𝔼𝒟[𝟙(hρMV(X)Y)].\displaystyle L(h_{\rho}^{\mathrm{MV}})=\mathbb{E}_{\mathcal{D}}[\mathds{1}(h_{\rho}^{\mathrm{MV}}(X)\neq Y)].

Lastly, we define the point-wise error rate, Wρ(X,Y)W_{\rho}(X,Y), which will serve a very important role in this paper (for clarity, we will denote Wρ(X,Y)W_{\rho}(X,Y) by WρW_{\rho} unless otherwise necessary):

Wρ(X,Y)=𝔼ρ[𝟙(h(X)Y)].\displaystyle W_{\rho}(X,Y)=\mathbb{E}_{\rho}[\mathds{1}(h(X)\neq Y)]. (1)

2.2 Bounds on the majority vote error rate

The simplest relationship between the majority vote error L(hρMV)L(h_{\rho}^{\mathrm{MV}}) and the average error rate 𝔼ρ[L(h)]\mathbb{E}_{\rho}[L(h)] was introduced in McA (98). It states that the error in the majority vote classifier cannot exceed twice the average error rate:

L(hρMV)2𝔼ρ[L(h)]\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq 2\mathbb{E}_{\rho}[L(h)] (2)

A simple proof for this relationship can be found in MLIS (20) using Markov’s inequality. Although (2) does not provide useful information in practice, it is worth noting that this bound is, in fact, tight. There exist pathological examples where hρMVh_{\rho}^{\mathrm{MV}} exhibits twice the average error rate (see Appendix C in TKY+ (24)). This suggests that we can hardly obtain a useful or tighter bound by relying on only the “first-order” term, 𝔼ρ[L(h)]\mathbb{E}_{\rho}[L(h)].

Accordingly, more recent work constructed bounds in terms of “second-order” quantities, 𝔼ρ2[L(h,h)]\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})] and 𝔼ρ2[D(h,h)]\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]. In particular, LMRR (17) and MLIS (20) designed a so-called C-bound using the Chebyshev-Cantelli inequality, establishing that, if 𝔼ρ[L(h)]<1/2\mathbb{E}_{\rho}[L(h)]<1/2, then

L(hρMV)𝔼ρ2[L(h,h)]𝔼ρ[L(h)]2𝔼ρ2[L(h,h)]𝔼ρ[L(h)]+14.\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\frac{\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]-\mathbb{E}_{\rho}[L(h)]^{2}}{\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]-\mathbb{E}_{\rho}[L(h)]+\frac{1}{4}}. (3)

As an alternative approach, MLIS (20) incorporated the disagreement 𝔼ρ2[D(h,h)]\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})] into the bound as well, albeit restricted to the binary classification problem, to obtain:

L(hρMV)4𝔼ρ[L(h)]2𝔼ρ2[D(h,h)].\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq 4\mathbb{E}_{\rho}[L(h)]-2\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]. (4)

While (3) and (4) may be tighter in some cases, once again, there do exist pathological examples where this bound is as uninformative as the first-order bound (2). Motivated by these weak results, TKY+ (24) take a new approach by restricting ρ\rho to be a “good ensemble,” and introducing the competence condition (see Definition 3 in our Appendix A). Informally, competent ensembles are those where it is more likely—in average across the data—that more classifiers are correct than not. Based on this notion, TKY+ (24) prove that competent ensembles are guaranteed to have weighted majority vote error smaller than the weighted average error of individual classifiers:

L(hρMV)𝔼ρ[L(h)].\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\mathbb{E}_{\rho}[L(h)]. (5)

That is, the majority vote classifier is always beneficial. Moreover, TKY+ (24) proves that any competent ensemble ρ\rho of KK-class classifiers satisfy the following inequality.

L(hρMV)4(K1)K(𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]).\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\frac{4(K-1)}{K}\left(\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right). (6)

We defer further discussion of competence to Appendix A, where we introduce simple cases for which competence does not hold. In these cases, we show how one can overcome this issue so that the bounds (5) and (6) still hold. In particular, in Appendix A.3, we provide an example to show the bound (6) is tight.

3 The Polarization of an Ensemble

In this section, we introduce a new quantity, ηρ\eta_{\rho}, which we refer to as the polarization of an ensemble ρ\rho. First, we provide examples as to what this quantity represents and draw a connection to previous studies. Then, we present theoretical and empirical results that show this quantity plays a fundamental role in relating the majority vote error rate to average error rate and disagreement. In Theorem 1, we prove an upper bound for the polarization ηρ\eta_{\rho}, which highlights a fundamental relationship between the polarization and the constant 43\frac{4}{3}. Inspired from the theorem, we propose Conjecture 1 which we call a neural polarization law. Figures 1 and 2 present empirical results on an image recognition task that corroborates the conjecture.

We start by defining the polarization of an ensemble. In essence, the polarization is an improved (smaller) coefficient on the Markov’s inequality on 𝒟(Wρ>0.5)\mathbb{P}_{\mathcal{D}}(W_{\rho}>0.5), where WρW_{\rho} is the point-wise error rate defined as equation (1). It measures how much the ensemble is “polarized” from the truth, with consideration of the distribution of WρW_{\rho}.

Definition 1 (Polarization).

An ensemble ρ\rho is η\eta-polarized if

η𝔼𝒟[Wρ2]𝒟(Wρ>1/2).\displaystyle\eta\,\mathbb{E}_{\mathcal{D}}[W_{\rho}^{2}]\geq\mathbb{P}_{\mathcal{D}}(W_{\rho}>1/2). (7)

The polarization of an ensemble ρ\rho is

ηρ:=𝒟(Wρ>1/2)𝔼𝒟[Wρ2],\displaystyle\eta_{\rho}:=\frac{\mathbb{P}_{\mathcal{D}}(W_{\rho}>1/2)}{\mathbb{E}_{\mathcal{D}}[W_{\rho}^{2}]}, (8)

which is the smallest value of η\eta satisfies inequality (7).

Note that the polarization always takes a value in [0,4][0,4], due to the positivity constraint and Markov’s inequality. Also note that ensemble ρ\rho with polarization ηρ\eta_{\rho} is η\eta-polarized for any ηηρ\eta\geq\eta_{\rho}.

To understand better what this quantity represents, consider the following examples. The first example demonstrates that polarization increases as the majority vote becomes more polarized from the truth, while the second example demonstrates how polarization increases when the constituent classifiers are more evenly split.

Example 1.

Consider an ensemble ρ\rho where 75% of classifiers output Label 1 with probability one, and the other 25% classifiers output Label 2 with probability one.

  • -

    Case 1. The true label is Label 1 for the whole data.
    In this case, the majority vote in ρ\rho results in zero error rate. The point-wise error rate WρW_{\rho} is 0.250.25 on the entire dataset, and thus 𝒟(Wρ>0.5)=0\mathbb{P}_{\mathcal{D}}(W_{\rho}>0.5)=0. The polarization ηρ\eta_{\rho} is 0.

  • -

    Case 2. The true label is Label 1 for half of the data and is Label 2 for the other half.
    In this case, the majority vote is only correct for half of the data. The point-wise error rate WρW_{\rho} is 0.250.25 for this half, and is 0.750.75 for the other half. The polarization ηρ\eta_{\rho} is 0.5/0.3125=1.60.5/0.3125=1.6.

  • -

    Case 3. The true label is Label 2 for the whole data.
    In this case, the majority vote in ρ\rho is wrong on every data point. The point-wise error rate WρW_{\rho} is 0.750.75 on the entire dataset and thus 𝒟(Wρ>0.5)=1\mathbb{P}_{\mathcal{D}}(W_{\rho}>0.5)=1. The polarization ηρ\eta_{\rho} is 1/0.3125=3.21/0.3125=3.2.

Example 2.

Now consider an ensemble ρ\rho of which 51% of classifiers always output Label 1, and the other 49% classifiers always output Label 2.

  • -

    Case 1. The polarization ηρ\eta_{\rho} is now 0, the same as in Example 1.

  • -

    Case 2. The polarization ηρ\eta_{\rho} is 0.5/0.250120.5/0.2501\approx 2, which is larger than 1.61.6 in Example 1.

  • -

    Case 3. The polarization ηρ\eta_{\rho} is now 1/0.250141/0.2501\approx 4 , which is larger than 3.23.2 in Example 1.

In addition, the following proposition draws a connection between polarization and the competence condition mentioned in Section 2.2. It states that the polarization of competent ensembles cannot be very large. The proof is deferred to Appendix A.2.

Proposition 1.

Competent ensembles are 22-polarized.

Now we delve more into this new quantity. We introduce Theorem 1, which establishes (by means of concentration inequalities) an upper bound on the polarization ηρ\eta_{\rho}. The proof of Theorem 1 is deferred to Appendix B.1.

Theorem 1.

Let {(Xi,Yi)}i=1m\{(X_{i},Y_{i})\}_{i=1}^{m} be independent and identically distributed samples from 𝒟\mathcal{D} that are independent of an ensemble ρ\rho. Then the polarization of the ensemble, ηρ\eta_{\rho}, satisfies

ηρmax{43,(38mlog1δ+38mlog1δ+4SP2S)2},\displaystyle\eta_{\rho}\leq\max\left\{\frac{4}{3},\left(\frac{\sqrt{\frac{3}{8m}\log\frac{1}{\delta}}+\sqrt{\frac{3}{8m}\log\frac{1}{\delta}+4SP}}{2S}\right)^{2}\right\}, (9)

with probability at least 1δ1-\delta, where S=1mi=1mWρ2(Xi,Yi)S=\frac{1}{m}\sum_{i=1}^{m}W_{\rho}^{2}(X_{i},Y_{i}) and P=1m𝟙(Wρ(Xi,Yi)>1/2)P=\frac{1}{m}\mathds{1}(W_{\rho}(X_{i},Y_{i})>1/2).

Surprisingly, in practice, ηρ=43\eta_{\rho}=\frac{4}{3} appears to be a good choice for a wide variety of cases. See Figure 1 and Figure 2, which show the polarization ηρ\eta_{\rho} obtained from VGG11 SZ (14), DenseNet40 HLVDMW (17), ResNet18, ResNet50 and ResNet101 HZRS (16) trained on CIFAR-10 Kri (09) with various hyperparameters choices. The trend does not deviate even when evaluated on an out-of-distribution dataset, CIFAR-10.1 RRSS (18); TFF (08). For more details on these empirical results, see Appendix C.

Refer to caption
Figure 1: Polarizations ηρ\eta_{\rho} obtained from ResNet18 trained on CIFAR-10 with various sets of hyper-parameters tested on (a) an out-of-sample CIFAR-10 and (b) an out-of-distribution dataset, CIFAR-10.1. Red dashed line indicates y=4/3y=4/3, a suggested value of polarization appears in Theorem 1 and Conjecture 1.
Refer to caption
Figure 2: Polarization ηρ\eta_{\rho} obtained (a) from various architectures trained on CIFAR-10 and (b) only from interpolating classifiers trained on various datasets. Red dashed line indicates y=4/3y=4/3. In subplot (b), we observe that the polarization of all interpolating models expect one are smaller than 4/34/3, which aligns with Conjecture 1.

Remark.

We emphasize that values for ηρ\eta_{\rho} that are larger than 43\frac{4}{3} does not contradict Theorem 1. This happens when the non-constant second term in (9) is larger than 43\frac{4}{3}, which is often the case for classifiers which are not interpolating (or, indeed, that underfit or perform poorly).

Definition 2 (Interpolating, BHMM (19)).

A classifier is interpolating if it achieves an accuracy of 100% on the training data.

Putting Theorem 1 and the consistent empirical trend shown in Figure 2(b) together, we propose the following conjecture.

Conjecture 1 (Neural Polarization Law).

The polarization of ensembles comprised of independently trained interpolating neural networks is smaller than 43\frac{4}{3}.

4 Entropy-Restricted Ensembles

In this section, we first present an upper bound on the majority vote error rate, L(hρMV)L(h_{\rho}^{\mathrm{MV}}), in Theorem 2, using our notion of polarization ηρ\eta_{\rho} which we introduced and defined in the previous section. Then, we present Theorems 3 and 4 which are the main elements in obtaining tighter upper bounds on L(hρMV)L(h_{\rho}^{\mathrm{MV}}). Figure 3 shows our proposed bound offers a significant improvement over state-of-the-art results. The new upper bounds are inspired from the fact that classifier prediction probabilities tend to concentrate on a small number of labels, rather than be uniformly spread over all the possible labels. This is analogous to the phenomenon of neural collapse Kot (22). As an example, in the context of a computer vision model, when presented with a photo of a dog, one might expect that a large portion of reasonable models might classify the photo as an animal other than a dog, but not as a car or an airplane.

We start by stating an upper bound on the majority vote error, L(hρMV)L(h_{\rho}^{\mathrm{MV}}) as a function of polarization ηρ\eta_{\rho}. This upper bound is tighter (smaller) than the previous bound in inequality (6) when the polarization is lower than 22, which is the case for competent ensembles. The proof is deferred to Appendix B.2.

Theorem 2.

For an ensemble ρ\rho of KK-class classifiers,

L(hρMV)2ηρ(K1)K(𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]),\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\frac{2\eta_{\rho}(K-1)}{K}\left(\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right),

where ηρ\eta_{\rho} is the polarization of the ensemble ρ\rho.

Based on the upper bound stated in Theorem 2, we add a restriction on the entropy of constituent classifiers to obtain Theorem 3. The theorem provides a tighter scalable bound that does not have explicit dependency on the total number of labels, with a small cost in terms of the entropy of constituent classifiers. The proof of Theorem 3 is deferred to Appendix B.3.

Theorem 3.

Let ρ\rho be any η\eta-polarized ensemble of KK-class classifiers that satisfies ρ(h(x)A(x))Δ\mathbb{P}_{\rho}(h(x)\notin A(x))\leq\Delta, where yA(x)[K]y\in A(x)\subset[K] and |A(x)|M|A(x)|\leq M, for all data points (x,y)𝒟(x,y)\in\mathcal{D}. Then, we have

L(hρMV)2η(M1)M[(1+ΔM1)𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]].\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\,\frac{2\eta(M\!-\!1)}{M}\left[\left(1+\frac{\Delta}{M\!-\!1}\right)\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right].

While Theorem 3 might provide a tighter bound than prior work, coming up with pairs (M,Δ)(M,\Delta) that satisfy the constraint is not an easy task. This is not an issue for a discrete ensemble, however. If ρ\rho is a discrete distribution of NN classifiers, then we observe that the assumption of Theorem 3 must always hold with (M,Δ)=(N+1,0)(M,\Delta)=(N\!+\!1,0). We state this as the following corollary.

Corollary 1 (Finite Ensemble).

For an ensemble ρ\rho that is a weighted sum of NN classifiers, we have

L(hρMV)2ηρNN+1(𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]),\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\,\frac{2\eta_{\rho}N}{N\!+\!1}\left(\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right), (10)

where ηρ\eta_{\rho} is the polarization of the ensemble ρ\rho.

Refer to caption
Figure 3: Comparing our new bound from Corollary 1 (colored black), which is the right hand side of inequality (10), with bounds from previous studies. Green corresponds to the C-bound in inequality (3), and blue corresponds to the right hand side of inequality (6). ResNet18, ResNet50, ResNet101 models with various sets of hyperparameters are trained on CIFAR-10 then tested on (a) the out-of-sample CIFAR-10, (b) an out-of-distribution dataset, CIFAR-10.1

.

See Figure 3, which provides empirical results that compare the bound in Corollary 1 with the C-bound in inequality (3), and with inequality (6) proposed in TKY+ (24). We can observe that the new bound in Corollary 1 is strictly tighter than the others. For more details on these empirical results, see Appendix C.

Although the bound in Corollary 1 is tighter than the bounds from previous studies, it’s still not tight enough to use it as an estimator for L(hρMV)L(h_{\rho}^{\mathrm{MV}}). In the following theorem, we use a stronger condition on the entropy of an ensemble to obtain a tighter bound. The proof is deferred to Appendix B.4.

Theorem 4.

For any η\eta-polarized ensemble ρ\rho that satisfies

12𝔼𝒟[ρ2(h(X)Y,h(X)Y,h(X)h(X))]ε𝔼𝒟[ρ(h(X)Y)],\displaystyle\frac{1}{2}\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho^{2}}\left(h(X)\neq Y,h^{\prime}(X)\neq Y,h(X)\neq h^{\prime}(X)\right)\right]\leq\varepsilon\,\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}\left(h(X)\neq Y\right)\right], (11)

we have

L(hρMV)η[(1+ε)𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]].\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\,\eta\,\left[\left(1+\varepsilon\right)\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right].

The condition (11) can be rephrased as follows: compared to the error ρ(h(x)y)\mathbb{P}_{\rho}(h(x)\neq y), the entropy of the distribution of wrong predictions is small, and it is concentrated on a small number of labels. A potential problem is that one must know or estimate the smallest possible value of ε\varepsilon in advance. At least, we can prove that ε=K22(K1)\varepsilon=\frac{K\!-\!2}{2(K\!-\!1)} always satisfies the condition (11) for an ensemble of KK-class classifiers. The proof is deferred to Appendix B.4.

Corollary 2.

For any η\eta-polarized ensemble ρ\rho of K-class classifiers, we have

L(hρMV)η[(1+K22(K1))𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]].\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\,\eta\,\left[\left(1+\frac{K\!-\!2}{2(K\!-\!1)}\right)\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right].

Naturally, this ε\varepsilon is not good enough for our goal. We discuss more on how to estimate the smallest possible value of ε\varepsilon in the following section.

5 A Universal Law for Ensembling

In this section, our goal is to predict the majority vote error rate of an ensemble with large number of classifiers by just using information we can obtain from an ensemble with a small number, e.g., three, of classifiers. Among the elements in the bound in Theorem 4,

η[(1+ε)𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]],\displaystyle\eta\,\left[\left(1+\varepsilon\right)\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right],

we plug in η=43\eta=\frac{4}{3} as a result of Theorem 1; and since 𝔼ρ[L(h)]\mathbb{E}_{\rho}[L(h)] is invariant to the number of classifiers, it remains to predict the behavior of 𝔼ρ2[D(h,h)]\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})] and the smallest possible value of ε\varepsilon, ερ=𝔼𝒟[ρ2(h(X)Y,h(X)Y,h(X)h(X))]2𝔼𝒟[ρ(h(X)Y)]\varepsilon_{\rho}=\frac{\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho^{2}}\left(h(X)\neq Y,h^{\prime}(X)\neq Y,h(X)\neq h^{\prime}(X)\right)\right]}{2\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}\left(h(X)\neq Y\right)\right]}. Since the denominator 𝔼𝒟[ρ(h(X)Y)]=𝔼ρ[L(h)]\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}\left(h(X)\neq Y\right)\right]=\mathbb{E}_{\rho}[L(h)] is invariant to the number of classifiers, and the numerator resembles the disagreement between classifiers, ερ\varepsilon_{\rho} is expected to follow a similar pattern as 𝔼ρ2[D(h,h)]\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]. Note that the numerator of ερ\varepsilon_{\rho} has the same form as the disagreement, differing by only one less label. Both are VV-statistics that can be expressed as a multiple of a UU-statistic, as shown in equation (5). In the next theorem, we show that the disagreement for a finite number of classifiers can be expressed as the sum of a hyperbolic curve and an unbiased random walk. Here, [x][x] denotes the greatest integer less than or equal to xx and 𝒟[0,1]\mathcal{D}[0,1] is the Skorokhod space on [0,1][0,1] (see Appendix B.5).

Theorem 5.

Let ρN\rho_{N} denote an empirical distribution of NN independent classifiers {hi}i=1N\{h_{i}\}_{i=1}^{N} sampled from a distribution ρ\rho and σ12=𝖵𝖺𝗋hρ(𝔼hρ𝒟(h(X)h(X)))\sigma_{1}^{2}=\mathsf{Var}_{h\sim\rho}(\mathbb{E}_{h^{\prime}\sim\rho}\mathbb{P}_{\mathcal{D}}(h(X)\neq h^{\prime}(X))). Then, there exists D>0D_{\infty}>0 such that

𝔼(h,h)ρN2[D(h,h)]=(11N)(D+2NZN),\mathbb{E}_{(h,h^{\prime})\sim\rho_{N}^{2}}[D(h,h^{\prime})]=\left(1-\frac{1}{N}\right)\left(D_{\infty}+\frac{2}{\sqrt{N}}Z_{N}\right),

where 𝔼ZN=0\mathbb{E}Z_{N}=0, 𝖵𝖺𝗋ZNσ12\mathsf{Var}Z_{N}\to\sigma_{1}^{2} and {tσ1Z[Nt]}t[0,1]\{\frac{\sqrt{t}}{\sigma_{1}}Z_{[Nt]}\}_{t\in[0,1]} converges weakly to a standard Wiener process in 𝒟[0,1]\mathcal{D}[0,1] as NN\to\infty.

Proof.

Let Φ(hi,hj)=𝒟(hi(X)hj(X))\Phi(h_{i},h_{j})=\mathbb{P}_{\mathcal{D}}(h_{i}(X)\neq h_{j}(X)). We observe that

N2N(N1)\displaystyle\frac{N^{2}}{N(N\!-\!1)} 𝔼(h,h)ρN2[D(h,h)]=1N(N1)i,j=1N𝒟(hi(X)hj(X))\displaystyle\mathbb{E}_{(h,h^{\prime})\sim\rho_{N}^{2}}[D(h,h^{\prime})]=\frac{1}{N(N\!-\!1)}\sum_{i,j=1}^{N}\mathbb{P}_{\mathcal{D}}(h_{i}(X)\neq h_{j}(X))
=1N(N1)i,j=1NΦ(hi,hj)=Φ: symmetricΦ(hi,hi)=02N(N1)1i<jNΦ(hi,hj)UN,\displaystyle=\frac{1}{N(N\!-\!1)}\sum_{i,j=1}^{N}\Phi(h_{i},h_{j})\underset{\begin{subarray}{c}\Phi\text{: symmetric}\\ \Phi(h_{i},h_{i})=0\end{subarray}}{=}\frac{2}{N(N\!-\!1)}\sum_{1\leq i<j\leq N}\Phi(h_{i},h_{j})\eqqcolon U_{N}, (12)

which is a UU-statistic with the kernel function Φ\Phi. Let Φ0=𝔼(h,h)ρ2Φ(h,h)\Phi_{0}=\mathbb{E}_{(h,h^{\prime})\sim\rho^{2}}\Phi(h,h^{\prime}).

The invariance principle of UU-statistics (Theorem 7 in Appendix B.5) states that the process ξN=(ξN(t),t[0,1])\xi_{N}=(\xi_{N}(t),t\in[0,1]), defined by ξN(kN)=k2Nσ12(UkΦ0)\xi_{N}(\frac{k}{N})=\frac{k}{2\sqrt{N\sigma_{1}^{2}}}(U_{k}-\Phi_{0}) and ξN(t)=ξN([Nt]N)\xi_{N}(t)=\xi_{N}(\frac{[Nt]}{N}), converges weakly to a standard Wiener process in 𝒟[0,1]\mathcal{D}[0,1] as NN\!\to\!\infty, since σ12=𝖵𝖺𝗋hρ𝔼hρΦ(h,h)\sigma_{1}^{2}=\mathsf{Var}_{h\sim\rho}\mathbb{E}_{h^{\prime}\sim\rho}\Phi(h,h^{\prime}). Therefore, UNU_{N} converges in probability as NN\!\to\!\infty to DΦ0D_{\infty}\coloneqq\Phi_{0}.

Letting ZN=σ1ξN(1)=N2(UND)Z_{N}\!=\!\sigma_{1}\xi_{N}(1)\!=\!\frac{\sqrt{N}}{2}(U_{N}\!-\!D_{\infty}), we can express UNU_{N} as UN=D+2NZNU_{N}\!=\!D_{\infty}\!+\!\frac{2}{\sqrt{N}}Z_{N}, with 𝔼ZN=0\mathbb{E}Z_{N}\!=\!0 and 𝖵𝖺𝗋ZNσ12\mathsf{Var}Z_{N}\!\to\!\sigma_{1}^{2}. Since tσ1Z[Nt]=Nt[Nt]ξN([Nt]N)=Nt[Nt]ξN(t)\frac{\sqrt{t}}{\sigma_{1}}Z_{[Nt]}\!=\!\sqrt{\frac{Nt}{[Nt]}}\,\xi_{N}(\frac{[Nt]}{N})\!=\!\sqrt{\frac{Nt}{[Nt]}}\,\xi_{N}(t), it follows by Slutsky’s Theorem that {tσ1Z[Nt]}t[0,1]\{\frac{\sqrt{t}}{\sigma_{1}}Z_{[Nt]}\}_{t\in[0,1]} converges weakly to a standard Wiener process in 𝒟[0,1]\mathcal{D}[0,1] as NN\!\to\!\infty. ∎

Theorem 5 suggests that the disagreement within NN classifiers, 𝔼ρN2[D(h,h)]\mathbb{E}_{\rho_{N}^{2}}[D(h,h^{\prime})], can be approximated as N1ND\frac{N-1}{N}D_{\infty}. From the disagreement within M(N)M(\ll\!\!N) classifiers, DD_{\infty} can be approximated as MM1𝔼ρM2[D(h,h)]\frac{M}{M-1}\mathbb{E}_{\rho_{M}^{2}}[D(h,h^{\prime})], and therefore we get

𝔼ρN2[D(h,h)]N1NMM1𝔼ρM2[D(h,h)].\displaystyle\mathbb{E}_{\rho_{N}^{2}}[D(h,h^{\prime})]\approx\frac{N-1}{N}\cdot\frac{M}{M-1}\mathbb{E}_{\rho_{M}^{2}}[D(h,h^{\prime})]. (13)

Assume that we have three classifiers sampled from ρ\rho. We denote the average error rate, the disagreement, and the ερ\varepsilon_{\rho} from these three classifiers by 𝔼3[L(h)]\mathbb{E}_{3}[L(h)], 𝔼3[D(h,h)]\mathbb{E}_{3}[D(h,h^{\prime})], and ε3\varepsilon_{3}, respectively. Then, from Theorem 4 and approximation (13) (which applies to both disagreement and ερ\varepsilon_{\rho}), we estimate the majority vote error rate of NN classifiers from ρ\rho as the following:

L(hρMV)\displaystyle L(h_{\rho}^{\mathrm{MV}}) 43[(1+N1N32ε3)𝔼3[L(h)]N1N3212𝔼3[D(h,h)]]\displaystyle\lessapprox\frac{4}{3}\left[\left(1+\frac{N-1}{N}\cdot\frac{3}{2}\cdot\varepsilon_{3}\right)\,\mathbb{E}_{3}[L(h)]-\frac{N-1}{N}\cdot\frac{3}{2}\cdot\frac{1}{2}\mathbb{E}_{3}[D(h,h^{\prime})]\right]
=43[𝔼3[L(h)]+3(N1)2N(ε3𝔼3[L(h)]12𝔼3[D(h,h)])].\displaystyle=\frac{4}{3}\left[\mathbb{E}_{3}[L(h)]+\frac{3(N-1)}{2N}\left(\varepsilon_{3}\mathbb{E}_{3}[L(h)]-\frac{1}{2}\mathbb{E}_{3}[D(h,h^{\prime})]\right)\right]. (14)

Alternatively, we can use the polarization measured from three classifiers, η3\eta_{3}, instead of η=43\eta=\frac{4}{3}, to obtain:

L(hρMV)=η3[𝔼3[L(h)]+3(N1)2N(ε3𝔼3[L(h)]12𝔼3[D(h,h)])].\displaystyle L(h_{\rho}^{\mathrm{MV}})=\eta_{3}\left[\mathbb{E}_{3}[L(h)]+\frac{3(N-1)}{2N}\left(\varepsilon_{3}\mathbb{E}_{3}[L(h)]-\frac{1}{2}\mathbb{E}_{3}[D(h,h^{\prime})]\right)\right]. (15)
Refer to caption
Figure 4: Comparing the estimated (extrapolated) majority vote error rates in equation (5) (blue-dashed lines) and (15) (orange-dashed lines) with the true majority vote error (green solid line) for each number of classifiers. The solid sky-blue line corresponds to the average error rate of constituent classifiers. Subplots (a1), (b), (c), (d), (e) show the results from different pairs of (classification model, dataset). Subplot (a2) overlays the right hand side of inequality (3) (C-bound, colored red) and inequality (6) (TKY+ (24) bound, colored purple) on the subplot (a1). These two quantities from previous studies are much larger compared to the average error rate. We see the same pattern for other (architecture, dataset) pairs, which we therefore omit from the plot. For more details on these empirical results, see Appendix C.

Figure 4 presents empirical results that compare the estimated (extrapolated) majority vote error rates in equations (5) and (15) with the true majority vote error for each number of classifiers. ResNet18 models are tested on four different dataset: CIFAR-10, CIFAR-10.1, Fashion-MNIST XRV (17) and Kuzushiji-MNIST CBIK+ (18) where the models are trained on the corresponding train data. MobileNet How (17) is trained and tested on the MNIST Den (12) dataset. Not only do the estimators show significant improvement compared to the bounds introduced in Section 2.2, we observe that the estimators are very close to the actual majority vote error rate; and thus the estimators have practical usages, unlike the bounds from previous studies. In Figure 4(a2), existing bounds (3) and (6) are much larger compared to the average error rate. This is also the case for (architecture, dataset) pairs of other subplots.

6 Discussion and Conclusion

This work addresses the question: how does the majority vote error rate change according to the number of classifiers? While this is an age-old question, it is one that has received renewed interest in recent years. On the journey to answering the question, we introduce several new ideas of independent interest. (1) We introduced the polarization ηρ\eta_{\rho}, of an ensemble of classifiers. This notion plays an important role throughout this paper and appears in every upper bound presented. Although Theorem 1 gives some insight into polarization, our conjectured neural polarization law (Conjecture 1) is yet to be proved or disproved, and it provides an exciting avenue for future work. (2) We proposed two classes of ensembles whose entropy is restricted in different ways. Without these constraints, there will always be examples that saturate even the least useful majority vote error bounds. We believe that accurately describing how models behave in terms of the entropy of their output is key to precisely characterizing the behavior of majority vote, and likely other ensembling methods.

Throughout this paper, we have theoretically and empirically demonstrated that polarization is fairly invariant to the hyperparameters and architecture of classifiers. We also proved a tight bound for majority vote error, under an assumption with another quantity ε\varepsilon, and we presented how the components of this tight bound behave according to the number of classifiers. Altogether, we have sharpened bounds on the majority vote error to the extent that we are able to identify the trend of majority vote error rate in terms of number of classifiers.

We close with one final remark regarding the metrics used to evaluate an ensemble. Majority vote error rate is the most common and popular metric used to measure the performance of an ensemble. However, it seems unlikely that a practitioner would consider an ensemble to have performed adequately if the majority vote conclusion was correct, but was only reached by a relatively small fraction of the classifiers. With the advent of large language models, it is worth considering whether the majority vote error rate is still as valuable. The natural alternative in this regard is the probability ρ(Wρ>1/2)\mathbb{P}_{\rho}(W_{\rho}>1/2), that is, the probability that at least half of the classifiers agree on the correct answer. This quantity is especially well-behaved, and it frequently appears in our proofs. (Indeed, every bound presented in this work serves as an upper bound for ρ(Wρ>1/2)\mathbb{P}_{\rho}(W_{\rho}>1/2).) We conjecture that this quantity is useful much more generally.

Acknowledgements.

We would like to thank the DOE, IARPA, NSF, and ONR for providing partial support of this work.

References

  • BHMM [19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • Bil [13] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2nd edition, 2013.
  • BJRK [22] Christina Baek, Yiding Jiang, Aditi Raghunathan, and Zico Kolter. Agreement-on-the-line: Predicting the performance of neural networks under distribution shift. Advances in Neural Information Processing Systems, 35:19274–19289, 2022.
  • CBIK+ [18] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718, 2018.
  • Den [12] Li Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • HLVDMW [17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
  • How [17] Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • HZRS [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • JNBK [22] Yiding Jiang, Vaishnavh Nagarajan, Christina Baek, and J Zico Kolter. Assessing generalization of SGD via disagreement. In International Conference on Learning Representations, 2022.
  • Kal [21] Olav Kallenberg. Foundations of modern probability. Springer, 3rd edition, 2021.
  • KB [13] Vladimir S. Korolyuk and Yu V. Borovskich. Theory of UU-statistics. Springer Science & Business Media, 2013.
  • Kot [22] Vignesh Kothapalli. Neural collapse: A review on modelling principles and generalization. arXiv preprint arXiv:2206.04041, 2022.
  • Kri [09] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • LMRR [17] François Laviolette, Emilie Morvant, Liva Ralaivola, and Jean-Francis Roy. Risk upper bounds for general ensemble methods with an application to multiclass classification. Neurocomputing, 219:15–25, 2017.
  • LPB [17] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017.
  • McA [98] David A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual conference on Computational Learning Theory, pages 230–234, 1998.
  • MLIS [20] Andrés Masegosa, Stephan Lorenzen, Christian Igel, and Yevgeny Seldin. Second order PAC-Bayesian bounds for the weighted majority vote. Advances in Neural Information Processing Systems, 33:5263–5273, 2020.
  • RRSS [18] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do CIFAR-10 classifiers generalize to CIFAR-10? arXiv preprint arXiv:1806.00451, 2018.
  • SZ [14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • TFF [08] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008.
  • TKY+ [24] Ryan Theisen, Hyunsuk Kim, Yaoqing Yang, Liam Hodgkinson, and Michael W. Mahoney. When are ensembles really effective? Advances in Neural Information Processing Systems, 36, 2024.
  • XRV [17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

Appendix A More discussion on competence

In this section, we delve more into the competence condition that was introduced in [21]. We explore in which cases the competence condition might not work and how to overcome these issues. We discuss a few milder versions of competence that are enough for bounds (5) and (6) to hold. Then we discuss how to check whether these weaker competence conditions hold in practice, with or without a separate validation set. We start by formally stating the original competence condition.

Definition 3 (Competence, [21]).

The ensemble ρ\rho is competent if for every 0t1/20\leq t\leq 1/2,

𝒟(Wρ[t,1/2))𝒟(Wρ[1/2,1t]).\displaystyle\mathbb{P}_{\mathcal{D}}(W_{\rho}\in[t,1/2))\geq\mathbb{P}_{\mathcal{D}}(W_{\rho}\in[1/2,1-t]). (16)

A.1 Cases when competence fails

One tricky part in the definition of competence is that it requires inequality (16) to hold for every 0t1/20\leq t\leq 1/2. In case t=1/2t=1/2, the inequality becomes

0𝒟(Wρ=1/2).\displaystyle 0\geq\mathbb{P}_{\mathcal{D}}(W_{\rho}=1/2).

This is not a significant issue in the case that ρ\rho is a continuous distribution over classifiers, e.g., a Bayes posterior or a distribution over a parametric family hθh_{\theta}, as {Wρ=1/2}\{W_{\rho}=1/2\} would be a measure-zero set. In the case that ρ\rho is a discrete distribution over finite number of classifiers, however, 𝒟(Wρ=1/2)\mathbb{P}_{\mathcal{D}}(W_{\rho}=1/2) is likely to be a positive quantity, in which case it can violate the competence condition.

That being said, {(x,y)Wρ(x,y)=1/2}\{(x,y)\mid W_{\rho}(x,y)=1/2\} represent tricky data points that deserves separate attention. This event can be divided into two cases: 1) all the classifiers that incorrectly made a prediction output the same label; or 2) incorrect predictions consist of multiple labels so that the majority vote outputs the true label. Among these two possibilities, the first case is troublesome. We denote such data points by TIE(ρ,𝒟)\text{TIE}(\rho,\mathcal{D}):

TIE(ρ,𝒟)\displaystyle\text{TIE}(\rho,\mathcal{D}) :=\displaystyle:=
{(x,y)ρ(𝟙(h(x)=j))=ρ(𝟙(h(x)=y))=1/2 for true label y and an incorrect label j}.\displaystyle\{(x,y)\mid\mathbb{P}_{\rho}(\mathds{1}(h(x)=j))=\mathbb{P}_{\rho}(\mathds{1}(h(x)=y))=1/2\text{ for true label $y$ and an incorrect label $j$}\}.

In this case, the true label and an incorrect label are chosen by exactly the same ρ\rho-weights of classifiers. An easy way to resolve this issue is to slightly tweak the weights. For instance, if ρ\rho is an equally weighted sum of two classifiers, we can change each of their weights to be (1/2+ϵ,1/2ϵ)(1/2+\epsilon,1/2-\epsilon), instead of (1/2,1/2)(1/2,1/2). This change may seem manipulative, but it corresponds to a deterministic tie-breaking rule which prioritizes one classifier over the other, which is a commonly used tie-breaking rule.

Definition 4 (Tie-free ensemble).

An ensemble is tie-free if 𝒟(TIE(ρ,𝒟))=0\mathbb{P}_{\mathcal{D}}(\text{TIE}(\rho,\mathcal{D}))=0.

Proposition 2.

An ensemble with a deterministic tie-breaking rule is tie-free.

With such tweak to make the set TIE(ρ,𝒟)\text{TIE}(\rho,\mathcal{D}) to be an empty set or a measure-zero set, we present a slightly milder condition that is enough for the bounds (5) and (6) to still hold.

Definition 5 (Semi-competence).

The ensemble ρ\rho is semi-competent if for every 0t<1/20\leq t<1/2,

P(Wρ[t,1/2])(Wρ(1/2,1t]).\displaystyle P(W_{\rho}\in[t,1/2])\geq\mathbb{P}(W_{\rho}\in(1/2,1-t]). (17)

Note that inequality (17) is a strictly weaker condition than inequality (16), and hence competence implies semi-competence. The converse is not true. An ensemble is semi-competent even if the point-wise error Wρ(X,Y)=1/2W_{\rho}(X,Y)=1/2 on every data points, but such an ensemble is not competent.

Theorem 6.

For a tie-free ensemble and semi-competent ensemble ρ\rho, L(hρMV)𝔼ρ[L(h)]L(h_{\rho}^{\mathrm{MV}})\leq\mathbb{E}_{\rho}[L(h)] and

L(hρMV)4(K1)K(𝔼ρ[L(h)]12𝔼ρ2[D(h,h)])\displaystyle L(h_{\rho}^{\mathrm{MV}})\leq\frac{4(K-1)}{K}\left(\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right)

holds in KK-class classification setting.

We provide the proof as a separate subsection below.

A.2 Proof of Theorem 6 and Proposition 1

We start with the following lemma, which is a semi-competence version of Lemma 2 from [21].

Lemma 1.

For a semi-competent ensemble ρ\rho and any increasing function gg satisfying g(0)=0g(0)=0,

𝔼𝒟[g(Wρ)𝟙Wρ1/2]𝔼𝒟[g(Wρ~)𝟙Wρ~<1/2],\displaystyle\mathbb{E}_{\mathcal{D}}[g(W_{\rho})\mathds{1}_{W_{\rho}\leq 1/2}]\;\geq\;\mathbb{E}_{\mathcal{D}}[g(\widetilde{W_{\rho}})\mathds{1}_{\widetilde{W_{\rho}}<1/2}],

where Wρ~=1Wρ\widetilde{W_{\rho}}=1-W_{\rho}.

Proof.

For every x[0,1]x\in[0,1], it holds that

𝒟(Wρ𝟙Wρ1/2x)\displaystyle\mathbb{P}_{\mathcal{D}}(W_{\rho}\mathds{1}_{W_{\rho}\leq 1/2}\geq x) =𝒟(Wρ[x,1/2]) 1x1/2,\displaystyle=\mathbb{P}_{\mathcal{D}}({W_{\rho}}\in[x,1/2])\,\mathds{1}_{x\leq 1/2},
𝒟(Wρ~𝟙Wρ~<1/2x)\displaystyle\mathbb{P}_{\mathcal{D}}(\widetilde{W_{\rho}}\mathds{1}_{\widetilde{W_{\rho}}<1/2}\geq x) =𝒟(Wρ~[x,1/2)) 1x1/2=𝒟(Wρ(1/2,1x]) 1x1/2.\displaystyle=\mathbb{P}_{\mathcal{D}}(\widetilde{W_{\rho}}\in[x,1/2))\,\mathds{1}_{x\leq 1/2}=\mathbb{P}_{\mathcal{D}}(W_{\rho}\in(1/2,1-x])\,\mathds{1}_{x\leq 1/2}.

From the definition of semi-competence, this implies that 𝒟(Wρ𝟙Wρ1/2x)𝒟(Wρ~𝟙Wρ~<1/2x)\mathbb{P}_{\mathcal{D}}({W_{\rho}}\mathds{1}_{{W_{\rho}}\leq 1/2}\geq x)\geq\mathbb{P}_{\mathcal{D}}(\widetilde{W_{\rho}}\mathds{1}_{\widetilde{W_{\rho}}<1/2}\geq x) for every x[0,1]x\in[0,1]. Using the fact that g(x 1xc)=g(x)𝟙xcg(x\,\mathds{1}_{x\leq c})=g(x)\mathds{1}_{x\leq c} for any increasing function gg with g(0)=0g(0)=0, we obtain

𝒟(h(Wρ)𝟙Wρ1/2x)𝒟(h(Wρ~)𝟙Wρ~<1/2x).\displaystyle\mathbb{P}_{\mathcal{D}}(h({W_{\rho}})\mathds{1}_{{W_{\rho}}\leq 1/2}\geq x)\geq\mathbb{P}_{\mathcal{D}}(h(\widetilde{W_{\rho}})\mathds{1}_{\widetilde{W_{\rho}}<1/2}\geq x).

Putting these together with a well-known equality 𝔼X=0(Xx)dx\mathbb{E}X=\int_{0}^{\infty}\mathbb{P}(X\geq x)\mathrm{d}x for a non-negative random variable XX proves the lemma. ∎

Now we use Lemma 1 and Theorem 2 to prove Theorem 6.

Proof of Theorem 6.

Applying Lemma 1 with g(x)=2x2g(x)=2x^{2} gives,

𝔼𝒟[2Wρ2𝟙Wρ1/2]𝔼𝒟[2Wρ~2𝟙Wρ~<1/2]=𝔼𝒟[(24Wρ+2Wρ2)𝟙Wρ>1/2].\displaystyle\mathbb{E}_{\mathcal{D}}[2W_{\rho}^{2}\mathds{1}_{W_{\rho}\leq 1/2}]\geq\mathbb{E}_{\mathcal{D}}[2\widetilde{W_{\rho}}^{2}\mathds{1}_{\widetilde{W_{\rho}}<1/2}]=\mathbb{E}_{\mathcal{D}}[(2-4W_{\rho}+2W_{\rho}^{2})\mathds{1}_{W_{\rho}>1/2}]. (18)

Putting this together with the following decomposition of 𝔼𝒟[2Wρ2]\mathbb{E}_{\mathcal{D}}[2W_{\rho}^{2}] shows that the ensemble ρ\rho is 22-polarized:

𝔼𝒟[2Wρ2]𝔼𝒟[2Wρ2𝟙Wρ>1/2]+𝔼𝒟[2Wρ2𝟙Wρ1/2](18)𝔼𝒟[2Wρ2𝟙Wρ>1/2]+𝔼𝒟[(24Wρ+2Wρ2)𝟙Wρ>1/2]𝔼𝒟[(12Wρ)2𝟙Wρ>1/2]+𝒟(Wρ>1/2)𝒟(Wρ>1/2).\displaystyle\begin{split}\mathbb{E}_{\mathcal{D}}[2W_{\rho}^{2}]&\,\geq\,\mathbb{E}_{\mathcal{D}}[2W_{\rho}^{2}\mathds{1}_{W_{\rho}>1/2}]+\mathbb{E}_{\mathcal{D}}[2W_{\rho}^{2}\mathds{1}_{W_{\rho}\leq 1/2}]\\ &\underset{\eqref{eq:thm:semi_comp_proof2}}{\geq}\mathbb{E}_{\mathcal{D}}[2W_{\rho}^{2}\mathds{1}_{W_{\rho}>1/2}]+\mathbb{E}_{\mathcal{D}}[(2-4W_{\rho}+2W_{\rho}^{2})\mathds{1}_{W_{\rho}>1/2}]\\ &\,\geq\,\mathbb{E}_{\mathcal{D}}[(1-2W_{\rho})^{2}\mathds{1}_{W_{\rho}>1/2}]+\mathbb{P}_{\mathcal{D}}(W_{\rho}>1/2)\\ &\,\geq\,\mathbb{P}_{\mathcal{D}}(W_{\rho}>1/2).\end{split} (19)

Therefore, applying Theorem 2 with constant η=2\eta=2 concludes the proof. ∎

We also state the following proof of Proposition 1 for completeness.

Proof of Proposition 1.

Inequality (19) with Lemma 3 proves the proposition. ∎

A.3 Example that the bound (6) is tight

Here, we provide a combination of (ρ,𝒟)(\rho,\mathcal{D}) of which L(hρMV)L(h_{\rho}^{\mathrm{MV}}) is arbitrarily close to the bound.

Consider, for each feature xx, that exactly (1ϵ)(1-\epsilon) fraction of classifiers predict the correct label, and that the remaining ϵ\epsilon fraction of classifiers predict a wrong label. In this case, L(hρMV)=0L(h_{\rho}^{\mathrm{MV}})=0, 𝔼ρ[L(h)]=ϵ\mathbb{E}_{\rho}[L(h)]=\epsilon, and 𝔼ρ2[D(h,h)]=2ϵ(1ϵ)\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]=2\epsilon(1-\epsilon). Hence, the upper bound (6) is 4(K1)Kϵ2\frac{4(K-1)}{K}\epsilon^{2}, which can be arbitrarily close to 0.

Appendix B Proofs of our main results

In this section, we provide proofs for our main results.

B.1 Proof of Theorem 1

We start with the following lemma which shows the concentration of a linear combination of Wρ2W_{\rho}^{2} and 𝟙(Wρ>1/2)\mathds{1}(W_{\rho}>1/2).

Lemma 2.

For sampled data points {(Xi,Yi)}i=1m𝒟\{(X_{i},Y_{i})\}_{i=1}^{m}\sim\mathcal{D}, define Z2:=i=1mWρ2(Xi,Yi)Z_{2}:=\sum_{i=1}^{m}W_{\rho}^{2}(X_{i},Y_{i}) and Z0:=i=1m𝟙(Wρ(Xi,Yi)>1/2)Z_{0}:=\sum_{i=1}^{m}\mathds{1}(W_{\rho}(X_{i},Y_{i})>1/2). The ensemble ρ\rho is η\eta-polarized with probability at least 1δ1-\delta if

1m(ηZ2Z0)>max{3η4,1}2mlog1δ.\displaystyle\frac{1}{m}(\eta Z_{2}-Z_{0})>\sqrt{\frac{\max\{\frac{3\eta}{4},1\}}{2m}\log\frac{1}{\delta}}. (20)
Proof.

Let Z2i=Wρ2(Xi,Yi)Z_{2i}=W_{\rho}^{2}(X_{i},Y_{i}) and Z0i=𝟙(Wρ(Xi,Yi)>1/2)Z_{0i}=\mathds{1}(W_{\rho}(X_{i},Y_{i})>1/2). Observe that ηZ2iZ0i\eta\,Z_{2i}-Z_{0i} always takes a value between [η41,max{η4,η1}][\frac{\eta}{4}-1,\max\{\frac{\eta}{4},\eta-1\}] since Wρ(Xi,Yi)[0,1]W_{\rho}(X_{i},Y_{i})\in[0,1]. This implies that ηZ2iZ0i\eta Z_{2i}-Z_{0i}s are i.i.d. sub-Gaussian random variable with parameter σ=max{3η4,1}/2\sigma=\max\{\frac{3\eta}{4},1\}/2.
By letting A2=𝔼[ηWρ2𝟙(Wρ1/2)]A_{2}=\mathbb{E}[\eta W_{\rho}^{2}-\mathds{1}(W_{\rho}\geq 1/2)] and using the Hoeffding’s inequality, we obtain

1m(ηZ2Z0)A2max{3η4,1}2mlog1δ\displaystyle\frac{1}{m}(\eta Z_{2}-Z_{0})-A_{2}\leq\sqrt{\frac{\max\{\frac{3\eta}{4},1\}}{2m}\log\frac{1}{\delta}}

with probability at least 1δ1-\delta.

Therefore, ρ\rho is η\eta-polarized with probability at least 1δ1-\delta if

1m(ηZ2Z0)>max{3η4,1}2mlog1δ.\displaystyle\frac{1}{m}(\eta Z_{2}-Z_{0})>\sqrt{\frac{\max\{\frac{3\eta}{4},1\}}{2m}\log\frac{1}{\delta}}.

Now we use Lemma 2 to prove Theorem 1.

Proof of Theorem 1.

Observe that S=1mZ2S=\frac{1}{m}Z_{2}, P=1mZ0P=\frac{1}{m}Z_{0}, and thus 1m(ηZ2Z0)=ηSP\frac{1}{m}(\eta Z_{2}-Z_{0})=\eta S-P. For η43\eta\geq\frac{4}{3}, the lower bound in Lemma 2 is simply 3η8mlog1δ\sqrt{\frac{3\eta}{8m}\log\frac{1}{\delta}}, and the inequality (20) can be viewed as a quadratic inequality in terms of η\sqrt{\eta}. From quadratic formula, we know that

ifη>3η8mlog1δ+3η8mlog1δ+4SP2S, thenηSP3η8mlog1δ>0.\displaystyle\text{if}\quad\sqrt{\eta}>\frac{\sqrt{\frac{3\eta}{8m}\log\frac{1}{\delta}}+\sqrt{\frac{3\eta}{8m}\log\frac{1}{\delta}+4SP}}{2S},\quad\text{ then}\quad\eta S-P-\sqrt{\frac{3\eta}{8m}\log\frac{1}{\delta}}>0.

Putting this together with Lemma 2 proves the theorem:

ηmax\displaystyle\eta\geq\max {43,(38mlog1δ+38mlog1δ+4SP2S)2}\displaystyle\left\{\frac{4}{3},\left(\frac{\sqrt{\frac{3}{8m}\log\frac{1}{\delta}}+\sqrt{\frac{3}{8m}\log\frac{1}{\delta}+4SP}}{2S}\right)^{2}\right\} (21)
ηSP>3η8mlog1δLemma2ρ is η-polarized w.p. 1δ,\displaystyle\quad\Rightarrow\quad\eta S-P>\sqrt{\frac{3\eta}{8m}\log\frac{1}{\delta}}\underset{\text{Lemma}\ref{lem:wr2-concen}}{\quad\quad\Rightarrow\quad\quad}\rho\text{ is }\eta\text{-polarized w.p. }1-\delta,

and thus the polarization ηρ\eta_{\rho}, the smallest η\eta such that ρ\rho is η\eta-polarized, is upper bounded by the right hand side of inequality (21). ∎

B.2 Proof of Theorem 2

We start by proving the following lemma which relates the error rate of the majority vote, L(hρMV)L(h_{\rho}^{\mathrm{MV}}), with the point-wise error rate, WρW_{\rho}, using Markov’s inequality. In general, L(hρMV)𝒟(Wρ1/2)L(h_{\rho}^{\mathrm{MV}})\leq\mathbb{P}_{\mathcal{D}}(W_{\rho}\geq 1/2) is true for any ensemble ρ\rho. We prove a tighter version of this. The difference between the two can be non-negligible when dealing with an ensemble with finite number of classifiers. Refer to Appendix A.1 and Definition 4 for more details regarding this difference and tie-free ensembles.

Lemma 3.

For a tie-free ensemble ρ\rho, we have the inequality L(hρMV)𝒟(Wρ>1/2).L(h_{\rho}^{\mathrm{MV}})\leq\mathbb{P}_{\mathcal{D}}(W_{\rho}>1/2).

Proof.

For given feature xx, Wρ1/2W_{\rho}\leq 1/2 implies that more than or exactly ρ\rho-weighted half of the classifiers outputs the true label. Since the ensemble ρ\rho is tie-free, hρMVh_{\rho}^{\mathrm{MV}} outputs the true label if Wρ1/2W_{\rho}\leq 1/2. Therefore, {(x,y)Wρ(x,y)1/2}{(x,y)hρMV(x)=y}\{(x,y)\mid W_{\rho}(x,y)\leq 1/2\}\subset\{(x,y)\mid h_{\rho}^{\mathrm{MV}}(x)=y\}. Applying 𝒟\mathbb{P}_{\mathcal{D}} on the both sides proves the lemma. ∎

The following lemma appears as Lemma 2 in [17]. This lemma draws the connection between the point-wise error rate, WρW_{\rho} and the tandem loss, 𝔼ρ2[L(h,h)]\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})].

Lemma 4.

The equality 𝔼𝒟[Wρ2]=𝔼ρ2[L(h,h)]\mathbb{E}_{\mathcal{D}}[{W_{\rho}}^{2}]=\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})] holds.

The next lemma appears as Lemma 4 in [21]. This lemma provides an upper bound on the tandem loss, 𝔼ρ2[L(h,h)]\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})], in terms of the average error rate, 𝔼ρ[L(h)]\mathbb{E}_{\rho}[L(h)], and the average disagreement, 𝔼ρ2[D(h,h)]\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})].

Lemma 5.

For the KK-class problem,

𝔼ρ2[L(h,h)]2(K1)K(𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]).\displaystyle\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]\leq\frac{2(K-1)}{K}\left(\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right).

Now we use these results to prove Theorem 2.

Proof of Theorem 2.

Putting Lemmas 3, 4, and 5 and the definition of the polarization together proves the theorem:

L(hρMV)Lemma 3\displaystyle L(h_{\rho}^{\mathrm{MV}})\underset{\text{Lemma \ref{lemma:tie-free-markov}}}{\leq} 𝒟(Wρ>1/2)polarizationηρ𝔼𝒟[Wρ2]\displaystyle\mathbb{P}_{\mathcal{D}}(W_{\rho}>1/2)\underset{\text{polarization}}{\,\leq\,}\eta_{\rho}\,\mathbb{E}_{\mathcal{D}}[W_{\rho}^{2}]
=Lemma 4ηρ𝔼ρ2[L(h,h)]=Lemma 52ηρ(K1)K(𝔼h[L(h)]12𝔼h,h[D(h,h)]).\displaystyle\underset{\text{Lemma \ref{lemma:tandem}}}{=}\eta_{\rho}\,\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]\underset{\text{Lemma \ref{lemma:kclass_tandem_ub}}}{=}\frac{2\eta_{\rho}(K-1)}{K}\left(\mathbb{E}_{h}[L(h)]-\frac{1}{2}\mathbb{E}_{h,h^{\prime}}[D(h,h^{\prime})]\right).

B.3 Proof of Theorem 3

We start with a lemma which is a corollary of Newton’s inequality.

Lemma 6.

For any collection of probabilities p1,,pnp_{1},\dots,p_{n}, the following inequality holds.

1i<jnpipjn12n(i=1npi)2.\sum_{1\leq i<j\leq n}p_{i}p_{j}\,\leq\,\frac{n-1}{2n}\left(\sum_{i=1}^{n}p_{i}\right)^{2}.
Proof.

Newton’s inequality states that

e2(n2)(e1n)2wheree1=i=1npiande2=1i<jnpipj.\displaystyle\frac{e_{2}}{\binom{n}{2}}\leq\left(\frac{e_{1}}{n}\right)^{2}\qquad\text{where}\quad e_{1}=\sum_{i=1}^{n}p_{i}\quad\text{and}\quad e_{2}=\sum_{1\leq i<j\leq n}p_{i}\,p_{j}.

Rearranging the terms gives the lemma. ∎

Now we use this and the previous lemmas to prove Theorem 3.

Proof of Theorem 3.

From Lemma 3, Lemma 4, and the definition of η\eta-polarized ensemble, we have the following relationship between L(hρMV)L(h_{\rho}^{\mathrm{MV}}) and 𝔼ρ2[L(h,h)]\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]:

L(hρMV)Lem. 3𝒟(Wρ>1/2)η-polarizedη𝔼𝒟[Wρ2]=Lemma 4η𝔼ρ2[L(h,h)].\displaystyle L(h_{\rho}^{\mathrm{MV}})\underset{\text{Lem. \ref{lemma:tie-free-markov}}}{\leq}\mathbb{P}_{\mathcal{D}}(W_{\rho}>1/2)\underset{\eta\text{-polarized}}{\leq}\eta\,\mathbb{E}_{\mathcal{D}}[{W_{\rho}}^{2}]\underset{\text{Lemma \ref{lemma:tandem}}}{=}\eta\,\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]. (22)

From this, it suffices to prove that hα𝔼ρ2[L(h,h)]h_{\alpha}\,\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})] is smaller than the upper bound in the theorem. First, observe the following decomposition of 𝔼ρ2[L(h,h)]\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]:

𝔼ρ2[L(h,h)]=𝔼𝒟[ρ(h(X)Y)2]=𝔼𝒟[ρ(h(X)Y)ρ2(h(X)Y,h(X)=Y)].\displaystyle\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]=\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}(h(X)\neq Y)^{2}\right]=\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}(h(X)\neq Y)-\mathbb{P}_{\rho^{2}}(h(X)\neq Y,h^{\prime}(X)=Y)\right]. (23)

For any predictor mapping into KK classes, let yy denote the true label for an input xx. Now we derive a lower bound of ρ2(h(X)Y,h(X)=Y)\mathbb{P}_{\rho^{2}}(h(X)\neq Y,h^{\prime}(X)=Y) using the following decomposition of ρ2(h(x)h(x))\mathbb{P}_{\rho^{2}}(h(x)\neq h^{\prime}(x)):

12ρ2\displaystyle\frac{1}{2}\,\mathbb{P}_{\rho^{2}} (h(x)h(x))\displaystyle(h(x)\neq h^{\prime}(x))
=ρ2(h(x)y,h(x)=y)+iA(x)jA(x)\{y}pipj,+i,jA(x)\{y}i<jpipj+i,jA(x)i<jpipj,\displaystyle=\mathbb{P}_{\rho^{2}}(h(x)\neq y,h^{\prime}(x)=y)+\sum_{\begin{subarray}{c}i\notin A(x)\\ j\in A(x)\backslash\{y\}\end{subarray}}p_{i}p_{j},+\sum_{\begin{subarray}{c}i,j\in A(x)\backslash\{y\}\\ i<j\end{subarray}}p_{i}p_{j}+\sum_{\begin{subarray}{c}i,j\notin A(x)\\ i<j\end{subarray}}p_{i}p_{j},

where pi:=pi(x)=ρ(h(x)=i)p_{i}:=p_{i}(x)=\mathbb{P}_{\rho}(h(x)=i). We let Δx:=ρ(h(x)A(x))\Delta_{x}:=\mathbb{P}_{\rho}(h(x)\notin A(x)) and apply Lemma 6 to the last two terms:

12ρ2\displaystyle\frac{1}{2}\,\mathbb{P}_{\rho^{2}} (h(x)h(x))\displaystyle(h(x)\neq h^{\prime}(x))
=ρ2(h(x)y,h(x)=y)+Δx(1pYΔx)+i,jA(x)\{y}i<jpipj+i,jA(x)i<jpipj\displaystyle=\mathbb{P}_{\rho^{2}}(h(x)\neq y,h^{\prime}(x)=y)+\Delta_{x}(1\!-\!p_{Y}\!-\!\Delta_{x})+\sum_{\begin{subarray}{c}i,j\in A(x)\backslash\{y\}\\ i<j\end{subarray}}p_{i}p_{j}+\sum_{\begin{subarray}{c}i,j\notin A(x)\\ i<j\end{subarray}}p_{i}p_{j}
Lemma 6ρ2(h(x)y,h(x)=y)+Δx(1pyΔx)+M22(M1)(1pyΔx)2+KM12(KM)Δx2.\displaystyle\underset{\text{Lemma \ref{lemma:Newton}}}{\leq}\mathbb{P}_{\rho^{2}}(h(x)\neq y,h^{\prime}(x)=y)+\Delta_{x}(1\!-\!p_{y}\!-\!\Delta_{x})+\frac{M\!-\!2}{2(M\!-\!1)}(1\!-\!p_{y}\!-\!\Delta_{x})^{2}+\frac{K\!-\!M\!-\!1}{2(K\!-\!M)}\Delta_{x}^{2}.

Rearranging the terms and plugging 1pY=ρ(h(x)y)1\!-\!p_{Y}=\mathbb{P}_{\rho}(h(x)\neq y) gives

ρ2\displaystyle\mathbb{P}_{\rho^{2}} (h(x)y,h(x)=y)\displaystyle(h(x)\neq y,h^{\prime}(x)=y)
12ρ2(h(x)h(x))ΔxM1ρ(h(x)y)M22(M1)ρ(h(x)y)2\displaystyle\geq\frac{1}{2}\,\mathbb{P}_{\rho^{2}}(h(x)\neq h^{\prime}(x))-\frac{\Delta_{x}}{M\!-\!1}\mathbb{P}_{\rho}(h(x)\neq y)-\frac{M\!-\!2}{2(M\!-\!1)}\mathbb{P}_{\rho}(h(x)\neq y)^{2}
+K12(KM)(M1)Δx2\displaystyle\hskip 14.22636pt+\frac{K\!-\!1}{2(K\!-\!M)(M\!-\!1)}\Delta_{x}^{2}
12ρ2(h(x)h(x))ΔxM1ρ(h(x)y)M22(M1)ρ(h(x)y)2\displaystyle\geq\frac{1}{2}\,\mathbb{P}_{\rho^{2}}(h(x)\neq h^{\prime}(x))-\frac{\Delta_{x}}{M\!-\!1}\mathbb{P}_{\rho}(h(x)\neq y)-\frac{M\!-\!2}{2(M\!-\!1)}\mathbb{P}_{\rho}(h(x)\neq y)^{2}
12ρ2(h(x)h(x))ΔM1ρ(h(x)y)M22(M1)ρ(h(x)y)2,\displaystyle\geq\frac{1}{2}\,\mathbb{P}_{\rho^{2}}(h(x)\neq h^{\prime}(x))-\frac{\Delta}{M\!-\!1}\mathbb{P}_{\rho}(h(x)\neq y)-\frac{M\!-\!2}{2(M\!-\!1)}\mathbb{P}_{\rho}(h(x)\neq y)^{2},

where the last inequality comes from the condition Δx:=ρ(h(x)A(x))Δ\Delta_{x}:=\mathbb{P}_{\rho}(h(x)\notin A(x))\leq\Delta. Putting this together with the equality (23) gives

𝔼𝒟[ρ(h(X)Y)2]\displaystyle\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}(h(X)\neq Y)^{2}\right] (1+ΔM1)𝔼𝒟[ρ(h(X)Y)]12𝔼𝒟[ρ2(h(X)h(X))]\displaystyle\leq\left(1+\frac{\Delta}{M\!-\!1}\right)\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}(h(X)\neq Y)\right]-\frac{1}{2}\,\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho^{2}}(h(X)\neq h^{\prime}(X))\right]
+M22(M1)𝔼𝒟[ρ(h(X)Y)2],\displaystyle\hskip 14.22636pt+\frac{M\!-\!2}{2(M\!-\!1)}\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}(h(X)\neq Y)^{2}\right],

which implies

𝔼ρ2[L(h,h)]\displaystyle\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})] =𝔼𝒟[ρ(h(X)Y)2]\displaystyle=\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}(h(X)\neq Y)^{2}\right]
2(M1)M[(1+ΔM1)𝔼𝒟[ρ(h(X)Y)]12𝔼𝒟[ρ2(h(X)h(X))]]\displaystyle\leq\frac{2(M\!-\!1)}{M}\left[\left(1+\frac{\Delta}{M\!-\!1}\right)\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}(h(X)\neq Y)\right]-\frac{1}{2}\,\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho^{2}}(h(X)\neq h^{\prime}(X))\right]\right]
=2(M1)M[(1+ΔM1)𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]].\displaystyle=\frac{2(M\!-\!1)}{M}\left[\left(1+\frac{\Delta}{M\!-\!1}\right)\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\,\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right].

Combining this with inequality (22) concludes the proof. ∎

B.4 Proof of Theorem 4 and Corollary 2

First, we prove Theorem 4 by decomposing the point-wise disagreement between constituent classifiers.

Proof of Theorem 4.

The following decomposition of ρ2(h(x)h(x))\mathbb{P}_{\rho^{2}}(h(x)\neq h^{\prime}(x)) holds:

12ρ2(h(x)h(x))=ρ2(h(x)y,h(x)=y)+12ρ2(h(x)y,h(x)=y,h(x)h(x)).\displaystyle\frac{1}{2}\,\mathbb{P}_{\rho^{2}}(h(x)\neq h^{\prime}(x))=\mathbb{P}_{\rho^{2}}(h(x)\neq y,h^{\prime}(x)=y)+\frac{1}{2}\mathbb{P}_{\rho^{2}}(h(x)\neq y,h^{\prime}(x)=y,h(x)\neq h^{\prime}(x)).

Applying 𝔼𝒟\mathbb{E}_{\mathcal{D}} to both sides and using the given condition (11), we obtain,

12𝔼𝒟[ρ2(h(X)h(X))]𝔼𝒟[ρ2(h(X)Y,h(X)=Y)]+ε𝔼𝒟[ρ(h(X)Y)].\displaystyle\frac{1}{2}\,\mathbb{E}_{\mathcal{D}}[\mathbb{P}_{\rho^{2}}(h(X)\neq h^{\prime}(X))]\leq\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho^{2}}(h(X)\neq Y,h^{\prime}(X)=Y)\right]+\varepsilon\,\mathbb{E}_{\mathcal{D}}[\mathbb{P}_{\rho}(h(X)\neq Y)].

The left hand side equals 12𝔼ρ2[D(h,h)]\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})], and the second term on the right hand side is simply ε𝔼ρ[L(h)]\varepsilon\,\mathbb{E}_{\rho}[L(h)]. Hence, the inequality above can be rephrased as follows:

12𝔼ρ2[D(h,h)]ε𝔼ρ[L(h)]𝔼𝒟[ρ2(h(X)Y,h(X)=Y)].\displaystyle\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]-\varepsilon\,\mathbb{E}_{\rho}[L(h)]\leq\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho^{2}}(h(X)\neq Y,h^{\prime}(X)=Y)\right]. (24)

Putting this together with the inequality (22) and the equality (23), gives

L(hρMV)\displaystyle L(h_{\rho}^{\mathrm{MV}}) Ineq.22η𝔼ρ2[L(h,h)]=Eq.23η𝔼𝒟[ρ(h(X)Y)ρ2(h(X)Y,h(X)=Y)]\displaystyle\underset{\text{Ineq.}\ref{eq:entropy-proof0}}{\leq}\eta\,\mathbb{E}_{\rho^{2}}[L(h,h^{\prime})]\underset{\text{Eq.}\ref{eq:entropy-proof1}}{=}\eta\,\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho}(h(X)\neq Y)-\mathbb{P}_{\rho^{2}}(h(X)\neq Y,h^{\prime}(X)=Y)\right]
=η[𝔼ρ[L(h)]𝔼𝒟[ρ2(h(X)Y,h(X)=Y)]]\displaystyle\hskip 7.11317pt=\hskip 7.11317pt\eta\,\left[\mathbb{E}_{\rho}[L(h)]-\mathbb{E}_{\mathcal{D}}\left[\mathbb{P}_{\rho^{2}}(h(X)\neq Y,h^{\prime}(X)=Y)\right]\right]
Ineq.24η[(1+ε)𝔼ρ[L(h)]12𝔼ρ2[D(h,h)]].\displaystyle\underset{\text{Ineq.}\ref{eq:entropy-proof9}}{\leq}\eta\,\left[(1+\varepsilon)\mathbb{E}_{\rho}[L(h)]-\frac{1}{2}\mathbb{E}_{\rho^{2}}[D(h,h^{\prime})]\right].

Next, we use Lemma 6 to prove Corollary 2.

Proof of Corollary 2.

Let pi:=ρ(h(x)=i)p_{i}:=\mathbb{P}_{\rho}(h(x)=i) for i[K]i\in[K], and let y=Ky=K be the true label, without loss of generality. Then, we observe

ρ2(h(X)Y,h(X)Y,h(X)h(X))=i,j=1ijK1pipjandρ(h(X)Y)=i=1K1pi.\displaystyle\mathbb{P}_{\rho^{2}}\left(h(X)\neq Y,h^{\prime}(X)\neq Y,h(X)\neq h^{\prime}(X)\right)=\sum_{\begin{subarray}{c}i,j=1\\ i\neq j\end{subarray}}^{K-1}p_{i}p_{j}\quad\text{and}\quad\mathbb{P}_{\rho}\left(h(X)\neq Y\right)=\sum_{i=1}^{K-1}p_{i}.

Lemma 6 gives us the following:

1ijK1pipj21iK1pi1ijK1pipj2(1iK1pi)21i<jK1pipj(1iK1pi)2Lemma 6K22(K1),\displaystyle\frac{\sum_{1\leq i\neq j\leq K-1}p_{i}p_{j}}{2\sum_{1\leq i\leq K-1}p_{i}}\leq\frac{\sum_{1\leq i\neq j\leq K-1}p_{i}p_{j}}{2(\sum_{1\leq i\leq K-1}p_{i})^{2}}\leq\frac{\sum_{1\leq i<j\leq K-1}p_{i}p_{j}}{(\sum_{1\leq i\leq K-1}p_{i})^{2}}\underset{\text{Lemma }\ref{lemma:Newton}}{\leq}\frac{K\!-\!2}{2(K\!-\!1)},

where the first inequality used the fact that i=1K1pi1\sum_{i=1}^{K-1}p_{i}\leq 1. Thus, ε=K22(K1)\varepsilon=\frac{K\!-\!2}{2(K\!-\!1)} satisfies the condition (11), and the result follows from Theorem 4. ∎

B.5 Invariance principle of UU-statistics

In this subsection, we state the invariance principle of UU-statistics, which plays a main role in the proof of Theorem 5. We note that this is a special case of an approximation of random walks (Theorem 23.14 in [10]) combined with functional central limit theorem (Donsker’s theorem). Here, 𝒟[0,1]\mathcal{D}[0,1] is the Skorokhod space on [0,1][0,1], which is the space of all real-valued right-continuous functions on [0,1][0,1] equipped with the Skorokhod metric/topology (see Section 14 in [2]).

Theorem 7 (Theorem 5.2.1 in [11]).

Define a UU-statistic Uk=(k2)11i<jkΦ(hi,hj)U_{k}=\binom{k}{2}^{-1}\sum_{1\leq i<j\leq k}\Phi(h_{i},h_{j}), the expectation of the kernel Φ\Phi as Φ0=𝔼(h,h)ρ2Φ(h,h)\Phi_{0}=\mathbb{E}_{(h,h^{\prime})\sim\rho^{2}}\Phi(h,h^{\prime}) and the first-coordinate variance σ12=𝖵𝖺𝗋hρ(g1(h))\sigma_{1}^{2}=\mathsf{Var}_{h\sim\rho}(g_{1}(h)), where g1(h)=𝔼hρΦ(h,h)g_{1}(h)=\mathbb{E}_{h^{\prime}\sim\rho}\Phi(h^{\prime},h). Let ξn=(ξn(t),t[0,1])\xi_{n}=(\xi_{n}(t),t\in[0,1]), where

ξn(kn)=k(UkΦ0)2nσ12fork=0,1,,n1,\displaystyle\xi_{n}\left(\frac{k}{n}\right)=\frac{k(U_{k}-\Phi_{0})}{2\sqrt{n\sigma_{1}^{2}}}\qquad\text{for}\quad\!k=0,1,...,n-1,

and ξn(t)=ξn([nt]/n)\xi_{n}(t)=\xi_{n}([nt]/n), with [x][x] denoting the greatest integer less than or equal to xx. Then, ξn\xi_{n} converges weakly in 𝒟[0,1]\mathcal{D}[0,1] to a standard Wiener process as nn\to\infty.

Appendix C Details on our empirical results

In this section, we provide additional details on our empirical results.

C.1 Trained classifiers

On CIFAR-10 [13] train set with size 50,00050,000, the following models were trained with 100 epochs, learning rate starting with 0.050.05. For models trained with learning rate decay, we used learning rate 0.0050.005 after epoch 5050, and used 0.00050.0005 after epoch 7575. For following models, 5 classifiers are trained for each hyperparameter combination. Five classifiers differ in weight initialization and vary due to the randomized batches used during training.

  • ResNet18, every combination (width, batch size) of

    1. -

      Width:4,8,16,32,64,1284,8,16,32,64,128

    2. -

      Batch size: 16,128,256,102416,128,256,1024, with learning rate decay
      Additional batch size of 64,180,36464,180,364 for without learning rate decay

  • ResNet50, ResNet101, every combination (width, batch size) of

    1. -

      Width:8,168,16

    2. -

      Batch size: 64,25664,256, without learning rate decay

  • VGG11, every combination (width, batch size) of

    1. -

      Width:16,6416,64

    2. -

      Batch size: 64,25664,256, without learning rate decay

  • DenseNet40, every combination (width, batch size) of

    1. -

      Width:5,12,405,12,40

    2. -

      Batch size: 64,25664,256, without learning rate decay

For models in Figure 4, more than 55 classifiers were trained. The classifiers differ in weight initialization and vary due to the randomized batches used during training.

  • ResNet18 on CIFAR-10, width 1616 and batch size 6464 without learning rate decay (2020 classifiers)
    The models below are trained with learning rate 0.050.05, momentum 0.90.9 and weight decay 55e-44
    with cosine annealing.

  • MobileNet on MNIST, batch size 128128 (1010 classifiers)

  • ResNet18 on FMNIST, width 4848 and batch size 128128 (1010 classifiers)

  • ResNet18 on KMNIST, every combination of widths and batch sizes below (88 classifiers each)

    1. -

      Width: 48,6448,64

    2. -

      Batch size: 32,64,12832,64,128

C.2 Majority vote and tie-free

For an ensemble with NN classifiers, we generated NN uniformly-distributed random numbers e1,,eN[0,0.0001]e_{1},...,e_{N}\in[0,0.0001]. Then used (1N+e1,1N+eN)(\frac{1}{N}+e_{1},...\frac{1}{N}+e_{N}) after normalization as weights for each classifier. This guarantees the ensemble to be tie-free.