This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Selective Ensembling for Prediction Consistency

Emily Black &Matthew Fredrikson
Department of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15289
emilybla@andrew.cmu.edu
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies.
\externaldocument

paper

Appendix A Proofs

A.1 Proof of Theorem 3.1

Refer to caption
Refer to caption
Figure 1: Intuitive illustration of how two models which predict identical classification labels can have arbitrary gradients. To show this, given a binary classifier HH and an arbitrary function gg, we construct a classifier HH^{\prime} that predicts the same labels as HH, yet has gradients equal to gg almost everywhere. We formally state this result in Theorem A.1.

Theorem 3.1.

Let H:𝐗{1,1}=sign(h)H:\mathbf{X}\rightarrow\{-1,1\}=\mathrm{sign}(h) be a binary classifier and g:𝐗g:\mathbf{X}\rightarrow\mathbb{R} be an unrelated function that is bounded from above and below, continuous, and piecewise differentiable. Then there exists another binary classifier H^=sign(h^)\hat{H}=\mathrm{sign}(\hat{h}) such that for any ϵ>0\epsilon>0,

x𝐗.1.H^(x)=H(x)2.infx:H(x)H(x){||xx||}>ϵ/2h^(x)=g(x)\forall x\in\mathbf{X}~{}.\hskip 20.00003pt\text{1.}~{}~{}\hat{H}(x)=H(x)\hskip 20.00003pt\text{2.}~{}~{}\inf_{x^{\prime}:H(x^{\prime})\neq H(x)}\Big{\{}||x-x^{\prime}||\Big{\}}>\nicefrac{{\epsilon}}{{2}}~{}~{}\Longrightarrow~{}~{}\nabla\hat{h}(x)=\nabla g(x)
Proof.

We partition 𝐗\mathbf{X} into regions {I1.Ik}\{I_{1}....I_{k}\} determined by the decision boundaries of HH. That is, each IiI_{i} represents a maximal contiguous region for which each xIix\in I_{i} receives the same label from HH.

Recall we are given a function g:𝐗g:\mathbf{X}\rightarrow\mathbb{R} which is bounded from above and below. We create a set of functions gIi^:IiR\hat{g_{I_{i}}}:I_{i}\rightarrow R such that

g^Ii(x)={g(x)infxg(x)+c if H(Ii)=1g(x)supxg(x)c if H(Ii)=1\hat{g}_{I_{i}}(x)=\begin{cases}g(x)-\inf_{x}g(x)+c&\text{ if $H(I_{i})=1$}\\ g(x)-\sup_{x}g(x)-c&\text{ if $H(I_{i})=-1$}\end{cases}

where cc is some small constant greater than zero. Additionally, let d(x)d(x) be the 2\ell_{2} distance from xx to the nearest decision boundary of hh, i.e. d(x)=infx:H(x)H(x){xx}d(x)=\inf_{x^{\prime}:H(x^{\prime})\neq H(x)}\Big{\{}||x-x^{\prime}||\Big{\}}. Then, we define h^\hat{h} to be:

h^(x)={g^Ii(x) for x Ii if d(x)>ϵ2g^Ii(x)2d(x)ϵ for x Ii if d(x)ϵ2\hat{h}(x)=\begin{cases}\hat{g}_{I_{i}}(x)&\text{ for $x$ }\in I_{i}\text{ if $d(x)>\frac{\epsilon}{2}$}\\ \hat{g}_{I_{i}}(x)\cdot\frac{2d(x)}{\epsilon}&\text{ for $x$ }\in I_{i}\text{ if $d(x)\leq\frac{\epsilon}{2}$}\\ \end{cases}

And, as described above, we define H^=sign(h^)\hat{H}=\text{sign}(\hat{h}). First, we show that H^(x)=H(x)x𝐗\hat{H}(x)=H(x)~{}~{}\forall x\in\mathbf{X}. Without loss of generality, consider some IiI_{i} where H(x)=1H(x)=1, for any xIix\in I_{i}. We first consider the case where d(x)>ϵ2d(x)>\frac{\epsilon}{2}.

By construction, for xIix\in I_{i}, H^(x)=sign(h^(x))=sign(g^Ii(x))=sign(g(x)infxg(x)+c)\hat{H}(x)=\text{sign}(\hat{h}(x))=\text{sign}(\hat{g}_{I_{i}}(x))=\text{sign}(g(x)-\inf_{x}g(x)+c). By definition of the infimum, g(x)infxg(x)0g(x)-\inf_{x}g(x)\geq 0, and thus sign(g(x)infxg(x)+c)=1\text{sign}(g(x)-\inf_{x}g(x)+c)=1, so H^(x)=1=H(x)\hat{H}(x)=1=H(x).

Note that in the case where d(x)ϵ2d(x)\leq\frac{\epsilon}{2}, we can follow the same argument as multiplication by a positive constant does not affect the sign. A symmetric argument follows for the case where for xIix\in I_{i}, H(x)=1H(x)=-1; thus, H^(x)=H(x)x𝐗\hat{H}(x)=H(x)~{}~{}\forall x\in\mathbf{X}.

Secondly, we show that h^(x)=g(x)x\nabla\hat{h}(x)=\nabla g(x)~{}~{}\forall x where d(x)>ϵ2d(x)>\frac{\epsilon}{2}. Consider the case where H(x)=1H(x)=1. By construction, h^(x)=g^Ii(x)=g(x)infxg(x)+c\hat{h}(x)=\hat{g}_{I_{i}}(x)=g(x)-\inf_{x}g(x)+c. Note that this means the infimum and cc are constants, so their gradients are zero. Thus, h^(x)=g(x)\nabla\hat{h}(x)=\nabla g(x). A symmetric argument holds for the case where H(x)=1H(x)=-1.

It remains to prove that h^\hat{h} is continuous and piecewise differentiable, in order to be a realizable as a ReLU-network. By assumption, gg is piecewise differentiable, which means that gi^\hat{g_{i}} are piecewise differentiable as well, as is gi^(x)d(x)ϵ\hat{g_{i}}(x)\cdot\frac{d(x)}{\epsilon}. Thus, h^\hat{h} is piecewise-differentiable. To see that h^\hat{h} is continuous, consider the case where d(x)=ϵ/2d(x)=\nicefrac{{\epsilon}}{{2}} for some xx. Then gi^(x)d(x)ϵ=gi^(x)ϵϵ=gi^(x)\hat{g_{i}}(x)\cdot\frac{d(x)}{\epsilon}=\hat{g_{i}}(x)\cdot\frac{\epsilon}{\epsilon}=\hat{g_{i}}(x). Additionally, consider the case where d(x)=0d(x)=0, i.e. xx is on a decision boundary of h(x)h(x), between two regions Ii,IjI_{i},I_{j}. Then h^(x)=gi^(x)d(x)ϵ=gi^(x)0=0=gj^(x)0=gj^(x)\hat{h}(x)=\hat{g_{i}}(x)\cdot\frac{d(x)}{\epsilon}=\hat{g_{i}}(x)\cdot 0=0=\hat{g_{j}}(x)\cdot 0=\hat{g_{j}}(x). This shows that the piecewise components of h^\hat{h} come to the same value at their intersection.Further, each piecewise component of h^\hat{h} is equal to some continuous function, as g(x)g(x) is continuous by assumption. Thus, h^\hat{h} is continuous, and we conclude our proof. ∎

We include a visual intuition of the proof in Figure 1.

A.2 Proof of Theorem 4.1

Theorem 4.1.

Let 𝒫\mathcal{P} be a learning pipeline, and let 𝒮\mathcal{S} be a distribution over random states. Further, let g𝒫,𝒮g_{\mathcal{P},\mathcal{S}} be the mode predictor, let g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S) for S𝒮nS\sim\mathcal{S}^{n} be a selective ensemble, and let α0\alpha\geq 0. Then,

x𝐗.𝐏𝐫S𝒮n[g^n(𝒫,S;α,x)𝙰𝙱𝚂g𝒫,𝒮(x)]α\forall x\in\mathbf{X}~{}~{}.~{}~{}\mathop{\mathbf{Pr}}_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\mathrel{\mathop{\neq}\limits^{\vbox to-1.39998pt{\kern-2.0pt\hbox{${}_{\mathtt{ABS}}$}\vss}}}g_{\mathcal{P},\mathcal{S}}(x)\Big{]}\leq\alpha
Proof.

g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S) is an ensemble of nn models. By the definition of Algorithm LABEL:prediction_algo, g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S) gathers a vector of class counts of the prediction for xx from each model in the ensemble. Let the class with the highest count be cAc_{A}, with counts nAn_{A}, and the class with the second highest count be called cBc_{B}, with counts nBn_{B}. g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S) runs a two-sided hypothesis test to ensure that Pr[nABinomial(nA+nB,0.5)]<α\Pr[n_{A}\sim\text{Binomial}(n_{A}+n_{B},0.5)]<\alpha, i.e. that cAc_{A} is the true mode prediction over 𝒮\mathcal{S}. See that

Pr[g𝒫,𝒮(x)cAg^n(𝒫,S;α,x)=cA]\displaystyle\Pr\Big{[}g_{\mathcal{P},\mathcal{S}}(x)\neq c_{A}~{}~{}\land~{}~{}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)=c_{A}\Big{]} (1)
=\displaystyle=~{} Pr[g𝒫,𝒮(x)cA]Pr[g^n(𝒫,S;α,x)𝙰𝙱𝚂𝚃𝙰𝙸𝙽|g𝒫,𝒮(x)cA]\displaystyle\Pr\Big{[}g_{\mathcal{P},\mathcal{S}}(x)\neq c_{A}\Big{]}\cdot\Pr\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\neq\mathtt{ABSTAIN}~{}|~{}g_{\mathcal{P},\mathcal{S}}(x)\neq c_{A}\Big{]} (2)
\displaystyle\leq~{} Pr[g^n(𝒫,S;α,x)𝙰𝙱𝚂𝚃𝙰𝙸𝙽|g𝒫,𝒮(x)cA]\displaystyle\Pr\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\neq\mathtt{ABSTAIN}~{}|~{}g_{\mathcal{P},\mathcal{S}}(x)\neq c_{A}\Big{]} (3)
\displaystyle\leq~{} Pr[g^n(𝒫,S;α,x)𝙰𝙱𝚂𝚃𝙰𝙸𝙽|g𝒫,𝒮(x)cA]=α\displaystyle\Pr\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\neq\mathtt{ABSTAIN}~{}|~{}g_{\mathcal{P},\mathcal{S}}(x)\neq c_{A}\Big{]}=\alpha By hung2019rank (4)

Thus,

Pr[g𝒫,𝒮(x)cAg^n(𝒫,S;α,x)=cA]α\Pr\Big{[}g_{\mathcal{P},\mathcal{S}}(x)\neq c_{A}~{}~{}\land~{}~{}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)=c_{A}\Big{]}\leq\alpha

A.3 Proof of Corollary 4.2

Corollary 4.2.

Let 𝒫\mathcal{P} be a learning pipeline, and let 𝒮\mathcal{S} be a distribution over random states. Further, let g𝒫,𝒮g_{\mathcal{P},\mathcal{S}} be the mode predictor, let g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S) for S𝒮nS\sim\mathcal{S}^{n} be a selective ensemble. Finally, let α0\alpha\geq 0, and let β0\beta\geq 0 be an upper bound on the expected abstention rate of g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S). Then, the expected loss variance, V(x)V(x), over inputs, xx, is bounded by α+β\alpha+\beta. That is,

𝔼x𝒟[V(x)]=𝔼x𝒟[PrS𝒮n[g^n(𝒫,S;x)g𝒫,𝒮(x)]]α+β\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Big{[}V(x)\Big{]}=\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Bigg{[}~{}\Pr_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}x)\neq g_{\mathcal{P},\mathcal{S}}(x)\Big{]}~{}\Bigg{]}\leq\alpha+\beta
Proof.

Since g𝒫,𝒮g_{\mathcal{P},\mathcal{S}} never abstains, we have by the law of total probability that

PrS𝒮n[g^n(𝒫,S;α,x)g𝒫,𝒮(x)]\displaystyle\Pr_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\neq g_{\mathcal{P},\mathcal{S}}(x)\Big{]} =PrS𝒮n[g^n(𝒫,S;α,x)𝙰𝙱𝚂g𝒫,𝒮(x)g^n(𝒫,S;α,x)=𝙰𝙱𝚂𝚃𝙰𝙸𝙽\displaystyle=\Pr_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\mathrel{\mathop{\neq}\limits^{\vbox to-1.11998pt{\kern-2.0pt\hbox{${}_{\mathtt{ABS}}$}\vss}}}g_{\mathcal{P},\mathcal{S}}(x)~{}~{}\lor~{}~{}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)=\mathtt{ABSTAIN}
=PrS𝒮n[g^n(𝒫,S;α,x)𝙰𝙱𝚂g𝒫,𝒮(x)]+PrS𝒮n[g^n(𝒫,S;α,x)=𝙰𝙱𝚂𝚃𝙰𝙸𝙽]\displaystyle=\Pr_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\mathrel{\mathop{\neq}\limits^{\vbox to-1.11998pt{\kern-2.0pt\hbox{${}_{\mathtt{ABS}}$}\vss}}}g_{\mathcal{P},\mathcal{S}}(x)\Big{]}+\Pr_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)=\mathtt{ABSTAIN}\Big{]}

By Theorem LABEL:thm:matches_mode, we have that PrS𝒮n[g^n(𝒫,S;α,x)𝙰𝙱𝚂g𝒫,𝒮(x)]α\Pr_{S\sim\mathcal{S}^{n}}\big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\mathrel{\mathop{\neq}\limits^{\vbox to-1.39998pt{\kern-2.0pt\hbox{${}_{\mathtt{ABS}}$}\vss}}}g_{\mathcal{P},\mathcal{S}}(x)\big{]}\leq\alpha, thus

𝔼x𝒟[PrS𝒮n[g^n(𝒫,S;α,x)g𝒫,𝒮(x)]]α+𝔼x𝒟[PrS𝒮n[g^n(𝒫,S;α,x)=𝙰𝙱𝚂𝚃𝙰𝙸𝙽]]\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Bigg{[}~{}\Pr_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\neq g_{\mathcal{P},\mathcal{S}}(x)\Big{]}~{}\Bigg{]}\leq\alpha+\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Bigg{[}~{}\Pr_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)=\mathtt{ABSTAIN}\Big{]}~{}\Bigg{]}

Finally, since β\beta is an upper bound on the expected abstention rate of g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S), we conclude that

𝔼x𝒟[PrS𝒮n[g^n(𝒫,S;α,x)g𝒫,𝒮(x)]]α+β\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Bigg{[}~{}\Pr_{S\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S~{};~{}\alpha,x)\neq g_{\mathcal{P},\mathcal{S}}(x)\Big{]}~{}\Bigg{]}\leq\alpha+\beta

A.4 Proof of Corollary 4.3

Corollary 4.3.

Let 𝒫\mathcal{P} be a learning pipeline, and let 𝒮\mathcal{S} be a distribution over random states. Further, let g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S) for S𝒮nS\sim\mathcal{S}^{n} be a selective ensemble. Finally, let α0\alpha\geq 0, and let β0\beta\geq 0 be an upper bound on the expected abstention rate of g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S). Then,

𝔼x𝒟[𝐏𝐫S1,S2𝒮n[g^n(𝒫,S1;α,x)g^n(𝒫,S2;α,x)]]2(α+β)\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Bigg{[}~{}\mathop{\mathbf{Pr}}_{S^{1},S^{2}\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S^{1}~{};~{}\alpha,x)\neq\hat{g}_{n}(\mathcal{P},S^{2}~{};~{}\alpha,x)\Big{]}~{}\Bigg{]}\leq 2(\alpha+\beta)
Proof.

For i{1,2}i\in\{1,2\}, let AiA^{i} be the event that g^n(𝒫,Si;α,x)=𝙰𝙱𝚂𝚃𝙰𝙸𝙽\hat{g}_{n}(\mathcal{P},S^{i}~{};~{}\alpha,x)=\mathtt{ABSTAIN}, and let NiN^{i} be the event that g^n(𝒫,Si;α,x)𝙰𝙱𝚂g𝒫,𝒮\hat{g}_{n}(\mathcal{P},S^{i}~{};~{}\alpha,x)\mathrel{\mathop{\neq}\limits^{\vbox to-1.39998pt{\kern-2.0pt\hbox{${}_{\mathtt{ABS}}$}\vss}}}g_{\mathcal{P},\mathcal{S}}. In the worst case, A1A^{1} and A2A^{2}, and N1N^{1} and N2N^{2} are disjoint, that is, e.g., if g^n(𝒫,Si)\hat{g}_{n}(\mathcal{P},S^{i}) abstains on xx, then g^n(𝒫,S1;α,x)g^n(𝒫,S2;α,x)\hat{g}_{n}(\mathcal{P},S^{1}~{};~{}\alpha,x)\neq\hat{g}_{n}(\mathcal{P},S^{2}~{};~{}\alpha,x). In other words, we have that

𝐏𝐫S1,S2𝒮n[g^n(𝒫,S1;α,x)g^n(𝒫,S2;α,x)]𝐏𝐫[A1A2N1N2]\mathop{\mathbf{Pr}}_{S^{1},S^{2}\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S^{1}~{};~{}\alpha,x)\neq\hat{g}_{n}(\mathcal{P},S^{2}~{};~{}\alpha,x)\Big{]}\leq\mathop{\mathbf{Pr}}\Big{[}A^{1}~{}\lor~{}A^{2}~{}\lor~{}N^{1}~{}\lor~{}N^{2}\Big{]}

which, by union bound implies that

𝐏𝐫S1,S2𝒮n[g^n(𝒫,S1;α,x)g^n(𝒫,S2;α,x)]𝐏𝐫[A1]+𝐏𝐫[A2]+𝐏𝐫[N1]+𝐏𝐫[N2].\mathop{\mathbf{Pr}}_{S^{1},S^{2}\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S^{1}~{};~{}\alpha,x)\neq\hat{g}_{n}(\mathcal{P},S^{2}~{};~{}\alpha,x)\Big{]}\leq\mathop{\mathbf{Pr}}\big{[}A^{1}\big{]}+\mathop{\mathbf{Pr}}\big{[}A^{2}\big{]}+\mathop{\mathbf{Pr}}\big{[}N^{1}\big{]}+\mathop{\mathbf{Pr}}\big{[}N^{2}\big{]}.

By Theorem LABEL:thm:matches_mode Pr[Ni]α\Pr\big{[}N^{i}\big{]}\leq\alpha. Thus we have

𝔼x𝒟[𝐏𝐫S1,S2𝒮n[g^n(𝒫,S1;α,x)g^n(𝒫,S2;α,x)]]2α+𝔼x𝒟[𝐏𝐫[A1]]+𝔼x𝒟[𝐏𝐫[A2]].\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Bigg{[}~{}\mathop{\mathbf{Pr}}_{S^{1},S^{2}\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S^{1}~{};~{}\alpha,x)\neq\hat{g}_{n}(\mathcal{P},S^{2}~{};~{}\alpha,x)\Big{]}~{}\Bigg{]}\leq 2\alpha+\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Big{[}~{}\mathop{\mathbf{Pr}}\big{[}A^{1}\big{]}~{}\Big{]}+\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Big{[}~{}\mathop{\mathbf{Pr}}\big{[}A^{2}\big{]}~{}\Big{]}.

Finally, since β\beta is an upper bound on the expected abstention rate of g^n(𝒫,S)\hat{g}_{n}(\mathcal{P},S), we conclude that

𝔼x𝒟[𝐏𝐫S1,S2𝒮n[g^n(𝒫,S1;α,x)g^n(𝒫,S2;α,x)]]2(α+β)\mathop{\mathbb{E}}_{x\sim\mathcal{D}}\Bigg{[}~{}\mathop{\mathbf{Pr}}_{S^{1},S^{2}\sim\mathcal{S}^{n}}\Big{[}\hat{g}_{n}(\mathcal{P},S^{1}~{};~{}\alpha,x)\neq\hat{g}_{n}(\mathcal{P},S^{2}~{};~{}\alpha,x)\Big{]}~{}\Bigg{]}\leq 2(\alpha+\beta)

Appendix B Datasets

The German Credit and Taiwanese data sets consist of individuals financial data, with a binary response indicating their creditworthiness. For the German Credit dataset, there are 1000 points, and 20 attributes. We one-hot encode the data to get 61 features, and standardize the data to zero mean and unit variance using SKLearn Standard scaler. We partitioned the data intro a training set of 700 and a test set of 200. The Taiwanese credit dataset has 30,000 instances with 24 attributes. We one-hot encode the data to get 32 features and normalize the data to be between zero and one. We partitioned the data intro a training set of 22500 and a test set of 7500.

The Adult dataset consists of a subset of publicly-available US Census data, binary response indicating annual income of >50>50k. There are 14 attributes, which we one-hot encode to get 96 features. We normalize the numerical features to have values between 0 and 11. After removing instances with missing values, there are 30,16230,162 examples which we split into a training set of 14891, a leave one out set of 100, and a test set of 1501 examples.

The Seizure dataset comprises time-series EEG recordings for 500 individuals, with a binary response indicating the occurrence of a seizure. This is represented as 11500 rows with 178 features each. We split this into 7,950 train points and 3,550 test points. We standardize the numeric features to zero mean and unit variance.

The Warfain dataset is collected by the International Warfarin Pharmacogenetics Consortium [nejm-warfarin] about patients who were prescribed warfarin. We removed rows with missing values, 4819 patients remained in the dataset. The inputs to the model are demographic (age, height, weight, race), medical (use of amiodarone, use of enzyme inducer), and genetic (VKORC1, CYP2C9) attributes. Age, height, and weight are real-valued and were scaled to zero mean and unit variance. The medical attributes take binary values, and the remaining attributes were one-hot encoded. The output is the weekly dose of warfarin in milligrams, which we encode as "low", "medium", or "high", following the recommendations set by nejm-warfarin.

Fashion MNIST contains images of clothing items, with a multilabel response of 10 classes. There are 60000 training examples and 10000 test examples. We pre-process the data by normalizing the numerical values in the image array to be between 0 and 11.

The colorectal histology dataset contains images of human colorectal cancer, with a multilabel response of 8 classes. There are 5,000 images, which we divide into a training set of 3750 and a validation set of 1250. We pre-process the data by normalizing the numerical values in the image array to be between 0 and 11.

The UCI datasets as well as FMNIST are under an MIT license, the colorectal histology and Warfarin datasets are under a Creative Commons License. [uci, colon_license, nejm-warfarin].

Appendix C Model Architecture and Hyper-Parameters

The German Credit and Seizure models have three hidden layers, of size 128, 64, and 16. Models on the Adult dataset have one hidden layer of 200 neurons. Models on the Taiwanese dataset have two hidden layers of 32 and 16. The Warfarin models have one hidden layer of 100. The FMNIST model is a modified LeNet architecture [lecun1995learning]. This model is trained with dropout. The Colon models are trained with a modified, ResNet50 [he2016deep], pre-trained on ImageNet [deng2009imagenet], available from Keras. German Credit, Adult, Seizure, Taiwanese, and Warfarin models are trained for 100 epochs; FMNIST for 50, and Colon models are trained for 20 epochs. German Credit models are trained with a batch size of 32; FMNIST 64; Adult, Seizure, and Warfarin models with batch sizes of 128; and Colon and Taiwanese Credit models with batch sizes of 512. German Credit, Adult, Seizure, Taiwanese Credit, Warfarin, and Colon are trained with keras’ Adam optimizer with the default parameters. FMNIST models are trained with keras’ SGD optimizer with the default parameters.

Note that we discuss train-test splits and data preprocessing above in Section B. We prepare different models for the same dataset using Tensorflow 2.3.0 and all computations are done using a Titan RTX accelerator on a machine with 64 gigabytes of memory.

Appendix D Metrics

We report similarity between feature attributions with Spearman’s Ranking Correlation (ρ\rho), Pearson’s Correlation Coefficient (rr), top-kk intersection, 2\ell_{2} distance, and SSIM for image datasets. We use standard implementations for Spearman’s Ranking Correlation (ρ\rho) and Pearson’s Correlation Coefficient (rr) from scipy, and implement 2\ell_{2} distance as well as the top-kk using numpy functions.

Note that rr and ρ\rho vary from -1 to 1, denoting negative, zero, and positive correlation. We display top-kk for kk=5, and compute this by taking the number of features in the intersection of the top 55 between two models, and then diving this by 55. Thus top-kk is between 0 and 1, indicating low and high correlation respectively.

The 2\ell_{2} distance has a minimum of 0, but is unbounded from above, and SSIM varies from -1 to 1, indicating no correlation to exact correlation respectively.

Note that we compute these metrics between two different models on the same point, for every point in the test set, over 276 different pairs of models for tabular datasets and over 40 pairs of models for image datasets. We average this result over the points in the test set and over the comparisons to get the numbers displayed in the tables and graphs throughout the paper.

D.1 SSIM

Explanations for image models can be interpreted as an image (as there is an attribution for each pixel), and are often evaluated visually [leino18influence, simonyan2014deep, sundararajan2017axiomatic]. However, pixel-wise indicators for similarity between images (such as top-k similarity between pixel values, Spearman’s ranking coefficient, or mean squared error) often do not capture how similar images are visually, in aggregate. In order to give an indication if the entire explanation for an image model, i.e. the explanatory image produced, is similar, we use the structural similarity index (SSIM) [wang2004image]. We use the implementation from 𝚜𝚌𝚒𝚔𝚒𝚝𝚒𝚖𝚊𝚐𝚎\mathtt{scikit-image} [structural_similarity_index]. SSIM varies from -1 to 1, indicating no correlation to exact correlation respectively.

Appendix E Experimental Results for α=0.01\alpha=0.01

We include results on the prediction of selective ensemble models for α=0.01\alpha=0.01 as well. We include the percentage of points with disagreement between at least one pair of models (pflip>0p_{\text{flip}}>0) trained with different random seeds (RS) or leave-one-out differences in training data, for singleton models (n=1n=1) and selective ensembles (n>1n>1) in Table 1. Notice the number of points with pflip>0p_{\text{flip}}>0 is again zero. We also include the mean and standard deviation of accuracy and abstention rate for α=0.01\alpha=0.01 in Table 2.

mean ±\pm std. dev of portion of test data with pflip>0p_{\text{flip}}>0
Randomness nn Ger. Credit Adult Seizure Tai. Credit Warfarin FMNIST Colon
RS 1 .570±.020.570\pm.020 .087±.001.087\pm.001 .060±.01.060\pm.01 .082±.002.082\pm.002 .098±.003.098\pm.003 .061±.008.061\pm.008 .037±.005.037\pm.005
RS (5, 10, 15, 20) 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
LOO 1 .262±.014.262\pm.014 .063±.001.063\pm.001 .031±.001.031\pm.001 .031±.001.031\pm.001 .033±.003.033\pm.003 .034±.004.034\pm.004 .042±.005.042\pm.005
LOO (5, 10, 15, 20) 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
Table 1: The percentage of points with disagreement between at least one pair of models (pflip>0p_{\text{flip}}>0) trained with different random seeds (RS) or leave-one-out differences in training data, for singleton models (n=1n=1) and selective ensembles (n>1n>1). Results for selective ensembles all selective ensembles are shown together, as they all have no disagreement. Note that these results are for α=0.01\alpha=0.01. But this different α\alpha also leads to zero disagreement between predicted points.
mean accuracy (abstain as error) / std. dev
𝒮\mathcal{S} nn Ger. Credit Adult Seizure Wafarin Tai. Credit FMNIST Colon
RS 5 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
RS 10 .461±.016.461\pm.016 .807±1e3.807\pm 1e-3~{} .945±2e3.945\pm 2e-3 .646±3e3.646\pm 3e-3 .788±2e3.788\pm 2e-3 .870±5e3.870\pm 5e-3 .902±2e3.902\pm 2e-3
RS 15 .589±.015.589\pm.015 .822±8e4.822\pm 8e-4~{} .961±1e3.961\pm 1e-3 .661±3e3.661\pm 3e-3 .802±9e4.802\pm 9e-4 .890±2e3.890\pm 2e-3 .915±1e3.915\pm 1e-3
RS 20 .593±.011.593\pm.011 .822±7e4.822\pm 7e-4~{} .961±8e4.961\pm 8e-4 .662±1e3.662\pm 1e-3 .803±9e4.803\pm 9e-4 .991±1e3.991\pm 1e-3 .926±1e3.926\pm 1e-3
LOO 5 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
LOO 10 .618±.017.618\pm.017 .818±1e3.818\pm 1e-3 .947±4e3.947\pm 4e-3 .674±2e3.674\pm 2e-3 .807±1e3.807\pm 1e-3 .904±6e4.904\pm 6e-4 .901±2e3.901\pm 2e-3
LOO 15 .656±.017.656\pm.017 .828±1e3.828\pm 1e-3~{} .963±1e3.963\pm 1e-3 .678±9e4.678\pm 9e-4 .812±9e4.812\pm 9e-4 .908±1e3.908\pm 1e-3 .912±2e3.912\pm 2e-3
LOO 20 .661±.018.661\pm.018 .829±7e4.829\pm 7e-4~{} .964±1e3.964\pm 1e-3 .678±7e4.678\pm 7e-4 .812±8e4.812\pm 8e-4 .909±6e4.909\pm 6e-4 .912±2e3.912\pm 2e-3
mean abstention rate / std dev
𝒮\mathcal{S} nn Ger. Credit Adult Seizure Warfarin Tai. Credit FMNIST Colon
RS 5 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0
RS 10 .449±.021.449\pm.021 .068±2e3.068\pm 2e-3 .045±2e3.045\pm 2e-3 .078±5e3.078\pm 5e-3 .063±2e3.063\pm 2e-3 .087±8e3.087\pm 8e-3 .050±3e3.050\pm 3e-3
RS 15 .278±.017.278\pm.017 .041±1e3.041\pm 1e-3 .025±1e3.025\pm 1e-3 .049±3e3.049\pm 3e-3 .037±1e3.037\pm 1e-3 .055±2e3.055\pm 2e-3 .030±2e3.030\pm 2e-3
RS 20 .270±.015.270\pm.015 .040±1e3.040\pm 1-e3 .024±1e3.024\pm 1e-3 .047±2e3.047\pm 2e-3 .036±1e3.036\pm 1e-3 .054±9e4.054\pm 9e-4 .038±1e3.038\pm 1e-3
LOO 5 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0
LOO 10 .215±.030.215\pm.030 .049±2e3.049\pm 2e-3 .045±5e3.045\pm 5e-3 .027±2e3.027\pm 2e-3 .025±1e3.025\pm 1e-3 .029±1e3.029\pm 1e-3 .054±2e3.054\pm 2e-3
LOO 15 .144±0.040.144\pm 0.040 .030±2e3.030\pm 2e-3 .026±1e3.026\pm 1e-3 .017±2e3.017\pm 2e-3 .017±2e3.017\pm 2e-3 .021±3e3.021\pm 3e-3 .035±2e3.035\pm 2e-3
LOO 20 .135±.040.135\pm.040 .029±1e3.029\pm 1e-3 .025±1e3.025\pm 1e-3 .017±1e3.017\pm 1e-3 .017±2e3.017\pm 2e-3 .019±1e3.019\pm 1e-3 .035±3e3.035\pm 3e-3
Table 2: Accuracy (above) and abstention rate (below) of selective ensembles with n{5,10,15,20}n\in\{5,10,15,20\} constituents. Results are averaged over 24 models, standard deviation is presented. Note that these results are for alpha=0.01.

Appendix F Selective Ensembling Full Results

We include the full results from the evaluation section, including error bars on the disagreement, accuracy, abstention rates of selective ensembles, in Table 3 and Table 4 respectively. We also include the results for all datasets on the accuracy of non-selective ensembling and their ability to mitigate disagreement, in Table 3 and Table 2 respectively.

mean ±\pm std. dev of portion of test data with pflip>0p_{\text{flip}}>0
Randomness nn Ger. Credit Adult Seizure Tai. Credit Warfarin FMNIST Colon
RS 1 .570±.020.570\pm.020 .087±.001.087\pm.001 .060±.01.060\pm.01 .082±.002.082\pm.002 .098±.003.098\pm.003 .061±.008.061\pm.008 .037±.005.037\pm.005
RS (5, 10, 15, 20) 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
LOO 1 .262±.014.262\pm.014 .063±.001.063\pm.001 .031±.001.031\pm.001 .031±.001.031\pm.001 .033±.003.033\pm.003 .034±.004.034\pm.004 .042±.005.042\pm.005
LOO (5, 10, 15, 20) 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
Table 3: Percentage of points with disagreement between at least one pair of models (pflip>0p_{\text{flip}}>0) trained with different random seeds (RS) or leave-one-out differences in training data, for singleton models (n=1n=1) and selective ensembles (n>1n>1). We present the mean and standard deviation of this percentage over 10 runs of re-sampling ensemble models. Note that these results are for alpha=0.05, which are presented in the main paper.
mean accuracy (abstain as error) / std. dev
𝒮\mathcal{S} nn Ger. Credit Adult Seizure Warfarin Tai. Credit FMNIST Colon
RS 5 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
RS 10 .576±.013.576\pm.013 .820±8e4.820\pm 8e-4~{} .960±1e3.960\pm 1e-3 .660±2e3.660\pm 2e-3 .800±1e3.800\pm 1e-3 .888±2e3.888\pm 2e-3 .914±1e3.914\pm 1e-3
RS 15 .636±.017.636\pm.017 .827±5e4.827\pm 5e-4~{} .965±1e3.965\pm 1e-3 .668±2e3.668\pm 2e-3 .807±9e4.807\pm 9e-4 .897±2e3.897\pm 2e-3 .919±1e3.919\pm 1e-3
RS 20 .664±.014.664\pm.014 .830±5e4.830\pm 5e-4~{} .967±9e4.967\pm 9e-4 .670±3e3.670\pm 3e-3 .810±8e4.810\pm 8e-4 .902±1e3.902\pm 1e-3 .921±1e3.921\pm 1e-3
LOO 5 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0 0.0±0.00.0\pm 0.0
LOO 10 .653±.017.653\pm.017 .827±1e3.827\pm 1e-3 .962±2e3.962\pm 2e-3 .677±1e3.677\pm 1e-3 .812±1e3.812\pm 1e-3 .909±4e4.909\pm 4e-4 .912±1e3.912\pm 1e-3
LOO 15 .678±.014.678\pm.014 .832±7e4.832\pm 7e-4~{} .968±9e4.968\pm 9e-4 .679±9e4.679\pm 9e-4 .814±9e4.814\pm 9e-4 .910±1e3.910\pm 1e-3 .916±2e3.916\pm 2e-3
LOO 20 .689±.014.689\pm.014 .834±7e4.834\pm 7e-4~{} .970±1e3.970\pm 1e-3 .680±7e4.680\pm 7e-4 .815±8e4.815\pm 8e-4 .911±4e4.911\pm 4e-4 .918±8e4.918\pm 8e-4
mean abstention rate / std dev
𝒮\mathcal{S} nn Ger. Credit Adult Seizure Warfarin Tai. Credit FMNIST Colon
RS 5 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0
RS 10 .291±.014.291\pm.014 .043±1e3.043\pm 1e-3 .02±1e3.02\pm 1e-3 .050±3e3.050\pm 3e-3 .039±2e3.039\pm 2e-3 .059±2e3.059\pm 2e-3 .032±3e3.032\pm 3e-3
RS 15 .205±.020.205\pm.020 .032±1e3.032\pm 1e-3 .018±1e3.018\pm 1e-3 .037±3e3.037\pm 3e-3 .028±1e3.028\pm 1e-3 .042±2e3.042\pm 2e-3 .023±2e3.023\pm 2e-3
RS 20 .165±.015.165\pm.015 .024±7e4.024\pm 7-e4 .014±7e4.014\pm 7e-4 .031±4e3.031\pm 4e-3 .023±8e4.023\pm 8e-4 .036±1e3.036\pm 1e-3 .019±2e3.019\pm 2e-3
LOO 5 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0 1.0±0.01.0\pm 0.0
LOO 10 .151±.041.151\pm.041 .032±2e3.032\pm 2e-3 .027±2e3.027\pm 2e-3 .018±2e3.018\pm 2e-3 .017±2e3.017\pm 2e-3 .020±5e4.020\pm 5e-4 .036±3e3.036\pm 3e-3
LOO 15 .105±0.034.105\pm 0.034 .022±1e3.022\pm 1e-3 .019±1e3.019\pm 1e-3 .013±2e3.013\pm 2e-3 .013±2e3.013\pm 2e-3 .016±2e3.016\pm 2e-3 .027±2e3.027\pm 2e-3
LOO 20 .079±.029.079\pm.029 .018±1e3.018\pm 1e-3 .015±1e3.015\pm 1e-3 .011±2e3.011\pm 2e-3 .010±1e3.010\pm 1e-3 .012±8e4.012\pm 8e-4 .023±2e3.023\pm 2e-3
Table 4: Accuracy (above) and abstention rate (below) of selective ensembles with n{5,10,15,20}n\in\{5,10,15,20\} constituents. Results are averaged over 24 models, standard deviation is presented. Note that these results are for alpha=0.05, which are presented in the main paper.
disagreement of non-abstaining ensembles
𝒮\mathcal{S} nn Ger. Credit Adult Seizure Tai. Credit Warfarin FMNIST Colon
RS 1 .570±.020.570\pm.020 .087±.001.087\pm.001 .060±.01.060\pm.01 .082±.002.082\pm.002 .098±.003.098\pm.003 0.113±.0050.113\pm.005 .066±.002.066\pm.002
RS 5 .305±.017.305\pm.017 .045±.001.045\pm.001 .028±.001.028\pm.001 .082±.002.082\pm.002 .054±.003.054\pm.003 .046±.002.046\pm.002 .022±.001.022\pm.001
RS 10 .234±.014.234\pm.014 .031±.001.031\pm.001 .019±.001.019\pm.001 .041±.001.041\pm.001 .040±.002.040\pm.002 .032±.002.032\pm.002 .014±.002.014\pm.002
RS 15 .185±.012.185\pm.012 .026±.001.026\pm.001 .015±.001.015\pm.001 .030±.000.030\pm.000 .033±.002.033\pm.002 .028±.002.028\pm.002 .012±.001.012\pm.001
RS 20 .155±.010.155\pm.010 .022±.001.022\pm.001 .013±.001.013\pm.001 .021±.001.021\pm.001 .030±.002.030\pm.002 .026±.001.026\pm.001 .010±.001.010\pm.001
LOO 1 .262±.014.262\pm.014 .063±.001.063\pm.001 .031±.001.031\pm.001 .031±.001.031\pm.001 .033±.003.033\pm.003 .056±.004.056\pm.004 .068±.003.068\pm.003
LOO 5 .142±.037.142\pm.037 .033±.001.033\pm.001 .028±.001.028\pm.001 .019±.001.019\pm.001 .018±.001.018\pm.001 .032±.002.032\pm.002 .030±.003.030\pm.003
LOO 10 .111±.020.111\pm.020 .023±.001.023\pm.001 .020±.001.020\pm.001 .014±.001.014\pm.001 .016±.001.016\pm.001 .034±.002.034\pm.002 .016±.003.016\pm.003
LOO 15 .074±.020.074\pm.020 .019±.001.019\pm.001 .017±.001.017\pm.001 .011±.001.011\pm.001 .012±.001.012\pm.001 .029±.001.029\pm.001 .014±.002.014\pm.002
LOO 20 .067±.013.067\pm.013 .016±.001.016\pm.001 .015±.001.015\pm.001 .010±.000.010\pm.000 .011±.001.011\pm.001 .027±.001.027\pm.001 .010±.001.010\pm.001
Figure 2: Mean and standard deviation of the percentage of test data with non-zero disagreement over 24 normal (i.e., not selective) ensembles. The mean and standard deviation are taken over ten re-samplings of 24 ensembles.While ensembling alone mitigates much of the prediction instability, it is unable to eliminate it as selective ensembles do.
accuracy of non-abstaining ensembles
𝒮\mathcal{S} nn Ger. Credit Adult Seizure Warfarin Tai. Credit FMNIST Colon
RS 5 0.745±0.0130.745\pm 0.013 0.842±0.0010.842\pm 0.001 0.975±0.0010.975\pm 0.001 0.688±0.00.688\pm 0.0 0.822±0.0010.822\pm 0.001 0.919±0.0010.919\pm 0.001 0.927±0.0010.927\pm 0.001
RS 10 0.747±0.0140.747\pm 0.014 0.843±0.0010.843\pm 0.001 0.975±0.0010.975\pm 0.001 0.688±0.00.688\pm 0.0 0.822±0.0010.822\pm 0.001 0.92±0.0010.92\pm 0.001 0.928±0.0010.928\pm 0.001
RS 15 0.75±0.010.75\pm 0.01 0.842±0.0010.842\pm 0.001 0.975±0.0010.975\pm 0.001 0.688±0.00.688\pm 0.0 0.822±0.0010.822\pm 0.001 0.92±0.0010.92\pm 0.001 0.928±0.0010.928\pm 0.001
RS 20 0.747±0.010.747\pm 0.01 0.842±0.00.842\pm 0.0 0.975±0.0010.975\pm 0.001 0.688±0.00.688\pm 0.0 0.822±0.0010.822\pm 0.001 0.92±0.0010.92\pm 0.001 0.928±0.00.928\pm 0.0
LOO 5 0.728±0.0110.728\pm 0.011 0.844±0.00.844\pm 0.0 0.979±0.0010.979\pm 0.001 0.685±0.0020.685\pm 0.002 0.821±0.0010.821\pm 0.001 0.918±0.00.918\pm 0.0 0.927±0.0020.927\pm 0.002
LOO 10 0.728±0.0080.728\pm 0.008 0.844±0.0010.844\pm 0.001 0.978±0.0010.978\pm 0.001 0.686±0.0020.686\pm 0.002 0.821±0.0010.821\pm 0.001 0.918±0.00.918\pm 0.0 0.927±0.0020.927\pm 0.002
LOO 15 0.733±0.0080.733\pm 0.008 0.844±0.00.844\pm 0.0 0.979±0.0010.979\pm 0.001 0.685±0.0010.685\pm 0.001 0.821±0.00.821\pm 0.0 0.917±0.00.917\pm 0.0 0.927±0.0010.927\pm 0.001
LOO 20 0.73±0.0080.73\pm 0.008 0.843±0.00.843\pm 0.0 0.979±0.0010.979\pm 0.001 0.685±0.0020.685\pm 0.002 0.821±0.00.821\pm 0.0 0.918±0.0010.918\pm 0.001 0.927±0.0010.927\pm 0.001
Figure 3: Accuracy of non-selective (regular) ensembles with n{5,10,15,20}n\in\{5,10,15,20\} constituents. Results are averaged over 24 models, standard deviation is presented.

Appendix G Selective Ensembles and Disparity in Selective Prediction

In light of the fact that prior work has brought to light the possibility of selective prediction exacerbating model accuracy disparity between demographic groups [jones2020selective], we present the selective ensemble accuracy and abstention rate group-by-group for several different demographic groups across four datasets: Adult, German Credit, Taiwanese Credit, and Warfarin Dosing. Results are in Table 5.

accuracy (abstain as error) / abstention rate
𝒮\mathcal{S} nn Adult Male Adult Fem. Ger. Cred. Young Ger. Cred. Old Tai. Cred. Male Tai. Cred. Fem. Warf. Black Warf. White Warf. Asian
Base 1 .804.804 / - .923.923 / - .677.677 / - .777.777 / - .814.814 / - .825.825 / - .665.665 / - .688.688 / - .689.689 / -
RS 5 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0
RS 10 .777.777 /.053/.053 .912.912 /.023/.023 .507.507 /.334/.334 .636.636 /.254/.254 .791.791 /.048/.048 .807.807 /.035/.035 .659.659 /.009/.009 .681.681 /.002/.002 .683.683 /.007/.007
RS 15 .786.786 /.037/.037 .915.915 /.015/.015 .559.559 /.248/.248 .705.705 /.168/.168 .798.798 /.033/.033 .812.812 /.025/.025 .664.664 /.010/.010 .683.683 /.002/.002 .688.688 /.006/.006
RS 20 .789.789 /.030/.030 .917.917 /.013/.013 .586.586 /.205/.205 .733.733 /.130/.130 .802.802 /.028/.028 .814.814 /.020/.020 .667.667 /.009/.009 .683.683 /.002/.002 .689.689 /.006/.006
Base 1 .806.806 / - .922.922 / - .697.697 / - .757.757 / - .815.815 / - .825.825 / - .665.665 / - .687.687 / - .688.688 / -
LOO 5 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0 0.00.0 /1.0/1.0
LOO 10 .787.787 /.038/.038 .913.913 /.018/.018 .612.612 /.166/.166 .689.689 /.138/.138 .802.802 /.023/.023 .817.817 /.014/.014 .655.655 /.020/.020 .680.680 /.019/.019 .680.680 /.019/.019
LOO 15 .793.793 /.026/.026 .916.916 /.012/.012 .646.646 /.101/.101 .704.704 /.107/.107 .806.806 /.017/.017 .819.819 /.011/.011 .658.658 /.014/.014 .681.681 /.013/.013 .682.682 /.013/.013
LOO 20 .796.796 /.022/.022 .917.917 /.010/.010 .661.661 /.071/.071 .714.714 /.084/.084 .808.808 /.014/.014 .820.820 /.009/.009 .659.659 /.011/.011 .682.682 /.011/.011 .683.683 /.011/.011
Table 5: We present the selective ensemble accuracy and abstention rate group-by-group for several different demographic groups across four datasets: Adult, German Credit, Taiwanese Credit, and Warfarin Dosing. We note that by and large, using selective ensembles did not exacerbate accuracy disparity by very much (within  1% of the original disparity), although they did not ameliorate disparities in accuracy that already existed within the performance of the algorithm. The only exception to this was German Credit, where we note, as in the remainder of our results, that the entire dataset is only  1000 points, so results may be slightly different in this regime. Overall, we note that subgroup abstention rates can vary by dataset, and so it should be studied whenever selective ensembles are used in a sensitive setting.

Appendix H Explanation Consistency Full Results

We give full results for selective and non-selective ensembling’s mitigation of inconsistency in feature attributions.

H.1 Attributions

We pictorially show the inconsistency of individual model feature attributions versus the consistency of attributions ensembles of 15 for each tabular dataset in Figure 4 and Figure 5. The former shows inconsistency over differences in random initialization, the latter shows inconsistency over one-point changes to the training set.

Refer to captionRefer to caption

(a)
Refer to captionRefer to caption
(b)
Refer to captionRefer to caption
(c)
Refer to captionRefer to caption
(d)
Refer to captionRefer to caption
(e)
Figure 4: Inconsistency of attributions on the same point across an individual (left) and ensembled (right) model (n=15n=15), for all datasets, over differences in random seed chosen for initialization parameters before training. The height of each bar on the horizontal axis represents the attribution score of a distinct feature, and each color represents a different model. Features are ordered according to the attribution scores of one randomly-selected model. Figure 4(a) depicts the German Credit Dataset, Figure 4(b) depicts Adult, Figure 4(c) Seizure, Figure 4(d) Taiwanese, and Figure 4(e) Warfarin. We do not include feature attribution for image datasets as the individual pixels are less meaningful than the feature attributions in a tabular dataset.
Refer to captionRefer to caption

(a)
Refer to captionRefer to caption
(b)
Refer to captionRefer to caption
(c)
Refer to captionRefer to caption
(d)
Refer to captionRefer to caption
(e)
Figure 5: Inconsistency of attributions on the same point across an individual (left) and ensembled (right) model (n=15n=15), for all datasets, over leave-one-out differences in the training set. The height of each bar on the horizontal axis represents the attribution score of a distinct feature, and each color represents a different model. Features are ordered according to the attribution scores of one randomly-selected model. Figure 5(a) depicts the German Credit Dataset, Figure 5(b) depicts Adult, Figure 5(c) Seizure, Figure 5(d) Taiwanese, and Figure 5(e) Warfarin. We do not include feature attribution for image datasets as the individual pixels are less meaningful than the feature attributions in a tabular dataset.

H.2 Similarity Metrics of Attributions

We display how Spearman’s ranking coefficient (ρ\rho), Pearson’s Correlation Coefficient (rr), top-5 intersection and 2\ell_{2} distance between feature attributions over the same point become more and more similar with increasing numbers of ensemble models. While the comparisons to generate the similarity score is between two models on the same point, the result is averaged over this comparison for the entire test set. We average this over 276 comparisons between different models. In cases were abstention is high, indicating inconsistency on the dataset for the training pipeline, selective ensembling can further improve stability of attributions by not considering unstable points (see e.g. German Credit). We present the expanded results from the main paper, for all datasets, on all four metrics (as SSIM is only computed for image datasets, and ρ\rho is not computed for image datasets). We display error bars indicating standard deviation over the 276 comparisons between two models for tabular datasets, and 40 comparisons for image datasets.

Refer to caption
Figure 6: We plot the average similarity across feature attributions for an individual point, averaged over 276 comparisons of feature attributions from two different models. This is aggregated across the entire validation split. The error bars represent the standard deviation over the 276 comparisons between models. Each row of plots constitutes the plots for a given dataset, noted on the far left, and each column of plots is for a given metric, noted at the top. Note that for image datasets, (FMNIST and Colon), we plot SSIM instead of Spearman’s Ranking Coefficient (ρ)(\rho). The x-axis is the number of models in the ensemble, starting with one, and the y-axis indicates the value of the similarity metric averaged over all 276 comparisons of individual points’ in the validation split’s attributions. The red and orange lines depict regular ensembles, and the green and blue represent selective ensembles.