This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Approximate Conditional Coverage & Calibration via Neural Model Approximations

Allen Schmaltz
Reexpress AI
allen@re.express
&Danielle Rasooly
Harvard University
drasooly@bwh.harvard.edu
Abstract

A typical desideratum for quantifying the uncertainty from a classification model as a prediction set is class-conditional singleton set calibration. That is, such sets should map to the output of well-calibrated selective classifiers, matching the observed frequencies of similar instances. Recent works proposing adaptive and localized conformal p-values for deep networks do not guarantee this behavior, nor do they achieve it empirically. Instead, we use the strong signals for prediction reliability from KNN-based approximations of Transformer networks to construct data-driven partitions for Mondrian Conformal Predictors, which are treated as weak selective classifiers that are then calibrated via a new Inductive Venn Predictor, the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor. The resulting selective classifiers are well-calibrated, in a conservative but practically useful sense for a given threshold. They are inherently robust to changes in the proportions of the data partitions, and straightforward conservative heuristics provide additional robustness to covariate shifts. We compare and contrast to the quantities produced by recent Conformal Predictors on several representative and challenging natural language processing classification tasks, including class-imbalanced and distribution-shifted settings.

1 Introduction

Uncertainty quantification is challenging because the problem of the reference class (see, e.g., Vovk et al., 2005, p. 159) necessitates task-specific care in interpreting even well-calibrated probabilities that agree with the observed frequencies. It is made even more complex in practice with deep neural networks, for which the otherwise strong blackbox point predictions are typically not well-calibrated and can unexpectedly under-perform over distribution shifts. And it becomes even more involved for classification, given that the promising distribution-free approach of split-conformal inference (Vovk et al., 2005; Papadopoulos et al., 2002), an assumption-light frequentist approach suitable when sample sizes are sufficiently large, produces a counterintuitive p-value quantity in the case of classification (cf., regression).

Setting.

In a typical natural language processing (NLP) binary or multi-class classification task, we have access to a computationally expensive blackbox neural model, FF; a training dataset, 𝒟tr={Zi}i=1I={(Xi,Yi)}i=1I{\mathcal{D}}_{\rm{tr}}=\{Z_{i}\}_{i=1}^{I}=\{(X_{i},Y_{i})\}_{i=1}^{I} of |𝒟tr|=I|{\mathcal{D}}_{\rm{tr}}|=I instances paired with their corresponding ground-truth discrete labels, Yi𝒴={1,,C}Y_{i}\in{\mathcal{Y}}=\{1,\ldots,C\}; and a held-out labeled calibration dataset, 𝒟ca={Zj}j=I+1N=I+J{\mathcal{D}}_{\rm{ca}}=\{Z_{j}\}_{j=I+1}^{N=I+J} of |𝒟ca|=J|{\mathcal{D}}_{\rm{ca}}|=J instances. We are then given a new test instance, XN+1X_{N+1}, from an unlabeled test set, 𝒟te{\mathcal{D}}_{\rm{te}}. One approach to convey uncertainty in the predictions is to construct a prediction set, produced by some set-valued function 𝒞^(XN+1)2C\hat{{\mathcal{C}}}(X_{N+1})\in 2^{C}, containing the true unseen label with a specified level 1α(0,1)1-\alpha\in(0,1) on average. We consider two distinct interpretations: As coverage and as a conservatively coarsened calibrated probability (after conversion to selective classification), both from a frequentist perspective.

Desiderata.

For such prediction sets to be of interest for typical classification settings with deep networks, we seek class-conditional singleton set calibration (ccs). We are willing to accept noise in other size stratifications, but the singleton sets, |𝒞^|=1|\hat{{\mathcal{C}}}|=1, must contain the true value with a proportion 1α\geq 1-\alpha, at least on average per class. We further seek singleton set sharpness; that is, to maximize the number of singleton sets. We seek reasonable robustness to distribution shifts. Finally, we seek informative sets that avoid the trivial solution of full cardinality.

If we are willing to fully dispense with specificity in the non-singleton-set stratifications for tasks with |𝒴|>2|{\mathcal{Y}}|>2, our desiderata can be achieved, in principle, with selective classifiers.111The calibrated probabilities of the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor, introduced here, can be presented as empirical probabilities, prediction sets, and/or selective classifications. We emphasize the latter in the present work, as it is useful as a minimal required quantity in typical classification settings, and provides a clear contrast—and means of comparison—to alternative distribution-free approaches that may be considered by practitioners.

Definition 1 (Classification with reject option).

A selective classifier, g:𝒳𝒴{}g:{\mathcal{X}}\rightarrow{\mathcal{Y}}~{}\cup~{}\{\bot\}, maps from the input to either a single class or the reject option (represented here with the falsum symbol).

Remark 1 (Prediction sets are selective classifications).

The output of any set-valued function 𝒞^(XN+1)2C\hat{{\mathcal{C}}}(X_{N+1})\in 2^{C} corresponds to that of a selective classifier: Map non-singleton sets, |𝒞^(XN+1)|1|\hat{{\mathcal{C}}}(X_{N+1})|\neq 1, to \bot. Map all singleton sets to the corresponding class in 𝒴{\mathcal{Y}}.

To date, the typical approach for constructing prediction sets in a distribution-free setting is not via methods for calibrating probabilities, but rather in the hypothesis testing framework of Conformal Predictors, which carry a PAC-style (α,δ)(\alpha,\delta)-valid coverage guarantee. In the inductive (or “split”) conformal formulation (Vovk, 2012; Papadopoulos et al., 2002, inter alia), the p-value corresponds to confidence that a new point is as or more conforming than a held-out set with known labels. More specifically, we require a measurable function A:𝒵I×𝒵A:{\mathcal{Z}}^{I}\times{\mathcal{Z}}\rightarrow\mathbb{R}, which measures the conformity between zz and other instances. For example, given the softmax output of a neural network for xx, π^C\hat{\pi}\in\mathbb{R}^{C}, with π^y\hat{\pi}^{y} as the output of the true class, A((z1,,zI),(x,y))π^yA((z_{1},\ldots,z_{I}),(x,y))\coloneqq\hat{\pi}^{y} is a typical choice. We construct a p-value, vy^v^{\hat{y}}, as follows: vy^|{j=I+1,,N|τjτN+1}|+1N+1v^{\hat{y}}\coloneqq\frac{|\{j=I+1,\ldots,N|~{}\tau_{j}\leq\tau_{N+1}\}|+1}{N+1}, where τjA((z1,,zI),zj),zj𝒟ca\tau_{j}\coloneqq A((z_{1},\ldots,z_{I}),z_{j}),~{}\forall~{}z_{j}\in{\mathcal{D}}_{\rm{ca}} and τN+1A((z1,,zI),(xN+1,y^N+1))\tau_{N+1}\coloneqq A((z_{1},\ldots,z_{I}),(x_{N+1},\hat{y}_{N+1})), where we suppose the true label is y^\hat{y}. We then construct the prediction set: C^(xN+1)={y^:vy^>α}\hat{C}(x_{N+1})=\left\{\hat{y}:v^{\hat{y}}>\alpha\right\}. This is accompanied by a finite-sample, distribution-free coverage guarantee, which we state informally here.222We omit the randomness component (for tie-breaking), given the large sample sizes considered here.

Theorem 1 (Marginal Coverage of Conformal Predictors (Vovk et al., 2005)).

Provided the points of 𝒟ca{\mathcal{D}}_{\rm{ca}} and 𝒟te{\mathcal{D}}_{\rm{te}} are drawn exchangeably from the same distribution PXYP_{XY} (which need not be further specified), the following marginal guarantee holds for a given α\alpha: {YN+1𝒞^(XN+1)}1α{\mathbb{P}}\left\{Y_{N+1}\in\hat{{\mathcal{C}}}(X_{N+1})\right\}\geq 1-\alpha.

The distribution of split-conformal coverage is Beta distributed (Vovk, 2012), from which a PAC-style (α,δ)(\alpha,\delta)-validity guarantee can be obtained, and from which we can determine a suitable sample size to achieve this coverage in expectation. Unfortunately, this does not guarantee singleton set coverage (the hypothesis testing analogue of our ccs desideratum), a known, but under-appreciated, negative result that motivates the present work:

Corollary 1.

Conformal Predictors do not guarantee singleton set coverage. If they did, it would imply a stronger than marginal coverage guarantee.

Existing approaches.

Empirically, Conformal Predictors are weak selective classifiers, limiting their real-world utility. We show this problem is not resolved by re-weighting the empirical CDF near a test point (Guan, 2022), nor by applying separate per-class hypothesis tests, nor by APS conformal score functions (Romano et al., 2020), nor by adaptive regularization RAPS (Angelopoulos et al., 2021), and occurs even on in-distribution data.

Solution. In the present work, we demonstrate, with a focus on Transformer networks (Vaswani et al., 2017), first that a closer notion of approximate conditional coverage obtained via the stronger validity guarantees of Mondrian Conformal Predictors is not sufficient in itself to achieve our desired desiderata. Instead, we treat such Conformal Predictors as weak selective classifiers, which serve as the underlying learner to construct a taxonomy for a Venn Predictor (Vovk et al., 2003), a valid multi-probability calibrator. This is enabled by data-driven partitions determined by KNN (Devroye et al., 1996) approximations, which themselves encode strong signals for prediction reliability. The result is a principled, well-calibrated selective classifier, with a sharpness suitable even for highly imbalanced, low-accuracy settings, and with at least modest robustness to covariate shifts.

2 Mondrian Conformal Predictors and Venn Predictors

A stronger than marginal coverage guarantee can be obtained by Mondrian Conformal Predictors (Vovk et al., 2005), which guarantee coverage within partitions of the data, including conditioned on the labels. Such Predictors are not sufficient for obtaining our desired desiderata, but serve as a principled approach for constructing a Venn taxonomy with a desirable balance between specificity vs. generality (a.k.a., over-fitting vs. under-fitting), the classic problem of the reference class.

Both Mondrian Conformal Predictors and Venn Predictors are defined by the choice of a particular taxonomy. A taxonomy is a measurable function E:𝒵I×𝒵E:{\mathcal{Z}}^{I}\times{\mathcal{Z}}\rightarrow{\mathcal{E}}, where {\mathcal{E}} is a measurable space. A E((z1,,zI),z)E((z_{1},\ldots,z_{I}),z) is referred to as a category, and corresponds to a classification of zz, as via an attribute or label. The p-value of a Mondrian Conformal Predictor is then determined similarly to Conformal Predictors, but with conditioning on the category: vy^|{j=I+1,,N|ej=eN+1τjτN+1}|+1N+1v^{\hat{y}}\coloneqq\frac{|\{j=I+1,\ldots,N|~{}e_{j}=e_{N+1}\wedge\tau_{j}\leq\tau_{N+1}\}|+1}{N+1}, where ejE((z1,,zI),zj),zj𝒟cae_{j}\coloneqq E((z_{1},\ldots,z_{I}),z_{j}),~{}\forall~{}z_{j}\in{\mathcal{D}}_{\rm{ca}} and eN+1E((z1,,zI),(xN+1,y^N+1))e_{N+1}\coloneqq E((z_{1},\ldots,z_{I}),(x_{N+1},\hat{y}_{N+1})). We will refer to the resulting coverage as approximate conditional coverage, a middle ground between marginal coverage and conditional coverage, which is not possible in the distribution-free setting with a finite sample (Lei & Wasserman, 2014):

Theorem 2 (Approximate Conditional Coverage of Mondrian Conformal Predictors (Vovk et al., 2005; Vovk, 2012)).

Provided the points of 𝒟ca{\mathcal{D}}_{\rm{ca}} and 𝒟te{\mathcal{D}}_{\rm{te}} are exchangeable within their categories defined by taxonomy EE (Mondrian-exchangeability), the following coverage guarantee holds for a given α\alpha: {YN+1𝒞^(XN+1)|E(,(zN+1))}1α{\mathbb{P}}\left\{Y_{N+1}\in\hat{{\mathcal{C}}}(X_{N+1})~{}|~{}E(\cdot,(z_{N+1}))\right\}\geq 1-\alpha.

Venn Predictors dispense with p-values (and coverage) and instead seek validity via calibration. They are multi-probability calibrators, in that they produce not one probability, but multiple probabilities for a single class, a compromise which yields an otherwise quite strong theoretical guarantee. Venn Predictors have a simple, intuitive appeal: They amount to calculating the empirical probability among similar points to the test instance. The quirk to enable the theoretical guarantee is that this is done by including the test point itself, assigning each possible label; hence, the generation of multiple empirical probability distributions. Specifically, for xN+1x_{N+1} we first determine its category, typically some classification from the underlying model.333Existing taxonomies for Venn Predictors include using the predictions from a classifier (Lambrou et al., 2015), variations on nearest neighbors (Johansson et al., 2018), and isotonic regression, the Venn-ABERS Predictor (Vovk & Petej, 2014; Vovk et al., 2015). We will use 𝒯{\mathcal{T}} to indicate all instances in the category, where we have added (xN+1,c)(x_{N+1},c) assuming the true label is cc. We then calculate the empirical probability:

pc(c)|{(x,y)𝒯:y=c}||𝒯|,c𝒴p_{c}(c^{\prime})\coloneqq\frac{|\{(x^{*},~{}y^{*})\in{\mathcal{T}}:~{}y^{*}=c^{\prime}\}|}{|{\mathcal{T}}|},~{}\forall~{}c^{\prime}\in{\mathcal{Y}} (1)

We repeat this assuming each label is the true label, in turn, equivariant with respect to the taxonomy (that is, without respect to the ordering of points in the category). Remarkably, one of the probabilities from a multi-probability Venn Predictor is guaranteed to be perfectly calibrated. For our purposes, it will be sufficient to show this for the binary case. For a random variable O[0,1]O\in[0,1], such as the probablistic output of a classifier, and a binary random variable Y{0,1}Y\in\{0,1\}, we will follow previous work (Vovk & Petej, 2014) in saying OO is perfectly calibrated if 𝔼(Y|O)=O\operatorname{\text{$\mathbb{E}$}}(Y~{}|~{}O)=O a.s. The validity of the Venn Predictor is then:

Theorem 3 (Venn Predictor Calibration Validity (Theorem 1 Vovk & Petej (2014) ).

Provided the points of 𝒟ca{\mathcal{D}}_{\rm{ca}} and 𝒟te{\mathcal{D}}_{\rm{te}} are IID, among the two probabilities output by a Venn Predictor for Y=1Y=1, p0(1)p_{0}(1) and p1(1)p_{1}(1), one is perfectly calibrated.

3 Tasks: Classification with Transformers for NLP

The taxonomies will be chosen based on the need to partition the high-dimensional input of NLP tasks without having explicit attributes known in advance. We first introduce general notation for sequence labeling and document classification tasks. Each instance consists of a document, 𝒙=x1,,xt,,xT{\bm{x}}={x}_{1},\ldots,{x}_{t},\ldots,{x}_{T}, of TT tokens: Here, either words or amino acids. In the case of supervised sequence labeling (𝐒𝐒𝐋\operatorname{\mathbf{SSL}}), we seek to predict 𝒚^=y^1,,y^t,,y^T\hat{\bm{y}}=\hat{{y}}_{1},\ldots,\hat{{y}}_{t},\ldots,\hat{{y}}_{T}, the token-level labels for each token in the document, and we have the ground-truth token labels, yty_{t}, for training. For document classification (𝐃𝐂\operatorname{\mathbf{DC}}), we seek to predict the document-level label y^\hat{y}, and we have yy at training.

Table 1: Overview of experiments.
Label Task |𝒴||{\mathcal{Y}}| |𝒟valid||{\mathcal{D}}_{\rm{valid}}| |𝒟te||{\mathcal{D}}_{\rm{te}}| 𝒓{\bm{r}} Base network Acc. Characteristics
Protein\operatorname{\textsc{Protein}} 𝐒𝐒𝐋\operatorname{\mathbf{SSL}} 3 560k {30k,7k} 1000\mathbb{R}^{1000} BERTbase\sim\operatorname{\textsc{BERT}\textsubscript{{base}}} Mid In-domain (2 test sets)
GrammarOOD\operatorname{\textsc{GrammarOOD}} 𝐒𝐒𝐋\operatorname{\mathbf{SSL}} 2 35k 93k 1000\mathbb{R}^{1000} BERTlarge\operatorname{\textsc{BERT}\textsubscript{{large}}} Low Domain-shifted+imbalanced
Sentiment\operatorname{\textsc{Sentiment}} 𝐃𝐂\operatorname{\mathbf{DC}} 2 16k 488 2000\mathbb{R}^{2000} BERTlarge\operatorname{\textsc{BERT}\textsubscript{{large}}} High In-domain (acc. >1α>1-\alpha)
SentimentOOD\operatorname{\textsc{SentimentOOD}} 𝐃𝐂\operatorname{\mathbf{DC}} 2 16k 5k 2000\mathbb{R}^{2000} BERTlarge\operatorname{\textsc{BERT}\textsubscript{{large}}} Mid-Low Domain-shifted/OOD

For each task, our base model is a Transformer network. After training and/or fine-tuning, we fine-tune a kernel-width 1 CNN (memory layer\operatorname{\textsc{memory layer}}) over the output representations of the Transformer, producing predictions and representative dense vectors at a resolution (e.g., word-level or document-level) suitable for each task. Following past work, we will refer to these representations as “exemplar vectors” primarily to contrast with “prototype”, which is sometimes taken to refer to class-centroids, whereas the “exemplars” are unique to each instance. We will subsequently use f(xt)Cf(x_{t})\in\mathbb{R}^{C} for the prediction logits produced by the memory layer\operatorname{\textsc{memory layer}} corresponding to the token at index tt; πc\pi^{c} as the corresponding softmax normalized output for class cc; and 𝒓t{\bm{r}}_{t} as the associated exemplar vector. For 𝐒𝐒𝐋\operatorname{\mathbf{SSL}} there are TT such logits and vectors. For 𝐃𝐂\operatorname{\mathbf{DC}}, tt corresponds to a single representation of the document, with f(xt)f(x_{t}) formed by a combination of local and global predictions (as described further in the Appendix). In the present work, we will primarily only be concerned with Transformers at the level of abstraction of the exemplar vectors, 𝒓t{\bm{r}}_{t}; we refer the reader to the original works describing Transformers (Vaswani et al., 2017) and the particular choice for the memory layer\operatorname{\textsc{memory layer}} (Schmaltz, 2021) for additional details. Splits. Our baselines of comparisons use 𝒟tr{\mathcal{D}}_{\rm{tr}}, 𝒟ca{\mathcal{D}}_{\rm{ca}}, as the training and calibration sets, respectively. For our methods, we will assume the existence of an additional disjoint split of the data, 𝒟knn{\mathcal{D}}_{\rm{knn}} for setting the parameters of the KNNs. We will also require two calibration sets, 𝒟camc{\mathcal{D}}^{\rm{mc}}_{\rm{ca}}, which serves as the calibration set for the Mondrian Conformal Predictor, and 𝒟cavp{\mathcal{D}}^{\rm{vp}}_{\rm{ca}}, which serves as the calibration set for the Venn Predictor.

4 Methods

Refer to caption
(a) Overview
Refer to caption
(b) Taxonomy and prediction signals
Figure 1: On the left is a high-level overview. The right illustrates a category assignment for the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor, and the key prediction signals enabled by f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}}: Predictions become more reliable with increased label and prediction matches into training (q\operatorname{\text{$q$}}) and lower distances to training (dd). The resulting well-calibrated selective classifiers are robust to changes in the proportion of these categories.

We first define the taxonomy for our Mondrian Conformal Predictor, ADMIT\operatorname{\textsc{ADMIT}}. The resulting sets will serve as baselines of comparisons, but will primarily be used as weak selective classifiers for defining the taxonomy of our Venn Predictor, the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor. In both cases, we make use of non-parametric approximations of Transformers, which encode strong signals for prediction reliability, including over distribution-shifts, and are at least as effective as the model being approximated (Schmaltz, 2021). Predictions become less reliable at L2L^{2} distances farther from the training set and with increased label and prediction mismatches among the nearest matches. We further introduce an additional KNN approximation that serves as a localizer, relating a test instance to the distribution of the conformal calibration set. We use this localizer to construct a conservative heuristic for category assignments. Figure 1 provides an overview of the components and a visualization of the prediction signals and Venn taxonomy.

4.1 KNN approximation of a Transformer network

In order to partition the feature space, we first approximate the Transformer as a weighted combination of predictions and labels over 𝒟tr{\mathcal{D}}_{\rm{tr}} (Section 4.1.1). This approximation, f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}}, then becomes the model we use in practice, rather than the logits from the Transformer itself. We use this approximation to partition the data via a feature that separates more reliable points from less reliable points (Section 4.1.2) and via a distance-to-training band (Section 4.1.3).

4.1.1 Recasting a Transformer prediction as a weighting over the training set

We adapt the distance-weighted KNN approximation of Schmaltz (2021) for the multi-class setting. As in the original work, the KNN is trained to minimize prediction mis-matches against the output of the memory layer\operatorname{\textsc{memory layer}} (not the ground-truth labels). Training is performed on a 50/50 split of 𝒟knn{\mathcal{D}}_{\rm{knn}}, as described further in Appendix C:

fc(xt)fc(xt)trKNN=βc+kargKmini{1,,|𝒟tr|}𝒓t𝒓i2wk(tanh(fc(xk))+γcy~c),\displaystyle f^{c}(x_{t})\approx\text{$f^{c}(x_{t})_{\rm{tr}}^{\rm{KNN}}$}=\beta^{c}+\sum_{k\in\operatornamewithlimits{arg\,K\,min}\limits_{i\in\{1,\ldots,|{\mathcal{D}}_{\rm{tr}}|\}}||{\bm{r}}_{t}-{\bm{r}}_{i}||_{2}}w_{k}\cdot\left(\text{\rm{tanh}}(f^{c}(x_{k}))+\gamma^{c}\cdot\tilde{y}^{c}\right), (2)
wherewk=exp(𝒓t𝒓k2/η)kargKmini{1,,|𝒟tr|}𝒓t𝒓i2exp(𝒓t𝒓k2/η)\displaystyle\text{where}~{}w_{k}=\frac{\exp\left(-||{\bm{r}}_{t}-{\bm{r}}_{k}||_{2}/\eta\right)}{\sum\limits_{k^{\prime}\in\operatornamewithlimits{arg\,K\,min}\limits_{i\in\{1,\ldots,|{\mathcal{D}}_{\rm{tr}}|\}}||{\bm{r}}_{t}-{\bm{r}}_{i}||_{2}}\exp\left(-||{\bm{r}}_{t}-{\bm{r}}_{k^{\prime}}||_{2}/\eta\right)} (3)

y~c\tilde{y}^{c} is the ground-truth label (yy for 𝐃𝐂\operatorname{\mathbf{DC}}, yty_{t} for 𝐒𝐒𝐋\operatorname{\mathbf{SSL}}) for class cc transformed to be in {1,1}\{-1,1\}. KK is small in practice; K=25K=25 in all experiments here, and in general can be chosen using 𝒟knn{\mathcal{D}}_{\rm{knn}}. This approximation has 2C+12\cdot C+1 learnable parameters, corresponding to βc\beta^{c} and γc\gamma^{c} for each class, and the temperature parameter η\eta. We indicate the softmax normalized output for each class with πc(xt)trKNN\pi^{c}(x_{t})^{\rm{KNN}}_{\rm{tr}}. This model is used to produce approximations over all calibration and test instances. The choice of this particular functional form is discussed in Schmaltz (2021) and is designed to be as simple as possible while obtaining a sufficiently close approximation to the deep network.

4.1.2 Data-driven feature-space partitioning: True positive matching constraint

For each calibration and test point, we define the feature qt[0,K]\operatorname{\text{$q$}}_{t}\in[0,K] as the count of consecutive sign matches of the prediction of the KNN, y^tKNN\hat{y}^{\rm{KNN}}_{t}, with the true label and memory layer\operatorname{\textsc{memory layer}} prediction of the up to KK nearest matches from the training set, 𝒟tr{\mathcal{D}}_{\rm{tr}}:

qt(K)=kargKmini{1,,|𝒟tr|}𝒓t𝒓i2[y^tKNN=y^ktr][y^ktr=yktr][qt(k1)=k1],\operatorname{\text{$q$}}_{t}(K)=\sum_{k\in\operatornamewithlimits{arg\,K\,min}\limits_{i\in\{1,\ldots,|{\mathcal{D}}_{\rm{tr}}|\}}||{\bm{r}}_{t}-{\bm{r}}_{i}||_{2}}\left[\text{$\hat{y}^{\rm{KNN}}_{t}$}=\hat{y}^{\rm{tr}}_{k}\right]\wedge\left[\hat{y}^{\rm{tr}}_{k}=y^{\rm{tr}}_{k}\right]{\color[rgb]{0,0,1}\wedge\left[\operatorname{\text{$q$}}_{t}(k-1)=k-1\right]}, (4)

with q(0)0\operatorname{\text{$q$}}(0)\coloneqq 0. We further also use the L2L^{2} distance to the nearest training set match as a basis for subsetting the distribution into distance bands, as discussed in the next section:

dt=min𝒓t𝒓i2,i{1,,|𝒟tr|}d_{t}=\min||{\bm{r}}_{t}-{\bm{r}}_{i}||_{2},i\in\{1,\ldots,|{\mathcal{D}}_{\rm{tr}}|\} (5)

4.1.3 Data-driven feature-space partitioning: Distance-to-training band

We define the partition, {\mathcal{B}}, around each xt𝒟tex_{t}\in{\mathcal{D}}_{\rm{te}}, constrained to qq, as the L2L^{2}-distance-to-training band with a radius of ω=δs^\omega=\delta\cdot\hat{\rm{s}}, with δ+\delta\in\mathbb{R}^{+} as a user-specified parameter and s^\hat{\rm{s}} as the estimated standard deviation of constrained true positive calibration set distances, s^=std([dj:j{I+1,,I+|𝒟ca|},qj>0,y^jKNN=yj])\hat{\rm{s}}=\rm{std}([d_{j}:~{}j\in\{I+1,\ldots,I+|{\mathcal{D}}_{\rm{ca}}|\},\operatorname{\text{$q$}}_{j}>0,\text{$\hat{y}^{\rm{KNN}}_{j}$}=y_{j}]):

(xt,ω,qt,dt;𝒟ca)={xj:xj𝒟ca,dj[dtω,dt+ω],qt=qj}{\mathcal{B}}(x_{t},\omega,\operatorname{\text{$q$}}_{t},d_{t};{\mathcal{D}}_{\rm{ca}})=\{x_{j}:x_{j}\in{\mathcal{D}}_{\rm{ca}},d_{j}\in[d_{t}-\omega,d_{t}+\omega],q_{t}=q_{j}\} (6)

4.2 Prediction sets with approximate conditional coverage: ADMIT\operatorname{\textsc{ADMIT}}

We then define a taxonomy for our Mondrian Conformal Predictor by the partitions defined by {\mathcal{B}} and the true labels: E(,(zt))=(xt,ω,qt,dt;𝒟ca)yE(\cdot,(z_{t}))={\mathcal{B}}(x_{t},\omega,\operatorname{\text{$q$}}_{t},d_{t};{\mathcal{D}}_{\rm{ca}})\wedge y. This conditioning on the labels means we apply split-conformal prediction separately for each label, “label-conditional” conformal prediction (Vovk et al., 2005; Vovk, 2012; Sadinle et al., 2018), which provides built-in robustness to label proportion shifts (c.f., Podkopaev & Ramdas, 2021). We will refer to this method and the resulting sets with the label ADMIT\operatorname{\textsc{ADMIT}}. We always include the predicted label in the set. Pseudo-code appears in Appendix E.

4.3 Inductive Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictors & Selective Classifiers

An ADMIT\operatorname{\textsc{ADMIT}} Predictor maps to a weak selective classifier. We instead seek a well-calibrated selective classifier, which we define as follows, as a straightforward coarsening of the probability, only calculated over the admitted subset:

Definition 2 (Well-calibrated selective classifiers).

We take as S[0,1]S\in[0,1] the random variable indicating the probability a non-rejected prediction from a selective classifier, gg, should be admitted. We will say a selective classifier is conservatively well-calibrated (or just “well-calibrated”) if 𝔼(Y|S1α)1α\operatorname{\text{$\mathbb{E}$}}(Y~{}|~{}S\geq 1-\alpha)\geq 1-\alpha for a given α(0,1)\alpha\in(0,1).

We construct such a selective classifier, gg, as follows. Construct ADMIT\operatorname{\textsc{ADMIT}} sets for 𝒟cavp{\mathcal{D}}^{\rm{vp}}_{\rm{ca}} and 𝒟te{\mathcal{D}}_{\rm{te}}, in both cases using 𝒟camc{\mathcal{D}}^{\rm{mc}}_{\rm{ca}} as the calibration set. Next, convert the ADMIT\operatorname{\textsc{ADMIT}} sets into selective classifiers, gweakg_{\rm{weak}}, as in Remark 1. Calibrate the non-rejected predictions of 𝒟te{\mathcal{D}}_{\rm{te}} (i.e., the y^KNN\hat{y}^{\rm{KNN}} predictions that were singleton sets and now admitted predictions of gweakg_{\rm{weak}}) using a Venn Predictor with a taxonomy defined by {\mathcal{B}} and the prediction of the KNN, y^KNN\hat{y}^{\rm{KNN}}, now using 𝒟cavp{\mathcal{D}}^{\rm{vp}}_{\rm{ca}} as the calibration set. The Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} selective classifier is then the following decision rule, where p0(1),p1(1)p_{0}(1),p_{1}(1) are the two Venn probabilities associated with gweakg_{\rm{weak}}:

g(xt)={if min(p0(1),p1(1))<1αy^tKNNotherwise g(x_{t})=\begin{cases}\bot&\quad\text{if }\min(p_{0}(1),p_{1}(1))<1-\alpha\\ \text{$\hat{y}^{\rm{KNN}}_{t}$}&\quad\text{otherwise }\end{cases} (7)

We can then take as S1S\coloneqq 1, if g(xt)g(x_{t})\neq\bot (as used in Def. 2); i.e., a coarsening of the probability of the points admitted by gg.

Proposition 1 (Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} selective classifiers are well-calibrated).

Provided the points of 𝒟cavp{\mathcal{D}}^{\rm{vp}}_{\rm{ca}} and 𝒟te{\mathcal{D}}_{\rm{te}}, restricted to {\mathcal{B}}, are IID, the selective classifier gg defined by Eq. 7 is well-calibrated in the sense of Definition 2.

This follows directly from Theorem 3. \Box

4.4 Robustness

Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} selective classifiers are robust to covariate shifts that correspond to changes in the proportion of the partitions. We propose two simple heuristics that provide additional robustness. We first state two useful propositions that will justify the heuristics.

Proposition 2 (Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} calibration invariance to partition censoring).

Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} selective classifiers remain well-calibrated in the sense of Definition 2 with censoring of 1 or more partitions {\mathcal{B}}.

This directly follows from the fact that both the weak selective classifier (ADMIT\operatorname{\textsc{ADMIT}}) and the well-calibrated selective classifier (Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}}) treat each partition independently. We can thus construct a new selective classifier, gg^{\prime}, that maps any input in the censored partition(s) to \bot. \Box

Proposition 3 (Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} calibration invariance to test point up-weighting).

Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} selective classifiers remain well-calibrated in the sense of Definition 2 using any test point weight [1,)[1,\infty) when calculating the empirical probabilities of the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor.

Increasing the test point weight above 1 can only decrease the lower probability produced by the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor (since the denominator in Eq. 1 can only increase). Since Def. 2 is only calculated for admitted points, this notion of conservative well-calibration is retained. \Box

4.4.1 Censoring Less Reliable Data Partitions

The feature q\operatorname{\text{$q$}} can be viewed as an ensemble across multiple similar instances from training. Greater agreement suggests greater confidence in the prediction. We can restrict to partitions with the maximum value, q=K=25\operatorname{\text{$q$}}=K=25, here. By Prop. 2, calibration of the selective classifier is maintained.

4.4.2 Localized up-weighting based on category similarity

We can up-weight the test point when calculating the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} probabilities using an additional KNN localizer, f(x)caKNN^f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}}, in this case with 𝒟ca{\mathcal{D}}_{\rm{ca}} as the support set of the KNN and a single parameter, a temperature weight. Weights increase above 1 with greater dissimilarity between a test point and its assigned category. Additional details in Appendix B. By Prop. 3, calibration of the selective classifier is maintained.

4.4.3 Robust Venn-ADMIT Selective Classifications

For a given test point, xtx_{t}, we first construct a weak selective classifier with the ADMIT\operatorname{\textsc{ADMIT}} procedure of Section 4.2, followed by calibration via the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor (Section 4.3). Optionally, we apply the heuristics described in Sections 4.4.1 and 4.4.2. Pseudo-code appears in Appendix E.

5 Experiments

We have established that the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} selective classifications are conservatively well-calibrated; however, we have not said anything about the proportion of points that will be admitted. If the procedure is unnecessarily strict, we may nonetheless prefer the output from alternative approaches, such as Conformal Predictors. Additionally, the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor is inherently robust to changes in the proportions of the data partitions, but whether that corresponds to real-world distribution shifts is task and data dependent. To address these concerns, we turn to empirical evaluations.

We evaluate on a wide-range of representative NLP tasks, including challenging domain-shifted and class-imbalanced settings, and in settings in which the point prediction accuracies are quite high (marginally >1α>1-\alpha) and in which they are relatively low. We follow past work in setting α=0.1\alpha=0.1 in our main experiments. We set δ=1\delta=1. We summarize and label our benchmark tasks, the underlying parametric networks, and data in Table 1 A disjoint set of size 144k, the cb513 set from Protein\operatorname{\textsc{Protein}}, was used for initial methods development. In the Table, 𝒟valid{\mathcal{D}}_{\rm{valid}} is the original held-out validation set associated with each task. For the ADMIT\operatorname{\textsc{ADMIT}} and Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} approaches, a random 10%10\% sample of 𝒟valid{\mathcal{D}}_{\rm{valid}} serves as the disjoint 𝒟knn{\mathcal{D}}_{\rm{knn}} set for training the KNNs, with the remaining data split evenly for 𝒟camc{\mathcal{D}}^{\rm{mc}}_{\rm{ca}} and 𝒟cavp{\mathcal{D}}^{\rm{vp}}_{\rm{ca}}. The baseline and comparison methods are given the full 𝒟valid{\mathcal{D}}_{\rm{valid}} as 𝒟ca{\mathcal{D}}_{\rm{ca}}. The Appendix provides implementation details on constructing the exemplar vectors, 𝒓{\bm{r}}, for each of the tasks from the memory layer\operatorname{\textsc{memory layer}}.

5.1 Comparison Models

As a distribution-free baseline of comparison we consider the size- and adaptiveness-optimized RAPS algorithm of Angelopoulos et al. (2021), RAPSsize\operatorname{\textsc{RAPS\textsubscript{size}}} and RAPSadapt\operatorname{\textsc{RAPS\textsubscript{adapt}}}, which combine regularization and post-hoc Platt-scaling calibration (Platt, 1999; Guo et al., 2017), on the output of the memory layer\operatorname{\textsc{memory layer}}. Using stratification of coverage by cardinality as a metric, RAPSadapt\operatorname{\textsc{RAPS\textsubscript{adapt}}}, in particular, was reported to more closely approximate conditional coverage than the alternative APS\operatorname{\textsc{APS}} (Romano et al., 2020), with smaller sets. Confbase\operatorname{\textsc{Conf\textsubscript{base}}} is a split-conformal point of reference for simply using the output of f(x)trKNNf(x)_{\rm{tr}}^{\rm{KNN}} without further conditioning, nor post-hoc calibration. Localconf\operatorname{\textsc{Local\textsubscript{conf}}} is a localized conformal (Guan, 2022) baseline using the KNN localizer f(x)caKNN^f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}}. Across methods, the point prediction is included in the set, which ensures conservative (but not necessarily exact/upper-bounded) coverage by eliminating null sets.

We use the label ADMIT\operatorname{\textsc{ADMIT}} to indicate the Mondrian Conformal sets. We use the label Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} to indicate Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} selective classifications with test-point up-weighting with the KNN localizer (Sec. 4.4.2), and Venn-ADMITqK\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}}} as those with the further restriction of q=K\operatorname{\text{$q$}}=K (Sec. 4.4.1). The results Venn-ADMIT-w\operatorname{\textsc{Venn-ADMIT-w}} exclude test-point up-weighting; Venn-ADMITqK-w\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}-w}} excludes test-point up-weighting, but restricts to q=K\operatorname{\text{$q$}}=K.

Calibration in general is difficult to evaluate, with conflicting definitions and metics (Kull et al., 2019; Gupta & Ramdas, 2022, inter alia). In the hypothesis testing framework, approaches have been proposed to make marginal Conformal Predictors more adaptive (i.e., to achieve closer approximations to conditional coverage), but evaluations tend to omit class-wise singleton set coverage, arguably the baseline required quantity needed in practice for classification. In contrast, our desiderata are easily evaluated and resolve these concerns: Of the admitted points, we calculate the proportion of points matching the true label, y𝒞¯\overline{y\in{\mathcal{C}}}, for each class. That is, given an admitted prediction (or similarly, a singleton set), an end-user should have confidence that the per-class accuracy is at least 1α1-\alpha. Additionally, other things being equal, the proportion of admitted points (nN\frac{n}{N}) should be maximized.

Table 2: Model approximation vs. memory layer\operatorname{\textsc{memory layer}} accuracy/F0.5F_{0.5}.
Protein\operatorname{\textsc{Protein}} (Acc.) GrammarOOD\operatorname{\textsc{GrammarOOD}} (F0.5F_{0.5}) Sentiment\operatorname{\textsc{Sentiment}} (Acc.) SentimentOOD\operatorname{\textsc{SentimentOOD}} (Acc.)
Model/Approx. 𝒟ca{\mathcal{D}}_{\rm{ca}} ts115 casp12 𝒟ca{\mathcal{D}}_{\rm{ca}} 𝒟te{\mathcal{D}}_{\rm{te}} 𝒟ca{\mathcal{D}}_{\rm{ca}} 𝒟te{\mathcal{D}}_{\rm{te}} 𝒟ca{\mathcal{D}}_{\rm{ca}} 𝒟te{\mathcal{D}}_{\rm{te}}
memory layer\operatorname{\textsc{memory layer}} 0.75 0.77 0.70 0.59 0.40 0.92 0.93 0.92 0.78
f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}} 0.76 0.77 0.71 0.58 0.43 0.92 0.93 0.92 0.79
f(x)caKNN^\operatorname{\text{$f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}}$}} - 0.77 0.70 - 0.42 - 0.93 - 0.78

6 Results

Table 3: The empirical behavior of the calibration points differs significantly with q=0q=0 vs. q=Kq=K, and as the distance to training (dtd_{t}) varies, in terms of f(x)trKNNf(x)_{\rm{tr}}^{\rm{KNN}} point accuracy (Acc.), and the distribution of over-confidence and under-confidence (reflected in τ^0.1\hat{\tau}^{0.1}, 0.1 quantile threshold). (Validation set of Protein\operatorname{\textsc{Protein}}.)
Protein\operatorname{\textsc{Protein}}: Class Label (Amino-Acid/Token-Level Sequence Labeling)
y=Helixy=\rm{\textsc{{H}elix}} y=Strandy=\rm{\textsc{{S}trand}} y=Othery=\rm{\textsc{{O}ther}} y{HSO}y\in\{\rm{\textsc{{H}, {S}, {O}}}\}
Subset τ^c0.1\hat{\tau}_{c}^{0.1} Acc. nN\frac{n}{N} τ^c0.1\hat{\tau}_{c}^{0.1} Acc. nN\frac{n}{N} τ^c0.1\hat{\tau}_{c}^{0.1} Acc. nN\frac{n}{N} τ^0.1\hat{\tau}^{0.1} Acc. nN\frac{n}{N}
q=0q=0 0.07 0.59 0.07 0.07 0.56 0.05 0.18 0.56 0.10 0.11 0.57 0.22
q=Kq=K 0.96 0.98 0.12 0.95 0.94 0.04 0.92 0.92 0.08 0.94 0.96 0.24
q[0,K]q\in[0,K] 0.12 0.81 0.37 0.06 0.70 0.21 0.13 0.74 0.42 0.11 0.76 1.
dt<mediand_{t}<\text{median}
      q=0q=0 0.09 0.64 0.02 0.07 0.65 0.01 0.16 0.55 0.03 0.12 0.60 0.07
      q=Kq=K 0.96 0.98 0.07 0.95 0.96 0.03 0.93 0.94 0.06 0.94 0.96 0.15
      q[0,K]q\in[0,K] 0.27 0.87 0.16 0.09 0.81 0.08 0.14 0.79 0.18 0.16 0.82 0.43
dtmediand_{t}\geq\text{median}
      q=0q=0 0.07 0.57 0.05 0.07 0.53 0.04 0.19 0.57 0.07 0.11 0.56 0.16
      q=Kq=K 0.96 0.98 0.05 0.04 0.90 0.01 0.03 0.87 0.02 0.93 0.94 0.09
      q[0,K]q\in[0,K] 0.09 0.76 0.21 0.05 0.62 0.12 0.13 0.70 0.24 0.09 0.70 0.57

Across tasks, the KNNs consistently achieve similar point accuracies as the base networks (Table 2). This justifies their use in replacing the output logit of the underlying Transformers. Table 3 then highlights our core motivations for leveraging the signals from the KNNs: There are stark differences across instances as q\operatorname{\text{$q$}} increases and as the distance to training increases (shown here for Protein\operatorname{\textsc{Protein}}, but observed across tasks). In order to obtain calibration, coverage, or even similar point accuracies, on datasets with proportionally more points with q<K\operatorname{\text{$q$}}<K, and/or far from training, we must control for changes in the proportions of these partitions.

Table 4 and Table 5 contain the main results. The Conformal Predictors RAPS and APS behave as advertised, obtaining marginal coverage (not shown) for in-distribution data. However, the additional adaptiveness of these approaches does not translate into reliable singleton set coverage. More specifically: The Conformal Predictors tend to only obtain singleton set coverage when the point accuracy of the model is 1α\geq 1-\alpha, including over in-distribution data. Only for the high-accuracy, in-distribution Sentiment\operatorname{\textsc{Sentiment}} task (Table 5) is adequate singleton set coverage obtained. For the in-distribution Protein\operatorname{\textsc{Protein}} task, coverage falls to the 70s70s for casp12 and the low 80s80s for ts115 for the y=Othery=\rm{\textsc{{O}ther}} class (Table 4). On the low-accuracy, class-imbalanced GrammarOOD\operatorname{\textsc{GrammarOOD}} task (Table 5), in which the minority class occurs with a proportion less than α\alpha, singleton set coverage for the minority class is very poor. Re-weighting the empirical CDF near a test point is not an adequate solution to obtain singleton set coverage. The Localconf\operatorname{\textsc{Local\textsubscript{conf}}} approach obtains coverage on the distribution-shifted SentimentOOD\operatorname{\textsc{SentimentOOD}} task (Table 5), but coverage is inadequate, as with simpler Conformal Predictors, for the GrammarOOD\operatorname{\textsc{GrammarOOD}} task. The stronger per-class Mondrian Conformal guarantee is also not sufficient to obtain singleton set coverage in practice. The ADMIT\operatorname{\textsc{ADMIT}} sets obtain less severe under-coverage on the GrammarOOD\operatorname{\textsc{GrammarOOD}} task compared to the marginal Conformal Predictors, but singleton set coverage is not clearly better on the in-distribution Protein\operatorname{\textsc{Protein}} task. The under-coverage of these approaches could come as a surprise to end-users. In this way, such split-conformal approaches are not ideal for instance-level decision making.

Fortunately, we can nonetheless achieve the desired desiderata with distribution-free methods in a straightforward manner via Venn Predictors, recasting our goal in terms of calibration rather than coverage. The base approach with test-point up-weighting, Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}}, is well-calibrated across tasks. We further note that there is no cost to be paid on these datasets by rejecting points in all partitions other than those with q=K\operatorname{\text{$q$}}=K, as seen with Venn-ADMITqK\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}}}. That is, the admitted points are almost exclusively in the q=K\operatorname{\text{$q$}}=K partition. By means of comparison with the Venn-ADMIT-w\operatorname{\textsc{Venn-ADMIT-w}} and Venn-ADMITqK-w\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}-w}} ablations, in which calibration is obtained by Venn-ADMITqK-w\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}-w}} without weighting, we would recommend making this restriction (i.e., using Venn-ADMITqK\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}}}) as the default approach in higher-risk settings as an additional safeguard.

Table 4: Selective classification evaluation on Protein\operatorname{\textsc{Protein}} test sets.
|𝒞|=1|{\mathcal{C}}|=1 by Class Label (Amino-Acid/Token-Level Sequence Labeling)
y=Helixy=\rm{\textsc{{H}elix}} y=Strandy=\rm{\textsc{{S}trand}} y=Othery=\rm{\textsc{{O}ther}} y{HSO}y\in\{\rm{\textsc{{H}, {S}, {O}}}\}
Set Method y𝒞¯\overline{y\in{\mathcal{C}}} nN\frac{n}{N} y𝒞¯\overline{y\in{\mathcal{C}}} nN\frac{n}{N} y𝒞¯\overline{y\in{\mathcal{C}}} nN\frac{n}{N} y𝒞¯\overline{y\in{\mathcal{C}}} nN\frac{n}{N}
ts115 (N=29,704N=29,704)
RAPSsize\operatorname{\textsc{RAPS\textsubscript{size}}} 0.96 0.22 0.88 0.06 0.82 0.14 0.90 0.43
RAPSadapt\operatorname{\textsc{RAPS\textsubscript{adapt}}} 0.96 0.24 0.86 0.07 0.82 0.17 0.90 0.48
APS\operatorname{\textsc{APS}} 0.96 0.24 0.86 0.07 0.83 0.17 0.90 0.48
Localconf\operatorname{\textsc{Local\textsubscript{conf}}} 0.96 0.23 0.85 0.07 0.86 0.18 0.91 0.49
ADMIT\operatorname{\textsc{ADMIT}} 0.96 0.23 0.88 0.07 0.76 0.14 0.88 0.43
Venn-ADMIT-w\operatorname{\textsc{Venn-ADMIT-w}} 0.98 0.14 0.91 0.03 0.92 0.08 0.95 0.24
Venn-ADMITqK-w\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}-w}} 0.98 0.14 0.92 0.03 0.92 0.08 0.96 0.24
Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} 0.99 0.14 0.92 0.03 0.91 0.07 0.96 0.24
Venn-ADMITqK\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}}} 0.99 0.14 0.92 0.03 0.91 0.07 0.96 0.24
casp12 (N=7,256N=7,256)
RAPSsize\operatorname{\textsc{RAPS\textsubscript{size}}} 0.96 0.14 0.85 0.05 0.77 0.13 0.87 0.31
RAPSadapt\operatorname{\textsc{RAPS\textsubscript{adapt}}} 0.95 0.16 0.85 0.06 0.78 0.15 0.86 0.36
APS\operatorname{\textsc{APS}} 0.95 0.15 0.86 0.05 0.74 0.15 0.85 0.36
Localconf\operatorname{\textsc{Local\textsubscript{conf}}} 0.97 0.14 0.82 0.03 0.84 0.12 0.90 0.30
ADMIT\operatorname{\textsc{ADMIT}} 0.94 0.16 0.87 0.06 0.67 0.12 0.83 0.34
Venn-ADMIT-w\operatorname{\textsc{Venn-ADMIT-w}} 0.95 0.09 0.85 0.01 0.90 0.06 0.92 0.16
Venn-ADMITqK-w\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}-w}} 0.96 0.09 0.87 0.01 0.89 0.06 0.93 0.16
Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} 0.96 0.09 0.87 0.01 0.89 0.06 0.93 0.16
Venn-ADMITqK\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}}} 0.96 0.09 0.87 0.01 0.89 0.06 0.93 0.16
Table 5: Selective classification evaluation on distribution-shifted data. Sentiment\operatorname{\textsc{Sentiment}} (in-dist.) for contrast.
|𝒞|=1|{\mathcal{C}}|=1 by Class Label (Binary 𝐒𝐒𝐋\operatorname{\mathbf{SSL}} and 𝐃𝐂\operatorname{\mathbf{DC}})
y=0y=0 y=1y=1 y{0,1}y\in\{0,1\}
Set Method y𝒞¯\overline{y\in{\mathcal{C}}} nN\frac{n}{N} y𝒞¯\overline{y\in{\mathcal{C}}} nN\frac{n}{N} y𝒞¯\overline{y\in{\mathcal{C}}} nN\frac{n}{N}
SentimentOOD\operatorname{\textsc{SentimentOOD}} (N=4750N=4750)
f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}} (Acc.) 0.86 0.50 0.72 0.50 0.79 1.0
Confbase\operatorname{\textsc{Conf\textsubscript{base}}} 0.86 0.50 0.72 0.50 0.79 1.0
RAPSadapt\operatorname{\textsc{RAPS\textsubscript{adapt}}} 0.79 0.27 0.91 0.33 0.86 0.61
RAPSsize\operatorname{\textsc{RAPS\textsubscript{size}}} 0.75 0.50 0.80 0.50 0.78 1.00
APS\operatorname{\textsc{APS}} 0.80 0.28 0.91 0.33 0.86 0.61
Localconf\operatorname{\textsc{Local\textsubscript{conf}}} 0.90 0.10 0.96 0.18 0.94 0.28
ADMIT\operatorname{\textsc{ADMIT}} 0.96 0.16 0.93 0.18 0.94 0.33
Venn-ADMIT-w\operatorname{\textsc{Venn-ADMIT-w}} 0.96 0.14 0.94 0.17 0.94 0.31
Venn-ADMITqK-w\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}-w}} 0.96 0.14 0.94 0.17 0.94 0.31
Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} 0.96 0.13 0.94 0.17 0.95 0.29
Venn-ADMITqK\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}}} 0.96 0.13 0.94 0.17 0.95 0.29
Sentiment\operatorname{\textsc{Sentiment}} (N=488N=488)
f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}} (Acc.) 0.94 0.50 0.91 0.50 0.93 1.0
Confbase\operatorname{\textsc{Conf\textsubscript{base}}} 0.94 0.50 0.91 0.50 0.93 1.0
RAPSadapt\operatorname{\textsc{RAPS\textsubscript{adapt}}} 0.97 0.47 0.96 0.46 0.96 0.93
RAPSsize\operatorname{\textsc{RAPS\textsubscript{size}}} 0.94 0.50 0.91 0.50 0.93 1.00
APS\operatorname{\textsc{APS}} 0.96 0.47 0.95 0.46 0.96 0.92
Localconf\operatorname{\textsc{Local\textsubscript{conf}}} 0.95 0.43 0.94 0.44 0.95 0.88
ADMIT\operatorname{\textsc{ADMIT}} 0.95 0.42 0.93 0.40 0.94 0.82
Venn-ADMIT-w\operatorname{\textsc{Venn-ADMIT-w}} 0.94 0.38 0.93 0.40 0.94 0.78
Venn-ADMITqK-w\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}-w}} 0.94 0.38 0.93 0.40 0.94 0.78
Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} 0.94 0.37 0.94 0.40 0.94 0.76
Venn-ADMITqK\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}}} 0.94 0.37 0.94 0.40 0.94 0.76
GrammarOOD\operatorname{\textsc{GrammarOOD}} (N=92597N=92597)
f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}} (Acc.) 0.98 0.93 0.27 0.07 0.93 1.0
Confbase\operatorname{\textsc{Conf\textsubscript{base}}} 0.98 0.92 0.26 0.06 0.94 0.99
RAPSadapt\operatorname{\textsc{RAPS\textsubscript{adapt}}} 0.97 0.78 0.34 0.05 0.94 0.83
RAPSsize\operatorname{\textsc{RAPS\textsubscript{size}}} 0.97 0.79 0.34 0.05 0.94 0.84
APS\operatorname{\textsc{APS}} 0.97 0.79 0.34 0.05 0.94 0.83
Localconf\operatorname{\textsc{Local\textsubscript{conf}}} 1.00 0.85 0.19 0.05 0.95 0.91
ADMIT\operatorname{\textsc{ADMIT}} 0.93 0.20 0.77 0.02 0.92 0.22
Venn-ADMIT-w\operatorname{\textsc{Venn-ADMIT-w}} 1.00 0.11 0.75 0.01 0.98 0.12
Venn-ADMITqK-w\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}-w}} 1.00 0.08 0.89 <<0.01 0.99 0.09
Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} 1.00 0.05 0.92 <<0.01 0.99 0.05
Venn-ADMITqK\operatorname{\textsc{Venn-ADMIT\textsubscript{$qK$}}} 1.00 0.05 0.92 <<0.01 0.99 0.05

7 Conclusion

The finite-sample, distribution-free guarantees of Conformal Predictors are appealing; however, the coverage guarantee is too weak for typical classification use-cases. We have instead demonstrated that the key characteristics desired for prediction sets are instead achievable by calibrating weak selective classifiers with Venn Predictors, enabled by KNN approximations of the deep networks.

Reproducibility Statement

We will provide a link to our Pytorch code and replication scripts with the camera-ready version of the paper. The data and pre-trained weights of the underlying Transformers are publicly available and are further described for each experiment in the Appendix.

Ethics Statement

Uncertainty quantification is a cornerstone for trustworthy AI. We have demonstrated a principled approach for selective classification that achieves the desired desiderata in challenging settings (low accuracy, class-imbalanced, distribution shifted) under a stringent class-wise evaluation scenario. We have also shown that alternative existing distribution-free approaches do not produce the quantities needed in typical classification settings.

Whereas the use-cases for prediction sets with marginal coverage are relatively limited, the use-cases for reliable selective classification are numerous. For example, reliable class-conditional selective classification directly applies to routing to reduce overall computation (e.g., use small, fast models, only deferring to larger models for rejected predictions), and higher-risk settings where less confident predictions must be sent to humans for further adjudication.

An unusual and advantageous aspect of a Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor, and which further distinguishes it from post-hoc Platt-scaling-style calibration (Platt, 1999; Guo et al., 2017), is a degree of inherent example-based interpretability: The calibrated distribution for a point is a simple transformation of the empirical probability among similar points, with partitions determined by a KNN that can be readily inspected. This matching component yields a direct avenue for addressing group-wise fairness: Known group attributes can be incorporated as categories to ensure group-wise calibration.

Acknowledgments

We thank the organizers, reviewers, and participants of the Workshop on Distribution-Free Uncertainty Quantification at the Thirty-ninth International Conference on Machine Learning (ICML 2022) for feedback and suggestions on an earlier version of this work.

References

  • Angelopoulos & Bates (2021) Anastasios N. Angelopoulos and Stephen Bates. A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. CoRR, abs/2107.07511, 2021. URL https://arxiv.org/abs/2107.07511.
  • Angelopoulos et al. (2021) Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty Sets for Image Classifiers using Conformal Prediction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=eNdiU_DbM9.
  • Berman et al. (2000) Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
  • Chelba et al. (2014) Ciprian Chelba, Tomás Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. In Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie (eds.), INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pp. 2635–2639. ISCA, 2014. URL http://www.isca-speech.org/archive/interspeech_2014/i14_2635.html.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • Devroye et al. (1996) Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition. In Stochastic Modelling and Applied Probability, 1996.
  • El-Gebali et al. (2019) Sara El-Gebali, Jaina Mistry, Alex Bateman, Sean R Eddy, Aurélien Luciani, Simon C Potter, Matloob Qureshi, Lorna J Richardson, Gustavo A Salazar, Alfredo Smart, Erik L L Sonnhammer, Layla Hirsh, Lisanna Paladin, Damiano Piovesan, Silvio C E Tosatto, and Robert D Finn. The Pfam protein families database in 2019. Nucleic Acids Research, 47(D1):D427–D432, 2019. ISSN 0305-1048. doi: 10.1093/nar/gky995. URL https://academic.oup.com/nar/article/47/D1/D427/5144153.
  • Guan (2022) Leying Guan. Localized Conformal Prediction: A Generalized Inference Framework for Conformal Prediction. Biometrika, 07 2022. ISSN 1464-3510. doi: 10.1093/biomet/asac040. URL https://doi.org/10.1093/biomet/asac040. asac040.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp.  1321–1330. JMLR.org, 2017.
  • Gupta & Ramdas (2022) Chirag Gupta and Aaditya Ramdas. Top-label calibration and multiclass-to-binary reductions. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WqoBaaPHS-.
  • Johansson et al. (2018) Ulf Johansson, Tuve Löfström, and Håkan Sundell. Venn predictors using lazy learners. In The 2018 World Congress in Computer Science, Computer Engineering & Applied Computing, July 30-August 02, Las Vegas, Nevada, USA, pp.  220–226. CSREA Press, 2018.
  • Kaushik et al. (2020) Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. Learning The Difference That Makes A Difference With Counterfactually-Augmented Data. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Sklgs0NFvr.
  • Klausen et al. (2019) Michael Schantz Klausen, Martin Closter Jespersen, Henrik Nielsen, Kamilla Kjaergaard Jensen, Vanessa Isabell Jurtz, Casper Kaae Soenderby, Morten Otto Alexander Sommer, Ole Winther, Morten Nielsen, Bent Petersen, et al. Netsurfp-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins: Structure, Function, and Bioinformatics, 2019.
  • Kull et al. (2019) Meelis Kull, Miquel Perello-Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration. Curran Associates Inc., Red Hook, NY, USA, 2019.
  • Lambrou et al. (2015) Antonis Lambrou, Ilia Nouretdinov, and Harris Papadopoulos. Inductive Venn Prediction. Annals of Mathematics and Artificial Intelligence, 74(1):181–201, 2015. doi: 10.1007/s10472-014-9420-z. URL https://doi.org/10.1007/s10472-014-9420-z.
  • Lei & Wasserman (2014) Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71–96, 2014. doi: https://doi.org/10.1111/rssb.12021. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12021.
  • Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization, 2019.
  • Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1015.
  • Moult et al. (2018) John Moult, Krzysztof Fidelis, Andriy Kryshtafovych, Torsten Schwede, and Anna Tramontano. Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins: Structure, Function, and Bioinformatics, 86:7–15, 2018. ISSN 08873585. doi: 10.1002/prot.25415. URL http://doi.wiley.com/10.1002/prot.25415.
  • Papadopoulos et al. (2002) Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In Proceedings of the 13th European Conference on Machine Learning, ECML’02, pp.  345–356, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540440364. doi: 10.1007/3-540-36755-1˙29. URL https://doi.org/10.1007/3-540-36755-1_29.
  • Platt (1999) John C. Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers, pp.  61–74. MIT Press, 1999.
  • Podkopaev & Ramdas (2021) Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantification for classification under label shift. In Cassio de Campos and Marloes H. Maathuis (eds.), Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pp. 844–853. PMLR, 27–30 Jul 2021. URL https://proceedings.mlr.press/v161/podkopaev21a.html.
  • Rao et al. (2019) Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S Song. Evaluating Protein Transfer Learning with TAPE. In Advances in Neural Information Processing Systems, 2019.
  • Rei & Yannakoudakis (2016) Marek Rei and Helen Yannakoudakis. Compositional Sequence Labeling Models for Error Detection in Learner Writing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1181–1191, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1112. URL https://www.aclweb.org/anthology/P16-1112.
  • Romano et al. (2020) Yaniv Romano, Matteo Sesia, and Emmanuel J. Candès. Classification with valid and adaptive coverage. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  • Rosenthal et al. (2017) Sara Rosenthal, Noura Farra, and Preslav Nakov. SemEval-2017 Task 4: Sentiment Analysis in Twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp.  502–518, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2088. URL https://www.aclweb.org/anthology/S17-2088.
  • Sadinle et al. (2018) Mauricio Sadinle, Jing Lei, and Larry A. Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114:223 – 234, 2018.
  • Schmaltz (2021) Allen Schmaltz. Detecting Local Insights from Global Labels: Supervised and Zero-Shot Sequence Labeling via a Convolutional Decomposition. Computational Linguistics, 47(4):729–773, December 2021. doi: 10.1162/coli˙a˙00416. URL https://aclanthology.org/2021.cl-4.25.
  • Schmaltz & Beam (2020) Allen Schmaltz and Andrew Beam. Exemplar Auditing for Multi-Label Biomedical Text Classification. CoRR, abs/2004.03093, 2020. URL https://arxiv.org/abs/2004.03093.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30, pp.  6000–6010. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Vovk (2012) Vladimir Vovk. Conditional validity of inductive conformal predictors. In Steven C. H. Hoi and Wray Buntine (eds.), Proceedings of the Asian Conference on Machine Learning, volume 25 of Proceedings of Machine Learning Research, pp.  475–490, Singapore Management University, Singapore, 04–06 Nov 2012. PMLR. URL https://proceedings.mlr.press/v25/vovk12.html.
  • Vovk & Petej (2014) Vladimir Vovk and Ivan Petej. Venn-Abers Predictors. In UAI, 2014.
  • Vovk et al. (2003) Vladimir Vovk, Glenn Shafer, and Ilia Nouretdinov. Self-calibrating Probability Forecasting. In S. Thrun, L. Saul, and B. Schölkopf (eds.), Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003. URL https://proceedings.neurips.cc/paper/2003/file/10c66082c124f8afe3df4886f5e516e0-Paper.pdf.
  • Vovk et al. (2005) Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522.
  • Vovk et al. (2015) Vladimir Vovk, Ivan Petej, and Valentina Fedorova. Large-scale probabilistic predictors with and without guarantees of validity. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/a9a1d5317a33ae8cef33961c34144f84-Paper.pdf.
  • Yannakoudakis et al. (2011) Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. A New Dataset and Method for Automatically Grading ESOL Texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  180–189, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P11-1019.
  • Zeiler (2012) Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701, 2012. URL http://arxiv.org/abs/1212.5701.

Appendix A Appendix: Contents

Appendix B describes the KNN localizer, and Appendix C provides additional details for training the KNNs. Appendix D provides guidelines on controlling for—and conveying the variance of—the sample size. We provide pseudo-code in Appendix E. In Appendix F, G, and H, we provide additional details for each of the tasks.

Appendix B KNN localizer

We use a KNN localizer, against the calibration set, as a localized conformal (Guan, 2022) baseline of comparison, and to re-weight category assignments for the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor. This KNN localizer recasts the test approximation output as a weighted linear combination over the calibration set approximations:

fc(xt)trKNNfc(xt)caKNN^=kargKminj{I+1,,I+|𝒟ca|}𝒓t𝒓j2ψkfc(xk)trKNN,whereK=|𝒟ca|\text{$f^{c}(x_{t})_{\rm{tr}}^{\rm{KNN}}$}\approx\text{$f^{c}(x_{t})_{\rm{ca}}^{\widehat{\rm{KNN}}}$}=\sum_{k\in\operatornamewithlimits{arg\,K\,min}\limits_{j\in\{I+1,\ldots,I+|{\mathcal{D}}_{\rm{ca}}|\}}||{\bm{r}}_{t}-{\bm{r}}_{j}||_{2}}\psi_{k}\cdot\text{$f^{c}(x_{k})_{\rm{tr}}^{\rm{KNN}}$},\text{where}~{}K=|{\mathcal{D}}_{\rm{ca}}| (8)

The single parameter, the temperature parameter of ψk[0,1]\psi_{k}\in[0,1], which is calculated in an analogous manner as wkw_{k} in Equation 3, is trained via gradient descent against 𝒟te{\mathcal{D}}_{\rm{te}} to minimize prediction discrepancies between f(xt)trKNNf(x_{t})_{\rm{tr}}^{\rm{KNN}} and f(xt)caKNN^f(x_{t})_{\rm{ca}}^{\widehat{\rm{KNN}}}. As with f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}}, training is performed using 𝒟knn{\mathcal{D}}_{\rm{knn}}.

As noted in the main text, we can use the weights from this approximation as a guard against distribution shifts within the data partitions. For a given test point, we calculate f(xt)caKNN^f(x_{t})_{\rm{ca}}^{\widehat{\rm{KNN}}} (Eq. 8), and then determine the augmented distribution (i.e., pc()p_{c}(\cdot), the empirical probability for a point when including the point itself, assuming a given label cc) for the Venn Predictor by adding the test point up-weighted according to the weights of this KNN localizer. Specifically, the new weight for the test point is as follows:

1ψ=1{j:xj𝒯}ψj,\frac{1}{\psi^{\prime}}=\frac{1}{\sum_{\{j:~{}x_{j}\in{\mathcal{T}}\}}\psi_{j}}~{}, (9)

where 𝒯{\mathcal{T}} is the set of calibration points belonging to the same category as xtx_{t}. When this weight is 1, we have the standard Venn Predictor; when this weight is greater than 1, it is a sign of a mismatch (due to a distribution shift, or an otherwise aberrant category assignment) and the minimum probability estimated by the Venn Predictor becomes smaller. 1ψ[1,)\frac{1}{\psi^{\prime}}\in[1,\infty) satisfies Prop. 3, so calibration of the selective classifier is maintained when using this weight to up-weight the test point.

Appendix C KNN training

We train f(x)trKNNf(x)_{\rm{tr}}^{\rm{KNN}} and f(x)caKNN^f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}} with the same learning procedure, the only difference being the underlying model that is approximated. Here, we take as oco^{c} the unnormalized output logit for class cc of the model to be approximated (either the memory layer\operatorname{\textsc{memory layer}} or f(x)trKNNf(x)_{\rm{tr}}^{\rm{KNN}}) and aca^{c} the unnormalized output logit for class cc of the approximation (either f(x)trKNNf(x)_{\rm{tr}}^{\rm{KNN}} or f(x)caKNN^f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}}). The binary cross-entropy loss for a token, tt, is then calculated as follows:

t\displaystyle\mathcal{L}_{t} =1|𝒴|c𝒴σ(oc)logσ(ac)(1σ(oc))log(1σ(ac))\displaystyle=\frac{1}{|{\mathcal{Y}}|}\sum\limits_{c\in{\mathcal{Y}}}-\sigma(o^{c})\cdot\log\sigma\left(a^{c}\right)-(1-\sigma(o^{c}))\cdot\log\left(1-\sigma\left(a^{c}\right)\right) (10)

That is, we seek to minimize the difference between the original model’s output and the KNN’s output, for each class, holding the parameters of the original model fixed. t\mathcal{L}_{t} is averaged over all classes in mini-batches constructed from the tokens of shuffled documents. We train with Adadelta (Zeiler, 2012) with a learning rate of 1.0, choosing the epoch that minimizes

δKNN=dev[argmaxc𝒴(o)argmaxc𝒴(a)],\displaystyle\delta^{KNN}=\sum\limits_{\text{{dev}}}[\operatornamewithlimits{arg\,max}_{c\in{\mathcal{Y}}}\left(o\right)\neq\operatornamewithlimits{arg\,max}_{c\in{\mathcal{Y}}}\left(a\right)]~{}, (11)

the total number of prediction discrepancies between the original model and the KNN approximation over the KNN dev set. During training, if the immediately preceding epoch did not yield a new minimal δKNN\delta^{KNN} among the running epochs, we subsequently only calculate t\mathcal{L}_{t} for the tokens with prediction discrepancies until a new minimum δKNN\delta^{KNN} is found (after which we return to calculating the loss over all points), or the maximum number of epochs is reached. This has a regularizing effect: There is signal in the magnitude of the KNN output, so we aim to optimize in the direction of minimizing the residuals; however, we seek to avoid over-fitting to the magnitude of the outliers.

A key insight is that we can readily approximate the vast majority of the predictions from the Transformer networks (possibly other networks, as well) using such KNN approximations, and critically, when the approximations diverge from the model, those points tend to be from the subsets over which the underlying model is itself unreliable. This implies a non-homogenous error distribution, and we find that the aforementioned procedure of iterative masking effectively learns the KNN parameters without the need to introduce other regularization approaches. In practice, we find that a relatively small amount of data (e.g., only 10% of the original validation sets for the tasks in the experiments in the main text) is sufficient to learn the low number of parameters of the KNNs.

Appendix D Controlling for sample size

Given a single sample from PXYP_{XY} (i.e., our single 𝒟ca{\mathcal{D}}_{ca} of some fixed size), we need to convey the variance due to the observed sample size. We opt for a simple hard threshold, κ\kappa, given that the distribution of split-conformal coverage is Beta distributed (Vovk, 2012). With, for example κ=1000\kappa=1000, assuming exchangeability, the finite-sample guarantee then implies ±0.02\approx\pm\leq 0.02 coverage variation within a conditioning band of size 1000\geq 1000 with α=0.1,|𝒟te|=\alpha=0.1,|{\mathcal{D}}_{te}|=\infty. See the comprehensive tutorial Angelopoulos & Bates (2021) for additional details. In our experiments, if the size of at least 1 label-specific band for a given point falls below κ\kappa, we revert to a set of full cardinality. For the Protein\operatorname{\textsc{Protein}}, Sentiment\operatorname{\textsc{Sentiment}}, and SentimentOOD\operatorname{\textsc{SentimentOOD}} tasks, we set κ=1000\kappa=1000. With the low accuracy and low frequency of the minority class in the GrammarOOD\operatorname{\textsc{GrammarOOD}} task, the q=K\operatorname{\text{$q$}}=K partition is comparatively small. As such, for the GrammarOOD\operatorname{\textsc{GrammarOOD}} task, we set κ=100\kappa=100 to avoid heavily censoring the q=K\operatorname{\text{$q$}}=K partition, at the expense of potentially higher variability.

Appendix E Pseudo-code

Algorithm 1 provides pseudo-code for constructing a well-calibrated selective classification, via the two stage approach described in the text. First, Algorithm 2 constructs an ADMIT\operatorname{\textsc{ADMIT}} prediction set for a test point, xtx_{t}. If the set only includes a single class (i.e., the weak selective classifier admitted the class), the output is then calibrated via the Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor (Algorithm 3). The class prediction is then returned if the calibrated probability matches or exceeds the provided threshold, 1α1-\alpha. As an additional safeguard, we can further restrict the partitions based on q\operatorname{\text{$q$}}, as described in Section 4.4.1.

In the main text, we also compare to a variation without the test-point weighting. This unweighted variation appears in Algorithm 4 and replaces the corresponding line in Algorithm 1.

Algorithm 1 Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Selective Classification
1:𝒟ca{\mathcal{D}}_{\rm{ca}}, (xt𝒟tex_{t}\in{\mathcal{D}}_{\rm{te}}, qt\operatorname{\text{$q$}}_{t}, dtd_{t}), band radius ω\omega, f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}}, α\alpha, localizer f(x)caKNN^f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}}
2:procedure selective-classification(𝒟ca,xt,qt,dt,ω,f(x)trKNN,α{\mathcal{D}}_{\rm{ca}},x_{t},\operatorname{\text{$q$}}_{t},d_{t},\omega,\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}},\alpha, localizer f(x)caKNN^f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}})
3:     ss\leftarrow\bot \triangleright Reject option
4:     𝒞^(xt)\hat{{\mathcal{C}}}(x_{t})\leftarrowadmit(𝒟ca,xt,qt,dt,ω,f(x)trKNN,α{\mathcal{D}}_{\rm{ca}},x_{t},\operatorname{\text{$q$}}_{t},d_{t},\omega,\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}},\alpha) \triangleright Mondrian Conformal prediction set (Alg. 2)
5:     if |𝒞^(xt)|=1|\hat{{\mathcal{C}}}(x_{t})|=1 then
6:         cc𝒞^(xt)c^{\prime}\leftarrow c\in\hat{{\mathcal{C}}}(x_{t}) \triangleright Output of the weak selective classifier
7:         p(c)¯\underline{p(c^{\prime})}\leftarrowweighted-venn-admit(𝒟ca,xt,qt,dt,ω,f(x)trKNN,f(x)caKNN^{\mathcal{D}}_{\rm{ca}},x_{t},\operatorname{\text{$q$}}_{t},d_{t},\omega,\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}},\text{$f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}}$})
8:         if p(c)¯1α\underline{p(c^{\prime})}\geq 1-\alpha then
9:              scs\leftarrow c^{\prime}               
10:ss, selective classification (class prediction or reject option)
Algorithm 2 ADMIT\operatorname{\textsc{ADMIT}} Prediction Sets via Neural Model Approximations
1:𝒟ca{\mathcal{D}}_{\rm{ca}}, (xt𝒟tex_{t}\in{\mathcal{D}}_{\rm{te}}, qt\operatorname{\text{$q$}}_{t}, dtd_{t}), band radius ω\omega, f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}}, α\alpha
2:procedure _threshold(,α{\mathcal{I}}^{\prime},\alpha) \triangleright Standard split-conformal if =𝒟ca{\mathcal{I}}^{\prime}={\mathcal{D}}_{\rm{ca}}
3:     𝒮js(xj)=1π^y(xj)trKNN,xj{\mathcal{S}}_{j}\leftarrow s(x_{j})=1-\text{$\hat{\pi}^{y}(x_{j})_{\rm{tr}}^{\rm{KNN}}$},~{}\forall~{}x_{j}\in{\mathcal{I}}^{\prime} \triangleright Conformity scores over calibration subset
4:     l^α(||+1)(1α)/||\hat{l}^{\alpha}\leftarrow\lceil(|{\mathcal{I}}^{\prime}|+1)(1-\alpha)\rceil/|{\mathcal{I}}^{\prime}| quantile of 𝒮{\mathcal{S}}
5:     return τ^α1l^α\hat{\tau}^{\alpha}\leftarrow 1-\hat{l}^{\alpha}
6:procedure admit(𝒟ca,xt,qt,dt,ω,f(x)trKNN,α{\mathcal{D}}_{\rm{ca}},x_{t},\operatorname{\text{$q$}}_{t},d_{t},\omega,\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}},\alpha)
7:     𝒞^(xt){y^tKNN}\hat{{\mathcal{C}}}(x_{t})\leftarrow\{\text{$\hat{y}^{\rm{KNN}}_{t}$}\}
8:     (xt,ω,qt,dt;𝒟ca){\mathcal{I}}\leftarrow{\mathcal{B}}(x_{t},\omega,\operatorname{\text{$q$}}_{t},d_{t};{\mathcal{D}}_{\rm{ca}}) \triangleright Calibration points in band centered at xtx_{t} (Eq. 6)
9:     for c{1,,C}c\in\{1,\ldots,C\} do
10:         c{xj:xj,yj=c}{\mathcal{I}}^{c}\leftarrow\{x_{j}:x_{j}\in{\mathcal{I}},y_{j}=c\} \triangleright Subset of band for which true class is cc
11:         τ^cα\hat{\tau}^{\alpha}_{c}\leftarrow_threshold(c,α{\mathcal{I}}^{c},\alpha)
12:         𝒞^(xt)𝒞^(xt){c:π^c(xt)trKNNτ^cα}\hat{{\mathcal{C}}}(x_{t})\leftarrow\hat{{\mathcal{C}}}(x_{t})\cup\left\{c:\text{$\hat{\pi}^{c}(x_{t})_{\rm{tr}}^{\rm{KNN}}$}\geq\hat{\tau}^{\alpha}_{c}\right\}      
13:     return 𝒞^(xt)\hat{{\mathcal{C}}}(x_{t})
14:𝒞^(xt)\hat{{\mathcal{C}}}(x_{t}), prediction set
Algorithm 3 Conservative Calibration via Inductive Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor (weighted)
1:𝒟ca{\mathcal{D}}_{\rm{ca}}, (xt𝒟tex_{t}\in{\mathcal{D}}_{\rm{te}}, qt\operatorname{\text{$q$}}_{t}, dtd_{t}), band radius ω\omega, f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}}, localizer f(x)caKNN^f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}}
2:procedure _category(,xt,y{\mathcal{I}},x_{t},y^{\prime}) \triangleright Same as in Alg. 4
3:     𝒯{(xt,y)}{\mathcal{T}}\leftarrow\{(x_{t},y^{\prime})\} \triangleright yy^{\prime} is the assumed true label for xtx_{t}
4:     for xjx_{j}\in{\mathcal{I}} do
5:         if y^tKNN=y^jKNN𝒞^(xt)=𝒞^(xj)\text{$\hat{y}^{\rm{KNN}}_{t}$}=\text{$\hat{y}^{\rm{KNN}}_{j}$}\wedge\hat{{\mathcal{C}}}(x_{t})=\hat{{\mathcal{C}}}(x_{j}) then \triangleright 𝒞^\hat{{\mathcal{C}}} calculated as in Alg. 2
6:              𝒯𝒯{(xj,yj)}{\mathcal{T}}\leftarrow{\mathcal{T}}\cup\{(x_{j},y_{j})\} \triangleright yjy_{j} is the true label for xjx_{j}               
7:     return 𝒯{\mathcal{T}}
8:procedure weighted-venn-admit(𝒟ca,xt,qt,dt,ω,f(x)trKNN,f(x)caKNN^{\mathcal{D}}_{\rm{ca}},x_{t},\operatorname{\text{$q$}}_{t},d_{t},\omega,\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}},\text{$f(x)_{\rm{ca}}^{\widehat{\rm{KNN}}}$})
9:     (xt,ω,qt,dt;𝒟ca){\mathcal{I}}\leftarrow{\mathcal{B}}(x_{t},\omega,\operatorname{\text{$q$}}_{t},d_{t};{\mathcal{D}}_{\rm{ca}}) \triangleright Calibration points in band centered at xtx_{t} (Eq. 6)
10:     for c{1,,C}c\in\{1,\ldots,C\} do
11:         𝒯{\mathcal{T}}\leftarrow_category(,xt,c{\mathcal{I}},x_{t},c) {(xt,c)}\setminus~{}\{(x_{t},c)\} \triangleright Exclude xtx_{t} from the category
12:         ψ{j:xj𝒯}ψj\psi^{\prime}\leftarrow\sum_{\{j:~{}x_{j}\in{\mathcal{T}}\}}\psi_{j} \triangleright Sum of weights from KNN localizer Eq. 8; ψ(0,1]\psi^{\prime}\in(0,1]
13:         for c{1,,C}c^{\prime}\in\{1,\ldots,C\} do
14:              pc(c)|{(x,y)𝒯:y=c}|+[c=c](1ψ)|𝒯|+1ψp_{c}(c^{\prime})\leftarrow\frac{|\{(x^{*},~{}y^{*})\in{\mathcal{T}}:~{}y^{*}=c^{\prime}\}|+[c=c^{\prime}]\cdot(\frac{1}{\psi^{\prime}})}{|{\mathcal{T}}|+\frac{1}{\psi^{\prime}}} \triangleright Test-point weighted empirical probability               
15:     for c{1,,C}c^{\prime}\in\{1,\ldots,C\} do
16:         p(c)¯mincCpc(c)\underline{p(c^{\prime})}\leftarrow\min\limits_{c\in C}p_{c}(c^{\prime}) \triangleright Lower Venn probability for each class (across augmented distributions)      
17:     return p()¯\underline{p(\cdot)}
18:p()¯\underline{p(\cdot)}, lower Venn calibrated distribution for xtx_{t}.
Algorithm 4 Conservative Calibration via Inductive Venn-ADMIT\operatorname{\textsc{Venn-ADMIT}} Predictor (unweighted)
1:𝒟ca{\mathcal{D}}_{\rm{ca}}, (xt𝒟tex_{t}\in{\mathcal{D}}_{\rm{te}}, qt\operatorname{\text{$q$}}_{t}, dtd_{t}), band radius ω\omega, f(x)trKNN\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}}
2:procedure _category(,xt,y{\mathcal{I}},x_{t},y^{\prime})
3:     𝒯{(xt,y)}{\mathcal{T}}\leftarrow\{(x_{t},y^{\prime})\} \triangleright yy^{\prime} is the assumed true label for xtx_{t}
4:     for xjx_{j}\in{\mathcal{I}} do
5:         if y^tKNN=y^jKNN𝒞^(xt)=𝒞^(xj)\text{$\hat{y}^{\rm{KNN}}_{t}$}=\text{$\hat{y}^{\rm{KNN}}_{j}$}\wedge\hat{{\mathcal{C}}}(x_{t})=\hat{{\mathcal{C}}}(x_{j}) then \triangleright 𝒞^\hat{{\mathcal{C}}} calculated as in Alg. 2
6:              𝒯𝒯{(xj,yj)}{\mathcal{T}}\leftarrow{\mathcal{T}}\cup\{(x_{j},y_{j})\} \triangleright yjy_{j} is the true label for xjx_{j}               
7:     return 𝒯{\mathcal{T}}
8:procedure venn-admit(𝒟ca,xt,qt,dt,ω,f(x)trKNN{\mathcal{D}}_{\rm{ca}},x_{t},\operatorname{\text{$q$}}_{t},d_{t},\omega,\operatorname{\text{$f(x)_{\rm{tr}}^{\rm{KNN}}$}})
9:     (xt,ω,qt,dt;𝒟ca){\mathcal{I}}\leftarrow{\mathcal{B}}(x_{t},\omega,\operatorname{\text{$q$}}_{t},d_{t};{\mathcal{D}}_{\rm{ca}}) \triangleright Calibration points in band centered at xtx_{t} (Eq. 6)
10:     for c{1,,C}c\in\{1,\ldots,C\} do
11:         𝒯{\mathcal{T}}\leftarrow_category(,xt,c{\mathcal{I}},x_{t},c)
12:         for c{1,,C}c^{\prime}\in\{1,\ldots,C\} do
13:              pc(c)|{(x,y)𝒯:y=c}||𝒯|p_{c}(c^{\prime})\leftarrow\frac{|\{(x^{*},~{}y^{*})\in{\mathcal{T}}:~{}y^{*}=c^{\prime}\}|}{|{\mathcal{T}}|} \triangleright Empirical probability               
14:     for c{1,,C}c^{\prime}\in\{1,\ldots,C\} do
15:         p(c)¯mincCpc(c)\underline{p(c^{\prime})}\leftarrow\min\limits_{c\in C}p_{c}(c^{\prime}) \triangleright Lower Venn probability for each class (across augmented distributions)      
16:     return p()¯\underline{p(\cdot)}
17:p()¯\underline{p(\cdot)}, lower Venn calibrated distribution for xtx_{t}.

Appendix F Task: Protein secondary structure prediction (Protein\operatorname{\textsc{Protein}})

In the supervised sequence labeling Protein\operatorname{\textsc{Protein}} task, we seek to predict the secondary structure of proteins. For each amino acid, we seek to predict one of three classes, y{Helix, Strand, Other}y\in\{\rm{\textsc{{H}elix, {S}trand, {O}ther}}\}.

For training and evaluation, we use the TAPE datasets of Rao et al. (2019).444TAPE provides a standardized benchmark from existing models and data (El-Gebali et al., 2019; Berman et al., 2000; Moult et al., 2018; Klausen et al., 2019). We approximate the Transformer of Rao et al. (2019), which is not SOTA on the task; while not degenerate, this fine-tuned self-supervised model was outperformed by models with HMM alignment-based input features in the original work. Of interest in the present work is whether coverage can be obtained with a neural model with otherwise relatively modest overall point accuracy. We use the publicly available model and pre-trained weights555https://github.com/songlab-cal/tape.

F.1 memory layer\operatorname{\textsc{memory layer}}

The base network consists of a pre-trained Transformer similar to BERTbase\operatorname{\textsc{BERT}\textsubscript{{base}}} with a final convolutional classification layer, consisting of two 1-dimensional CNNs: The first over the final hidden layer of the Transformer corresponding to each amino acid (each hidden layer is of size 768), using 512 filters of width 5, followed by ReLU\operatorname{ReLU} and a second CNN using 3 filters of width 3. Batch normalization is applied before the first CNN, and weight normalization is applied to the output of each of the CNNs. The application of the 3 filters of the final CNN produces the logits, 3\mathbb{R}^{3}, for each amino acid.

The memory layer\operatorname{\textsc{memory layer}} consists of an additional 1-dimensional CNN, which uses 1000 filters of width 1. The input to the memory layer\operatorname{\textsc{memory layer}} corresponding to each amino acid is the concatenation of the final hidden layer of the Transformer, the output of the final CNN of the base network, and a randomly initialized 10-dimensional word-embedding. The output of the CNN is passed to a LinearLayer of dimension 1000 by 3. (Unlike the sparse supervised sequence labeling task of GrammarOOD\operatorname{\textsc{GrammarOOD}}, we use neither a ReLU\operatorname{ReLU}, nor a max-pool operation. The sequences are very long in this setting—up to 1000 used in training and 1632 at inference to avoid truncation—so removing the max-pool bottleneck enables keeping the number of filters of the CNN lower than the total number of amino acids. In this way, we also do not use the decomposition of the CNN with the LinearLayer, as in the GrammarOOD\operatorname{\textsc{GrammarOOD}} task, since the sparsity over the input is not needed for this task.) The exemplar vectors for the KNNs are then the 𝒓1000{\bm{r}}\in\mathbb{R}^{1000} filter applications of the CNN corresponding to each amino acid.

We fine-tune the base network and train the memory layer\operatorname{\textsc{memory layer}} in an iterative fashion. Each epoch we either update the gradients of the base network, or those of the memory layer\operatorname{\textsc{memory layer}}, freezing the counter-part each epoch. We start by updating the base network (and freezing the memory layer\operatorname{\textsc{memory layer}}), and we use separate optimizers for each: Adadelta (Zeiler, 2012) with a learning rate of 1.0 for the memory layer\operatorname{\textsc{memory layer}} and Adam with weight decay (Loshchilov & Hutter, 2019) with a learning rate of 0.0001 and a warmup proportion of 0.01 for the base network. For the latter, we use the BertAdam code from the HuggingFace re-implementation of Devlin et al. (2019). We fine-tune for up to 16 epochs, and we use a standard cross-entropy loss.

Appendix G Supervised grammatical error detection (GrammarOOD\operatorname{\textsc{GrammarOOD}})

The GrammarOOD\operatorname{\textsc{GrammarOOD}} task is a binary sequence labeling task in which we aim to predict whether each word in the input does (yt=1y_{t}=1) or does not (yt=0y_{t}=0) have a grammatical error. 𝒟tr{\mathcal{D}}_{\rm{tr}} and 𝒟ca{\mathcal{D}}_{\rm{ca}} consist of essays written by second-language learners (Yannakoudakis et al., 2011; Rei & Yannakoudakis, 2016) and 𝒟te{\mathcal{D}}_{\rm{te}} consists of student written essays and newswire text (Chelba et al., 2014). The test set is the FCE+news2k set of Schmaltz (2021).

The test set is challenging for two reasons. First, the y=1y=1 class appears with a proportion of 0.07 of all of the words. This is less than our default value for α\alpha, with the implication that marginal coverage can potentially be obtained by altogether ignoring that class. Second, the in-domain task itself is relatively challenging, but it is made yet harder by adding newswire text, as evident in the large F0.5F_{0.5} score differences across 𝒟ca{\mathcal{D}}_{\rm{ca}} and 𝒟te{\mathcal{D}}_{\rm{te}} in Table 2.

The exemplar vectors, 𝒓1000{\bm{r}}\in\mathbb{R}^{1000}, used in the KNNs are extracted from the filter applications of a penultimate CNN layer over a frozen BERTlarge\operatorname{\textsc{BERT}\textsubscript{{large}}} model, as in Schmaltz (2021).

Appendix H Tasks: Sentiment classification (Sentiment\operatorname{\textsc{Sentiment}}) and out-of-domain sentiment classification (SentimentOOD\operatorname{\textsc{SentimentOOD}})

Sentiment\operatorname{\textsc{Sentiment}} and SentimentOOD\operatorname{\textsc{SentimentOOD}} are document-level binary classification tasks in which we aim to predict whether the document is of negative (y=0y=0) or positive (y=1y=1) sentiment. The training and calibration sets, as well as the base networks, are the same for both tasks, with the distinction in the differing test sets. The training set is the 3.4k IMDb movie review set used in Kaushik et al. (2020) from the data of Maas et al. (2011). For calibration, we use a disjoint 16k set of reviews from the original training set of Maas et al. (2011). The test set of Sentiment\operatorname{\textsc{Sentiment}} is the 488-review in-domain test set of original reviews used in Kaushik et al. (2020), and the test set of SentimentOOD\operatorname{\textsc{SentimentOOD}} consists of 5k Twitter messages from SemEval-2017 Task 4a (Rosenthal et al., 2017).

Similar to the GrammarOOD\operatorname{\textsc{GrammarOOD}} task, the exemplar vectors, 𝒓2000{\bm{r}}\in\mathbb{R}^{2000}, are derived from the filter applications of a penultimate CNN layer over a frozen BERTlarge\operatorname{\textsc{BERT}\textsubscript{{large}}} model. However, in this case, the vectors are the concatenation of the document-level max-pooled vector, 𝒓1000{\bm{r}}\in\mathbb{R}^{1000}, and the vector associated with a single representative token in the document, 𝒓1000{\bm{r}}\in\mathbb{R}^{1000}. To achieve this, we model the task as multi-label classification and fine-tune the penultimate layer CNN and a final layer consisting of two linear layers with the combined min-max and global normalization loss of Schmaltz & Beam (2020). In this way, we can associate each word with one of (or in principle, both) positive and negative sentiment, or a neutral class, while nonetheless having a single exclusive global prediction. This provides sparsity over the detected features, and captures the notion that a document may, in totality, represent one of the classes (e.g., indicate a positively rated movie overall) while at the same time including sentences or phrases that are of the opposite class (e.g., aspects that the reviewer rated negatively). This behavior is illustrated with examples from the calibration set in Table 6. We use the max scoring word from the “convolutional decomposition”, a hard-attention-style approach, for the document-level predicted class as the single representative word for the document. For the document-level prediction, we take the max over the multi-label logits, which combine the global and max local scores.

Table 6: Model feature detections from snippets from 𝒟ca{\mathcal{D}}_{\rm{ca}} for Sentiment\operatorname{\textsc{Sentiment}} and SentimentOOD\operatorname{\textsc{SentimentOOD}}, for which prediction sets are constructed for the binary document-level predictions. Most documents only have features of a single class detected (as in the example in the final row), but our modeling choice (Section H) does enable multi-label detection as in the first example, for which the true document label is positive sentiment, and the second example, for which the true document label is negative sentiment. The max scoring word for each document is underlined.
Model predictions over 𝒟ca{\mathcal{D}}_{\rm{ca}}
What an amazing film. [...] My only gripe is that it has not been released on video in Australia and is therefore only available on TV. What a waste.
[...] But the story that then develops lacks any of the stuff that these opening fables display. [...] I will say that the music by Aimee Mann was great and I’ll be looking for the Soundtrack CD. [...]
Kenneth Branagh shows off his excellent skill in both acting and writing in this deep and thought provoking interpretation of Shakespeare’s most classic and well-written tragedy. [...]