This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distribution-Free Conditional Median Inference

Dhruv Medarametla Department of Statistics, Stanford University    Emmanuel Candès Department of Statistics and Mathematics, Stanford University
(September 1, 2021)
Abstract

We consider the problem of constructing confidence intervals for the median of a response YY\in\mathbb{R} conditional on features XdX\in\mathbb{R}^{d} in a situation where we are not willing to make any assumption whatsoever on the underlying distribution of the data (X,Y)(X,Y). We propose a method based upon ideas from conformal prediction and establish a theoretical guarantee of coverage while also going over particular distributions where its performance is sharp. Additionally, we prove an equivalence between confidence intervals for the conditional median and confidence intervals for the response variable, resulting in a lower bound on the length of any possible conditional median confidence interval. This lower bound is independent of sample size and holds for all distributions with no point masses.

1 Introduction

Consider a dataset (X1,Y1),,(Xn,Yn)d×(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\subseteq\mathbb{R}^{d}\times\mathbb{R} and a test point (Xn+1,Yn+1)(X_{n+1},Y_{n+1}), with all datapoints being drawn i.i.d. from the same distribution PP. Given our training data, can we provide a confidence interval for the expected value μ(Xn+1)=𝔼[Yn+1|Xn+1]\mu(X_{n+1})=\mathbb{E}[Y_{n+1}|X_{n+1}]?

Methods for inferring the conditional mean are certainly not in short supply. In fact, most existing methods predict not just the conditional mean at a datapoint 𝔼[Yn+1|Xn+1]\mathbb{E}[Y_{n+1}|X_{n+1}], but the full conditional mean function 𝔼[Y|X=x]\mathbb{E}[Y|X=x] for all xdx\in\mathbb{R}^{d}. To the best of our knowledge, however, each approach relies on some assumptions in order to guarantee coverage. For instance, the classical linear regression model is often used but is only accurate if Y|XY|X is normal with mean μ(x)=𝔼[Y|X=x]\mu(x)=\mathbb{E}[Y|X=x] affine in xx and standard deviation independent of xx. Nonparametric regressions cannot estimate the conditional mean without imposing smoothness conditions and assuming that the conditional distribution is sufficiently light tailed. Since reliable conditional mean inference is a very common problem, these methods are nevertheless used all the time, e.g. in predicting disease survival times, classifying spam emails, pricing financial assets, and more. The issue is that the assumptions these methods make rarely hold in practice. Thus, the question remains: is it possible to estimate the conditional mean at a datapoint in a distribution-free setting, with no assumptions on PP?

It turns out that it is not only impossible to get a nontrivial confidence interval for the conditional mean 𝔼[Y|X=Xn+1]\mathbb{E}[Y|X=X_{n+1}], but it is actually impossible to get a confidence interval for 𝔼[Y]\mathbb{E}[Y] itself. This result originates in Bahadur and Savage (1956), where the authors show that any parameter sensitive to tails of a distribution cannot be estimated when no restrictions exist on the distribution class; an example of a distribution with a non-estimable mean is given in Appendix B.3.

Thus, within the distribution-free setting, making progress on inferring the conditional mean requires a modification to the problem statement. One strategy is to restrict the range of YY. An example of this is in Barber (2020), which introduces an algorithm that calculates a confidence interval for the conditional mean in the case where Y{0,1}Y\in\{0,1\}. However, even with this restriction, Barber shows that there exists a fundamental bound limiting how small any such confidence interval can be.

The other strategy is to modify the measure of central tendency that we study. Bahadur and Savage’s result suggests that the best parameters to study are robust to distribution outliers; this observation motivates our investigation of the conditional median.

In a nutshell, the conditional median is possible to infer because of the strong and quantifiable relationship between any particular sampled datapoint (Xi,Yi)(X_{i},Y_{i}) and Median(Y|X=Xi)\textrm{Median}(Y|X=X_{i}). Its robustness to outliers means that even within the distribution-free setting, there is no need to worry about ‘hidden’ parts of the distribution. Additionally, there already exists a well-known algorithm for estimating Median(Y)\textrm{Median}(Y) given a finite number nn of i.i.d. samples. Explored in Noether (1972) and covered in Appendix B.4, this algorithm produces intervals with guaranteed rates of coverage and widths going to zero as the sample size goes to infinity, suggesting that an algorithm for the conditional median might also perform well. We note that there already exists literature dealing with estimating the conditional quantile through quantile regression; Koenker (2005) provides an introduction to different methods. The difference between quantile regression and our work here is that quantile regression requires continuity conditions and only provides theoretical guarantees at the asymptotic level, as seen in Belloni et al. (2019), Takeuchi et al. (2006), and Oberhofer and Haupt (2016); the methods described here provide coverage guarantees on finite sample sizes and require no assumptions.

Our goal in this paper is to combine ideas from regular median inference with procedures from distribution-free inference in order to understand how well an algorithm can cover the conditional median and, more generally, conditional quantiles. In particular, we want to see if the properties of the median and quantiles lead to a valid inference method while also examining the limits of this inference.

It is important to note that the quantity we are attempting to study is Median(Y|X=Xn+1)\textrm{Median}(Y|X=X_{n+1}), not Median(Y|X=x)\textrm{Median}(Y|X=x). The first term is a random variable dependent on Xn+1X_{n+1}, whereas the second is fixed and exists for all xx in the support of PP. We focus on the first quantity because we are in the distribution-free setting; as we know nothing about the class of distributions that PP belongs to, it is more tractable to make inferences about datapoints as opposed to pointwise across the full distribution.

Another way to frame this is to think of our goal as to cover the value of the conditional median function when weighting by the marginal distribution PXP_{X}, as opposed to covering the full conditional median function over all xdx\in\mathbb{R}^{d}. Because we are predicting the conditional median at an unknown value Xn+1X_{n+1}, our success is not measured by whether or not we predict the conditional median correctly at all possible xdx\in\mathbb{R}^{d}, but by how often we predict the conditional median correctly across PXP_{X}.

The methods used in this paper are similar to those from distribution-free predictive inference, which focuses on predicting Yn+1Y_{n+1} from a finite training set. The field of conformal predictive inference began with Vovk et al. (2005) and was built up by works such as Shafer and Vovk (2008) and Vovk et al. (2009); it has been generating interest recently due to its versatility and lack of assumptions. Applications of conformal predictive inference range from criminal justice to drug discovery, as seen in Romano et al. (2020) and Cortés-Ciriano and Bender (2019) respectively.

While this paper relies on techniques from predictive inference, our focus is on parameter inference, which is quite different from prediction because it focuses on predicting a function of the conditional distribution Yn+1|Xn+1Y_{n+1}|X_{n+1} as opposed to the datapoint Yn+1Y_{n+1}. For example, using sample datapoints from an unknown normal distribution to estimate the true mean of the distribution would constitute parameter inference; using the estimated quantiles of the datapoints to create a predictive confidence interval for the next datapoint would constitute predictive inference. Whereas predictive inference exploits the fact that (Xn+1,Yn+1)(X_{n+1},Y_{n+1}) is exchangeable with the sample datapoints, parameter inference requires another layer of analysis, as (Xn+1,Median(Yn+1|Xn+1))(X_{n+1},\textrm{Median}(Y_{n+1}|X_{n+1})) is not exchangeable with (Xi,Yi)(X_{i},Y_{i}). This additional complexity demands modifying approaches from predictive inference to produce valid parameter inference.

1.1 Terminology

We begin by setting up definitions to formalize the concepts above. Throughout this paper, we assume that any distribution (X,Y)P(X,Y)\sim P is over d×\mathbb{R}^{d}\times\mathbb{R} unless explicitly stated otherwise. Each result in this paper holds true for all values of dd and nn.

Given a feature vector Xn+1X_{n+1}, we let C^n(Xn+1)\hat{C}_{n}(X_{n+1})\subseteq\mathbb{R} denote a confidence interval for some functional of the conditional distribution Yn+1|Xn+1Y_{n+1}|X_{n+1}. Note that we use the phrase confidence interval for convenience; in its most general form, C^n(Xn+1)\hat{C}_{n}(X_{n+1}) is a subset of \mathbb{R}. This interval is a function of the point Xn+1X_{n+1} at which we seek inference as well as of our training data 𝒟={(X1,Y1),,(Xn,Yn)}\mathcal{D}=\{(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\}. We write C^n\hat{C}_{n} to refer to the general algorithm that maps 𝒟\mathcal{D} to the resulting confidence intervals C^n(x)\hat{C}_{n}(x) for each xdx\in\mathbb{R}^{d}.

In order for C^n\hat{C}_{n} to be useful, we want it to capture, or contain, the parameter we care about with high probability. We formalize this as follows:

Definition 1.

We say that C^n\hat{C}_{n} satisfies distribution-free median coverage at level 1α1-\alpha, denoted by (1α)(1-\alpha)-Median, if

{Median(Yn+1|Xn+1)C^n(Xn+1)}1α for all distributions P on (X,Y)d×.\mathbb{P}\big{\{}\textrm{Median}(Y_{n+1}|X_{n+1})\in\hat{C}_{n}(X_{n+1})\big{\}}\geq 1-\alpha\textnormal{ for all distributions }P\textnormal{ on }(X,Y)\in\mathbb{R}^{d}\times\mathbb{R}.
Definition 2.

For 0<q<10<q<1, let Quantileq(Yn+1|Xn+1)\textrm{Quantile}_{q}(Y_{n+1}|X_{n+1}) refer to the qqth quantile of the conditional distribution Yn+1|Xn+1Y_{n+1}|X_{n+1}. We say that C^n\hat{C}_{n} satisfies distribution-free quantile coverage for the qqth quantile at level 1α1-\alpha, denoted by (1α,q)(1-\alpha,q)-Quantile, if

{Quantileq(Yn+1|Xn+1)C^n(Xn+1)}1α for all distributions P on (X,Y)d×.\mathbb{P}\{\textrm{Quantile}_{q}(Y_{n+1}|X_{n+1})\in\hat{C}_{n}(X_{n+1})\}\geq 1-\alpha\textnormal{ for all distributions }P\textnormal{ on }(X,Y)\in\mathbb{R}^{d}\times\mathbb{R}.

The probabilities in Definitions 1 and 2 are both taken over the training data 𝒟={(X1,Y1),,(Xn,Yn)}\mathcal{D}=\{(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\} and test point Xn+1X_{n+1}. Thus, satisfying (1α)(1-\alpha)-Median is equivalent to satisfying (1α,0.5)(1-\alpha,0.5)-Quantile.

These concepts are similar to predictive coverage, with the key difference being that our goal is now to predict a function of Xn+1X_{n+1} rather than a new datapoint Yn+1Y_{n+1}.

Definition 3.

We say that C^n\hat{C}_{n} satisfies distribution-free predictive coverage at level 1α1-\alpha, denoted by (1α)(1-\alpha)-Predictive, if

{Yn+1C^n(Xn+1)}1α for all distributions P on (X,Y)d×,\mathbb{P}\big{\{}Y_{n+1}\in\hat{C}_{n}(X_{n+1})\big{\}}\geq 1-\alpha\textnormal{ for all distributions }P\textnormal{ on }(X,Y)\in\mathbb{R}^{d}\times\mathbb{R},

where this probability is taken over the training data 𝒟={(X1,Y1),,(Xn,Yn)}\mathcal{D}=\{(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\} and test point (Xn+1,Yn+1)(X_{n+1},Y_{n+1}).

Finally, we define the type of conformity scores that we will use in our general quantile inference algorithm.

Definition 4.

We say that a function f:(d,)f:(\mathbb{R}^{d},\mathbb{R})\to\mathbb{R} is a locally nondecreasing conformity score if, for all xdx\in\mathbb{R}^{d} and y,yy,y^{\prime}\in\mathbb{R} with yyy\leq y^{\prime}, we have f(x,y)f(x,y)f(x,y)\leq f(x,y^{\prime}).

1.2 Summary of Results

We find that there exists a distribution-free predictive inference algorithm that satisfies both (1α/2)(1-\alpha/2)-Predictive and (1α)(1-\alpha)-Median. Moreover, an improved version of this algorithm also satisfies (1α,q)(1-\alpha,q)-Quantile. Together, these prove that there exists nontrivial algorithms C^n\hat{C}_{n} that satisfy (1α)(1-\alpha)-Median and (1α,q)(1-\alpha,q)-Quantile for all 0<q<10<q<1.

We go on to show that conditional median inference and predictive inference are nearly equivalent problems. Specifically, we show that any algorithm that contains Median(Yn+1|Xn+1)\textrm{Median}(Y_{n+1}|X_{n+1}) with probability 1α1-\alpha must also contain Yn+1Y_{n+1} with probability at least 1α1-\alpha, and that any algorithm that contains Yn+1Y_{n+1} with probability at least 1α/21-\alpha/2 must contain Median(Yn+1|Xn+1)\textrm{Median}(Y_{n+1}|X_{n+1}) with probability 1α1-\alpha.

Taken together, these results give us somewhat conflicting perspectives. On the one hand, there exist distribution-free algorithms that capture the conditional median and conditional quantile with high likelihood; on the other hand, any such algorithm will also capture a large proportion of the distribution itself, putting a hard limit on how well such algorithms can ever perform.

2 Confidence Intervals for the Conditional Median

This section proves the existence of algorithms obeying distribution-free median and quantile coverage. We then focus on situations where these algorithms are sharp.

2.1 Basic Conditional Median Inference

Algorithm 1 below operates by taking the training dataset and separating it into two halves of sizes n1+n2=nn_{1}+n_{2}=n. Next, a regression algorithm μ^\hat{\mu} is trained on 𝒟1={(X1,Y1),,(Xn1,Yn1)}\mathcal{D}_{1}=\{(X_{1},Y_{1}),\ldots,(X_{n_{1}},Y_{n_{1}})\}. The residuals Yiμ^(Xi)Y_{i}-\hat{\mu}(X_{i}) are calculated for n1<inn_{1}<i\leq n, and the 1α/21-\alpha/2 quantile of the absolute value of these residuals is then used to create a confidence band around the prediction μ^(Xn+1)\hat{\mu}(X_{n+1}). The expert will recognize that this is identical to a well-known algorithm from predictive inference as explained later.

Input:
  Number of i.i.d. datapoints nn\in\mathbb{N}.
  Split sizes n1+n2=nn_{1}+n_{2}=n.
  Datapoints (X1,Y1),,(Xn,Yn)P(d,)(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\sim P\subseteq(\mathbb{R}^{d},\mathbb{R}).
  Test point Xn+1PX_{n+1}\sim P.
  Regression algorithm μ^\hat{\mu}.
  Coverage level 1α(0,1)1-\alpha\in(0,1).
Process : 
  Randomly split {1,,n}\{1,\ldots,n\} into disjoint 1\mathcal{I}_{1} and 2\mathcal{I}_{2} with |1|=n1|\mathcal{I}_{1}|=n_{1} and |2|=n2|\mathcal{I}_{2}|=n_{2}.
  Fit regression function μ^:d\hat{\mu}:\mathbb{R}^{d}\to\mathbb{R} on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\}.
  For i2i\in\mathcal{I}_{2} set Ei=|Yiμ^(Xi)|E_{i}=|Y_{i}-\hat{\mu}(X_{i})|.
  Compute Q1α/2(E)Q_{1-\alpha/2}(E), the (1α/2)(1+1/n2)(1-\alpha/2)(1+1/n_{2})-th empirical quantile of {Ei:i2}\{E_{i}:i\in\mathcal{I}_{2}\}.
Output:
  Confidence interval C^n(Xn+1)=[μ^(Xn+1)Q1α/2(E),μ^(Xn+1)+Q1α/2(E)]\hat{C}_{n}(X_{n+1})=[\hat{\mu}(X_{n+1})-Q_{1-\alpha/2}(E),\hat{\mu}(X_{n+1})+Q_{1-\alpha/2}(E)] for
Median(Yn+1|Xn+1)\textrm{Median}(Y_{n+1}|X_{n+1}).
Algorithm 1 Confidence Interval for Median(Yn+1|Xn+1)\textrm{Median}(Y_{n+1}|X_{n+1}) with Coverage 1α1-\alpha
Theorem 1.

For all distributions PP, all regression algorithms μ^\hat{\mu}, and all split sizes n1+n2=nn_{1}+n_{2}=n, the output of Algorithm 1 contains Median(Yn+1|Xn+1)\textrm{Median}(Y_{n+1}|X_{n+1}) with probability at least 1α1-\alpha. That is, the algorithm satisfies (1α)(1-\alpha)-Median.

The proof of Theorem 1 is covered in Appendix A.1.

Remark 2.1.

Algorithm 1 works independently of how μ^\hat{\mu} is trained; this means that any regression function may be used, from simple linear regression to more complicated machine learning algorithms. It is important to note that μ^(x)\hat{\mu}(x) does not need to be an estimate of the true conditional mean μ(x)\mu(x); while one option is to train it to predict the conditional mean, it can be fit to predict the conditional median, conditional quantile, or any other measure of central tendency. The best choice for what to fit μ^\hat{\mu} to may depend on one’s underlying belief about the distribution.

We return at last to the connection with predictive inference. Introduced in Papadopoulos et al. (2002) and Vovk et al. (2005) and studied in Lei et al. (2013), Barber et al. (2021b), Romano et al. (2020), and several other papers, the split conformal method was initially created to achieve distribution-free predictive coverage guarantees. In particular, Vovk et al. (2005) shows that Algorithm 1 satisfies (1α/2)(1-\alpha/2)-Predictive, implying that in order to capture Median(Yn+1|Xn+1)\textrm{Median}(Y_{n+1}|X_{n+1}) with probability 1α1-\alpha, our algorithm produces a wider confidence interval than an algorithm trying to capture Yn+1Y_{n+1} with the same probability.

2.2 General Conditional Quantile Inference

Algorithm 1 is a good first step towards a usable method for conditional median inference; however, it may be too rudimentary to be used in practice. Algorithm 2 is a more general version of Algorithm 1 that results in conditional quantile coverage and better empirical performance. This provides a better understanding of how diverse parameter inference algorithms can be.

Input:
  Number of i.i.d. datapoints nn\in\mathbb{N}.
  Split sizes n1+n2=nn_{1}+n_{2}=n.
  Datapoints (X1,Y1),,(Xn,Yn)P(d,)(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\sim P\subseteq(\mathbb{R}^{d},\mathbb{R}).
  Test point Xn+1PX_{n+1}\sim P.
  Locally nondecreasing conformity score algorithms flof^{\textnormal{lo}} and fhif^{\textnormal{hi}}.
  Quantile level q(0,1)q\in(0,1).
  Coverage level 1α(0,1)1-\alpha\in(0,1).
  Split probabilities r+s=αr+s=\alpha.
Process : 
  Randomly split {1,,n}\{1,\ldots,n\} into disjoint 1\mathcal{I}_{1} and 2\mathcal{I}_{2} with |1|=n1|\mathcal{I}_{1}|=n_{1} and |2|=n2|\mathcal{I}_{2}|=n_{2}.
  Fit conformity scores flo,fhi:(d,)f^{\textnormal{lo}},f^{\textnormal{hi}}:(\mathbb{R}^{d},\mathbb{R})\to\mathbb{R} on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\}.
  For i2i\in\mathcal{I}_{2} set Eilo=flo(Xi,Yi))E_{i}^{\textnormal{lo}}=f^{\textnormal{lo}}(X_{i},Y_{i})) and Eihi=fhi(Xi,Yi))E_{i}^{\textnormal{hi}}=f^{\textnormal{hi}}(X_{i},Y_{i})).
  Compute Qrqlo(E)Q_{rq}^{\textrm{lo}}(E), the rq(1+1/n2)1/n2rq(1+1/n_{2})-1/n_{2} empirical quantile of {Eilo:i2}\{E_{i}^{\textnormal{lo}}:i\in\mathcal{I}_{2}\}, and Q1s(1q)hi(E)Q_{1-s(1-q)}^{\textrm{hi}}(E), the (1s(1q))(1+1/n2)(1-s(1-q))(1+1/n_{2}) empirical quantile of {Eihi:i2}\{E_{i}^{\textnormal{hi}}:i\in\mathcal{I}_{2}\}.
Output:
  Confidence interval C^n(Xn+1)={y:Qrqlo(E)flo(Xn+1,y),fhi(Xn+1,y)Q1s(1q)hi(E)}\hat{C}_{n}(X_{n+1})=\{y:Q_{rq}^{\textrm{lo}}(E)\leq f^{\textnormal{lo}}(X_{n+1},y),f^{\textnormal{hi}}(X_{n+1},y)\leq Q_{1-s(1-q)}^{\textrm{hi}}(E)\} for Quantileq(Yn+1|Xn+1)\textrm{Quantile}_{q}(Y_{n+1}|X_{n+1}).
Algorithm 2 Confidence Interval for Quantileq(Yn+1|Xn+1)\textrm{Quantile}_{q}(Y_{n+1}|X_{n+1}) with Coverage 1α1-\alpha

Algorithm 2 differs from Algorithm 1 in two ways. First, we use the rqrq quantile of the lower scores to create the confidence interval’s lower bound, and the 1s(1q)1-s(1-q) quantile of the upper scores (corresponding to the top s(1q)s(1-q) of the score distribution) for the upper bound. Second, the functions we fit are no longer regression functions, but instead locally nondecreasing conformity scores. These scores are described in Definition 4; see Remark 2.3 for examples.

Theorem 2.

For all distributions PP, all locally nondecreasing conformity scores flof^{\textnormal{lo}} and fhif^{\textnormal{hi}}, all split sizes n1+n2=nn_{1}+n_{2}=n, and all 0<q<10<q<1, the output of Algorithm 2 contains Quantileq(Yn+1|Xn+1)\textrm{Quantile}_{q}(Y_{n+1}|X_{n+1}) with probability at least 1α1-\alpha. That is, Algorithm 2 satisfies (1α,q)(1-\alpha,q)-Quantile.

The proof of Theorem 2 is covered in Appendix A.2 and is similar to that of Theorem 1; the main modifications come from the changes described above. Regarding the first change, the asymmetrical quantiles on the lower and upper end of the EiE_{i}’s balance the fact that datapoints have asymmetrical probabilities of being on either side of the conditional quantile. Regarding the second change, because the conformity scores EiE_{i} still preserve relative ordering, they do not affect the relationship between datapoints and the conditional quantile.

Remark 2.2.

One possible choice for rr and ss is r=s=α/2r=s=\alpha/2. This is motivated by the logic that rr and ss decide the probabilities of failure on the lower bound and the upper bound, respectively; if we want the bound to be equally accurate on both ends, it makes sense to set rr and ss equal. Another choice is r=(1q)αr=(1-q)\alpha and s=qαs=q\alpha; this results in the quantiles for QrqloQ_{rq}^{\textrm{lo}} and Q1s(1q)hiQ_{1-s(1-q)}^{\textrm{hi}} being approximately equal, with the algorithm taking the q(1q)αq(1-q)\alpha quantile of the scores on both the lower and upper ends.

Remark 2.3.

The versatility of the conformity scores flof^{\textnormal{lo}} and fhif^{\textnormal{hi}} is what differentiates Algorithm 2 from Algorithm 1 and makes it a viable option for conditional quantile inference. Below are a few examples of possible scores and the style of intervals they produce.

  • -

    flo(Xi,Yi)=fhi(Xi,Yi)=Yiμ^(Xi)f^{\textnormal{lo}}(X_{i},Y_{i})=f^{\textnormal{hi}}(X_{i},Y_{i})=Y_{i}-\hat{\mu}(X_{i}), where μ^:d\hat{\mu}:\mathbb{R}^{d}\to\mathbb{R} is trained on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\} as a central tendency estimator. This is the conformity score used in Algorithm 1 and Vovk et al. (2005), resulting in a confidence interval of the form [μ^(Xn+1)+clo,μ^(Xn+1)+chi][\hat{\mu}(X_{n+1})+c_{\textnormal{lo}},\hat{\mu}(X_{n+1})+c_{\textnormal{hi}}] for some clo,chic_{\textnormal{lo}},c_{\textnormal{hi}}\in\mathbb{R}. This score is best when the conditional distribution Y|X=xY|X=x is similar for all xx and either the mean or median can be estimated with reasonable accuracy. Note that if the conditional distribution Y𝔼[Y|X]Y-\mathbb{E}[Y|X] is independent of XX, Algorithm 2 will output the same confidence interval for μ^(x)=𝔼[Y|X=x]\hat{\mu}(x)=\mathbb{E}[Y|X=x] and μ^(x)=Median(Y|X=x)\hat{\mu}(x)=\textrm{Median}(Y|X=x).

  • -

    flo(Xi,Yi)=fhi(Xi,Yi)=Yiμ^(Xi)σ^(Xi)f^{\textnormal{lo}}(X_{i},Y_{i})=f^{\textnormal{hi}}(X_{i},Y_{i})=\frac{Y_{i}-\hat{\mu}(X_{i})}{\hat{\sigma}(X_{i})}, where μ^:d\hat{\mu}:\mathbb{R}^{d}\to\mathbb{R} and σ^:d+\hat{\sigma}:\mathbb{R}^{d}\to\mathbb{R}^{+} are trained on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\} as a central tendency estimator and conditional absolute deviation estimator, respectively. This score results in a confidence interval of the form [μ^(Xn+1)+cloσ^(Xn+1),μ^(Xn+1)+chiσ^(Xn+1)][\hat{\mu}(X_{n+1})+c_{\textnormal{lo}}\hat{\sigma}(X_{n+1}),\hat{\mu}(X_{n+1})+c_{\textnormal{hi}}\hat{\sigma}(X_{n+1})] for some clo,chic_{\textnormal{lo}},c_{\textnormal{hi}}\in\mathbb{R}. Unlike the previous example, this score no longer results in a fixed-length confidence interval; it is best used when there is high heteroskedasticity in the underlying distribution. This is the conformity scores used to create adaptive predictive intervals in Lei et al. (2018). Note that a normalization constant γ>0\gamma>0 can be added to the denominator σ^(Xi)\hat{\sigma}(X_{i}) to create stable confidence intervals.

  • -

    flo(Xi,Yi)=YiQ^lo(Xi)f^{\textnormal{lo}}(X_{i},Y_{i})=Y_{i}-\hat{Q}^{\textnormal{lo}}(X_{i}) and fhi(Xi,Yi)=YiQ^hi(Xi)f^{\textnormal{hi}}(X_{i},Y_{i})=Y_{i}-\hat{Q}^{\textnormal{hi}}(X_{i}), where Q^lo,Q^hi:d\hat{Q}^{\textnormal{lo}},\hat{Q}^{\textnormal{hi}}:\mathbb{R}^{d}\to\mathbb{R} are trained on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\} to estimate the rqrq quantile and the 1s(1q)1-s(1-q) quantile of the conditional distribution, respectively. This choice results in a confidence interval of the form [Q^lo(Xn+1)+clo,Q^hi(Xn+1)+chi][\hat{Q}^{\textnormal{lo}}(X_{n+1})+c_{\textnormal{lo}},\hat{Q}^{\textnormal{hi}}(X_{n+1})+c_{\textnormal{hi}}] for some clo,chic_{\textnormal{lo}},c_{\textnormal{hi}}\in\mathbb{R}. These scores are best when one can estimate the conditional quantiles reasonably well and the conditional distribution Y|X=xY|X=x is heteroskedastic. Note that if Q^lo\hat{Q}^{\textnormal{lo}} and Q^hi\hat{Q}^{\textnormal{hi}} are trained well, then the resulting confidence interval will be approximately [Q^lo(Xn+1),Q^hi(Xn+1)][\hat{Q}^{\textnormal{lo}}(X_{n+1}),\hat{Q}^{\textnormal{hi}}(X_{n+1})]. These are the scores used to create the predictive intervals seen in Romano et al. (2019) and Sesia and Candès (2020).

  • -

    flo(Xi,Yi)=fhi(Xi,Yi)=F^Y|X=Xi(Yi)f^{\textnormal{lo}}(X_{i},Y_{i})=f^{\textnormal{hi}}(X_{i},Y_{i})=\hat{F}_{Y|X=X_{i}}(Y_{i}), where F^Y|X=x:d×[0,1]\hat{F}_{Y|X=x}:\mathbb{R}^{d}\times\mathbb{R}\to[0,1] is trained on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\} to be the estimated cumulative distribution function of the conditional distribution Y|XY|X. Using this score will result in a confidence interval [F^Y|X=Xn+11(clo),F^Y|X=Xn+11(chi)][\hat{F}_{Y|X=X_{n+1}}^{-1}(c_{\textnormal{lo}}),\hat{F}_{Y|X=X_{n+1}}^{-1}(c_{\textnormal{hi}})] for some clo,chi[0,1]c_{\textnormal{lo}},c_{\textnormal{hi}}\in[0,1], similar to the predictive intervals in Chernozhukov et al. (2019) and Kivaranovic et al. (2020). This can be a good approach when the conditional distribution Y|XY|X is particularly complex.

  • -

    flo(Xi,Yi)=fhi(Xi,Yi)=logYiμ^(Xi)f^{\textnormal{lo}}(X_{i},Y_{i})=f^{\textnormal{hi}}(X_{i},Y_{i})=\log Y_{i}-\hat{\mu}(X_{i}), where μ^:d\hat{\mu}:\mathbb{R}^{d}\to\mathbb{R} is trained on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\} as a log central tendency estimator. This results in a confidence interval of the form [cloexp(μ^(Xn+1)),[c_{\textnormal{lo}}\exp(\hat{\mu}(X_{n+1})), chiexp(μ^(Xn+1))]c_{\textnormal{hi}}\exp(\hat{\mu}(X_{n+1}))] for some clo,chi+c_{\textnormal{lo}},c_{\textnormal{hi}}\in\mathbb{R}^{+}. This score works well when YY is known to be positive and one wants to minimize the approximation ratio; it is equivalent to taking a log\log transformation of the data.

In general, a good choice for flo(Xi,Yi)f^{\textnormal{lo}}(X_{i},Y_{i}) and fhi(Xi,Yi)f^{\textnormal{hi}}(X_{i},Y_{i}) depends on one’s underlying belief about the distribution as well as on the sample size nn, though some scores perform better in practice. The choice of n1n_{1} and n2n_{2} represents a balance between model training and interval tightness; increasing n1n_{1} increases the amount of data for flof^{\textnormal{lo}} and fhif^{\textnormal{hi}}, and increasing n2n_{2} results in a better quantile for the predictive interval. Sesia and Candès (2020) contains more information on the effect of the conformity score on the size of predictive intervals as well as the impact of the ratio n1/nn_{1}/n on interval width and coverage. We also simulate the impact of different scores on conditional quantile intervals in Section 4.

Remark 2.4.

Algorithm 2 can be generalized to use more than one split, resulting in multiple confidence intervals that can then be combined into one output. For more information, Vovk et al. (2018) describes how to apply kk-fold cross validation to split conformal inference, and Barber et al. (2021a) describes the special case of having nn different splits using the jackknife++ procedure.

2.3 Algorithm Sharpness

Now that we have seen that Algorithms 1 and 2 achieve coverage, an important question to ask is whether or not the terms for the error quantile can be improved. Do our methods consistently overcover the conditional median, and if so, is it possible to take a lower quantile of the error terms and still have Theorems 1 and 2 hold? In this section, we prove that this is impossible by going over a particular distribution PδP^{\delta} for which the 1α/21-\alpha/2 term in Algorithm 1 is necessary. Additionally, we go over a choice μ^c\hat{\mu}_{c} with the property that Algorithm 1 always results in 1α/21-\alpha/2 coverage when ran with input μ^c\hat{\mu}_{c}; this implies that there does not exist a distribution with the property that Algorithm 1 will always provide a sharp confidence interval for the conditional median regardless of the regression algorithm.

For each δ>0\delta>0, consider (X,Y)Pδ(X,Y)\sim P^{\delta} over ×\mathbb{R}\times\mathbb{R}, where PXδ=Unif[0.5,0.5]P^{\delta}_{X}=\textrm{Unif}[-0.5,0.5] and Y|X=dXBY|X\stackrel{{\scriptstyle\textnormal{d}}}{{=}}X\,B; B{0,1}B\in\{0,1\} is here an independent Bernoulli variable with (B=1)=0.5+δ\mathbb{P}(B=1)=0.5+\delta. That is, 0.5+δ0.5+\delta of the distribution is on the line segment Y=XY=X from (0.5,0.5)(-0.5,-0.5) to (0.5,0.5)(0.5,0.5), and 0.5δ0.5-\delta of the distribution is on the line segment Y=0Y=0 from (0.5,0)(-0.5,0) to (0.5,0)(0.5,0). Thus, Median(Y|X=x)=x\textrm{Median}(Y|X=x)=x. A visualization of PδP^{\delta} is shown in Figure 1.

Refer to caption
Figure 1: A distribution for which it is difficult to estimate the conditional median.

We know that Algorithm 1 is accurate for all distributions PP and all algorithms μ^\hat{\mu}. Consider the regression algorithm μ^:\hat{\mu}:\mathbb{R}\to\mathbb{R} such that μ^(x)=0\hat{\mu}(x)=0 for all xx\in\mathbb{R}; in other words, μ^\hat{\mu} predicts Yi=0Y_{i}=0 for all XiX_{i}. We show that Algorithm 1 returns a coverage almost exactly equal to 1α1-\alpha.

Theorem 3.

For all ϵ>0\epsilon>0, there exist NN and δ>0\delta>0 such that if we sample n>Nn>N datapoints from the distribution PδP^{\delta} and use Algorithm 1 with μ^=0\hat{\mu}=0 as defined above and n1=n2=n/2n_{1}=n_{2}=n/2 to get a confidence interval for the conditional median,

{Median(Yn+1|Xn+1)C^n(Xn+1)}1α+ϵ.\mathbb{P}\big{\{}\textnormal{Median}(Y_{n+1}|X_{n+1})\in\hat{C}_{n}(X_{n+1})\big{\}}\leq 1-\alpha+\epsilon.

The proof is in Appendix A.3. Theorem 3 does not directly prove that the 1α/21-\alpha/2 term in Algorithm 1 is sharp. However, we can see that if Algorithm 1 used the 1α/21-\alpha^{\prime}/2 quantile of the residuals with α>α\alpha^{\prime}>\alpha, then by Theorem 3 there would exist a choice for δ\delta and nn where the probability of conditional median coverage would be less than 1α1-\alpha. Therefore, the 1α/21-\alpha/2 term is required for the probability of coverage to always be at least 1α1-\alpha.

Remark 2.5.

It is possible to generalize Theorem 3 to Algorithm 2 as well; we can change PδP^{\delta} to have Y|XXBY|X\sim X\,B with BBernoulli(q+1[X0](12q)+δ)B\sim\textrm{Bernoulli}(q+\textbf{1}[X\geq 0](1-2q)+\delta). This results in Quantileq(Y|X=x)=x\textrm{Quantile}_{q}(Y|X=x)=x for all x[0.5,0.5]x\in[-0.5,0.5]. Then, if we consider the conformity scores flo(Xi,Yi)=fhi(Xi,Yi)=Yif^{\textnormal{lo}}(X_{i},Y_{i})=f^{\textnormal{hi}}(X_{i},Y_{i})=Y_{i}, it can be shown for large nn and small δ\delta that Algorithm 2 returns a confidence interval that has a conditional quantile coverage of at most 1α+ϵ1-\alpha+\epsilon, meaning that the rqrq and 1s(1q)1-s(1-q) terms in the quantiles for the error scores are sharp.

These results may seem somewhat pedantic because we are restricting μ^(x)\hat{\mu}(x) to be the zero function and flo(Xi,Yi)f^{\textnormal{lo}}(X_{i},Y_{i}) and fhi(Xi,Yi)f^{\textnormal{hi}}(X_{i},Y_{i}) to be YiY_{i}; this simplification is done to better illustrate our point. Even when flo(Xi,Yi)f^{\textnormal{lo}}(X_{i},Y_{i}) and fhi(Xi,Yi)f^{\textnormal{hi}}(X_{i},Y_{i}) are trained using more complicated approaches, there still exist distributions that result in only 1α1-\alpha coverage for Algorithm 2. For an example of a distribution where Algorithm 2 only achieves 1α1-\alpha coverage for standard conformity scores flo(Xi,Yi)f^{\textnormal{lo}}(X_{i},Y_{i}) and fhi(Xi,Yi)f^{\textnormal{hi}}(X_{i},Y_{i}), refer to P3P_{3} in Section 4. The existence of PδP^{\delta} and similarly ‘confusing’ distributions helps to show why capturing the conditional median can be tricky in a distribution-free setting.

At the same time, there exist conformity scores for which Algorithms 1 and 2 have rates of coverage that are always near 1α/21-\alpha/2. For c>0c>0, define the randomized regression function μ^c\hat{\mu}_{c} as follows: set M=maxi1|Yi|M=\displaystyle\max_{i\in\mathcal{I}_{1}}|Y_{i}| and μ^c(x)=Ax\hat{\mu}_{c}(x)=A_{x} for all xdx\in\mathbb{R}^{d}, where Axi.i.d.𝒩(0,(cM)2)A_{x}\overset{i.i.d.}{\sim}\mathcal{N}(0,(cM)^{2}). We prove the following:

Theorem 4.

For all ϵ>0\epsilon>0, there exists cc and NN such that for all n>Nn>N, there is a split n1+n2=nn_{1}+n_{2}=n such that when Algorithm 1 is ran using the regression function μ^c\hat{\mu}_{c} on nn datapoints with 1\mathcal{I}_{1} of size n1n_{1} and 2\mathcal{I}_{2} of size n2n_{2}, the resulting interval will be finite and will satisfy

{Median(Yn+1|Xn+1)C^n(Xn+1)}1α/2ϵ\mathbb{P}\big{\{}\textnormal{Median}(Y_{n+1}|X_{n+1})\in\hat{C}_{n}(X_{n+1})\big{\}}\geq 1-\alpha/2-\epsilon

for any distribution PP.

The proof for this theorem is in Appendix A.4.

Remark 2.6.

Theorem 4 can be extended to Algorithm 2 by taking flo(Xi,Yi)=fhi(Xi,Yi)=Yiμ^c(Xi)f^{\textnormal{lo}}(X_{i},Y_{i})=f^{\textnormal{hi}}(X_{i},Y_{i})=Y_{i}-\hat{\mu}_{c}(X_{i}). The corresponding result shows that given a large enough number of datapoints and a particular data split, there exist conformity scores that result in Algorithm 2 capturing the conditional quantile nontrivially with probability at least 1rqs(1q)1-rq-s(1-q) for all distributions PP.

Due to the definition of μ^c\hat{\mu}_{c}, the resulting confidence intervals will be near-useless; the predictions will be so far off that the intervals will have width several times the range of the (slightly clipped) marginal distribution PYP_{Y}. However, they still will be finite, and will still achieve predictive inference at a rate roughly equal to 1α/21-\alpha/2. The existence of scores that always result in higher-than-needed rates of coverage means that any result like Theorem 3 that provides a nontrivial upper bound for either Algorithm 1 or Algorithm 2’s coverage of the conditional median on a specific distribution will have to restrict the class of regression functions and/or conformity scores.

3 Median Intervals and Predictive Intervals are Equivalent

Up until this point, we have looked at the existence and accuracy of algorithms for estimating the conditional median. This section shows that any algorithm for the conditional median is also a predictive algorithm, and vice versa. As a consequence, this means there exists a strong lower bound on the size of any conditional median confidence interval.

Theorem 5.

Let C^n\hat{C}_{n} be any algorithm that satisfies (1α)(1-\alpha)-Median. Then, for any nonatomic distribution PP on d×\mathbb{R}^{d}\times\mathbb{R}, we have that

{Yn+1C^n(Xn+1)}1α.\mathbb{P}\big{\{}Y_{n+1}\in\hat{C}_{n}(X_{n+1})\big{\}}\geq 1-\alpha.

That is, C^n\hat{C}_{n} satisfies (1α)(1-\alpha)-Predictive for all nonatomic distributions PP.

Proof.

The proof above uses the same approach from the proof of Theorem 1 in Barber (2020). Consider an arbitrary C^n\hat{C}_{n} that satisfies (1α)(1-\alpha)-Median, and let PP be any distribution over d×\mathbb{R}^{d}\times\mathbb{R} for which PXP_{X} is nonatomic. Pick some Mn+1M\geq n+1, and sample ={(Xj,Yj):1jM}i.i.d.P\mathcal{L}=\{(X^{j},Y^{j}):1\leq j\leq M\}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}P. We define two different ways of sampling our data from \mathcal{L}.

Fix \mathcal{L} and pick (X1,Y1),,(Xn+1,Yn+1)(X_{1},Y_{1}),\ldots,(X_{n+1},Y_{n+1}) without replacement from \mathcal{L}. Call this method of sampling Q1Q_{1}. It is clear that after marginalizing over \mathcal{L}, the (Xi,Yi)(X_{i},Y_{i})’s are effectively drawn i.i.d. from PP; thus, we have that

P{Yn+1C^n(Xn+1)}=𝔼[Q1{Yn+1C^n(Xn+1)|}].\mathbb{P}_{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})\}=\mathbb{E}_{\mathcal{L}}\big{[}\mathbb{P}_{Q_{1}}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})|\mathcal{L}\}\big{]}.

Now, pick (X1,Y1),,(Xn+1,Yn+1)(X_{1},Y_{1}),\ldots,(X_{n+1},Y_{n+1}) with replacement from \mathcal{L}, and call this method of sampling Q2Q_{2}. Note that because PXP_{X} is nonatomic, the XjX^{j}’s are distinct with probability 11, which means that MedianQ2(Y|X=Xj)=Yj\textrm{Median}_{Q_{2}}(Y|X=X^{j})=Y^{j}. Then, as C^n\hat{C}_{n} applies to all distributions, it applies to our point distribution over \mathcal{L}; thus, we have that for all \mathcal{L},

Q2{Yn+1C^n(Xn+1)|}=Q2{MedianQ2(Y|X=Xn+1)C^n(Xn+1)|}1α.\mathbb{P}_{Q_{2}}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})|\mathcal{L}\}=\mathbb{P}_{Q_{2}}\{\textrm{Median}_{Q_{2}}(Y|X=X_{n+1})\in\hat{C}_{n}(X_{n+1})|\mathcal{L}\}\geq 1-\alpha.

Let RR be the event that any two of the (Xi,Yi)(X_{i},Y_{i}) are equal to each other; that is, R={(Xa,Ya)=(Xb,Yb)R=\{(X_{a},Y_{a})=(X_{b},Y_{b})  for any a<b}\textrm{ for any }a<b\}. Note that under Q2Q_{2}, for 1a<bn+11\leq a<b\leq n+1, the probability of (Xa,Ya)(X_{a},Y_{a}) and (Xb,Yb)(X_{b},Y_{b}) being equal is 1/M1/M. Then, by the union bound,

Q2{R}1a<bn+1Q2{(Xa,Ya)=(Xb,Yb)}n2M,\mathbb{P}_{Q_{2}}\{R\}\leq\sum_{1\leq a<b\leq n+1}\mathbb{P}_{Q_{2}}\{(X_{a},Y_{a})=(X_{b},Y_{b})\}\leq\frac{n^{2}}{M},

where the last step is from the fact that the number of possible pairs (a,b)(a,b) is bounded above by n2n^{2}. Meanwhile, we know that Q1{R}=0\mathbb{P}_{Q_{1}}\{R\}=0 by the definition of Q1Q_{1}.

We can use this to bound the total variation distance between Q1Q_{1} and Q2Q_{2}. For any fixed \mathcal{L} and any event EE, note that Q1{E}=Q1{E|RC}=Q2{E|RC}\mathbb{P}_{Q_{1}}\{E\}=\mathbb{P}_{Q_{1}}\{E|R^{C}\}=\mathbb{P}_{Q_{2}}\{E|R^{C}\}. Then, we can calculate

|Q1{E}Q2{E}|=\displaystyle|\mathbb{P}_{Q_{1}}\{E\}-\mathbb{P}_{Q_{2}}\{E\}|= |Q1{E|RC}(Q2{E|RC}Q2{RC}+Q2{E|R}Q2{R})|\displaystyle|\mathbb{P}_{Q_{1}}\{E|R^{C}\}-(\mathbb{P}_{Q_{2}}\{E|R^{C}\}\mathbb{P}_{Q_{2}}\{R^{C}\}+\mathbb{P}_{Q_{2}}\{E|R\}\mathbb{P}_{Q_{2}}\{R\})|
=\displaystyle= |Q2{E|RC}Q2{R}Q2{E|R}Q2{R}|\displaystyle|\mathbb{P}_{Q_{2}}\{E|R^{C}\}\mathbb{P}_{Q_{2}}\{R\}-\mathbb{P}_{Q_{2}}\{E|R\}\mathbb{P}_{Q_{2}}\{R\}|
=\displaystyle= Q2{R}|Q2{E|RC}Q2{E|R}|\displaystyle\mathbb{P}_{Q_{2}}\{R\}|\mathbb{P}_{Q_{2}}\{E|R^{C}\}-\mathbb{P}_{Q_{2}}\{E|R\}|
\displaystyle\leq n2/M.\displaystyle n^{2}/M.

Therefore, for any fixed \mathcal{L}, the total variation distance between the distributions Q1Q_{1} and Q2Q_{2} is at most n2/Mn^{2}/M, implying that

Q1{Yn+1C^n(Xn+1)|}Q2{Yn+1C^n(Xn+1))|}n2/M1αn2/M,\mathbb{P}_{Q_{1}}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})|\mathcal{L}\}\geq\mathbb{P}_{Q_{2}}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1}))|\mathcal{L}\}-n^{2}/M\geq 1-\alpha-n^{2}/M,

which means that

P{Yn+1C^n(Xn+1)}=𝔼[Q1{Yn+1C^n(Xn+1)|}]1αn2/M.\mathbb{P}_{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})\}=\mathbb{E}_{\mathcal{L}}\big{[}\mathbb{P}_{Q_{1}}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})|\mathcal{L}\}\big{]}\geq 1-\alpha-n^{2}/M.

Taking the limit as MM goes to infinity gives the result. ∎

Remark 3.1.

Theorem 5 also applies to all algorithms that satisfy (1α,q)(1-\alpha,q)-Quantile; for our uniform distribution over \mathcal{L}, Quantileq(Y|X=Xj)=Yj\textrm{Quantile}_{q}(Y|X=X^{j})=Y^{j}, so the proof translates exactly. As a result, this means that all algorithms that satisfy (1α,q)(1-\alpha,q)-Quantile also satisfy (1α)(1-\alpha)-Predictive for all nonatomic distributions PP.

Remark 3.2.

The approach taken in the proof of Theorem 5 is similar to those used to show the limits of distribution-free inference in other settings. As mentioned earlier, Barber (2020) shows that in the setting of distributions PP over d×{0,1}\mathbb{R}^{d}\times\{0,1\}, any confidence interval C^n(Xn+1)\hat{C}_{n}(X_{n+1}) for 𝔼[Y|X=Xn+1]\mathbb{E}[Y|X=X_{n+1}] with coverage 1α1-\alpha must contain Yn+1Y_{n+1} with probability 1α1-\alpha for all nonatomic distributions PP, and goes on to provide a lower bound for the length of the confidence interval. Additionally, Barber et al. (2021b) proves a similar theorem about predictive algorithms C^n(Xn+1)\hat{C}_{n}(X_{n+1}) for Yn+1Y_{n+1} that are required to have a weak form of conditional coverage. The proof for the result from Barber (2020) involves the same idea of marginalizing over a large finite sampled subset \mathcal{L} in order to apply C^n\hat{C}_{n} to the distribution over \mathcal{L}; the proof for the result from Barber et al. (2021b) focuses on sampling a large number of datapoints conditioned on whether or not they belong to a specific subset d×\mathcal{B}\subseteq\mathbb{R}^{d}\times\mathbb{R}. In both cases, studying two sampling distributions and measuring the total variation distance between them was crucial. Thus, it seems that this strategy may have further use in the future when studying confidence intervals for other parameters or data in a distribution-free setting.

We now prove a similar result in the opposite direction.

Theorem 6.

Let C^n\hat{C}_{n} be any algorithm that only outputs confidence intervals and satisfies (1α/2)(1-\alpha/2)-Predictive. Then, C^n\hat{C}_{n} satisfies (1α)(1-\alpha)-Median.

Proof.

Consider any distribution PP. For all xdx\in\mathbb{R}^{d} in the support of PP, set m(x)=Median(Y|X=x)m(x)=\textrm{Median}(Y|X=x).

We know that under the distribution PP, {Yn+1C^n(Xn+1)}1α/2\mathbb{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})\}\geq 1-\alpha/2. Then, we can condition on whether or not the conditional median is contained in C^n\hat{C}_{n} as follows:

1α/2\displaystyle 1-\alpha/2\leq {Yn+1C^n(Xn+1)}\displaystyle\mathbb{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})\}
=\displaystyle= {Yn+1C^n(Xn+1)|m(Xn+1)C^n(Xn+1)}{m(Xn+1)C^n(Xn+1)}\displaystyle\mathbb{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})|m(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}\mathbb{P}\{m(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}
+{Yn+1C^n(Xn+1)|m(Xn+1)C^n(Xn+1)}{m(Xn+1)C^n(Xn+1)}\displaystyle+\mathbb{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})|m(X_{n+1})\not\in\hat{C}_{n}(X_{n+1})\}\mathbb{P}\{m(X_{n+1})\not\in\hat{C}_{n}(X_{n+1})\}
\displaystyle\leq {m(Xn+1)C^n(Xn+1)}+12{m(Xn+1)C^n(Xn+1)}\displaystyle\mathbb{P}\{m(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}+\frac{1}{2}\mathbb{P}\{m(X_{n+1})\not\in\hat{C}_{n}(X_{n+1})\}
=\displaystyle= 12+12{m(Xn+1)C^n(Xn+1)}.\displaystyle\frac{1}{2}+\frac{1}{2}\mathbb{P}\{m(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}.

The key insight here is that because C^n\hat{C}_{n} only outputs confidence intervals, C^n(Xn+1)\hat{C}_{n}(X_{n+1}) can cover at most half of the conditional distribution of Y|X=Xn+1Y|X=X_{n+1} if m(Xn+1)m(X_{n+1}) is not contained in the confidence interval.

Shifting the constant over and multiplying by 22 tells us that {m(Xn+1)C^n(Xn+1)}1α\mathbb{P}\{m(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}\geq 1-\alpha, as desired.

Remark 3.3.

Theorem 6 can be modified to replace (1α)(1-\alpha)-Median with (1α,q)(1-\alpha,q)-Quantile. In particular, let C^n(x)=[L^n(x),H^n(x)]\hat{C}_{n}(x)=[\hat{L}_{n}(x),\hat{H}_{n}(x)] be any algorithm that only outputs confidence intervals and satisfies {Yn+1L^n(Xn+1)}1rq\mathbb{P}\{Y_{n+1}\geq\hat{L}_{n}(X_{n+1})\}\geq 1-rq and {Yn+1H^n(Xn+1)}1s(1q)\mathbb{P}\{Y_{n+1}\leq\hat{H}_{n}(X_{n+1})\}\geq 1-s(1-q) for some r+s=αr+s=\alpha. Then, C^n\hat{C}_{n} satisfies (1α,q)(1-\alpha,q)-Quantile.

Alternatively, we can also conclude that if C^n\hat{C}_{n} only outputs confidence intervals and satisfies (1min{q,1q}α)(1-\min\{q,1-q\}\alpha)-Predictive, then it satisfies (1α,q)(1-\alpha,q)-Quantile. The proofs of both of these statements are in Appendix A.5.

Theorem 5 tells us that conditional median inference is at least as imprecise as predictive inference. As a result, because all predictive intervals have nonvanishing widths (assuming nonzero conditional variance) no matter the sample size nn, it is not possible to write down conditional median algorithms with widths converging to 0. Thus, it may be better to study other distribution parameters similar to the conditional median if we are looking for better empirical performance. For discussion on related distribution parameters that are worth studying and may result in stronger inference, refer to Section 5.2.

Theorem 6 tells us that one way to approach conditional median inference is to apply strong predictive algorithms. It also suggests that improvements in predictive inference may translate to conditional median inference.

Lastly, we know that Algorithm 1 captures Yn+1Y_{n+1} with probability 1α/21-\alpha/2. Because there is space between the bounds of Theorems 5 and 6, there may exist a better conditional median algorithm that only captures Yn+1Y_{n+1} with probability 1α1-\alpha. Based on our result from Section 2.3, any such algorithm will likely follow a format different than the split conformal approach. Studying this problem in more detail, particularly on difficult distributions PP, might lead to more accurate conditional median algorithms.

4 Simulations

In this section, we analyze the impact of different conformity scores on the outcome of Algorithm 2. Specifically, we look at the four following conformity scores:

  1. Score 1:

    f1lo(Xi,Yi)=f1hi(Xi,Yi)=Yiμ^(Xi)f_{1}^{\textnormal{lo}}(X_{i},Y_{i})=f_{1}^{\textnormal{hi}}(X_{i},Y_{i})=Y_{i}-\hat{\mu}(X_{i}). We train μ^\hat{\mu} to predict the conditional mean using quantile regression forests on the dataset {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\}.

  2. Score 2:

    f2lo(Xi,Yi)=f2hi(Xi,Yi)=Yiμ^(Xi)σ^(Xi)f_{2}^{\textnormal{lo}}(X_{i},Y_{i})=f_{2}^{\textnormal{hi}}(X_{i},Y_{i})=\frac{Y_{i}-\hat{\mu}(X_{i})}{\hat{\sigma}(X_{i})}. We train μ^\hat{\mu} and σ^\hat{\sigma} jointly using random forests on the dataset {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\}.

  3. Score 3:

    f3lo(Xi,Yi)=YiQ^lo(Xi)f_{3}^{\textnormal{lo}}(X_{i},Y_{i})=Y_{i}-\hat{Q}^{\textnormal{lo}}(X_{i}) and f3hi(Xi,Yi)=YiQ^hi(Xi)f_{3}^{\textnormal{hi}}(X_{i},Y_{i})=Y_{i}-\hat{Q}^{\textnormal{hi}}(X_{i}). We train Q^lo\hat{Q}^{\textnormal{lo}} and Q^hi\hat{Q}^{\textnormal{hi}} to predict the conditional α/2\alpha/2 quantile and 1α/21-\alpha/2 quantile, respectively, using quantile regression forests on the dataset {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\}.

  4. Score 4:

    f4lo(Xi,Yi)=f4hi(Xi,Yi)=F^Y|X=Xi(Yi)f_{4}^{\textnormal{lo}}(X_{i},Y_{i})=f_{4}^{\textnormal{hi}}(X_{i},Y_{i})=\hat{F}_{Y|X=X_{i}}(Y_{i}). We create F^Y|X=Xi(Yi)\hat{F}_{Y|X=X_{i}}(Y_{i}) by using 101101 quantile regression forests trained on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\} to estimate the conditional qthq^{\textnormal{th}} quantile for q{0,0.01,,0.99,1}q\in\{0,0.01,\ldots,0.99,1\} and use linear interpolation between quantiles to estimate the conditional CDF. Our training method ensures that quantile predictions will never cross.

To compare the performance of Algorithm 2 against a method that does not have a distribution-free guarantee, we also consider a nonconformalized algorithm that outputs C^n(Xn+1)=[Q^lo(Xn+1),Q^hi(Xn+1)]\hat{C}_{n}(X_{n+1})=[\hat{Q}^{\textnormal{lo}}(X_{n+1}),\hat{Q}^{\textnormal{hi}}(X_{n+1})], where Q^lo\hat{Q}^{\textnormal{lo}} and Q^hi\hat{Q}^{\textnormal{hi}} are trained on 𝒟\mathcal{D} to predict the conditional α/2\alpha/2 quantile and 1α/21-\alpha/2 quantile, respectively, using quantile regression forests. We refer to this algorithm as QRF.

In order to test the conditional median coverage rate, we must look at distributions for which the conditional median is known and, therefore, focus on simulated datasets. We consider the performance of Algorithm 2 and QRF on these three distributions:

  1. Distribution 1:

    We draw (X,Y)P1(X,Y)\sim P_{1} from d×\mathbb{R}^{d}\times\mathbb{R}, where d=10d=10. Here, X=(X1,,Xd)X=(X^{1},\ldots,X^{d}) is an equicorrelated multivariate Gaussian vector with mean zero and Var(Xi)=1\textrm{Var}(X^{i})=1, Cov(Xi,Xj)=0.25\textnormal{Cov}(X^{i},X^{j})=0.25 for iji\neq j. We set Y=(X1+X2)2X3+σ(X)ϵY=(X^{1}+X^{2})^{2}-X^{3}+\sigma(X)\epsilon, where ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1) is independent of XX and σ(x)=0.1+0.25x22\sigma(x)=0.1+0.25\|x\|_{2}^{2} for all xdx\in\mathbb{R}^{d}.

  2. Distribution 2:

    We draw (X,Y)P2(X,Y)\sim P_{2} from ×\mathbb{R}\times\mathbb{R}. Draw XUnif[4π,4π]X\sim\textrm{Unif}[-4\pi,4\pi] and Y=U1/4f(X)Y=U^{1/4}f(X), where UUnif[0,1]U\sim\textrm{Unif}[0,1] is independent of XX and f(x)=1+|x|sin2(x)f(x)=1+|x|\sin^{2}(x) for all xx\in\mathbb{R}.

  3. Distribution 3:

    We draw (X,Y)P3(X,Y)\sim P_{3} from ×\mathbb{R}\times\mathbb{R}. Draw XUnif[1,1]X\sim\textrm{Unif}[-1,1] and set Y=Bf(X)Y=B\cdot f(X), where BBernoulli(0.5+2δ)B\sim\textrm{Bernoulli}(0.5+2\delta) is independent of XX and f(x)=γ{Mx}γ2(1)Mx(1γ2)f(x)=\gamma\{Mx\}-\frac{\gamma}{2}-(-1)^{\lfloor Mx\rfloor}(1-\frac{\gamma}{2}) for all xx\in\mathbb{R}. Note that {r}\{r\} is the fractional part of rr. We set δ=0.0001\delta=0.0001 and M=1/γ=25M=1/\gamma=25.

Distributions 2 and 3 are shown in Figure 2.

Refer to caption
Refer to caption
Figure 2: Plots of n=10,000n=10,000 datapoints from the two distributions P2P_{2} and P3P_{3} overlaid with the conditional medians. The left panel is a case where the conditional distribution Y|XY|X has high heteroskedasticity. The right is a case where it is nearly impossible to tell the location of the conditional median.

For each distribution, we run Algorithm 2 using each conformity score, as well as QRF, to get a confidence interval for the conditional median. We run 500 trials; in each trial, we set n=5,000n=5,000 with n1=n2=n/2n_{1}=n_{2}=n/2 and α=0.1\alpha=0.1 with r=s=α/2r=s=\alpha/2. For a separate study on the impact of nn on the confidence interval width and coverage, we refer the reader to Sesia and Candès (2020). We test coverage on 5,0005,000 datapoints for each trial. The average coverage rate, average interval width, and other statistics for each distribution and conformity score are shown in Figure 3. An example of the resulting confidence intervals for a single trial on Distribution 2 are displayed in Figure 4.

Distribution Score AC SDAC MCC AW SDAW
  1 1 99.08%99.08\% 0.20%0.20\% 0.0%0.0\% 14.0814.08 0.5730.573
2 99.78%99.78\% 0.09%0.09\% 5.0%5.0\% 13.0313.03 0.4050.405
3 99.72%99.72\% 0.10%0.10\% 8.4%8.4\% 12.8612.86 0.3580.358
4 99.77%99.77\% 0.09%0.09\% 12.0%12.0\% 13.0213.02 0.3980.398
QRF 99.73%99.73\% 0.09%0.09\% 1.6%1.6\% 11.1011.10 0.1910.191
  2 1 99.85%99.85\% 0.20%0.20\% 92.4%92.4\% 4.5374.537 0.1740.174
2 99.48%99.48\% 0.29%0.29\% 78.4%78.4\% 3.6043.604 0.0930.093
3 99.89%99.89\% 0.14%0.14\% 95.6%95.6\% 3.6193.619 0.0600.060
4 99.87%99.87\% 0.14%0.14\% 93.6%93.6\% 3.7003.700 0.0870.087
QRF 99.91%99.91\% 0.12%0.12\% 93.4%93.4\% 3.483.48 0.0510.051
  3 1 90.05%90.05\% 0.86%0.86\% 15.4%15.4\% 2.1222.122 0.0440.044
2 89.97%89.97\% 0.87%0.87\% 19.8%19.8\% 2.0842.084 0.0510.051
3 90.14%90.14\% 0.87%0.87\% 4.4%4.4\% 1.9891.989 0.0030.003
4 90.00%90.00\% 0.88%0.88\% 0.0%0.0\% 1.9901.990 0.0020.002
QRF 83.98%83.98\% 0.92%0.92\% 0.0%0.0\% 1.9621.962 0.0170.017
 
Figure 3: For each distribution and conformity score, we calculate: average coverage (AC), an estimate of {Median(Yn+1|Xn+1)C^n(Xn+1)}\mathbb{P}\{\textrm{Median}(Y_{n+1}|X_{n+1})\in\hat{C}_{n}(X_{n+1})\}; standard deviation of average coverage (SDAC), an estimation of Var({Median(Yn+1|Xn+1)C^n(Xn+1)|𝒟})1/2\textrm{Var}(\mathbb{P}\{\textrm{Median}(Y_{n+1}|X_{n+1})\in\hat{C}_{n}(X_{n+1})|\mathcal{D}\})^{1/2}, where 𝒟={(X1,Y1),,(Xn,Yn)}\mathcal{D}=\{(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\}; minimum conditional coverage (MCC), an estimate of minx{Median(Y|X=x)C^n(x)}\displaystyle\min_{x}\mathbb{P}\{\textrm{Median}(Y|X=x)\in\hat{C}_{n}(x)\}; average width (AW), an estimate of 𝔼[len(C^n(Xn+1))]\mathbb{E}[\textrm{len}(\hat{C}_{n}(X_{n+1}))]; and standard deviation of average width (SDAW), an estimate of Var(𝔼[len(C^n(Xn+1))|𝒟])1/2\textrm{Var}(\mathbb{E}[\textrm{len}(\hat{C}_{n}(X_{n+1}))|\mathcal{D}])^{1/2}. Estimations are averaged over 500 trials. 1α=0.91-\alpha=0.9 for all trials.
(a) f1lo(Xi,Yi)=f1hi(Xi,Yi)=Yiμ^(Xi)f_{1}^{\textnormal{lo}}(X_{i},Y_{i})=f_{1}^{\textnormal{hi}}(X_{i},Y_{i})=Y_{i}-\hat{\mu}(X_{i}). 99.50%99.50\% coverage; 4.524.52 average width.
Refer to caption
(b) f2lo(Xi,Yi)=f2hi(Xi,Yi)=Yiμ^(Xi)σ^(Xi)f_{2}^{\textnormal{lo}}(X_{i},Y_{i})=f_{2}^{\textnormal{hi}}(X_{i},Y_{i})=\frac{Y_{i}-\hat{\mu}(X_{i})}{\hat{\sigma}(X_{i})}. 99.76%99.76\% coverage; 3.613.61 average width.
Refer to caption
(c) f3lo(Xi,Yi)=YiQ^lo(Xi)f_{3}^{\textnormal{lo}}(X_{i},Y_{i})=Y_{i}-\hat{Q}^{\textnormal{lo}}(X_{i}); f3hi(Xi,Yi)=YiQ^hi(Xi)f_{3}^{\textnormal{hi}}(X_{i},Y_{i})=Y_{i}-\hat{Q}^{\textnormal{hi}}(X_{i}). 99.72%99.72\% coverage; 3.613.61 average width.
Refer to caption
(d) f4lo(Xi,Yi)=f4hi(Xi,Yi)=F^Y|X=Xi(Yi)f_{4}^{\textnormal{lo}}(X_{i},Y_{i})=f_{4}^{\textnormal{hi}}(X_{i},Y_{i})=\hat{F}_{Y|X=X_{i}}(Y_{i}). 100.0%100.0\% coverage; 3.663.66 average width.
Refer to caption
Figure 4: Confidence intervals (pink regions) from one trial for each conformity score on Distribution 2. Note that all scores result in a coverage well over 1α=0.91-\alpha=0.9.

Comparing the QRF algorithm with Algorithm 2, we see that while the widths are significantly lower for all three distributions, the coverage for Distribution 3 is much less than 1α1-\alpha. In particular, compared against Conformity Score 3, which has the same framework but includes a constant buffer on both ends due to the calibration set, we see how much extra width Algorithm 2 adds in the calibration step for Conformity Score 3.

Looking first at rates of coverage, we see that all scores have coverage much greater than 1α1-\alpha on Distributions 1 and 2. However, Distribution 3 is a case where all scores have near-identical rates of coverage at just about 1α1-\alpha. Further investigation into the confidence intervals produced for Distribution 3 suggests that the algorithms are often failing to capture the conditional median when its absolute value is almost exactly 11.

The minimum conditional coverage on Distributions 1 and 3 is near 0 for each conformity score. Interestingly, the scores with the worst minimum conditional coverage on Distribution 1 have the relative best minimum conditional coverage on Distribution 3, and vice versa. Scores 1, 3, and 4 have a minimum conditional coverage greater than 1α1-\alpha on Distribution 2, implying that these scores achieve point-wise conditional coverage.

Regarding interval width, Score 1 performs significantly worse on Distributions 1 and 2 than all other scores; meanwhile, the 3 other scores have roughly equal average widths. On Distribution 3, Scores 3 and 4 produce intervals with significantly less width than Scores 1 and 2.

Overall, we see that Score 1 is significantly worse than the other scores on distributions with a wide range in conditional variance; Scores 3 and 4 behave very similarly on all distributions and perform slightly better than Score 2 on Distribution 3.

5 Discussion

This paper introduced two algorithms for capturing the conditional median of a datapoint within the distribution-free setting, as well as a particular distribution where the performance of these algorithms was sharp. Our lower bounds prove that in the distribution-free setting, conditional median inference is fundamentally as difficult as prediction itself, thereby setting a concrete limit to how well any median inference algorithm can ever perform. We also showed that any predictive algorithm can be used as a median algorithm at a different coverage level, suggesting that the two problems are near-equivalent.

5.1 Takeaways

A few observations may prove useful. For one, distributions such as PδP^{\delta} from Section 2.3 and P3P_{3} from Section 4 will likely show up again. Because each distribution is a mixture of two disjoint distributions with roughly equal weights, it is hard to identify which half contains the median. It is likely that similar distributions will show up as the performance-limiting distribution for distribution-free parameter inference. Further, the proof technique of sampling a large finite number of datapoints and then marginalizing (Section 3) is similar to those in Barber (2020) and Barber et al. (2021b), pointing out to possible future use. Lastly, our results and those of Barber (2020) indicate that the value of conditional parameters cannot be known with higher accuracy than the values of future samples.

5.2 Further Work

We hope that this paper motivates further work on conditional parameter inference. We see three immediate potential avenues:

  • One direction is to extend our methods to study other conditional parameters similar to the conditional median. For example, the smoothed conditional median, equal to the conditional median convolved with a kernel, may be easier to infer than the conditional median. This parameter would allow for smarter inference in the case of smooth distributions without making smoothness a requirement for inference. Similarly, the truncated mean and other measures of central tendency may be amenable to model-free inference and analysis, as may the conditional interquartile range and other robust measures of scale.

  • Another direction is to get tighter bands by imposing mild shape constraints on the conditional median function. For instance, if we know that Median(Y|X=x)\textrm{Median}(Y|X=x) is convex, then the results from Section 3 no longer apply. Similarly, assuming that Median(Y|X=x)\textrm{Median}(Y|X=x) is decreasing in xx or Lipschitz would yield intervals with vanishing widths in the limit of large samples. For instance, when predicting economic damages caused by tornadoes using wind speed as a covariate, one may assume that the median damage is nondecreasing as wind speed increases.

  • A third subject of study is creating full conformal inference methods based off of our split conformal algorithms. Unlike split conformal inference, the full conformal method does not rely on splitting the dataset into a fitting half and a ranking half; instead, it calculates the conformity of a potential datapoint (Xn+1,y)(X_{n+1},y) to the full dataset 𝒟\mathcal{D} and includes yy in its confidence region only if (Xn+1,y)(X_{n+1},y) is similar enough to the observed datapoints. The study of full conformal inference has grown alongside that of split conformal inference; the method can be seen in Vovk et al. (2005), Shafer and Vovk (2008), and Lei et al. (2018). Standard full conformal algorithms do not guarantee coverage of the conditional median; however, there may exist modifications similar to locally nondecreasing conformity scores that result in a full conformal algorithm that captures the conditional median.

Acknowledgements

D. M. thanks Stanford University for supporting this research as part of the Masters program in Statistics. E. C. was supported by Office of Naval Research grant N00014-20-12157, by the National Science Foundation grant OAC 1934578, and by the Army Research Office (ARO) under grant W911NF-17-1-0304. We thank Lihua Lei for advice on different approaches and recommended resources on this topic, as well as the Stanford Statistics department for listening to a preliminary version of this work. We also thank a referee for encouraging us to deepen the connection between predictive and conditional median confidence intervals.

References

  • Bahadur and Savage [1956] Raghu R Bahadur and Leonard J Savage. The nonexistence of certain statistical procedures in nonparametric problems. The Annals of Mathematical Statistics, 27(4):1115–1122, 1956.
  • Barber [2020] Rina Foygel Barber. Is distribution-free inference possible for binary regression? Electronic Journal of Statistics, 14(2):3487–3524, 2020.
  • Barber et al. [2021a] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486–507, 2021a.
  • Barber et al. [2021b] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2):455–482, 2021b.
  • Belloni et al. [2019] Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, and Iván Fernández-Val. Conditional quantile processes based on series or many regressors. Journal of Econometrics, 213(1):4–29, 2019.
  • Chen et al. [2018] Wenyu Chen, Kelli-Jean Chun, and Rina Foygel Barber. Discretized conformal prediction for efficient distribution-free inference. Stat, 7(1):e173, 2018.
  • Chernozhukov et al. [2019] Victor Chernozhukov, Kaspar Wüthrich, and Yinchu Zhu. Distributional conformal prediction. arXiv preprint arXiv:1909.07889, 2019.
  • Claeskens et al. [2003] Gerda Claeskens, Ingrid Van Keilegom, et al. Bootstrap confidence bands for regression curves and their derivatives. The Annals of Statistics, 31(6):1852–1884, 2003.
  • Cortés-Ciriano and Bender [2019] Isidro Cortés-Ciriano and Andreas Bender. Concepts and applications of conformal prediction in computational drug discovery. arXiv preprint arXiv:1908.03569, 2019.
  • Cribari-Neto and Lima [2009] Francisco Cribari-Neto and Maria da Glória A Lima. Heteroskedasticity-consistent interval estimators. Journal of Statistical Computation and Simulation, 79(6):787–803, 2009.
  • Guan [2019] Leying Guan. Conformal prediction with localization. arXiv preprint arXiv:1908.08558, 2019.
  • Johansson et al. [2013] Ulf Johansson, Henrik Boström, and Tuve Löfström. Conformal prediction using decision trees. In 2013 IEEE 13th international conference on data mining, pages 330–339. IEEE, 2013.
  • Kivaranovic et al. [2020] Danijel Kivaranovic, Kory D Johnson, and Hannes Leeb. Adaptive, distribution-free prediction intervals for deep networks. In International Conference on Artificial Intelligence and Statistics, pages 4346–4356. PMLR, 2020.
  • Koenker [2005] Roger Koenker. Quantile Regression. Econometric Society Monographs. Cambridge University Press, 2005. doi: 10.1017/CBO9780511754098.
  • Lei et al. [2013] Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278–287, 2013.
  • Lei et al. [2018] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  • Lei and Candès [2020] Lihua Lei and Emmanuel J Candès. Conformal inference of counterfactuals and individual treatment effects. arXiv preprint arXiv:2006.06138, 2020.
  • McCarthy [1965] Philip J McCarthy. Stratified sampling and distribution-free confidence intervals for a median. Journal of the American Statistical Association, 60(311):772–783, 1965.
  • Noether [1972] Gottfried E Noether. Distribution-free confidence intervals. The American Statistician, 26(1):39–41, 1972.
  • Oberhofer and Haupt [2016] Walter Oberhofer and Harry Haupt. Asymptotic theory for nonlinear quantile regression under weak dependence. Econometric Theory, 32(3):686–713, 2016.
  • Papadopoulos et al. [2002] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, pages 345–356. Springer, 2002.
  • Papadopoulos et al. [2011] Harris Papadopoulos, Vladimir Vovk, and Alexander Gammerman. Regression conformal prediction with nearest neighbours. Journal of Artificial Intelligence Research, 40:815–840, 2011.
  • Romano et al. [2019] Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In Advances in Neural Information Processing Systems, pages 3543–3553, 2019.
  • Romano et al. [2020] Yaniv Romano, Rina Barber, Chiara Sabatti, and Emmanuel Candès. With malice toward none: Assessing uncertainty via equalized coverage. Harvard Data Science Review, 04 2020. doi: 10.1162/99608f92.03f00592.
  • Sesia and Candès [2020] Matteo Sesia and Emmanuel J Candès. A comparison of some conformal quantile regression methods. Stat, 9(1):e261, 2020.
  • Shafer and Vovk [2008] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(Mar):371–421, 2008.
  • Takeuchi et al. [2006] Ichiro Takeuchi, Quoc V. Le, Timothy D. Sears, and Alexander J. Smola. Nonparametric quantile estimation. Journal of Machine Learning Research, 7(45):1231–1264, 2006.
  • Tibshirani et al. [2019] Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems, pages 2530–2540, 2019.
  • Vovk [2015] Vladimir Vovk. Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence, 74(1):9–28, 2015.
  • Vovk et al. [2005] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005.
  • Vovk et al. [2009] Vladimir Vovk, Ilia Nouretdinov, Alex Gammerman, et al. On-line predictive linear regression. The Annals of Statistics, 37(3):1566–1590, 2009.
  • Vovk et al. [2018] Vladimir Vovk, Ilia Nouretdinov, Valery Manokhin, and Alexander Gammerman. Cross-conformal predictive distributions. In Conformal and Probabilistic Prediction and Applications, pages 37–51. PMLR, 2018.

Appendix A Theorem Proofs

A.1 Proof of Theorem 1

Theorem 1 follows directly from Theorem 6. In particular, Vovk et al. [2005] shows that Algorithm 1 satisfies (1α/2)(1-\alpha/2)-Predictive. Because Algorithm 1 always outputs a confidence interval, we have that Algorithm 1 satisfies (1α)(1-\alpha)-Median by Theorem 6.

A.2 Proof of Theorem 2

We can prove Theorem 2 from the extension of Theorem 6 described in Remark 3.3. We know that Algorithm 2 always outputs a confidence interval because it is the intersection of 2 confidence intervals. Define En+1lo=flo(Xn+1,Yn+1))E_{n+1}^{\textnormal{lo}}=f^{\textnormal{lo}}(X_{n+1},Y_{n+1})) and En+1hi=fhi(Xn+1,Yn+1))E_{n+1}^{\textnormal{hi}}=f^{\textnormal{hi}}(X_{n+1},Y_{n+1})). We can bound the probability of Yn+1Y_{n+1} being at least the lower bound by {En+1loQrqlo(E)}\mathbb{P}\{E_{n+1}^{\textnormal{lo}}\geq Q_{rq}^{\textrm{lo}}(E)\} and bound the probability of Yn+1Y_{n+1} being at most the upper bound by {En+1hiQ1s(1q)hi(E)}\mathbb{P}\{E_{n+1}^{\textnormal{hi}}\leq Q_{1-s(1-q)}^{\textrm{hi}}(E)\}.

Qrqlo(E)Q_{rq}^{\textrm{lo}}(E) is defined as the rq(1+1/n2)1/n2rq(1+1/n_{2})-1/n_{2}-th quantile of {Eilo:i2}\{E_{i}^{\textnormal{lo}}:i\in\mathcal{I}_{2}\}, which is equal to the n2(rq(1+1/n2)1/n2)=rq(n2+1)1\lceil n_{2}(rq(1+1/n_{2})-1/n_{2})\rceil=\lceil rq(n_{2}+1)-1\rceil smallest value of {Eilo:i2}\{E_{i}^{\textnormal{lo}}:i\in\mathcal{I}_{2}\}. Then, because {Eilo:i2}{En+1lo}\{E_{i}^{\textnormal{lo}}:i\in\mathcal{I}_{2}\}\cup\{E_{n+1}^{\textnormal{lo}}\} are exchangeable, as flof^{\textnormal{lo}} is only fit on {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\}, we have that the distribution of |{Eilo<En+1lo:i2}||\{E_{i}^{\textnormal{lo}}<E_{n+1}^{\textnormal{lo}}:i\in\mathcal{I}_{2}\}| is bounded above by the uniform distribution on {0,1,,n2}\{0,1,\ldots,n_{2}\}. Therefore,

{En+1loQrqlo(E)}\displaystyle\mathbb{P}\{E_{n+1}^{\textnormal{lo}}\geq Q_{rq}^{\textrm{lo}}(E)\}\geq rq(n2+1)1+1n21n2+1\displaystyle\sum_{\lceil rq(n_{2}+1)-1\rceil+1}^{n_{2}}\frac{1}{n_{2}+1}
=\displaystyle= n2+1rq(n2+1)n2+1\displaystyle\frac{n_{2}+1-\lceil rq(n_{2}+1)\rceil}{n_{2}+1}
\displaystyle\geq (n2+1)(1rq)n2+1\displaystyle\frac{(n_{2}+1)(1-rq)}{n_{2}+1}
=\displaystyle= 1rq.\displaystyle 1-rq.

Similarly, Q1s(1q)hi(E)Q_{1-s(1-q)}^{\textrm{hi}}(E) is defined as the (1s(1q))(1+1/n2)(1-s(1-q))(1+1/n_{2})-th quantile of {Eihi:i2}\{E^{\textnormal{hi}}_{i}:i\in\mathcal{I}_{2}\}, which equals the n2(1s(1q))(1+1/n2)=(1s(1q))(n2+1)\lceil n_{2}(1-s(1-q))(1+1/n_{2})\rceil=\lceil(1-s(1-q))(n_{2}+1)\rceil smallest value of {Eihi:i2}\{E^{\textnormal{hi}}_{i}:i\in\mathcal{I}_{2}\}. Then, because {Eihi:i2}{En+1hi}\{E_{i}^{\textnormal{hi}}:i\in\mathcal{I}_{2}\}\cup\{E_{n+1}^{\textnormal{hi}}\} are exchangeable, we have that the distribution of |{EihiEn+1hi:i2}||\{E_{i}^{\textnormal{hi}}\leq E_{n+1}^{\textnormal{hi}}:i\in\mathcal{I}_{2}\}| is bounded below by the uniform distribution on {0,1,,n2}\{0,1,\ldots,n_{2}\}. Therefore,

{En+1hiQ1s(1q)hi(E)}\displaystyle\mathbb{P}\{E_{n+1}^{\textnormal{hi}}\leq Q_{1-s(1-q)}^{\textrm{hi}}(E)\}\geq 0(1s(1q))(n2+1)11n2+1\displaystyle\sum_{0}^{\lceil(1-s(1-q))(n_{2}+1)\rceil-1}\frac{1}{n_{2}+1}
=\displaystyle= (1s(1q))(n2+1)n2+1\displaystyle\frac{\lceil(1-s(1-q))(n_{2}+1)\rceil}{n_{2}+1}
\displaystyle\geq 1s(1q).\displaystyle 1-s(1-q).

Combining these two results and applying the extension of Theorem 6 tells us that Algorithm 2 satisfies (1α,q)(1-\alpha,q)-Quantile as desired.

A.3 Proof of Theorem 3

We show that given ϵ\epsilon, there exists δ\delta and NN such that for all n>Nn>N, running Algorithm 1 on PδP^{\delta} with our chosen μ^\hat{\mu} results in a confidence interval that contains the conditional median with probability at most 1α+ϵ1-\alpha+\epsilon. Our approch is similar to that in Appendix A.1; however, we apply the inequalities in the opposite directions and use some analysis in order to get an upper bound as opposed to a lower bound.

First, note that {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\} is irrelevant to our algorithm, as μ^\hat{\mu} is set to be the zero function. Then, for each i2i\in\mathcal{I}_{2}, Ei=|Yi|E_{i}=|Y_{i}|. Thus, Q1α/2(E)Q_{1-\alpha/2}(E) is the (1α/2)(1+1/n2)(1-\alpha/2)(1+1/n_{2})-th empirical quantile of {|Yi|:i2}\{|Y_{i}|:i\in\mathcal{I}_{2}\}, and our confidence interval is C^n(Xn+1)=[Q1α/2(E),Q1α/2(E)]\hat{C}_{n}(X_{n+1})=[-Q_{1-\alpha/2}(E),Q_{1-\alpha/2}(E)]. Because the parameter we want to cover is Median(Y|X=Xn+1)=Xn+1\textrm{Median}(Y|X=X_{n+1})=X_{n+1},

Median(Yn+1|Xn+1)C^n(Xn+1) if and only if |Xn+1|Q1α/2({|Yi|:i2}).\textrm{Median}(Y_{n+1}|X_{n+1})\in\hat{C}_{n}(X_{n+1})\textrm{ if and only if }|X_{n+1}|\leq Q_{1-\alpha/2}(\{|Y_{i}|:i\in\mathcal{I}_{2}\}).

Define MR=#{i2:|Xi||Xn+1|}M_{R}=\#\{i\in\mathcal{I}_{2}:|X_{i}|\geq|X_{n+1}|\} and ME=#{i2:|Yi||Xn+1|}M_{E}=\#\{i\in\mathcal{I}_{2}:|Y_{i}|\geq|X_{n+1}|\}. Now, note that Q1α/2({|Yi|:i2})Q_{1-\alpha/2}(\{|Y_{i}|:i\in\mathcal{I}_{2}\}) is the n2(1α/2)(1+1/n2)=(1α/2)(n2+1)\lceil n_{2}(1-\alpha/2)(1+1/n_{2})\rceil=\lceil(1-\alpha/2)(n_{2}+1)\rceil smallest value of {|Yi|:i2}\{|Y_{i}|:i\in\mathcal{I}_{2}\}, which equals the n2+1(1α/2)(n2+1)n_{2}+1-\lceil(1-\alpha/2)(n_{2}+1)\rceil largest value. Letting m=n2+1(1α/2)(n2+1)m=n_{2}+1-\lceil(1-\alpha/2)(n_{2}+1)\rceil,

|Xn+1|Q1α/2({|Yi|:i2}) if and only if MEm.|X_{n+1}|\leq Q_{1-\alpha/2}(\{|Y_{i}|:i\in\mathcal{I}_{2}\})\textrm{ if and only if }M_{E}\geq m.

We now build up the following two lemmas.

Lemma A.1.

For all 0Mn20\leq M\leq n_{2},

{MR=M}=1n2+1.\mathbb{P}\big{\{}M_{R}=M\big{\}}=\frac{1}{n_{2}+1}.
Proof.

This holds from the fact that the values {|Xi|:i2}{|Xn+1|}\{|X_{i}|:i\in\mathcal{I}_{2}\}\cup\{|X_{n+1}|\} are i.i.d. and have a distribution over [0,0.5][0,0.5] with no point masses. As a result, MRM_{R} is uniformly distributed over {0,1,,n2}\{0,1,\ldots,n_{2}\}. ∎

Lemma A.2.

ME|MRBinom(MR,0.5+δ)M_{E}|M_{R}\sim\textnormal{Binom}(M_{R},0.5+\delta).

Proof.

First, note that for all i2i\in\mathcal{I}_{2}, if |Xi|<|Xn+1||X_{i}|<|X_{n+1}|, then {|Yi||Xn+1|}=0\mathbb{P}\{|Y_{i}|\geq|X_{n+1}|\}=0. This is because if |Xi|<|Xn+1||X_{i}|<|X_{n+1}|, then |Yi||Xi|<|Xn+1||Y_{i}|\leq|X_{i}|<|X_{n+1}|. Additionally, if |Xi||Xn+1||X_{i}|\geq|X_{n+1}|, then {|Yi||Xn+1|}=0.5+δ\mathbb{P}\{|Y_{i}|\geq|X_{n+1}|\}=0.5+\delta. This is due to the fact that if |Xi||Xn+1||X_{i}|\geq|X_{n+1}|, we have that |Yi|=0|Y_{i}|=0 with probability 0.5δ0.5-\delta and |Yi|=|Xi||Y_{i}|=|X_{i}| with probability 0.5+δ0.5+\delta. With probability 11, |Xn+1|>0|X_{n+1}|>0. Therefore, |Yi||Xn+1||Y_{i}|\geq|X_{n+1}| if and only if |Yi|=|Xi||Y_{i}|=|X_{i}|, which occurs with probability 0.5+δ0.5+\delta.

Furthermore, the events {|Yi|=|Xi|}\{|Y_{i}|=|X_{i}|\} are mutually independent for all i2i\in\mathcal{I}_{2} (the pairs (Xi,Yi)(X_{i},Y_{i}) are i.i.d.). Then, since

ME=i21[|Yi||Xn+1|]=i21[|Xi||Xn+1|]1[|Yi|=|Xi|]M_{E}=\sum_{i\in\mathcal{I}_{2}}\textbf{1}[|Y_{i}|\geq|X_{n+1}|]=\sum_{i\in\mathcal{I}_{2}}\textbf{1}[|X_{i}|\geq|X_{n+1}|]\textbf{1}[|Y_{i}|=|X_{i}|]

and each term 1[|Yi|=|Xi|]\textbf{1}[|Y_{i}|=|X_{i}|] is i.i.d. Bernoulli with probability 0.5+δ0.5+\delta, and MR=i21[|Xi||Xn+1|]M_{R}=\sum_{i\in\mathcal{I}_{2}}\textbf{1}[|X_{i}|\geq|X_{n+1}|], the result follows. ∎

We now apply our two lemmas:

{MEm}=\displaystyle\mathbb{P}\big{\{}M_{E}\geq m\big{\}}= j=0n2{MR=j}{MEm|MR=j}\displaystyle\sum_{j=0}^{n_{2}}\mathbb{P}\big{\{}M_{R}=j\big{\}}\mathbb{P}\big{\{}M_{E}\geq m|M_{R}=j\big{\}}
=\displaystyle= 1n2+1j=0n2k=mj{MEm|MR=j}\displaystyle\frac{1}{n_{2}+1}\sum_{j=0}^{n_{2}}\sum_{k=m}^{j}\mathbb{P}\big{\{}M_{E}\geq m|M_{R}=j\big{\}}
=\displaystyle= 1n2+1j=0n2k=mj(jk)(0.5+δ)k(0.5δ)jk;\displaystyle\frac{1}{n_{2}+1}\sum_{j=0}^{n_{2}}\sum_{k=m}^{j}\dbinom{j}{k}(0.5+\delta)^{k}(0.5-\delta)^{j-k};

the second equality follows from Lemma A.1, and the last from Lemma A.2. Now, by applying the same train of logic as used in Appendices A.1 and A.2,

{MEm}=\displaystyle\mathbb{P}\big{\{}M_{E}\geq m\big{\}}= 1n2+1j=0n2(1k=0m1(jk)(0.5+δ)k(0.5δ)jk)\displaystyle\frac{1}{n_{2}+1}\sum_{j=0}^{n_{2}}\Big{(}1-\sum_{k=0}^{m-1}\dbinom{j}{k}(0.5+\delta)^{k}(0.5-\delta)^{j-k}\Big{)}
=\displaystyle= 11n2+1j=0n2k=0m1(jk)(0.5+δ)k(0.5δ)jk\displaystyle 1-\frac{1}{n_{2}+1}\sum_{j=0}^{n_{2}}\sum_{k=0}^{m-1}\dbinom{j}{k}(0.5+\delta)^{k}(0.5-\delta)^{j-k}
=\displaystyle= 11n2+1k=0m1(0.5+δ0.5δ)kj=0n2(jk)(0.5δ)j\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}\Big{(}\frac{0.5+\delta}{0.5-\delta}\Big{)}^{k}\sum_{j=0}^{n_{2}}\dbinom{j}{k}(0.5-\delta)^{j}
=\displaystyle= 11n2+1k=0m1(0.5+δ0.5δ)k((0.5δ)k(0.5+δ)k+1j=n2+1(jk)(0.5δ)j),\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}\Big{(}\frac{0.5+\delta}{0.5-\delta}\Big{)}^{k}\Big{(}\frac{(0.5-\delta)^{k}}{(0.5+\delta)^{k+1}}-\sum_{j=n_{2}+1}^{\infty}\dbinom{j}{k}(0.5-\delta)^{j}\Big{)},

where the last equality is from evaluating the generating function G((tk);z)=t=0(tk)ztG(\binom{t}{k};z)=\sum_{t=0}^{\infty}\binom{t}{k}z^{t} at z=0.5δz=0.5-\delta.

Finally, in order to bring this into a coherent bound, we expand the equation to bring out the 1α1-\alpha term and isolate the remainder, which we can then show goes to 0:

{MEm}=\displaystyle\mathbb{P}\big{\{}M_{E}\geq m\big{\}}= 11n2+1k=0m1(10.5+δ(0.5+δ0.5δ)kj=n2+1(jk)(0.5δ)j)\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}\Big{(}\frac{1}{0.5+\delta}-\Big{(}\frac{0.5+\delta}{0.5-\delta}\Big{)}^{k}\sum_{j=n_{2}+1}^{\infty}\dbinom{j}{k}(0.5-\delta)^{j}\Big{)}
=\displaystyle= 1mn2+110.5+δ+k=0m1(0.5+δ0.5δ)kj=n2+1(jk)(0.5δ)j\displaystyle 1-\frac{m}{n_{2}+1}\cdot\frac{1}{0.5+\delta}+\sum_{k=0}^{m-1}\Big{(}\frac{0.5+\delta}{0.5-\delta}\Big{)}^{k}\sum_{j=n_{2}+1}^{\infty}\dbinom{j}{k}(0.5-\delta)^{j}
\displaystyle\leq 1mn2+110.5+δ+k=0m1(0.5+δ0.5δ)k(n2+1k)(0.5δ)n2+111(0.5δ)n2+1n2+1k,\displaystyle 1-\frac{m}{n_{2}+1}\cdot\frac{1}{0.5+\delta}+\sum_{k=0}^{m-1}\Big{(}\frac{0.5+\delta}{0.5-\delta}\Big{)}^{k}\dbinom{n_{2}+1}{k}(0.5-\delta)^{n_{2}+1}\frac{1}{1-(0.5-\delta)\frac{n_{2}+1}{n_{2}+1-k}},

where the inequality arises from upper bounding the summation j=n2+1(jk)(0.5δ)j\sum_{j=n_{2}+1}^{\infty}\binom{j}{k}(0.5-\delta)^{j} by

(n2+1k)(0.5δ)n2+1j=0(n2+1n2+1k(0.5δ))j\binom{n_{2}+1}{k}(0.5-\delta)^{n_{2}+1}\sum_{j=0}^{\infty}\left(\frac{n_{2}+1}{n_{2}+1-k}(0.5-\delta)\right)^{j}

using the maximum ratio of consecutive terms. Applying m(n2+1)α/2m\leq(n_{2}+1)\alpha/2 twice gives

{MEm}=\displaystyle\mathbb{P}\big{\{}M_{E}\geq m\big{\}}= 1mn2+110.5+δ+(0.5+δ0.5δ)m111(0.5δ)n2+1n2+1mk=0m1(n2+1k)(0.5δ)n2+1\displaystyle 1-\frac{m}{n_{2}+1}\cdot\frac{1}{0.5+\delta}+\Big{(}\frac{0.5+\delta}{0.5-\delta}\Big{)}^{m-1}\frac{1}{1-(0.5-\delta)\frac{n_{2}+1}{n_{2}+1-m}}\sum_{k=0}^{m-1}\dbinom{n_{2}+1}{k}(0.5-\delta)^{n_{2}+1}
\displaystyle\leq 1mn2+110.5+δ+(0.5+δ0.5δ)m1(2+α1α)k=0m1(n2+1k)(0.5δ)n2+1\displaystyle 1-\frac{m}{n_{2}+1}\cdot\frac{1}{0.5+\delta}+\Big{(}\frac{0.5+\delta}{0.5-\delta}\Big{)}^{m-1}\Big{(}2+\frac{\alpha}{1-\alpha}\Big{)}\sum_{k=0}^{m-1}\dbinom{n_{2}+1}{k}(0.5-\delta)^{n_{2}+1}
\displaystyle\leq 1α+2αδ1+2δ+(2+α1α)(0.5+δ0.5δ)(n2+1)α/2k=0(n2+1)α/2(n2+1k)2n2+1.\displaystyle 1-\alpha+\frac{2\alpha\delta}{1+2\delta}+\Big{(}2+\frac{\alpha}{1-\alpha}\Big{)}\Big{(}\frac{0.5+\delta}{0.5-\delta}\Big{)}^{(n_{2}+1)\alpha/2}\cdot\frac{\sum_{k=0}^{\lfloor(n_{2}+1)\alpha/2\rfloor}\dbinom{n_{2}+1}{k}}{2^{n_{2}+1}}.

Because α<1\alpha<1,

k=0(n2+1)α/2(n2+1k)2n2+10asn2;\frac{\sum_{k=0}^{\lfloor(n_{2}+1)\alpha/2\rfloor}\binom{n_{2}+1}{k}}{2^{n_{2}+1}}\to 0\quad\text{as}\quad n_{2}\to\infty;

this is due to the fact that

k=0n2+1(n2+1k)2n2+1=1\frac{\sum_{k=0}^{n_{2}+1}\binom{n_{2}+1}{k}}{2^{n_{2}+1}}=1

and that the standard deviation of Binom(n2,0.5)\textrm{Binom}(n_{2},0.5) is 𝒪(n2)\mathcal{O}(\sqrt{n_{2}}), meaning that

k=(n2+1)α/2+1n2+1(n2+1)α/2(n2+1k)2n2+11.\frac{\sum_{k=\lfloor(n_{2}+1)\alpha/2\rfloor+1}^{n_{2}+1-\lfloor(n_{2}+1)\alpha/2\rfloor}\binom{n_{2}+1}{k}}{2^{n_{2}+1}}\to 1.

Furthermore, as Binom(n2,0.5)\textrm{Binom}(n_{2},0.5) approaches a normal distribution as n2n_{2}\to\infty and Φ(cn2)\Phi(-c\sqrt{n_{2}}) is 𝒪(dn2)\mathcal{O}(d^{-n_{2}}) for some d>1d>1, for small enough δ\delta,

(0.5+δ0.5δ)(n2+1)α/2k=0(n2+1)α/2(n2+1k)2n2+10asn2.\left(\frac{0.5+\delta}{0.5-\delta}\right)^{(n_{2}+1)\alpha/2}\,\frac{\sum_{k=0}^{\lfloor(n_{2}+1)\alpha/2\rfloor}\binom{n_{2}+1}{k}}{2^{n_{2}+1}}\to 0\quad\text{as}\quad n_{2}\to\infty.

Thus, we can pick DD and NN such that for all δ<D\delta<D and nNn\geq N,

(0.5+δ0.5δ)(n2+1)α/2k=0(n2+1)α/2(n2+1k)2n2+1ϵ/2,\left(\frac{0.5+\delta}{0.5-\delta}\right)^{(n_{2}+1)\alpha/2}\,\frac{\sum_{k=0}^{\lfloor(n_{2}+1)\alpha/2\rfloor}\binom{n_{2}+1}{k}}{2^{n_{2}+1}}\leq\epsilon/2,

noting that n2=n/2n_{2}=n/2. Then, setting δ=min{ϵ4α2ϵ,D}\delta=\min\{\frac{\epsilon}{4\alpha-2\epsilon},D\} and 2αδ1+2δϵ/2\frac{2\alpha\delta}{1+2\delta}\leq\epsilon/2 yields

{MEm}1α+ϵ/2+ϵ/2=1α+ϵ.\mathbb{P}\big{\{}M_{E}\geq m\big{\}}\leq 1-\alpha+\epsilon/2+\epsilon/2=1-\alpha+\epsilon.

This says that the probability {MEm}\mathbb{P}\{M_{E}\geq m\} of the confidence interval containing the conditional median is at most 1α+ϵ1-\alpha+\epsilon.

A.4 Proof of Theorem 4

We show that given ϵ\epsilon there exists cc, NN, and n1+n2=nn_{1}+n_{2}=n for all n>Nn>N such that running Algorithm 1 on n>Nn>N datapoints from an arbitrary distribution PP with regression function μ^c\hat{\mu}_{c} and split sizes n1+n2=nn_{1}+n_{2}=n results in a finite confidence interval that contains the conditional median with probability at least 1α/2ϵ1-\alpha/2-\epsilon.

For each xx in the support of PP, define m(x)=Median(Y|X=x)m(x)=\textrm{Median}(Y|X=x) and recall that M=maxi1|Yi|M=\displaystyle\max_{i\in\mathcal{I}_{1}}|Y_{i}|. We begin with two lemmas.

Lemma A.3.

For all i2i\in\mathcal{I}_{2}, {|Yi|M}11n1+1\mathbb{P}\{|Y_{i}|\leq M\}\geq 1-\frac{1}{n_{1}+1}.

Proof.

This results from the fact that |Yi||Y_{i}| is exchangeable with |Yj||Y_{j}| for all j1j\in\mathcal{I}_{1}; thus, the probability that |Yi||Y_{i}| is the unique maximum of the set {Yj:j1{i}}\{Y_{j}:j\in\mathcal{I}_{1}\cup\{i\}\} is bounded above by 1n1+1\frac{1}{n_{1}+1}. Taking the complement yields the desired result. ∎

Lemma A.4.

For all i2{n+1}i\in\mathcal{I}_{2}\cup\{n+1\}, {|m(Xi)|M}12n1+1\mathbb{P}\{|m(X_{i})|\leq M\}\geq 1-\frac{2}{n_{1}+1}.

Proof.

Note that |m(Xi)||m(X_{i})| is exchangeable with |m(Xj)||m(X_{j})| for all j1j\in\mathcal{I}_{1}. Letting MR=#{|m(Xj)||m(Xi)|:j1}M_{R}=\#\{|m(X_{j})|\geq|m(X_{i})|:j\in\mathcal{I}_{1}\}, exchangeability gives that the CDF of MRM_{R} is bounded below by the CDF of the uniform distribution over {0,1,,n1}\{0,1,\ldots,n_{1}\}. For each j1j\in\mathcal{I}_{1}, the event {|Yj||m(Xj)|}\{|Y_{j}|\geq|m(X_{j})|\} occurs with probability at least 1/21/2 by definition of the median; moreover, these events are mutually independent. Therefore, if we condition on MRM_{R}, we have that

{|m(Xi)|>maxj1|Yj||MR=k}j1|m(Xj)||m(Xi)|{|Yj|<|m(Xj)|}2k.\mathbb{P}\{|m(X_{i})|>\displaystyle\max_{j\in\mathcal{I}_{1}}|Y_{j}|\big{|}M_{R}=k\}\leq\prod_{\begin{subarray}{c}j\in\mathcal{I}_{1}\\ |m(X_{j})|\geq|m(X_{i})|\end{subarray}}\mathbb{P}\{|Y_{j}|<|m(X_{j})|\}\leq 2^{-k}.

Putting this together, we see that

{|m(Xi)|>maxj1|Yj|}=\displaystyle\mathbb{P}\{|m(X_{i})|>\max_{j\in\mathcal{I}_{1}}|Y_{j}|\}= k=0n1{MR=k}{|m(Xi)|>maxj1|Yj||MR=k}\displaystyle\sum_{k=0}^{n_{1}}\mathbb{P}\{M_{R}=k\}\mathbb{P}\{|m(X_{i})|>\max_{j\in\mathcal{I}_{1}}|Y_{j}|\big{|}M_{R}=k\}
\displaystyle\leq k=0n1{MR=k}12k\displaystyle\sum_{k=0}^{n_{1}}\mathbb{P}\{M_{R}=k\}\frac{1}{2^{k}}
\displaystyle\leq 1n1+1k=0n112k\displaystyle\frac{1}{n_{1}+1}\sum_{k=0}^{n_{1}}\frac{1}{2^{k}}
\displaystyle\leq 2n1+1.\displaystyle\frac{2}{n_{1}+1}.

Taking the complement yields the desired result. ∎

Let AA be the event {|Yi|M for all i2 and |m(Xi)|M for all i2{n+1}}\{|Y_{i}|\leq M\textrm{ for all }i\in\mathcal{I}_{2}\textrm{ and }|m(X_{i})|\leq M\textrm{ for all }i\in\mathcal{I}_{2}\cup\{n+1\}\}. By Lemmas A.3 and A.4, {A}13n2+2n1+1\mathbb{P}\{A\}\geq 1-\displaystyle\frac{3n_{2}+2}{n_{1}+1}. Select N=12/α+10ϵ+2/α+1N=\displaystyle\Big{\lfloor}\frac{12/\alpha+10}{\epsilon}\Big{\rfloor}+\lfloor 2/\alpha\rfloor+1, and for all n>Nn>N, set n2=2/α+1n_{2}=\lfloor 2/\alpha\rfloor+1 and n1=nn2n_{1}=n-n_{2}. As a result, we have that 1/n2<α/21/n_{2}<\alpha/2 and 3n2+2n1+1<ϵ/2\displaystyle\frac{3n_{2}+2}{n_{1}+1}<\epsilon/2, so {A}1ϵ/2\mathbb{P}\{A\}\geq 1-\epsilon/2.

Next, let BB be the event {|μ^c(Xi1)μ^c(Xi2)|>2M for all i1i22{n+1}}\displaystyle\{|\hat{\mu}_{c}(X_{i_{1}})-\hat{\mu}_{c}(X_{i_{2}})|>2M\textrm{ for all }i_{1}\neq i_{2}\in\mathcal{I}_{2}\cup\{n+1\}\}. Note that limc{B}=1\displaystyle\lim_{c\to\infty}\mathbb{P}\{B\}=1 by definition of μ^c\hat{\mu}_{c}. Select cc such that {B}1ϵ/2\mathbb{P}\{B\}\geq 1-\epsilon/2. By the union bound, {AB}1ϵ\mathbb{P}\{A\cap B\}\geq 1-\epsilon.

Lemma A.5.

On the event ABA\cap B, for all i2i\in\mathcal{I}_{2},

|m(Xi)μ^c(Xi)||m(Xn+1)μ^c(Xn+1)|if and only if|Yiμ^c(Xi)||m(Xn+1)μ^c(Xn+1)|.|m(X_{i})-\hat{\mu}_{c}(X_{i})|\geq|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|\quad\text{if and only if}\quad|Y_{i}-\hat{\mu}_{c}(X_{i})|\geq|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|.
Proof.

Notice that on the event ABA\cap B, |m(Xi)μ^c(Xi)|,|Yiμ^c(Xi)|[|μ^c(Xi)|M,|μ^c(Xi)|+M]|m(X_{i})-\hat{\mu}_{c}(X_{i})|,|Y_{i}-\hat{\mu}_{c}(X_{i})|\in[|\hat{\mu}_{c}(X_{i})|-M,|\hat{\mu}_{c}(X_{i})|+M]. This holds because |m(Xi)|,|Yi|M|m(X_{i})|,|Y_{i}|\leq M on the event AA. Similarly, |m(Xn+1)μ^c(Xn+1)|[|μ^c(Xn+1)|M,|μ^c(Xn+1)|+M]|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|\in[|\hat{\mu}_{c}(X_{n+1})|-M,|\hat{\mu}_{c}(X_{n+1})|+M]. These two intervals both have length 2M2M, but their centers are at a distance greater than 2M2M on the event BB, meaning that the intervals are disjoint. Therefore, |m(Xi)μ^c(Xi)||m(Xn+1)μ^c(Xn+1)||m(X_{i})-\hat{\mu}_{c}(X_{i})|\geq|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})| implies that all elements of the first interval are greater than all elements of the second, so |Yiμ^c(Xi)||m(Xn+1)μ^c(Xn+1)||Y_{i}-\hat{\mu}_{c}(X_{i})|\geq|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|; similarly, |Yiμ^c(Xi)||m(Xn+1)μ^c(Xn+1)||Y_{i}-\hat{\mu}_{c}(X_{i})|\geq|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})| also implies that all elements of the first interval are greater than all elements of the second, so |m(Xi)μ^c(Xi)||m(Xn+1)μ^c(Xn+1)||m(X_{i})-\hat{\mu}_{c}(X_{i})|\geq|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|. ∎

Looking at Algorithm 1, we have that m(Xn+1)C^n(Xn+1)m(X_{n+1})\in\hat{C}_{n}(X_{n+1}) if |m(Xn+1)μ^c(Xn+1)|Q1α/2(E)|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|\leq Q_{1-\alpha/2}(E), where Ei=|Yiμ^c(Xi)|E_{i}=|Y_{i}-\hat{\mu}_{c}(X_{i})| for all i2i\in\mathcal{I}_{2}. Because 1/n2<α/21/n_{2}<\alpha/2, Q1α/2(E)Q_{1-\alpha/2}(E) is finite and thus the confidence interval is bounded. By Lemma A.5, on the event ABA\cap B, |m(Xn+1)μ^c(Xn+1)|Q1α/2(E)|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|\leq Q_{1-\alpha/2}(E) if and only if |m(Xn+1)μ^c(Xn+1)|Q1α/2(F)|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|\leq Q_{1-\alpha/2}(F), where Fi=|m(Xi)μ^c(Xi)|F_{i}=|m(X_{i})-\hat{\mu}_{c}(X_{i})| for all i2i\in\mathcal{I}_{2}.

Define CC to be the event {|m(Xn+1)μ^c(Xn+1)|Q1α/2(F)}\displaystyle\{|m(X_{n+1})-\hat{\mu}_{c}(X_{n+1})|\leq Q_{1-\alpha/2}(F)\}. We have just shown that on the event ABCA\cap B\cap C, we have m(Xn+1)C^n(Xn+1)m(X_{n+1})\in\hat{C}_{n}(X_{n+1}). Additionally, because the elements of {|m(Xi)μ^c(Xi)|:i2{n+1}}\{|m(X_{i})-\hat{\mu}_{c}(X_{i})|:i\in\mathcal{I}_{2}\cup\{n+1\}\} are exchangeable, we have that {C}1α/2\mathbb{P}\{C\}\geq 1-\alpha/2. Then, by the union bound,

{m(Xn+1)C^n(Xn+1)}{ABC}1α/2ϵ,\mathbb{P}\{m(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}\geq\mathbb{P}\{A\cap B\cap C\}\geq 1-\alpha/2-\epsilon,

proving the desired result.

A.5 Proving Extensions of Theorem 6

We show the following two results:

Let C^n(x)=[L^n(x),H^n(x)]\hat{C}_{n}(x)=[\hat{L}_{n}(x),\hat{H}_{n}(x)] be any algorithm that only outputs confidence intervals and satisfies {Yn+1L^n(Xn+1)}1rq\mathbb{P}\{Y_{n+1}\geq\hat{L}_{n}(X_{n+1})\}\geq 1-rq and {Yn+1H^n(Xn+1)}1s(1q)\mathbb{P}\{Y_{n+1}\leq\hat{H}_{n}(X_{n+1})\}\geq 1-s(1-q) for some r+s=αr+s=\alpha. Then, C^n\hat{C}_{n} satisfies (1α,q)(1-\alpha,q)-Quantile. Secondly, if C^n\hat{C}_{n} only outputs confidence intervals and satisfies (1min{q,1q}α)(1-\min\{q,1-q\}\alpha)-Predictive, then it satisfies (1α,q)(1-\alpha,q)-Quantile.

Consider a distribution PP, and for all xdx\in\mathbb{R}^{d} in the support of PP, let q(x)q(x) be Quantileq(Y|X=x)\textrm{Quantile}_{q}(Y|X=x). We prove the first result in two parts.

Conditioning on whether or not q(Xn+1)q(X_{n+1}) is greater than or equal to L^(Xn+1)\hat{L}(X_{n+1}), note that

1rq\displaystyle 1-rq\leq {Yn+1L^n(Xn+1)}\displaystyle\mathbb{P}\{Y_{n+1}\geq\hat{L}_{n}(X_{n+1})\}
=\displaystyle= {Yn+1L^n(Xn+1)|q(Xn+1)L^n(Xn+1)}{q(Xn+1)L^n(Xn+1)}\displaystyle\mathbb{P}\{Y_{n+1}\geq\hat{L}_{n}(X_{n+1})|q(X_{n+1})\geq\hat{L}_{n}(X_{n+1})\}\mathbb{P}\{q(X_{n+1})\geq\hat{L}_{n}(X_{n+1})\}
+{Yn+1L^n(Xn+1)|q(Xn+1)<L^n(Xn+1)}{q(Xn+1)<L^n(Xn+1)}\displaystyle+\mathbb{P}\{Y_{n+1}\geq\hat{L}_{n}(X_{n+1})|q(X_{n+1})<\hat{L}_{n}(X_{n+1})\}\mathbb{P}\{q(X_{n+1})<\hat{L}_{n}(X_{n+1})\}
\displaystyle\leq {q(Xn+1)L^n(Xn+1)}+(1q){q(Xn+1)<L^n(Xn+1)}\displaystyle\mathbb{P}\{q(X_{n+1})\geq\hat{L}_{n}(X_{n+1})\}+(1-q)\mathbb{P}\{q(X_{n+1})<\hat{L}_{n}(X_{n+1})\}
=\displaystyle= (1q)+q{q(Xn+1)L^n(Xn+1)}\displaystyle(1-q)+q\cdot\mathbb{P}\{q(X_{n+1})\geq\hat{L}_{n}(X_{n+1})\}

where we use the fact that if q(Xn+1)<L^n(Xn+1)q(X_{n+1})<\hat{L}_{n}(X_{n+1}), at most 1q1-q of the conditional distribution of Y|X=Xn+1Y|X=X_{n+1} can be at least L^n(Xn+1)\hat{L}_{n}(X_{n+1}). Subtracting 1q1-q from both sides and dividing by qq tells us that 1r{q(Xn+1)L^n(Xn+1)}1-r\leq\mathbb{P}\{q(X_{n+1})\geq\hat{L}_{n}(X_{n+1})\}.

Similarly,

1s(1q)\displaystyle 1-s(1-q)\leq {Yn+1H^n(Xn+1)}\displaystyle\mathbb{P}\{Y_{n+1}\leq\hat{H}_{n}(X_{n+1})\}
=\displaystyle= {Yn+1H^n(Xn+1)|q(Xn+1)H^n(Xn+1)}{q(Xn+1)H^n(Xn+1)}\displaystyle\mathbb{P}\{Y_{n+1}\leq\hat{H}_{n}(X_{n+1})|q(X_{n+1})\leq\hat{H}_{n}(X_{n+1})\}\mathbb{P}\{q(X_{n+1})\leq\hat{H}_{n}(X_{n+1})\}
+{Yn+1H^n(Xn+1)|q(Xn+1)>H^n(Xn+1)}{q(Xn+1)>H^n(Xn+1)}\displaystyle+\mathbb{P}\{Y_{n+1}\leq\hat{H}_{n}(X_{n+1})|q(X_{n+1})>\hat{H}_{n}(X_{n+1})\}\mathbb{P}\{q(X_{n+1})>\hat{H}_{n}(X_{n+1})\}
\displaystyle\leq {q(Xn+1)H^n(Xn+1)}+q{q(Xn+1)>H^n(Xn+1)}\displaystyle\mathbb{P}\{q(X_{n+1})\leq\hat{H}_{n}(X_{n+1})\}+q\cdot\mathbb{P}\{q(X_{n+1})>\hat{H}_{n}(X_{n+1})\}
=\displaystyle= q+(1q){q(Xn+1)H^n(Xn+1)}.\displaystyle q+(1-q)\mathbb{P}\{q(X_{n+1})\leq\hat{H}_{n}(X_{n+1})\}.

Subtracting qq from both sides and dividing by 1q1-q tells us that 1s{q(Xn+1)H^n(Xn+1)}1-s\leq\mathbb{P}\{q(X_{n+1})\leq\hat{H}_{n}(X_{n+1})\}.

Then, by the union bound, we have that 1α=1(r+s){q(Xn+1)L^n(Xn+1)q(Xn+1)H^n(Xn+1)}={q(Xn+1C^n(Xn+1)}1-\alpha=1-(r+s)\leq\mathbb{P}\{q(X_{n+1})\geq\hat{L}_{n}(X_{n+1})\cap q(X_{n+1})\leq\hat{H}_{n}(X_{n+1})\}=\mathbb{P}\{q(X_{n+1}\in\hat{C}_{n}(X_{n+1})\}, proving the first result.

We can prove the second result in the exact same fashion as the proof of Theorem 6. We have that

1min{q,1q}α\displaystyle 1-\min\{q,1-q\}\alpha\leq {Yn+1C^n(Xn+1)}\displaystyle\mathbb{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})\}
=\displaystyle= {Yn+1C^n(Xn+1)|q(Xn+1)C^n(Xn+1)}{q(Xn+1)C^n(Xn+1)}\displaystyle\mathbb{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})|q(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}\mathbb{P}\{q(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}
+{Yn+1C^n(Xn+1)|q(Xn+1)C^n(Xn+1)}{q(Xn+1)C^n(Xn+1)}\displaystyle+\mathbb{P}\{Y_{n+1}\in\hat{C}_{n}(X_{n+1})|q(X_{n+1})\not\in\hat{C}_{n}(X_{n+1})\}\mathbb{P}\{q(X_{n+1})\not\in\hat{C}_{n}(X_{n+1})\}
\displaystyle\leq {q(Xn+1)C^n(Xn+1)}+max{q,1q}{q(Xn+1)C^n(Xn+1)}\displaystyle\mathbb{P}\{q(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}+\max\{q,1-q\}\mathbb{P}\{q(X_{n+1})\not\in\hat{C}_{n}(X_{n+1})\}
=\displaystyle= max{q,1q}+min{q,1q}{q(Xn+1)C^n(Xn+1)}.\displaystyle\max\{q,1-q\}+\min\{q,1-q\}\mathbb{P}\{q(X_{n+1})\in\hat{C}_{n}(X_{n+1})\}.

Subtracting max{q,1q}\max\{q,1-q\} and dividing by min{q,1q}\min\{q,1-q\} yields the desired result.

Appendix B Additional Results

B.1 Alternate Proof of Theorem 1

The proof of the theorem relies on two lemmas: the first establishes a connection between Median(Yn+1|Xn+1)\textrm{Median}(Y_{n+1}|X_{n+1}) and Median(Y|X=Xi)\textrm{Median}(Y|X=X_{i})’s using exchangeability, and the second gives us a relationship between Median(Y|X=Xi)\textrm{Median}(Y|X=X_{i})’s and the YiY_{i}’s.

We begin with some notation. For all xdx\in\mathbb{R}^{d} in the support of PP, set m(x)=Median(Y|X=x)m(x)=\textrm{Median}(Y|X=x). Also, for 1in+11\leq i\leq n+1, let R(Xi)=|m(Xi)μ^(Xi)|R(X_{i})=|m(X_{i})-\hat{\mu}(X_{i})|, and for 1in1\leq i\leq n let E(Xi)=|Yiμ^(Xi)|E(X_{i})=|Y_{i}-\hat{\mu}(X_{i})|. Finally, put MR=#{i2:R(Xi)R(Xn+1)}M_{R}=\#\{i\in\mathcal{I}_{2}:R(X_{i})\geq R(X_{n+1})\} as the number of i2i\in\mathcal{I}_{2} for which R(Xi)R(Xn+1)R(X_{i})\geq R(X_{n+1}), and ME=#{i2:E(Xi)R(Xn+1)}M_{E}=\#\{i\in\mathcal{I}_{2}:E(X_{i})\geq R(X_{n+1})\} as the number of i2i\in\mathcal{I}_{2} for which E(Xi)R(Xn+1)E(X_{i})\geq R(X_{n+1}).

Note that Median(Yn+1|Xn+1)C^n(Xn+1)=[μ^(Xn+1)Q1α/2(E),μ^(Xn+1)+Q1α/2(E)]\textrm{Median}(Y_{n+1}|X_{n+1})\in\hat{C}_{n}(X_{n+1})=[\hat{\mu}(X_{n+1})-Q_{1-\alpha/2}(E),\hat{\mu}(X_{n+1})+Q_{1-\alpha/2}(E)] if and only if |Median(Yn+1|Xn+1)μ^(Xn+1)|=R(Xn+1)|\textrm{Median}(Y_{n+1}|X_{n+1})-\hat{\mu}(X_{n+1})|=R(X_{n+1}) is at most Q1α/2(E)Q_{1-\alpha/2}(E). Thus, we must study the value of R(Xn+1)R(X_{n+1}) in relation to the elements of {E(Xi):i2}\{E(X_{i}):i\in\mathcal{I}_{2}\}.

The first lemma relates R(Xn+1)R(X_{n+1}) to the other R(Xi)R(X_{i})’s.

Lemma B.1.

For all 0mn20\leq m\leq n_{2},

{MRm}1mn2+1.\mathbb{P}\big{\{}M_{R}\geq m\big{\}}\geq 1-\frac{m}{n_{2}+1}.
Proof.

Our statement follows from the fact that our samples are i.i.d. Because μ^\hat{\mu} is independent of (Xi,Yi)(X_{i},Y_{i}) for i2i\in\mathcal{I}_{2} and independent of Xn+1X_{n+1}, the values {R(Xi):i2}{R(Xn+1)}\{R(X_{i}):i\in\mathcal{I}_{2}\}\cup\{R(X_{n+1})\} are i.i.d. as well. Then, the probability of less than mm values in {R(Xi):i2}\{R(X_{i}):i\in\mathcal{I}_{2}\} being at least R(Xn+1)R(X_{n+1}) is bounded above by m|2|+1\frac{m}{|\mathcal{I}_{2}|+1}, as the ordering of these values is uniformly random. Taking the complement of both sides gives the result. ∎

The second lemma gives a direct relationship between E(Xi)E(X_{i}) and R(Xi)R(X_{i}). (Note that the events {E(Xi)R(Xi)}\{E(X_{i})\geq R(X_{i})\} below are mutually independent.)

Lemma B.2.

For all i2i\in\mathcal{I}_{2},

{E(Xi)R(Xi)}1/2.\mathbb{P}\big{\{}E(X_{i})\geq R(X_{i})\big{\}}\geq 1/2.
Proof.

We see that {Yim(Xi)}1/2\mathbb{P}\{Y_{i}\geq m(X_{i})\}\geq 1/2 and {Yim(Xi)}1/2\mathbb{P}\{Y_{i}\leq m(X_{i})\}\geq 1/2 by the definition of the conditional median. Furthermore, the events {Yim(Xi)}\{Y_{i}\geq m(X_{i})\} and {Yim(Xi)}\{Y_{i}\leq m(X_{i})\} are independent of the events {m(Xi)μ^(Xi)}\{m(X_{i})\geq\hat{\mu}(X_{i})\} and {m(Xi)μ^(Xi)}\{m(X_{i})\leq\hat{\mu}(X_{i})\} given XiX_{i}, as μ^\hat{\mu} is a function of {(Xi,Yi):i1}\{(X_{i},Y_{i}):i\in\mathcal{I}_{1}\} and the datapoints are i.i.d. Then, conditioned on XiX_{i}, if m(Xi)μ^(Xi)m(X_{i})\geq\hat{\mu}(X_{i}), with probability 1/21/2 we have that m(Xi)Yim(X_{i})\leq Y_{i}, in which case |m(Xi)μ^(Xi)|=m(Xi)μ^(Xi)Yiμ^(Xi)=|Yiμ^(Xi)||m(X_{i})-\hat{\mu}(X_{i})|=m(X_{i})-\hat{\mu}(X_{i})\leq Y_{i}-\hat{\mu}(X_{i})=|Y_{i}-\hat{\mu}(X_{i})|. Similarly, conditioned on XiX_{i} again, if m(Xi)<μ^(Xi)m(X_{i})<\hat{\mu}(X_{i}), with probability 1/21/2 we have that m(Xi)Yim(X_{i})\geq Y_{i}, in which case |m(Xi)μ^(Xi)|=μ^(Xi)m(Xi)μ^(Xi)Yi=|Yiμ^(Xi)||m(X_{i})-\hat{\mu}(X_{i})|=\hat{\mu}(X_{i})-m(X_{i})\leq\hat{\mu}(X_{i})-Y_{i}=|Y_{i}-\hat{\mu}(X_{i})|. The conclusion holds conditionally in both cases; marginalizing out XiX_{i} yields the desired result. ∎

We now study the number of datapoints obeying E(Xi)R(Xn+1)E(X_{i})\geq R(X_{n+1}) by combining these lemmas together. Consider any 0mn20\leq m\leq n_{2}. Note that by conditioning on MRM_{R},

{MEm}=\displaystyle\mathbb{P}\big{\{}M_{E}\geq m\big{\}}= j=0n2{MR=j}{MEm|MR=j}\displaystyle\sum_{j=0}^{n_{2}}\mathbb{P}\big{\{}M_{R}=j\big{\}}\mathbb{P}\big{\{}M_{E}\geq m|M_{R}=j\big{\}}
\displaystyle\geq j=0n2{MR=j}k=mj(jk)2j\displaystyle\sum_{j=0}^{n_{2}}\mathbb{P}\big{\{}M_{R}=j\big{\}}\sum_{k=m}^{j}\dbinom{j}{k}2^{-j}
\displaystyle\geq j=0n21n2+1k=mj(jk)2j.\displaystyle\sum_{j=0}^{n_{2}}\frac{1}{n_{2}+1}\sum_{k=m}^{j}\dbinom{j}{k}2^{-j}.

The first inequality holds true by Lemma B.2, which implies that {MEm|MR=j}\mathbb{P}\{M_{E}\geq m|M_{R}=j\} can be bounded below by the probability that MmM\geq m for MBinom(j,0.5)M\sim\textrm{Binom}(j,0.5). The second inequality is due to Lemma B.1. We know that k=mj(jk)2j\sum_{k=m}^{j}\binom{j}{k}2^{-j} is a nondecreasing function of jj; by Lemma B.1, the CDF of the distribution of MRM_{R} is lower bounded by the CDF of the uniform distribution over {0,1,,n2}\{0,1,\ldots,n_{2}\}, meaning that

𝔼MR[k=mMR(MRk)2MR]1n2+1j=0n2k=mj(jk)2j.\mathbb{E}_{M_{R}}\left[\sum_{k=m}^{M_{R}}\binom{M_{R}}{k}2^{-M_{R}}\right]\geq\frac{1}{n_{2}+1}\sum_{j=0}^{n_{2}}\sum_{k=m}^{j}\binom{j}{k}2^{-j}.

This gives

j=0n21n2+1k=mj(jk)2j=\displaystyle\sum_{j=0}^{n_{2}}\frac{1}{n_{2}+1}\sum_{k=m}^{j}\dbinom{j}{k}2^{-j}= j=0n21n2+1(1k=0m1(jk)2j)\displaystyle\sum_{j=0}^{n_{2}}\frac{1}{n_{2}+1}\Big{(}1-\sum_{k=0}^{m-1}\dbinom{j}{k}2^{-j}\Big{)}
=\displaystyle= 11n2+1j=0n2k=0m1(jk)2j\displaystyle 1-\frac{1}{n_{2}+1}\sum_{j=0}^{n_{2}}\sum_{k=0}^{m-1}\dbinom{j}{k}2^{-j}
=\displaystyle= 11n2+1k=0m1j=0n2(jk)2j\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}\sum_{j=0}^{n_{2}}\dbinom{j}{k}2^{-j}
\displaystyle\geq 11n2+1k=0m1j=0(jk)2j\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}\sum_{j=0}^{\infty}\dbinom{j}{k}2^{-j}
=\displaystyle= 11n2+1k=0m12\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}2
=\displaystyle= 12mn2+1.\displaystyle 1-\frac{2m}{n_{2}+1}.

The second-to-last equality comes from evaluating the generating function G((tk);z)=t=0(tk)ztG(\binom{t}{k};z)=\sum_{t=0}^{\infty}\binom{t}{k}z^{t} at z=0.5z=0.5.

Putting it together, in Algorithm 1, Q1α/2(E)Q_{1-\alpha/2}(E) is set to be the (1α/2)(1+1/n2)(1-\alpha/2)(1+1/n_{2})-th quantile of {Ei:i2}\{E_{i}:i\in\mathcal{I}_{2}\}. This is equal to the n2(1α/2)(1+1/n2)=(1α/2)(n2+1)\lceil n_{2}(1-\alpha/2)(1+1/n_{2})\rceil=\lceil(1-\alpha/2)(n_{2}+1)\rceil smallest value of {Ei:i2}\{E_{i}:i\in\mathcal{I}_{2}\}. This means that if MEn2(1α/2)(n2+1)+1M_{E}\geq n_{2}-\lceil(1-\alpha/2)(n_{2}+1)\rceil+1, then R(Xn+1)R(X_{n+1}) will be at most the n2(1α/2)(n2+1)+1n_{2}-\lceil(1-\alpha/2)(n_{2}+1)\rceil+1 largest value of {Ei:i2}\{E_{i}:i\in\mathcal{I}_{2}\}, which is equal to the (1α/2)(n2+1)\lceil(1-\alpha/2)(n_{2}+1)\rceil smallest value, or Q1α/2(E)Q_{1-\alpha/2}(E).

However, we have calculated a lower bound for the inverse CDF of MEM_{E} earlier. Substituting this in, we get that

{R(Xn+1)Q1α/2(E)}\displaystyle\mathbb{P}\{R(X_{n+1})\leq Q_{1-\alpha/2}(E)\}\geq {MEn2(1α/2)(n2+1)+1}\displaystyle\mathbb{P}\{M_{E}\geq n_{2}-\lceil(1-\alpha/2)(n_{2}+1)\rceil+1\}
\displaystyle\geq {MEn2+1(1α/2)(n2+1)}\displaystyle\mathbb{P}\{M_{E}\geq n_{2}+1-(1-\alpha/2)(n_{2}+1)\}
=\displaystyle= {MEα/2(n2+1)}\displaystyle\mathbb{P}\{M_{E}\geq\alpha/2(n_{2}+1)\}
\displaystyle\geq 1α\displaystyle 1-\alpha

by our previous calculation, completing our proof.

B.2 Alternate Proof of Theorem 2

Our approach is similar to that in Appendix A.1. The main difference in the proof arises from the fact that Algorithm 2 no longer uses the absolute value and uses two separate fitted functions, meaning that it is important to bound the probability of the confidence interval covering the desired value from both sides.

We begin with some definitions. For all xdx\in\mathbb{R}^{d} in the support of PP, let q(x)q(x) be Quantileq(Y|X=x)\textrm{Quantile}_{q}(Y|X=x). For all 1in+11\leq i\leq n+1, define:

  • Rlo(Xi)=flo(Xi,q(Xi))R^{\textnormal{lo}}(X_{i})=f^{\textnormal{lo}}(X_{i},q(X_{i})) and Rhi(Xi)=fhi(Xi,q(Xi))R^{\textnormal{hi}}(X_{i})=f^{\textnormal{hi}}(X_{i},q(X_{i})).

  • Elo(Xi)=flo(Xi,Yi)E^{\textnormal{lo}}(X_{i})=f^{\textnormal{lo}}(X_{i},Y_{i}) and Ehi(Xi)=fhi(Xi,Yi)E^{\textnormal{hi}}(X_{i})=f^{\textnormal{hi}}(X_{i},Y_{i}).

  • MRlo=#{i2:Rlo(Xi)Rlo(Xn+1)}M_{R}^{\textrm{lo}}=\#\{i\in\mathcal{I}_{2}:R^{\textnormal{lo}}(X_{i})\leq R^{\textnormal{lo}}(X_{n+1})\} and MRhi=#{i2:Rhi(Xi)Rhi(Xn+1)}M_{R}^{\textnormal{hi}}=\#\{i\in\mathcal{I}_{2}:R^{\textnormal{hi}}(X_{i})\geq R^{\textnormal{hi}}(X_{n+1})\} as the number of i2i\in\mathcal{I}_{2} for which Rlo(Xi)Rlo(Xn+1)R^{\textnormal{lo}}(X_{i})\leq R^{\textnormal{lo}}(X_{n+1}) and Rhi(Xi)Rhi(Xn+1)R^{\textnormal{hi}}(X_{i})\geq R^{\textnormal{hi}}(X_{n+1}) respectively.

  • MElo=#{i2:Elo(Xi)Rlo(Xn+1)}M_{E}^{\textrm{lo}}=\#\{i\in\mathcal{I}_{2}:E^{\textnormal{lo}}(X_{i})\leq R^{\textnormal{lo}}(X_{n+1})\} and MEhi=#{i2:Ehi(Xi)Rhi(Xn+1)}M_{E}^{\textnormal{hi}}=\#\{i\in\mathcal{I}_{2}:E^{\textnormal{hi}}(X_{i})\geq R^{\textnormal{hi}}(X_{n+1})\} as the number of i2i\in\mathcal{I}_{2} for which Elo(Xi)Rlo(Xn+1)E^{\textnormal{lo}}(X_{i})\leq R^{\textnormal{lo}}(X_{n+1}) and Ehi(Xi)Rhi(Xn+1)E^{\textnormal{hi}}(X_{i})\geq R^{\textnormal{hi}}(X_{n+1}) respectively.

Now, note that Quantileq(Y|X=Xn+1)C^n(Xn+1)\textrm{Quantile}_{q}(Y|X=X_{n+1})\in\hat{C}_{n}(X_{n+1}) precisely when flo(Xn+1,Quantileq(Y|X=Xn+1))=Rlo(Xn+1)Qrqlof^{\textnormal{lo}}(X_{n+1},\textrm{Quantile}_{q}(Y|X=X_{n+1}))=R^{\textnormal{lo}}(X_{n+1})\geq Q^{\textnormal{lo}}_{rq} and fhi(Xn+1,Quantileq(Y|X=Xn+1))=Rhi(Xn+1)Q1s(1q)hif^{\textnormal{hi}}(X_{n+1},\textrm{Quantile}_{q}(Y|X=X_{n+1}))=R^{\textnormal{hi}}(X_{n+1})\leq Q^{\textnormal{hi}}_{1-s(1-q)}. As such, we proceed to develop two lemmas which extend those from Section A.1: the first helps us to understand the distributions of MRhiM_{R}^{\textnormal{hi}} and MRloM_{R}^{\textrm{lo}}, and the second studies the individual relationships between Elo(Xi)E^{\textnormal{lo}}(X_{i}) and Rlo(Xi)R^{\textnormal{lo}}(X_{i}) and between Ehi(Xi)E^{\textnormal{hi}}(X_{i}) and Rhi(Xi)R^{\textnormal{hi}}(X_{i}). With both of these lemmas, we are able to bound the probability of each event {Rlo(Xn+1)Qrqlo}\{R^{\textnormal{lo}}(X_{n+1})\geq Q^{\textnormal{lo}}_{rq}\} and {Rhi(Xn+1)Q1s(1q)hi}\{R^{\textnormal{hi}}(X_{n+1})\leq Q^{\textnormal{hi}}_{1-s(1-q)}\}.

Lemma B.3.

For all 0mn20\leq m\leq n_{2},

{MRlom}1mn2+1\mathbb{P}\big{\{}M_{R}^{\textnormal{lo}}\geq m\big{\}}\geq 1-\frac{m}{n_{2}+1}

and

{MRhim}1mn2+1.\mathbb{P}\big{\{}M_{R}^{\textnormal{hi}}\geq m\big{\}}\geq 1-\frac{m}{n_{2}+1}.
Proof.

We prove the result for MRhiM_{R}^{\textnormal{hi}}; the same approach holds for MRloM_{R}^{\textrm{lo}}. Because fhif^{\textnormal{hi}} is independent of (Xi,Yi)(X_{i},Y_{i}) for i2i\in\mathcal{I}_{2} and independent of Xn+1X_{n+1}, the values in {Rhi(Xi):i2}{Rhi(Xn+1)}\{R^{\textnormal{hi}}(X_{i}):i\in\mathcal{I}_{2}\}\cup\{R^{\textnormal{hi}}(X_{n+1})\} are i.i.d.. Thus, the probability of less than mm values in {Rhi(Xi):i2}\{R^{\textnormal{hi}}(X_{i}):i\in\mathcal{I}_{2}\} being at least Rhi(Xn+1)R^{\textnormal{hi}}(X_{n+1}) is bounded above by m|2|+1\frac{m}{|\mathcal{I}_{2}|+1}, as the ordering of these values is uniformly random. Taking the complement of both sides establishes the claim. ∎

Lemma B.4.

For all i2i\in\mathcal{I}_{2},

{Elo(Xi)Rlo(Xi)}q\mathbb{P}\big{\{}E^{\textnormal{lo}}(X_{i})\leq R^{\textnormal{lo}}(X_{i})\big{\}}\geq q

and

{Ehi(Xi)Rhi(Xi)}1q.\mathbb{P}\big{\{}E^{\textnormal{hi}}(X_{i})\geq R^{\textnormal{hi}}(X_{i})\big{\}}\geq 1-q.
Proof.

We see that {Yiq(Xi)}q\mathbb{P}\{Y_{i}\leq q(X_{i})\}\geq q and {Yiq(Xi)}1q\mathbb{P}\{Y_{i}\geq q(X_{i})\}\geq 1-q by the definition of the conditional quantile. Then, with probability at least qq we have that Elo(Xi)=flo(Xi,Yi)flo(Xi,q(Xi))=Rlo(Xi)E^{\textnormal{lo}}(X_{i})=f^{\textnormal{lo}}(X_{i},Y_{i})\leq f^{\textnormal{lo}}(X_{i},q(X_{i}))=R^{\textnormal{lo}}(X_{i}) by the definition of a locally nondecreasing conformity score. Similarly, with probability at least 1q1-q we have that Ehi(Xi)=fhi(Xi,Yi)fhi(Xi,q(Xi))=Rhi(Xi)E^{\textnormal{hi}}(X_{i})=f^{\textnormal{hi}}(X_{i},Y_{i})\geq f^{\textnormal{hi}}(X_{i},q(X_{i}))=R^{\textnormal{hi}}(X_{i}), thereby concluding the proof. ∎

We now study the number of i2i\in\mathcal{I}_{2} such that Ehi(Xi)Rhi(Xn+1)E^{\textnormal{hi}}(X_{i})\geq R^{\textnormal{hi}}(X_{n+1}) by combining these two lemmas together. Consider any 0mn20\leq m\leq n_{2}, and note that by conditioning on MRhiM_{R}^{\textnormal{hi}}, we have

{MEhim}=\displaystyle\mathbb{P}\big{\{}M_{E}^{\textnormal{hi}}\geq m\big{\}}= j=0n2{MRhi=j}{MEhim|MRhi=j}\displaystyle\sum_{j=0}^{n_{2}}\mathbb{P}\big{\{}M_{R}^{\textnormal{hi}}=j\big{\}}\mathbb{P}\big{\{}M_{E}^{\textnormal{hi}}\geq m|M_{R}^{\textnormal{hi}}=j\big{\}}
\displaystyle\geq j=0n2{MRhi=j}k=mj(jk)(1q)kqjk\displaystyle\sum_{j=0}^{n_{2}}\mathbb{P}\big{\{}M_{R}^{\textnormal{hi}}=j\big{\}}\sum_{k=m}^{j}\dbinom{j}{k}(1-q)^{k}q^{j-k}
\displaystyle\geq j=0n21n2+1k=mj(jk)(1q)kqjk.\displaystyle\sum_{j=0}^{n_{2}}\frac{1}{n_{2}+1}\sum_{k=m}^{j}\dbinom{j}{k}(1-q)^{k}q^{j-k}.

The first inequality holds because {MEhim|MRhi=j}\mathbb{P}\{M_{E}^{\textnormal{hi}}\geq m|M_{R}^{\textnormal{hi}}=j\} can be bounded below by the probability that MmM\geq m for MBinom(j,1q)M\sim\textrm{Binom}(j,1-q) by Lemma B.4. The second inequality holds since the CDF MRhiM_{R}^{\textnormal{hi}} is greater than or equal to the CDF of the uniform distribution over {0,1,,n2}\{0,1,\ldots,n_{2}\} by Lemma B.3. Then, as k=mj(jk)(1q)kqjk\sum_{k=m}^{j}\binom{j}{k}(1-q)^{k}q^{j-k} is a nondecreasing function of jj, we have

𝔼MRhi[k=mMRhi(MRhik)(1q)kqMRhik]1n2+1j=0n2k=mj(jk)(1q)kqjk.\mathbb{E}_{M_{R}^{\textnormal{hi}}}\left[\sum_{k=m}^{M_{R}^{\textnormal{hi}}}\binom{M_{R}^{\textnormal{hi}}}{k}(1-q)^{k}q^{M_{R}^{\textnormal{hi}}-k}\right]\geq\frac{1}{n_{2}+1}\sum_{j=0}^{n_{2}}\sum_{k=m}^{j}\binom{j}{k}(1-q)^{k}q^{j-k}.

We now solve the summation, which gives

j=0n21n2+1k=mj(jk)(1q)kqjk=\displaystyle\sum_{j=0}^{n_{2}}\frac{1}{n_{2}+1}\sum_{k=m}^{j}\dbinom{j}{k}(1-q)^{k}q^{j-k}= j=0n21n2+1(1k=0m1(jk)(1q)kqjk)\displaystyle\sum_{j=0}^{n_{2}}\frac{1}{n_{2}+1}\Big{(}1-\sum_{k=0}^{m-1}\dbinom{j}{k}(1-q)^{k}q^{j-k}\Big{)}
=\displaystyle= 11n2+1j=0n2k=0m1(jk)(1q)kqjk\displaystyle 1-\frac{1}{n_{2}+1}\sum_{j=0}^{n_{2}}\sum_{k=0}^{m-1}\dbinom{j}{k}(1-q)^{k}q^{j-k}
=\displaystyle= 11n2+1k=0m1j=0n2(jk)(1q)kqjk\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}\sum_{j=0}^{n_{2}}\dbinom{j}{k}(1-q)^{k}q^{j-k}
\displaystyle\geq 11n2+1k=0m1(1qq)kj=0(jk)qj\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}\Big{(}\frac{1-q}{q}\Big{)}^{k}\sum_{j=0}^{\infty}\dbinom{j}{k}q^{j}
=\displaystyle= 11n2+1k=0m111q\displaystyle 1-\frac{1}{n_{2}+1}\sum_{k=0}^{m-1}\frac{1}{1-q}
=\displaystyle= 111qmn2+1,\displaystyle 1-\frac{1}{1-q}\cdot\frac{m}{n_{2}+1},

where the second-to-last equality is from evaluating the generating function G((tk);z)=t=0(tk)ztG(\binom{t}{k};z)=\sum_{t=0}^{\infty}\binom{t}{k}z^{t} at z=qz=q.

Note that this same calculation works for counting the i2i\in\mathcal{I}_{2} with Elo(Xi)Rlo(Xn+1)E^{\textnormal{lo}}(X_{i})\leq R^{\textnormal{lo}}(X_{n+1}) using the same lemmas, substituting MEloM_{E}^{\textrm{lo}} for MEhiM_{E}^{\textnormal{hi}}, MRloM_{R}^{\textrm{lo}} for MRhiM_{R}^{\textnormal{hi}}, and qq for 1q1-q within the calculation. This gives

{MElom}11qmn2+1.\mathbb{P}\big{\{}M_{E}^{\textnormal{lo}}\geq m\big{\}}\geq 1-\frac{1}{q}\cdot\frac{m}{n_{2}+1}.

Now, in Algorithm 2, Q1s(1q)hi(E)Q_{1-s(1-q)}^{\textnormal{hi}}(E) is defined as the (1s(1q))(1+1/n2)(1-s(1-q))(1+1/n_{2})-th quantile of {Eihi:i2}\{E^{\textnormal{hi}}_{i}:i\in\mathcal{I}_{2}\}. This is equal to the n2(1s(1q))(1+1/n2)=(1s(1q))(n2+1)\lceil n_{2}(1-s(1-q))(1+1/n_{2})\rceil=\lceil(1-s(1-q))(n_{2}+1)\rceil smallest value of {Eihi:i2}\{E^{\textnormal{hi}}_{i}:i\in\mathcal{I}_{2}\}. Thus, if MEhin2(1s(1q))(n2+1)+1M_{E}^{\textnormal{hi}}\geq n_{2}-\lceil(1-s(1-q))(n_{2}+1)\rceil+1, then Rhi(Xn+1)R^{\textnormal{hi}}(X_{n+1}) will be at most the n2(1s(1q))(n2+1)+1n_{2}-\lceil(1-s(1-q))(n_{2}+1)\rceil+1 largest value of {Eihi:i2}\{E^{\textnormal{hi}}_{i}:i\in\mathcal{I}_{2}\}, which is equal to the (1s(1q))(n2+1)\lceil(1-s(1-q))(n_{2}+1)\rceil smallest value, or Q1s(1q)hi(E)Q_{1-s(1-q)}^{\textnormal{hi}}(E). Then, using our earlier lower bound for the inverse CDF of MEhiM_{E}^{\textnormal{hi}}, we get that

{Rhi(Xn+1)Q1s(1q)hi(E)}\displaystyle\mathbb{P}\{R^{\textnormal{hi}}(X_{n+1})\leq Q_{1-s(1-q)}^{\textnormal{hi}}(E)\}\geq {MEhin2(1s(1q))(n2+1)+1}\displaystyle\mathbb{P}\{M_{E}^{\textnormal{hi}}\geq n_{2}-\lceil(1-s(1-q))(n_{2}+1)\rceil+1\}
\displaystyle\geq {MEhin2+1(1s(1q))(n2+1)}\displaystyle\mathbb{P}\{M_{E}^{\textnormal{hi}}\geq n_{2}+1-(1-s(1-q))(n_{2}+1)\}
=\displaystyle= {MEhis(1q)(n2+1)}\displaystyle\mathbb{P}\{M_{E}^{\textnormal{hi}}\geq s(1-q)(n_{2}+1)\}
\displaystyle\geq 1s.\displaystyle 1-s.

Similarly, Qrqlo(E)Q_{rq}^{\textrm{lo}}(E) is defined as the rq(1rq)/n2rq-(1-rq)/n_{2}-th quantile of {Eilo:i2}\{E^{\textnormal{lo}}_{i}:i\in\mathcal{I}_{2}\}. This is equal to the n2(rq(1rq)/n2)=(n2+1)rq1\lceil n_{2}(rq-(1-rq)/n_{2})\rceil=\lceil(n_{2}+1)rq-1\rceil smallest value of {Eilo:i2}\{E^{\textnormal{lo}}_{i}:i\in\mathcal{I}_{2}\}. Thus, if MElo(n2+1)rq1M_{E}^{\textrm{lo}}\geq\lceil(n_{2}+1)rq-1\rceil, then Rlo(Xn+1)R^{\textnormal{lo}}(X_{n+1}) will be at least Qrqlo(E)Q_{rq}^{\textrm{lo}}(E). Then, our lower bound for the MEloM_{E}^{\textrm{lo}} inverse CDF tells us that

{Rlo(Xn+1)Qrqlo(E)}\displaystyle\mathbb{P}\{R^{\textnormal{lo}}(X_{n+1})\geq Q_{rq}^{\textnormal{lo}}(E)\}\geq {MElo(n2+1)rq1}\displaystyle\mathbb{P}\{M_{E}^{\textnormal{lo}}\geq\lceil(n_{2}+1)rq-1\rceil\}
\displaystyle\geq {MElo(n2+1)rq}\displaystyle\mathbb{P}\{M_{E}^{\textnormal{lo}}\geq(n_{2}+1)rq\}
\displaystyle\geq 1r.\displaystyle 1-r.

Finally, by the union bound, we have that

{Qrqlo(E)Rlo(Xn+1) and Rhi(Xn+1)Q1s(1q)hi(E)}1rs=1α\mathbb{P}\{Q_{rq}^{\textnormal{lo}}(E)\leq R^{\textnormal{lo}}(X_{n+1})\textnormal{ and }R^{\textnormal{hi}}(X_{n+1})\leq Q_{1-s(1-q)}^{\textnormal{hi}}(E)\}\geq 1-r-s=1-\alpha

completing our proof.

B.3 Impossibility of Capturing the Distribution Mean

Instead of proving the impossibility of capturing the conditional mean of a distribution, we prove a more general result: we show that there does not exist an algorithm to capture the mean of a distribution YPY\sim P given no assumptions about PP. This is a more general form of our result because if we set XYX\perp\!\!\!\perp Y in (X,Y)P(X,Y)\sim P, then 𝔼[Y|X]=𝔼[Y]\mathbb{E}[Y|X]=\mathbb{E}[Y], meaning that the impossibility of capturing the mean results in the conditional mean being impossible to capture as well.

Consider an algorithm C^n\hat{C}_{n} that, given i.i.d. samples Y1,,YnPY_{1},\ldots,Y_{n}\sim P, returns a (possibly randomized) confidence interval C^n(𝒟)\hat{C}_{n}(\mathcal{D}), 𝒟={Yi,1in}\mathcal{D}=\{Y_{i},1\leq i\leq n\}, with length bounded by some function of PP that captures 𝔼[Y]\mathbb{E}[Y] with probability at least 1α1-\alpha, i.e. {𝔼[Y]C^n(𝒟)}1α\mathbb{P}\{\mathbb{E}[Y]\in\hat{C}_{n}(\mathcal{D})\}\geq 1-\alpha. Pick a>αa>\alpha, with a<1a<1. Consider a distribution PP where for YPY\sim P, {Y=0}=a1/n\mathbb{P}\{Y=0\}=a^{1/n} and {Y=u}=1a1/n\mathbb{P}\{Y=u\}=1-a^{1/n} for some uu. Then for Y1,,YnPY_{1},\ldots,Y_{n}\sim P, {Y1==Yn=0}a\mathbb{P}\{Y_{1}=\cdots=Y_{n}=0\}\geq a. Consider C^n({0,,0})\hat{C}_{n}(\{0,\ldots,0\}); by our assumption on C^n\hat{C}_{n}, there must exist some mm\in\mathbb{R} for which {mC^n({0,,0})}<1α/a\mathbb{P}\{m\in\hat{C}_{n}(\{0,\ldots,0\})\}<1-\alpha/a. Then, setting u=m1a1/nu=\frac{m}{1-a^{1/n}} yields 𝔼[Y]=m\mathbb{E}[Y]=m. With probability aa, C^n(𝒟)=C^n({0,,0})\hat{C}_{n}(\mathcal{D})=\hat{C}_{n}(\{0,\ldots,0\}), so

{𝔼[Y]=mC^n(𝒟)}>aαa=α.\mathbb{P}\{\mathbb{E}[Y]=m\not\in\hat{C}_{n}(\mathcal{D})\}>a\cdot\frac{\alpha}{a}=\alpha.

This implies that {𝔼[Y]C^n(𝒟)}<1α\mathbb{P}\{\mathbb{E}[Y]\in\hat{C}_{n}(\mathcal{D})\}<1-\alpha as desired, completing the proof.

B.4 Capturing the Distribution Median

Input:
  Number of i.i.d. datapoints nn\in\mathbb{N}.
  Datapoints Y1,,YnPY_{1},\ldots,Y_{n}\sim P\subseteq\mathbb{R}.
  Coverage level 1α(0,1)1-\alpha\in(0,1).
Process : 
  Order the YiY_{i} as Y(1)Y(n)Y_{(1)}\leq\cdots\leq Y_{(n)}.
  Calculate the largest k0k\geq 0 such that for XBinom(n,0.5)X\sim\textrm{Binom}(n,0.5), we have {X<k}α/2\mathbb{P}\{X<k\}\leq\alpha/2.
Output:
  Confidence interval C^n=[Y(k),Y(n+1k)]\hat{C}_{n}=[Y_{(k)},Y_{(n+1-k)}] for Median(Y)\textrm{Median}(Y).
  (Note that Y(0)=Y_{(0)}=-\infty and Y(n+1)=Y_{(n+1)}=\infty)
Algorithm 3 Confidence Interval for Median(Y)\textrm{Median}(Y) with Coverage 1α1-\alpha

We now show that Algorithm 3 captures the median of PP with probability at least 1α1-\alpha. Let m=Median(P)m=\textrm{Median}(P), let Mlo=#{Yim:1in}M^{\textnormal{lo}}=\#\{Y_{i}\leq m:1\leq i\leq n\} be the number of YiY_{i} at most mm, and let Mhi=#{Yim:1in}M^{\textnormal{hi}}=\#\{Y_{i}\geq m:1\leq i\leq n\} be the number of YiY_{i} at least mm. Note that by the definition of mm, we have that for all ii, {Yim}0.5\mathbb{P}\{Y_{i}\leq m\}\geq 0.5, and {Yim}0.5\mathbb{P}\{Y_{i}\geq m\}\geq 0.5 as well. Additionally, the events {Yim}\{Y_{i}\leq m\} are mutually independent for all ii, as are the events {Yim}\{Y_{i}\geq m\}. This implies that both MloM^{\textnormal{lo}} and MhiM^{\textnormal{hi}} follow a Binom(n,0.5)\textrm{Binom}(n,0.5) distribution.

Since C^n=[Y(k),Y(n+1k)]\hat{C}_{n}=[Y_{(k)},Y_{(n+1-k)}], we have that mC^nm\in\hat{C}_{n} if and only if MlokM^{\textnormal{lo}}\geq k and MhikM^{\textnormal{hi}}\geq k. Then,

{mC^n}=\displaystyle\mathbb{P}\{m\in\hat{C}_{n}\}= {Mlok and Mhik}\displaystyle\mathbb{P}\{M^{\textnormal{lo}}\geq k\textrm{ and }M^{\textnormal{hi}}\geq k\}
=\displaystyle= 1{Mlo<k or Mhi<k}\displaystyle 1-\mathbb{P}\{M^{\textnormal{lo}}<k\textrm{ or }M^{\textnormal{hi}}<k\}
\displaystyle\geq 1({Mlo<k}+{Mhi<k})\displaystyle 1-(\mathbb{P}\{M^{\textnormal{lo}}<k\}+\mathbb{P}\{M^{\textnormal{hi}}<k\})
\displaystyle\geq 1(α/2+α/2)\displaystyle 1-(\alpha/2+\alpha/2)
=\displaystyle= 1α.\displaystyle 1-\alpha.