This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Kernel-based Optimally Weighted Conformal Prediction Intervals

Jonghyeok Lee H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology. Chen Xu H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology. Yao Xie Correspondence: yao.xie@isye.gatech.edu H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology.
Abstract

Conformal prediction has been a popular distribution-free framework for uncertainty quantification. In this work, we present a novel conformal prediction method for time-series, which we call Kernel-based Optimally Weighted Conformal Prediction Intervals (KOWCPI). Specifically, KOWCPI adapts the classic Reweighted Nadaraya-Watson (RNW) estimator for quantile regression on dependent data and learns optimal data-adaptive weights. Theoretically, we tackle the challenge of establishing a conditional coverage guarantee for non-exchangeable data under strong mixing conditions on the non-conformity scores. We demonstrate the superior performance of KOWCPI on real time-series against state-of-the-art methods, where KOWCPI achieves narrower confidence intervals without losing coverage.

1 Introduction

Conformal prediction, originated in Vovk et al. (1999, 2005), offers a robust framework explicitly designed for reliable and distribution-free uncertainty quantification. Conformal prediction has become increasingly recognized and adopted within the domains of machine learning and statistics (Lei et al., 2013; Lei and Wasserman, 2014; Kim et al., 2020; Angelopoulos and Bates, 2023). Assuming nothing beyond the exchangeability of data, conformal prediction excels in generating valid prediction sets under any given significance level, irrespective of the underlying data distribution and model assumptions. This capability makes it particularly valuable for uncertainty quantification in settings characterized by diverse and complex models.

Going beyond the exchangeability assumption has been a research challenge, particularly as many real-world datasets (such as time-series data) are inherently non-exchangeable. Tibshirani et al. (2019) addresses situations where a feature distribution shifts between training and test data and restores valid coverage through weighted quantiles based on the likelihood ratio of the distributions. More recently, Barber et al. (2023) bounds the coverage gap using the total variation distance between training and test data eights and minimizes this gap using pre-specified data-independent weights. However, it remains open to how to appropriately optimize the weights.

To advance conformal prediction for time series, we extend the prior sequential predictive approach (Xu and Xie, 2023a, b) by incorporating nonparametric kernel regression into the quantile regression method on non-conformity scores. A key challenge of adapting this method to time-series data lies in selecting optimal weights to accommodate the dependent structure of the data. To ensure valid coverage of prediction sets, it is crucial to select weights inside the quantile estimator so that it closely approximates the true quantile of non-conformity scores.

In this paper, we introduce KOWCPI, which utilizes the Reweighted Nadaraya-Watson estimator (Hall et al., 1999) to facilitate the selection of data-dependant optimal weights. This approach anticipates that adaptive weights will enhance the robustness of uncertainty quantification, particularly when the assumption of exchangeability is compromised. Our method also addresses the weight selection issue in the weighted quantile method presented by Barber et al. (2023), as KOWCPI allows for the calculation of weights in a data-driven manner without prior knowledge about the data.

In summary, our main contributions are:

  • We propose KOWCPI, a sequential time-series conformal prediction method that performs nonparametric kernel quantile regression on non-conformity scores. In particular, KOWCPI learns optimal data-driven weights used in the conditional quantiles.

  • We prove the asymptotic conditional coverage guarantee of KOWCPI based on the classical theory of nonparametric regression. We further obtain the marginal coverage gap of KOWCPI using the general result for the weights on quantile for non-exchangeable data.

  • We demonstrate the effectiveness of KOWCPI on real time-series data against state-of-the-art baselines. Specifically, KOWCPI can achieve the narrowest width of prediction intervals without losing marginal and approximate conditional (i.e., rolling) coverage empirically.

1.1 Literature

RNW quantile regression

In Hall et al. (1999), the Reweighted Nadaraya-Watson (RNW, often referred to as Weighted or Adjusted Nadaraya-Watson) estimator was suggested as a method to estimate the conditional distribution function from time-series data. This estimator extends the renowned Nadaraya-Watson estimator (Nadaraya, 1964; Watson, 1964) by introducing an additional adjustment to the weights, thus combining the favorable bias properties of the local linear estimator with the benefit of being a distribution function by itself like the original Nadaraya-Watson estimator (Hall et al., 1999; Yu and Jones, 1998). The theory of the regression quantile with the RNW estimator has been further developed by Cai (2002). Furthermore, Cai (2002) and Salha (2006) demonstrated that the RNW estimator is consistent under strongly mixing conditions, which are commonly observed in time-series data. In this work, we adaptively utilize the RNW estimator within the conformal prediction framework to construct sequential prediction intervals for time-series data, leveraging its data-driven weights for quantile estimation and the weighted conformal approach.

Conformal prediction with weighted quantiles

Approaches using quantile regression instead of empirical quantiles in conformal prediction have been widespread (Romano et al., 2019; Kivaranovic et al., 2020; Gibbs et al., 2023). These methods utilize various quantile regression techniques to construct conformal prediction intervals, and the convergence to the oracle prediction width can be shown under the consistency of the quantile regression function (Sesia and Candès, 2020). Another recent work by Guan (2023) uses kernel weighting based on the distance between the test point and data to perform localized conformal prediction, which further discusses the selection of kernels and bandwidths. Recent work in this direction of utilizing the weighted quantiles, including Lee et al. (2023); Angelopoulos et al. (2023), continues to be vibrant. As we will discuss later, our approach leverages techniques in classical non-parametric statistics when constructing the weights.

Time-series conformal prediction

There is a growing body of research on time-series conformal prediction (Xu and Xie, 2021b; Gibbs and Candès, 2021). Various applications include financial markets (Gibbs and Candès, 2021), anomaly detection (Xu and Xie, 2021a), and geological classification (Xu and Xie, 2022). In particular, Gibbs and Candès (2021, 2022) sequentially construct prediction intervals by updating the significance level α\alpha based on the mis-coverage rate. This approach has become a major methodology for handling non-exchangeable data, leading to several subsequent developments (Feldman et al., 2022; Auer et al., 2023; Bhatnagar et al., 2023; Zaffran et al., 2022). On the other hand, Xu and Xie (2023b); Xu et al. (2024) take a slightly different approach by conducting sequential quantile regression using non-conformity scores. Our study aims to integrate non-parametric kernel estimation for sequential quantile regression, addressing the weight selection issues identified by Barber et al. (2023). Additionally, our research aligns with Guan (2023), particularly in utilizing a dissimilarity measure between the test point and the past data.

2 Problem Setup

We begin by assuming that the observations of the random sequence (Xt,Yt)d×(X_{t},Y_{t})\in\mathbb{R}^{d}\times\mathbb{R}, t=1,2,t=1,2,\ldots are obtained sequentially. Notably, XtX_{t} may represent exogenous variables that aid in predicting YtY_{t}, the historical values of YtY_{t} itself, or a combination of both. (In Appendix A, we expand our discussion to include cases where the response YtY_{t} is multivariate.) A key aspect of our setup is that the data are non-exchangeable and exhibit dependencies, which are typical in time-series data where temporal or sequential dependencies influence predictive dynamics.

Suppose we are given a pre-specified point predictor f^:d\hat{f}:\mathbb{R}^{d}\to\mathbb{R} trained on a separate dataset or on past data. This predictor f^\hat{f} maps a feature variable in d\mathbb{R}^{d} to a scalar point prediction for YtY_{t}. Given a user-specified significance level α(0,1)\alpha\in(0,1), we use the initial TT observations to construct prediction intervals C^t1α(Xt)\widehat{C}_{t-1}^{\alpha}(X_{t}) for YtY_{t} in a sequential manner from t=T+1t=T+1 onwards.

Two key types of coverage targeted by prediction intervals are marginal coverage and conditional coverage. Marginal coverage is defined as

(YtC^t1α(Xt))1α,\mathbb{P}(Y_{t}\in\widehat{C}_{t-1}^{\alpha}(X_{t}))\geq 1-\alpha, (1)

which ensures that the true value YtY_{t} falls within the interval C^t1α(Xt)\widehat{C}_{t-1}^{\alpha}(X_{t}) at least 100(1α)%100(1-\alpha)\% of the time, averaged over all instances. On the other hand, conditional coverage is defined as

(YtC^t1α(Xt)Xt)1α,\mathbb{P}(Y_{t}\in\widehat{C}_{t-1}^{\alpha}(X_{t})\mid X_{t})\geq 1-\alpha, (2)

which is a stronger guarantee ensuring that given each value of predictor XtX_{t}, the true value YtY_{t} falls within the interval C^t1α(Xt)\widehat{C}_{t-1}^{\alpha}(X_{t}) at least 100(1α)%100(1-\alpha)\% of the time.

3 Method

In this section, we introduce our proposed method, KOWCPI (Kernel-based Optimally Weighted Conformal Prediction Intervals), which embodies our approach to enhancing prediction accuracy and robustness in the face of time-series data. We delve into the methodology and algorithm of KOWCPI in-depth, highlighting how the Reweighted Nadaraya-Watson (RNW) estimator integrates with our predictive framework.

Consider prediction for a univariate time series, Y1,Y2,Y_{1},Y_{2},\ldots. We have predictors XtX_{t} given to us at time tt, t=1,2,t=1,2,\ldots, which can depend on the past observations (Yt1,Yt2,)(Y_{t-1},Y_{t-2},\ldots), and possibly other exogeneous time series ZtZ_{t}. Given a pre-trained algorithm f^\hat{f}, we also have a sequence of non-conformity scores indicating the accuracy of the prediction:

ε^t=Ytf^(Xt),t=1,2,.\hat{\varepsilon}_{t}=Y_{t}-\hat{f}(X_{t}),\quad t=1,2,\ldots.

We denote the collection of the past TT non-conformity scores at time t>Tt>T as

t=(ε^t1,,ε^tT),\mathcal{E}_{t}=(\hat{\varepsilon}_{t-1},\ldots,\hat{\varepsilon}_{t-T}),

We construct the prediction interval C^t1α(Xt)\widehat{C}_{t-1}^{\alpha}(X_{t}) with significance level α\alpha at time t>Tt>T as follows:

C^t1α(Xt)=[f^(Xt)+q^β(t),f^(Xt)+q^1α+β(t)],β=argminβ[0,α](q^1α+β(t)q^β(t)).\displaystyle\begin{split}\widehat{C}_{t-1}^{\alpha}(X_{t})&=[\hat{f}(X_{t})+\hat{q}_{\beta^{*}}(\mathcal{E}_{t}),\hat{f}(X_{t})+\hat{q}_{1-\alpha+\beta^{*}}(\mathcal{E}_{t})],\\ \beta^{*}&=\mathop{\rm argmin}_{\beta\in[0,\alpha]}\left(\hat{q}_{1-\alpha+\beta}(\mathcal{E}_{t})-\hat{q}_{\beta}(\mathcal{E}_{t})\right).\end{split} (3)

Here, q^β\hat{q}_{\beta} is a quantile regression algorithm that returns an estimate of the β\beta-quantile of the residuals, which we will explain through this section. We consider asymmetrical confidence intervals to ensure the tightest possible coverage.

3.1 Reweighted Nadaraya-Watson estimator

The Reweighted Nadaraya-Watson (RNW) estimator is a general and popular method for quantile regression. Observe (X~i,Y~i)(\tilde{X}_{i},\tilde{Y}_{i}), i=1,,ni=1,\ldots,n, where Y~i\tilde{Y}_{i}\in\mathbb{R}, and the predictors X~i\tilde{X}_{i} can be pp-dimensional. The goal is to predict the quantile (Y~b|X~)\mathbb{P}(\tilde{Y}\leq b|\tilde{X}), bb\in\mathbb{R}, given a test point X~\tilde{X} using training samples. The RNW estimator introduces adjustment weights on the predictors to ensure consistent estimation. We define the probability-like adjustment weights pi(X~)p_{i}(\tilde{X}), i=1,,ni=1,\ldots,n, by maximizing the empirical log-likelihood i=1nlogpi(X~)\sum_{i=1}^{n}\log p_{i}(\tilde{X}), subject to pi(X~)0p_{i}(\tilde{X})\geq 0, and

i=1npi(X~)=1,\displaystyle\sum_{i=1}^{n}p_{i}(\tilde{X})=1, (4)
i=1npi(X~)(X~iX~)Kh(X~iX~)=0.\displaystyle\sum_{i=1}^{n}p_{i}(\tilde{X})(\tilde{X}_{i}-\tilde{X})K_{h}(\tilde{X}_{i}-\tilde{X})=0. (5)

The RNW estimate of the conditional CDF (Y~b|X~)\mathbb{P}(\tilde{Y}\leq b|\tilde{X}) is defined as follows:

F^(b|X~)=i=1nW^i(X~)𝟙(Y~ib),\displaystyle\widehat{F}(b|\tilde{X})=\sum_{i=1}^{n}\widehat{W}_{i}(\tilde{X})\mathds{1}(\tilde{Y}_{i}\leq b), (6)

where the weights are given by

W^i(X~)=pi(X~)Kh(X~iX~)i=1npi(X~)Kh(X~iX~),i=1,,n.\widehat{W}_{i}(\tilde{X})=\frac{p_{i}(\tilde{X})K_{h}(\tilde{X}_{i}-\tilde{X})}{\sum_{i=1}^{n}p_{i}(\tilde{X})K_{h}(\tilde{X}_{i}-\tilde{X})},\quad i=1,\ldots,n. (7)

Here, the kernel K:pK\colon\mathbb{R}^{p}\to\mathbb{R} is a kernel function, and Kh(𝐮)=hpK(h1𝐮)K_{h}(\mathbf{u})=h^{-p}K(h^{-1}\mathbf{u}) for 𝐮p\mathbf{u}\in\mathbb{R}^{p}. Any reasonable choice of kernel function is possible; however, to ensure the validity of our theoretical results discussed in Section 4, the kernel should be nonnegative, bounded, continuous, and possess compact support. An example is K(u)=k(u)K(u)=k(\left\|u\right\|), where k:k:\mathbb{R}\to\mathbb{R} is the Epanechnikov kernel.

The computation of pi(X~)p_{i}(\tilde{X}) reduces to a simple one-dimensional convex minimization problem:

Lemma 3.1 ((Hall et al., 1999; Cai, 2001)).

The adjustment weights pi(X~)p_{i}(\tilde{X}), i=1,,ni=1,\ldots,n, for the RNW estimator are given as

pi(X~)=1n[1+λ([X~i]1[X~]1)Kh(X~iX~)]1,p_{i}(\tilde{X})=\frac{1}{n}\left[1+\lambda([\tilde{X}_{i}]_{1}-[\tilde{X}]_{1})K_{h}(\tilde{X}_{i}-\tilde{X})\right]^{-1}, (8)

where [X]1[X]_{1} denotes the first element of a vector XX, and λ\lambda\in\mathbb{R} is the minimizer of:

L(λ;X~)=i=1nlog{1+λ([X~i]1[X~]1)Kh(X~iX~)}.L(\lambda;\tilde{X})=-\sum_{i=1}^{n}\log\{1+\lambda([\tilde{X}_{i}]_{1}-[\tilde{X}]_{1})K_{h}(\tilde{X}_{i}-\tilde{X})\}. (9)

Lemma 3.1 is a starting point for the proof of the asymptotic conditional coverage property of our algorithm later on.

3.2 RNW for conformal prediction

Refer to caption
Figure 1: Illustration of Kernel-based Optimally Weighted Conformal Prediction Intervals (KOWCPI).

To perform the quantile regression for prediction interval construction at time t=T+1t=T+1, we use a sliding window approach, breaking the past TT residuals T+1=(ε^T,ε^1)\mathcal{E}_{T+1}=(\hat{\varepsilon}_{T},\ldots\hat{\varepsilon}_{1}) into n(Tw)n\coloneqq(T-w) overlapping segments of length ww. We construct the predictors and responses to fit the RNW estimator as follows:

Y~i=ε^i+w,X~i=(ε^i+w1,,ε^i),i=1,,n.\tilde{Y}_{i}=\hat{\varepsilon}_{i+w},\quad\tilde{X}_{i}=(\hat{\varepsilon}_{i+w-1},\ldots,\hat{\varepsilon}_{i}),\quad i=1,\ldots,n.

With RNW estimator F^\widehat{F} fitted on ((X~i,Y~i))i=1n((\tilde{X}_{i},\tilde{Y}_{i}))_{i=1}^{n}, the conditional β\beta-quantile estimator Q^β\widehat{Q}_{\beta} is defined as

Q^β(X~)inf{y~:F^(y~|X~)β}.\widehat{Q}_{\beta}(\tilde{X})\coloneqq\inf\{\tilde{y}\in\mathbb{R}\colon\widehat{F}(\tilde{y}|\tilde{X})\geq\beta\}. (10)

After time t=T+1t=T+1, we update t\mathcal{E}_{t} by removing the oldest residual and adding the newest one, then repeat the process (see Algorithm 1). In Section 4, we prove that due to the consistency of Q^β\widehat{Q}_{\beta}, KOWCPI achieves asymptotic conditional coverage despite the significant temporal dependence introduced by using overlapping segments of residuals.

Algorithm 1 Kernel-based Optimally Weighted Conformal Prediction Intervals (KOWCPI)
0:  Sequentially obtained data (Xt,Yt)(X_{t},Y_{t}), t=1,2,t=1,2,\ldots, pre-trained point predictor f^\hat{f}, Target significance level α(0,1)\alpha\in(0,1).
1:  Obtain prediction residuals ε^t=Ytf^(Xt)\hat{\varepsilon}_{t}=Y_{t}-\hat{f}(X_{t}), t=1,,Tt=1,\ldots,T.
2:  for t=T+1,t=T+1,\ldots do
3:     Update t=(ε^t1,,ε^tT)\mathcal{E}_{t}=(\hat{\varepsilon}_{t-1},\ldots,\hat{\varepsilon}_{t-T}).
4:     Break t\mathcal{E}_{t} into overlapping segments X~i=(ε^tT+i+w2,,ε^tT+i1)\tilde{X}_{i}=(\hat{\varepsilon}_{t-T+i+w-2},\ldots,\hat{\varepsilon}_{t-T+i-1}), i=1,,n+1i=1,\ldots,n+1.
5:     Obtain λ\lambda that minimizes L(;X~n+1)L(\cdot;\tilde{X}_{n+1}) in (9).
6:     Obtain the adjustment weights pi(X~n+1),i=1,,np_{i}(\tilde{X}_{n+1}),i=1,\ldots,n, and thus final weights W^i(X~n+1)\hat{W}_{i}(\tilde{X}_{n+1}).
7:     With Y~i=ε^tT+i+w1\tilde{Y}_{i}=\hat{\varepsilon}_{t-T+i+w-1}, i=1,,ni=1,\ldots,n, derive the quantile estimator Q^β(X~n+1)\widehat{Q}_{\beta}(\tilde{X}_{n+1}) for β[0,α]\beta\in[0,\alpha].
8:     Obtain β=argminβ[0,α]Q^1α+β(X~n+1)Q^β(X~n+1)\beta^{*}=\mathop{\rm argmin}_{\beta\in[0,\alpha]}\widehat{Q}_{1-\alpha+\beta}(\tilde{X}_{n+1})-\widehat{Q}_{\beta}(\tilde{X}_{n+1}).
9:     Return C^t1α(Xt)=[f^(Xt)+Q^β(X~n+1),f^(Xt)+Q^1α+β(X~n+1)]\widehat{C}_{t-1}^{\alpha}(X_{t})=[\hat{f}(X_{t})+\widehat{Q}_{\beta^{*}}(\tilde{X}_{n+1}),~{}\hat{f}(X_{t})+\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{X}_{n+1})].
10:     Obtain a new residual ε^t\hat{\varepsilon}_{t}.
11:  end for

Bandwidth selection

While it is theoretically possible to calculate the optimal bandwidth that minimizes the asymptotic mean-squared error, this requires additional derivative estimation, which significantly complicates the problem. Consequently, similar to general non-parametric models, one can use cross-validation to select the bandwidth. However, cross-validation can be computationally burdensome and may deteriorate under dependent data (Fan et al., 1995). Therefore, we adapt the non-parametric AIC (Cai and Tiwari, 2000), used for bandwidth selection in local linear estimators. This method is applicable because the RNW estimator belongs to the class of linear smoother (Cai, 2002). We choose the bandwidth hh that minimizes

AICC(h)log(RSS)+n+tr(SS)n(tr(SS)+2),{\rm AIC}_{C}(h)\coloneqq\log({\rm RSS})+\frac{n+\mbox{tr}(SS^{\top})}{n-(\mbox{tr}(SS^{\top})+2)}, (11)

where RSS{\rm RSS} is the residual sum of squares, and SS is a linear smoothing operator (Hastie, 1990), with the (i,j)(i,j)-th element given by Sij=W^j(X~i).S_{ij}=\widehat{W}_{j}(\tilde{X}_{i}).

Window length selection

To select the window length ww, cross-validation can be employed. One approach is to use a weighted sum of the average under-coverage rate and the average width obtained for a given ww as the criterion. Another approach could involve choosing ww with the smallest average width that achieves a target coverage in the validation set. In experiments, we observed that the performance is less sensitive to the choice of ww across a broader range compared to the bandwidth hh, although the selection of hh and ww interactively affects the performance.

4 Theory

In this section, we introduce the theoretical properties of the RNW estimator, a quantile regression method we use, and demonstrate in Theorem 4.9 that our KOWCPI asymptotically displays conditional coverage under the strong mixing of residuals. It turns out that the asymptotic conditional coverage gap can be derived from known results in the context of kernel quantile regression.

4.1 Marginal coverage

We begin by bounding the marginal coverage gap of the KOWCPI method. The following result shows the coverage gap using our weights, compared with the oracle weights; the results are established using a similar strategy as in (Tibshirani et al., 2019, Lemma 3):

Proposition 4.1 (Non-asymptotic marginal coverage gap).

Denote by 𝒫\mathcal{P} the joint density of (Y~1,,Y~n+1)(\tilde{Y}_{1},\ldots,\tilde{Y}_{n+1}). Then, we have

|(YT+1C^Tα(XT+1))(1α)|12(𝔼(W)1:nW^1+𝔼Wn+1)+2𝔼Δ(X~n+1)\displaystyle\begin{split}&\left|\mathbb{P}\left(Y_{T+1}\in\widehat{C}_{T}^{\alpha}(X_{T+1})\right)-(1-\alpha)\right|\\ &\leq\frac{1}{2}\left(\mathbb{E}\left\|(W^{*})_{1:n}-\widehat{W}\right\|_{1}+\mathbb{E}W^{*}_{n+1}\right)+2\mathbb{E}\Delta(\tilde{X}_{n+1})\end{split} (12)

where Δ(X~n+1)\Delta(\tilde{X}_{n+1}) is the discrete gap defined in (17), and WW^{*} is the vector of oracle weights with each entry is defined as

Wi:=σ:σ(n+1)=i𝒫(Y~σ(1),,Y~σ(n+1))σ𝒫(Y~σ(1),,Y~σ(n+1)),i=1,,n+1,(oracle weights)W_{i}^{*}:=\frac{\sum_{\sigma:\sigma(n+1)=i}\mathcal{P}(\tilde{Y}_{\sigma(1)},\ldots,\tilde{Y}_{\sigma(n+1)})}{\sum_{\sigma}\mathcal{P}(\tilde{Y}_{\sigma(1)},\ldots,\tilde{Y}_{\sigma(n+1)})},\quad i=1,\ldots,{n+1},\quad\mbox{(oracle weights)} (13)

and σ\sigma is a permutation on {1,,n+1}\{1,\ldots,n+1\}.

The implication of Proposition 4.1 is that

  • The “under-coverage” depends on the 1\ell_{1}-distance between the learned optimal weights and oracle-optimal weights (that depends on the true joint distribution of data).

  • Note that the oracle weights WtW_{t}^{*} cannot be evaluated, because in principle, it requires considering the (n+1)!(n+1)! possible shuffled observed residuals and their joint distributions.

  • The form of the oracle weights WiW_{i}^{*} from (13) offers an intuitive basis for algorithm development: we can practically estimate the weights through quantile regression, utilizing previously observed non-conformity scores.

4.2 Conditional coverage

In this section, we derive the asymptotic conditional coverage property of KOWCPI. For this, we introduce the assumptions necessary for the consistency of the RNW estimator. To account for the dependency in the data, we assume the strong mixing of the residual process.

A stationary stochastic process {Vt}t=\{V_{t}\}_{t=-\infty}^{\infty} on a probability space with a probability measure \mathbb{P} is said to be strongly mixing (α\alpha-mixing) if a mixing coefficient α(τ)\alpha(\tau) defined as

α(τ)=supA𝔐0,B𝔐τ|(AB)(A)(B)|\alpha(\tau)=\sup_{A\in\mathfrak{M}_{-\infty}^{0},B\in\mathfrak{M}_{\tau}^{\infty}}\left|\mathbb{P}(A\cap B)-\mathbb{P}(A)\mathbb{P}(B)\right|

satisfies α(τ)0\alpha(\tau)\to 0 as τ\tau\to\infty, where 𝔐st\mathfrak{M}_{s}^{t}, st-\infty\leq s\leq t\leq\infty, denotes a σ\sigma-algebra generated by {Vs,Vs+1,,Vt}\{V_{s},V_{s+1},\ldots,V_{t}\}. The mixing coefficient α(τ)\alpha(\tau) quantifies the asymptotic independence between the past and future of the sequence {Vt}t=\{V_{t}\}_{t=-\infty}^{\infty}.

Assumption 4.2 (Mixing of the process).

The stationary process (Vi=(X~i,Y~i))i=1(V_{i}=(\tilde{X}_{i},\tilde{Y}_{i}))_{i=1}^{\infty} is strongly mixing with the mixing coefficient α(τ)=𝒪(τ(2+δ))\alpha(\tau)=\mathcal{O}(\tau^{-(2+\delta)}) for some δ>0\delta>0.

Due to stationarity, the conditional CDF of the realized residual does not depend on the index ii; thus, denote

F(b|𝐳)=(Y~ib|X~i=𝐳),F(b|\mathbf{z})=\mathbb{P}(\tilde{Y}_{i}\leq b|\tilde{X}_{i}=\mathbf{z}),

as the conditional CDF of the random variable Y~i\tilde{Y}_{i} given X~i=𝐳\tilde{X}_{i}=\mathbf{z}. In addition, we introduce the following notations:

  • Let g(𝐳)g(\mathbf{z}) be the marginal density of X~i\tilde{X}_{i} at 𝐳\mathbf{z}. (Note that due to stationarity, we can have a common marginal density.)

  • Let g1,i,i2g_{1,i},i\geq 2 denote the joint density of X~1\tilde{X}_{1} and X~i\tilde{X}_{i}.

The following assumptions (4.3-4.5) are common in nonparametric statistics, essential for attaining desirable properties such as the consistency of an estimator (Tsybakov, 2009).

Assumption 4.3 (Smoothness of the conditional CDF and densities).

For fixed y~\tilde{y}\in\mathbb{R} and 𝐳w\mathbf{z}\in\mathbb{R}^{w},

  1. (i)

    0<F(y~|𝐳)<10<F(\tilde{y}|\mathbf{z})<1.

  2. (ii)

    F(y~|𝐳)F(\tilde{y}|\mathbf{z}) is twice continuously partially differentiable with respect to 𝐳\mathbf{z}.

  3. (iii)

    g(𝐳)>0g(\mathbf{z})>0 and g()g(\cdot) is continuous at 𝐳\mathbf{z}.

  4. (iv)

    There exists M>0M>0 such that |g1,i(u,v)g(u)g(v)|M\left|g_{1,i}(u,v)-g(u)g(v)\right|\leq M for all u,vu,v and i2i\geq 2.

Regarding Assumption 4.3, we would like to remark that there is a negative result: without additional assumptions about the distribution, it is impossible to construct finite-length prediction intervals that satisfy conditional coverage (Lei and Wasserman, 2014; Vovk, 2012).

Assumption 4.4 (Regularity of the kernel function).

The kernel K:wK:\mathbb{R}^{w}\to\mathbb{R} is a nonnegative, bounded, continuous, and compactly supported density function satisfying

  1. (i)

    w𝐮K(𝐮)𝑑𝐮=0\int_{\mathbb{R}^{w}}\mathbf{u}K(\mathbf{u})d\mathbf{u}=0,

  2. (ii)

    w𝐮𝐮K(𝐮)𝑑𝐮=μ2I\int_{\mathbb{R}^{w}}\mathbf{u}\mathbf{u}^{\top}K(\mathbf{u})d\mathbf{u}=\mu_{2}I for some μ2(0,)\mu_{2}\in(0,\infty),

  3. (iii)

    wK2(𝐮)𝑑𝐮=ν0\int_{\mathbb{R}^{w}}K^{2}(\mathbf{u})d\mathbf{u}=\nu_{0} and w𝐮𝐮K2(𝐮)𝑑𝐮=ν2I\int_{\mathbb{R}^{w}}\mathbf{u}\mathbf{u}^{\top}K^{2}(\mathbf{u})d\mathbf{u}=\nu_{2}I for some ν0,ν2(0,)\nu_{0},\nu_{2}\in(0,\infty).

Assumptions 4.4-(i), (ii), and (iii) are standard conditions (Wand and Jones, 1994) that require KK to be “symmetric” in a sense that that the weighting scheme relies solely on the distance between the observation and the test point. For example, if KK is isometric, i.e., K(u)=k(u)K(u)=k(\left\|u\right\|) for some univariate kernel function k:k:\mathbb{R}\to\mathbb{R}, it can satisfy these conditions using widely adopted kernels such as the Epanechnikov kernel.

Assumption 4.5 (Bandwidth selection).

As nn\to\infty, the bandwidth hh satisfies

h0, and nhw(1+2/δ).h\to 0,\mbox{ and }\,nh^{w(1+2/\delta)}\to\infty.

We note that Assumption 4.5 is met when selecting the (theoretically) optimal bandwidth hn1/(w+4)h^{*}\sim n^{-1/(w+4)}, which minimizes the asymptotic mean squared error (AMSE) of the RNW estimator, provided that δ>1/2\delta>1/2.

We prove the following proposition following a similar strategy as (Salha, 2006) by fixing several technical details:

Proposition 4.6 (Consistency of the RNW estimator).

Under Assumptions 4.2-4.5, given arbitrary x~\tilde{x} and y~\tilde{y}, as nn\rightarrow\infty,

F^(y~|𝐳)F(y~|𝐳)=12h2tr(D𝐳2(F(y~|𝐳)))μ2+op(h2)+𝒪p((nhw)1/2).\displaystyle\widehat{F}(\tilde{y}|\mathbf{z})-F(\tilde{y}|\mathbf{z})=\frac{1}{2}h^{2}\mbox{tr}(D_{\mathbf{z}}^{2}(F(\tilde{y}|\mathbf{z})))\mu_{2}+o_{p}(h^{2})+\mathcal{O}_{p}((nh^{w})^{-1/2}). (14)

This proposition implies pointwise convergence in probability of the RNW estimator, and since it is the weighted empirical CDF, this pointwise convergence implies uniform convergence in probability (Tucker, 1967, p.127-128). Consequently, we obtain the consistency of the conditional quantile estimator in (10) to the true conditional quantile given as

Qβ(x~)=inf{y~:F(y~|x~)β}.Q_{\beta}(\tilde{x})=\inf\{\tilde{y}\in\mathbb{R}:F(\tilde{y}|\tilde{x})\geq\beta\}.
Corollary 4.7.

Under Assumptions 4.2-4.5, for every β(0,1)\beta\in(0,1) and 𝐳\mathbf{z}, as nn\to\infty,

Q^β(𝐳)Qβ(𝐳) in probability.\widehat{Q}_{\beta}(\mathbf{z})\to Q_{\beta}(\mathbf{z})\mbox{ in probability}. (15)

As a direct consequence of Corollary 4.7, the asymptotic conditional coverage of KOWCPI is guaranteed by the consistency of the quantile estimator used in our sequential algorithm.

Corollary 4.8 (Asymptotic conditional coverage guarantee).

Under Assumptions 4.2-4.5, for any α(0,1)\alpha\in(0,1), as nn\to\infty,

(YtC^t1α(Xt)|Xt)(1α) in probability.\mathbb{P}(Y_{t}\in\hat{C}_{t-1}^{\alpha}(X_{t})|X_{t})\to(1-\alpha)\mbox{ in probability}. (16)

Thus, employing quantile regression using the RNW estimator for prediction residuals derived from the time-series data of continuous random variables, assuming strong mixing of these residuals, KOWCPI can achieve approximate conditional coverage with a sufficient number of residuals utilized.

To further specify the rate of convergence, define the discrete gap

Δ(X~)=supβ[0,1]|F^(Q^β(X~)|X~)β|=maxi=1,,nW^i(X~).\Delta(\tilde{X})=\sup_{\beta\in[0,1]}|{\widehat{F}(\widehat{Q}_{\beta}(\tilde{X})|\tilde{X})-\beta}|=\max_{i=1,\ldots,n}\widehat{W}_{i}(\tilde{X}). (17)
Theorem 4.9 (Conditional coverage gap).

Under Assumptions 4.2-4.5, for any α(0,1)\alpha\in(0,1) and xtx_{t}, as nn\to\infty,

|(YtC^t1α(xt)Xt=xt)(1α)|𝒪p(h2+(nhw)1/2)+2Δ(x~n+1),\left|\mathbb{P}\left(Y_{t}\in\widehat{C}_{t-1}^{\alpha}(x_{t})\mid X_{t}=x_{t}\right)-(1-\alpha)\right|\leq\mathcal{O}_{p}(h^{2}+(nh^{w})^{-1/2})+2\Delta(\tilde{x}_{n+1}), (18)

where x~n+1\tilde{x}_{n+1} is the realization of X~n+1\tilde{X}_{n+1} given Xt=xtX_{t}=x_{t}.

Given that the adjustment weights pi(𝐳)p_{i}(\mathbf{z}) uniformly concentrate to 1/n1/n (Steikert, 2014), one can see that the conditional coverage gap tends to zero, although its precise rate remains an open question.

5 Experiments

Table 1: Empirical marginal coverage and width across three real time-series datasets by different methods. The target coverage is 1α=0.91-\alpha=0.9.
Electric Wind Solar
Coverage Width Coverage Width Coverage Width
KOWCPI 0.90 0.22 0.91 2.41 0.90 48.8
Plain NW 0.89 0.31 0.95 3.58 0.41 20.1
SPCI 0.90 0.29 0.94 2.61 0.92 84.2
EnbPI 0.93 0.36 0.92 5.25 0.87 106.0
ACI 0.89 0.32 0.88 8.26 0.89 143.9
FACI 0.89 0.28 0.91 7.77 0.89 141.9
AgACI 0.91 0.30 0.88 7.54 0.90 144.6
SF-OGD 0.79 0.25 0.11 0.29 0 0.50
SAOCF 0.93 0.33 0.76 4.00 0.64 33.5
SCP 0.87 0.30 0.86 8.20 0.89 142.0

In this section, we compare the performance of KOWCPI against state-of-the-art conformal prediction baselines using real time-series data. We aim to show that KOWCPI can consistently reach valid coverage with the narrowest prediction intervals.

Dataset. We consider three real time series from different domains. The first ELEC2 data set (eletric) (Harries, 1999) tracks electricity usage and pricing in the states of New South Wales and Victoria in Australia for every 30 minutes over a 2.5-year period in 1996–1999. The second renewable energy data (solar) (Zhang et al., 2021) are from the National Solar Radiation Database and contain hourly solar radiation data (measured in GHI) from Atlanta in 2018. The third wind speed data (wind) (Zhu et al., 2021) are collected at wind farms operated by MISO in the US. The wind speed record was updated every 15 minutes over a one-week period in September 2020.

Baselines. We consider Sequential Predictive Conformal Inference (SPCI) (Xu and Xie, 2023b), Ensemble Prediction Interval (EnbPI) (Xu and Xie, 2023a), Adaptive Conformal Inference (ACI) (Gibbs and Candès, 2021), Aggregated ACI (AgACI) (Zaffran et al., 2022), Fully Adaptive Conformal Inference (FACI) (Gibbs and Candès, 2022), Scale-Free Online Gradient Descent (SF-OGD) (Orabona and Pál, 2018; Bhatnagar et al., 2023), Strongly Adaptive Online Conformal Prediction (SAOCP) (Bhatnagar et al., 2023), and vanilla Split Conformal Prediction (SCP) (Vovk et al., 2005). Additionally, we included a comparison where weights were derived from the original Nadaraya-Watson estimator (Plain NW). For the implementation of ACI-related methods, we utilized the R package AdaptiveConformal (https://github.com/herbps10/AdaptiveConformal). For SPCI and EnbPI, we used the Python code from https://github.com/hamrel-cxu/SPCI-code.

Setup and evaluation metrics. In all comparisons, we use the random forest as the base point predictor with the number of trees =10=10. Every dataset is split in a 7:1:2 ratio for training the point predictor, tuning the window length ww and bandwidth hh, and constructing prediction intervals, respectively. The window length for each dataset is fixed and determined through cross-validation, while the bandwidth is selected by minimizing the nonparametric AIC, as detailed in (11).

Besides examining marginal coverage and widths of prediction intervals on test data, we also focus on rolling coverage, which is helpful in showing approximate conditional coverage at specific time indices. Given a rolling window size m>0m>0, rolling coverage RC^t\widehat{\rm RC}_{t} at time tt is defined as RC^t=1mi=1m𝟙{Yti+1C^tiα(Xti+1)}.\widehat{\rm RC}_{t}=\frac{1}{m}\sum_{i=1}^{m}\mathds{1}\{Y_{t-i+1}\in\widehat{C}_{t-i}^{\alpha}(X_{t-i+1})\}.

Refer to caption
(a) Rolling coverage (boxplot)
Refer to caption
(b) Width of prediction intervals (boxplot)
Refer to caption
(c) Rolling coverage over prediction time indices
Refer to caption
(d) log(W^i)\log(\widehat{W}_{i}) by the time lag
\cprotect
Figure 2: Comparison of empirical rolling coverage and width on the electric dataset by different methods in (a) rolling coverage; (b) widths of intervals, (c) rolling coverage over time, and (d) an instance of computed adjusted weights, which shows that the weights do not necessarily decay over lag. The target coverage is 90%. In (a), the red dotted line is the target coverage and in (b), the blue dotted line is the median width of KOWCPI.

Results. The empirical marginal coverage and width results for all methods are summarized in Table 1. The results indicate that KOWCPI consistently achieves the 90% target coverage and maintains the smallest average width compared to the alternative state-of-the-art methods. While all methods, except SF-OGD, SAOCF, and Plain NW nearly achieve marginal coverage under target 1α=0.91-\alpha=0.9, KOWCPI produces the narrowest average width on all datasets. In terms of rolling results, we show in Figures 2 and 2 that the coverage of KOWCPI intervals consistently centers around 90% throughout the entire test phase. Additionally, Figure 2 shows that KOWCPI intervals are also significantly narrower with a smaller variance than the baselines. Lastly, Figure 2 shows weights W^\widehat{W} (in log scale) by the RNW estimator at the first time index of test data. We note that the most recent set of non-conformity scores (in terms of time indices) receives the heaviest weighting, which is natural due to the greatest similarity between the first test datum and training data closest to it in time. In Appendix C, we show additional comparisons of KOWCPI against the baselines on the other two datasets in terms of rolling results, where KOWCPI remains consistently better.

6 Conclusion

In this paper, we introduced KOWCPI, a method to sequentially construct prediction intervals for time-series data. By incorporating the classical Reweighted Nadaraya-Watson estimator into the weighted conformal prediction framework, KOWCPI effectively adapts to the dependent structure of time-series data by utilizing data-driven adaptive weights. Our theoretical contributions include providing theoretical guarantees for the asymptotic conditional coverage of KOWCPI under strong mixing conditions and bounding the marginal and conditional coverage gaps. Empirical validation on real-world time-series datasets demonstrated the effectiveness of KOWCPI compared to state-of-the-art methods, achieving narrower prediction intervals without compromising empirical coverage.

Future work could explore adaptive window selection, where the size of the non-conformity score batch is adjusted dynamically to capture shifts in the underlying distribution. Additionally, the natural compatibility of kernel regression with multivariate data can be leveraged to expand the utility of KOWCPI for multivariate time-series data, as detailed in Appendix A. There is also potential for improving theoretical guarantees and improving practical performance by designing alternative non-conformity scores.

Acknowledgments

This work is partially supported by an NSF CAREER CCF-1650913, NSF DMS-2134037, CMMI-2015787, CMMI-2112533, DMS-1938106, DMS-1830210, and the Coca-Cola Foundation.

References

  • Abdous and Theodorescu [1992] B. Abdous and R. Theodorescu. Note on the spatial quantile of a random vector. Statist. Probab. Lett., 13(4):333–336, 1992.
  • Angelopoulos et al. [2023] A. Angelopoulos, E. Candès, and R. J. Tibshirani. Conformal PID Control for Time Series Prediction. In Advances in Neural Information Processing Systems, 2023.
  • Angelopoulos and Bates [2023] A. N. Angelopoulos and S. Bates. Conformal Prediction: A Gentle Introduction. Foundations and Trends® in Machine Learning, 16(4):494–591, 2023.
  • Auer et al. [2023] A. Auer, M. Gauch, D. Klotz, and S. Hochreiter. Conformal Prediction for Time Series with Modern Hopfield Networks. In Advances in Neural Information Processing Systems, 2023.
  • Barber et al. [2023] R. F. Barber, E. J. Candès, A. Ramdas, and R. J. Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 – 845, 2023.
  • Bhatnagar et al. [2023] A. Bhatnagar, H. Wang, C. Xiong, and Y. Bai. Improved Online Conformal Prediction via Strongly Adaptive Online Learning. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  • Cai [2001] Z. Cai. Weighted Nadaraya-Watson regression estimation. Statistics & Probability Letters, 51(3):307–318, 2001.
  • Cai [2002] Z. Cai. Regression quantiles for time series. Econometric Theory, 18(1):169–192, 2002.
  • Cai and Tiwari [2000] Z. Cai and R. C. Tiwari. Application of a local linear autoregressive model to BOD time series. Environmetrics, 11(3):341–350, 2000.
  • Fan et al. [1995] J. Fan, N. E. Heckman, and M. P. Wand. Local Polynomial Kernel Regression for Generalized Linear Models and Quasi-Likelihood Functions. Journal of the American Statistical Association, 90(429):141–150, 1995.
  • Feldman et al. [2022] S. Feldman, L. Ringel, S. Bates, and Y. Romano. Achieving Risk Control in Online Learning Settings. arXiv preprint arXiv:2205.09095, 2022.
  • Gibbs and Candès [2021] I. Gibbs and E. Candès. Adaptive Conformal Inference Under Distribution Shift. Advances in Neural Information Processing Systems, 34:1660–1672, 2021.
  • Gibbs and Candès [2022] I. Gibbs and E. Candès. Conformal Inference for Online Prediction with Arbitrary Distribution Shifts. arXiv preprint arXiv:2208.08401, 2022.
  • Gibbs et al. [2023] I. Gibbs, J. J. Cherian, and E. J. Candès. Conformal Prediction With Conditional Guarantees. arXiv preprint arXiv:2305.12616, 2023.
  • Guan [2023] L. Guan. Localized conformal prediction: a generalized inference framework for conformal prediction. Biometrika, 110(1):33–50, 2023.
  • Hall et al. [1999] P. Hall, R. C. Wolff, and Q. Yao. Methods for Estimating a Conditional Distribution Function. Journal of the American Statistical Association, 94(445):154–163, 1999.
  • Harries [1999] M. Harries. SPLICE-2 Comparative Evaluation: Electricity Pricing. Technical report, University of New South Wales, School of Computer Science and Engineering, 1999.
  • Hastie [1990] T. J. Hastie. Generalized additive models. CRC Press, 1990.
  • Ibragimov et al. [1971] I. A. Ibragimov, Y. V. Linnik, and J. F. C. Kingman. Independent and Stationary Sequences of Random Variables. Wolters-Noordhoff., 1971.
  • Kim et al. [2020] B. Kim, C. Xu, and R. Barber. Predictive inference is free with the jackknife+-after-bootstrap. Advances in Neural Information Processing Systems, 33:4138–4149, 2020.
  • Kivaranovic et al. [2020] D. Kivaranovic, K. D. Johnson, and H. Leeb. Adaptive, Distribution-Free Prediction Intervals for Deep Networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 2020.
  • Lee et al. [2023] Y. Lee, R. F. Barber, and R. Willett. Distribution-free inference with hierarchical data. arXiv preprint arXiv:2306.06342, 2023.
  • Lei and Wasserman [2014] J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):71–96, 2014.
  • Lei et al. [2013] J. Lei, J. Robins, and L. Wasserman. Distribution-Free Prediction Sets. Journal of the American Statistical Association, 108(501):278–287, 2013.
  • Masry [1986] E. Masry. Recursive probability density estimation for weakly dependent stationary processes. IEEE Transactions on Information Theory, 32(2):254–267, 1986.
  • Nadaraya [1964] E. A. Nadaraya. On Estimating Regression. Theory of Probability & Its Applications, 9(1):141–142, 1964.
  • Orabona and Pál [2018] F. Orabona and D. Pál. Scale-free online learning. Theoretical Computer Science, 716:50–69, 2018.
  • Romano et al. [2019] Y. Romano, E. Patterson, and E. Candès. Conformalized Quantile Regression. In Advances in Neural Information Processing Systems, 2019.
  • Salha [2006] R. Salha. Kernel Estimation for the Conditional Mode and Quantiles of Time Series. PhD thesis, University of Macedonia, 2006.
  • Sesia and Candès [2020] M. Sesia and E. J. Candès. A comparison of some conformal quantile regression methods. Stat, 9(1):e261, 2020.
  • Stankevičiūtė et al. [2021] K. Stankevičiūtė, A. Alaa, and M. van der Schaar. Conformal time-series forecasting. In Advances in Neural Information Processing Systems, 2021.
  • Steikert [2014] K. U. Steikert. The weighted Nadaraya-Watson Estimator: Strong consistency results, rates of convergence, and a local bootstrap procedure to select the bandwidth. PhD thesis, University of Zurich, 2014.
  • Sun and Yu [2024] S. H. Sun and R. Yu. Copula conformal prediction for multi-step time series prediction. In The Twelfth International Conference on Learning Representations, 2024.
  • Tibshirani et al. [2019] R. J. Tibshirani, R. Foygel Barber, E. Candès, and A. Ramdas. Conformal Prediction Under Covariate Shift. Advances in Neural Information Processing Systems, 32, 2019.
  • Tsybakov [2009] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
  • Tucker [1967] H. G. Tucker. A Graduate Course in Probability. Academic Press, 1967.
  • Vovk [2012] V. Vovk. Conditional validity of inductive conformal predictors. In Proceedings of the Asian Conference on Machine Learning, 2012.
  • Vovk et al. [1999] V. Vovk, A. Gammerman, and C. Saunders. Machine-Learning Applications of Algorithmic Randomness. In Proceedings of the Sixteenth International Conference on Machine Learning, 1999.
  • Vovk et al. [2005] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic Learning in a Random World. Springer, 2005.
  • Wand and Jones [1994] M. P. Wand and M. C. Jones. Kernel Smoothing. CRC press, 1994.
  • Watson [1964] G. S. Watson. Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, 26:359–372, 1964.
  • Xu and Xie [2021a] C. Xu and Y. Xie. Conformal Anomaly Detection on Spatio-Temporal Observations with Missing Data. arXiv preprint arXiv:2105.11886, 2021a.
  • Xu and Xie [2021b] C. Xu and Y. Xie. Conformal prediction interval for dynamic time-series. In Proceedings of the 38th International Conference on Machine Learning, 2021b.
  • Xu and Xie [2022] C. Xu and Y. Xie. Conformal prediction set for time-series. arXiv preprint arXiv:2206.07851, 2022.
  • Xu and Xie [2023a] C. Xu and Y. Xie. Conformal prediction for time series. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):11575–11587, 2023a.
  • Xu and Xie [2023b] C. Xu and Y. Xie. Sequential Predictive Conformal Inference for Time Series. In Proceedings of the 40th International Conference on Machine Learning, 2023b.
  • Xu et al. [2024] C. Xu, H. Jiang, and Y. Xie. Conformal prediction for multi-dimensional time series by ellipsoidal sets. arXiv preprint arXiv:2403.03850, 2024.
  • Yu and Jones [1998] K. Yu and M. C. Jones. Local Linear Quantile Regression. Journal of the American Statistical Association, 93(441):228–237, 1998.
  • Zaffran et al. [2022] M. Zaffran, O. Féron, Y. Goude, J. Josse, and A. Dieuleveut. Adaptive Conformal Predictions for Time Series. In Proceedings of the 39th International Conference on Machine Learning, 2022.
  • Zhang et al. [2021] M. Zhang, C. Xu, A. Sun, F. Qiu, and Y. Xie. Solar Radiation Ramping Events Modeling Using Spatio-temporal Point Processes. arXiv preprint arXiv:2101.11179, 2021.
  • Zhu et al. [2021] S. Zhu, H. Zhang, Y. Xie, and P. Van Hentenryck. Multi-resolution spatio-temporal prediction with application to wind power generation. arXiv preprint arXiv:2108.13285, 2021.

Appendix A Multivariate time series

In the main text, our discussion has centered on cases where the response variables YtY_{t} are scalars. Here, we explore the natural extension of our methodology to handle scenarios with multivariate responses. This extension requires defining multivariate quantiles, introducing a multivariate version of the RNW estimator for estimating these quantiles [Salha, 2006], and adapting our KOWCPI method for multivariate responses.

Multivariate conditional quantiles

Consider a strongly mixing stationary process ((X~i,Y~i))i=1((\tilde{X}_{i},\tilde{Y}_{i}))_{i=1}^{\infty}, which is a realization of random variable (X~,Y~)p×s(\tilde{X},\tilde{Y})\in\mathbb{R}^{p}\times\mathbb{R}^{s}. Following Abdous and Theodorescu [1992], we first define a pseudo-norm function 2,α:s\left\|\cdot\right\|_{2,\alpha}:\mathbb{R}^{s}\to\mathbb{R} for α(0,1)\alpha\in(0,1) as

v2,α=(|v1|+(2α1)2,,|vs|+(2α1)2)2,\left\|v\right\|_{2,\alpha}=\left\|\left(\frac{\left|v_{1}\right|+(2\alpha-1)}{2},\ldots,\frac{\left|v_{s}\right|+(2\alpha-1)}{2}\right)\right\|_{2},

for vsv\in\mathbb{R}^{s}, where 2\left\|\cdot\right\|_{2} is the Euclidean norm on s\mathbb{R}^{s}. Let

Hα(θ,x~)𝔼[Y~θ2,αY~2,αX~=x~].H_{\alpha}(\theta,\tilde{x})\coloneqq\mathbb{E}[\|\tilde{Y}-\theta\|_{2,\alpha}-\|\tilde{Y}\|_{2,\alpha}\mid\tilde{X}=\tilde{x}].
Definition A.1 (Multivariate conditional quantile [Abdous and Theodorescu, 1992]).

Define a multivariate conditional α\alpha-quantile θα(x~)\theta_{\alpha}(\tilde{x}) for α(0,1)\alpha\in(0,1) as

θα(x~)=argminθsHα(θ,x~).\theta_{\alpha}(\tilde{x})=\mathop{\rm argmin}_{\theta\in\mathbb{R}^{s}}H_{\alpha}(\theta,\tilde{x}). (A.1)
Remark A.2 (Compatibility with univariate quantile function).

For a scalar Y~\tilde{Y}\in\mathbb{R}, its conditional quantile given X~=x~\tilde{X}=\tilde{x} is

θα(x~)\displaystyle\theta_{\alpha}(\tilde{x}) =argminθ𝔼[Y~θ2,αX~=x~]\displaystyle=\mathop{\rm argmin}_{\theta\in\mathbb{R}}\mathbb{E}\left[\|{\tilde{Y}-\theta}\|_{2,\alpha}\mid\tilde{X}=\tilde{x}\right]
=argminθ𝔼[|Y~θ|+(2α1)(Y~θ)X~=x~]\displaystyle=\mathop{\rm argmin}_{\theta\in\mathbb{R}}\mathbb{E}\left[|{\tilde{Y}-\theta}|+(2\alpha-1)(\tilde{Y}-\theta)\mid\tilde{X}=\tilde{x}\right]
=argminθ𝔼((Y~θ)(α𝟙(Y~θ))X~=x~)\displaystyle=\mathop{\rm argmin}_{\theta\in\mathbb{R}}\mathbb{E}((\tilde{Y}-\theta)(\alpha-\mathds{1}(\tilde{Y}\leq\theta))\mid\tilde{X}=\tilde{x})
=Qα(x~),\displaystyle=Q_{\alpha}(\tilde{x}),

for any α(0,1)\alpha\in(0,1). Thus, Definition A.1 is consistent with the univariate case.

Multivariate RNW estimator

Following the definition in (6), we obtain the RNW estimator for multivariate responses as

F^(y~|x~)=i=1nW(X~ix~h)𝟙(Y~iy~)i=1nW(X~ix~h),\widehat{F}(\tilde{y}|\tilde{x})=\frac{\sum_{i=1}^{n}W(\frac{\tilde{X}_{i}-\tilde{x}}{h})\mathds{1}(\tilde{Y}_{i}\leq\tilde{y})}{\sum_{i=1}^{n}W(\frac{\tilde{X}_{i}-\tilde{x}}{h})}, (A.2)

where, according to (7) and (8),

W(u)=K(u)1+λu1K(u),W(u)=\frac{K(u)}{1+\lambda u_{1}K(u)},

for u=(u1,,up)u=(u_{1},\ldots,u_{p})^{\top}. Now, let Wh(u)=hpW(h1u)W_{h}(u)=h^{-p}W(h^{-1}u), and define an estimator for Hα(θ,x)H_{\alpha}(\theta,x) as

H^α(θ,x~)s(y~θ2,αy~2,α)F^(dy~|x~)=i=1nWh(X~ix~)(Y~iθ2,αY~i2,α)i=1nWh(X~ix~),\widehat{H}_{\alpha}(\theta,\tilde{x})\coloneqq\int_{\mathbb{R}^{s}}(\|{\tilde{y}-\theta}\|_{2,\alpha}-\|{\tilde{y}}\|_{2,\alpha})\widehat{F}(d\tilde{y}|\tilde{x})=\frac{\sum_{i=1}^{n}W_{h}(\tilde{X}_{i}-\tilde{x})\left(\|{\tilde{Y}_{i}-\theta}\|_{2,\alpha}-\|{\tilde{Y}_{i}}\|_{2,\alpha}\right)}{\sum_{i=1}^{n}W_{h}(\tilde{X}_{i}-\tilde{x})},

and consequently the RNW conditional quantile estimator θ^α\widehat{\theta}_{\alpha} as

θ^α(x~)argminθpH^α(θ,x~)=argminθpi=1nWh(X~ix~)(Y~iθ2,αY~i2,α).\displaystyle\begin{split}\widehat{\theta}_{\alpha}(\tilde{x})\coloneqq\mathop{\rm argmin}_{\theta\in\mathbb{R}^{p}}\hat{H}_{\alpha}(\theta,\tilde{x})=\mathop{\rm argmin}_{\theta\in\mathbb{R}^{p}}\sum_{i=1}^{n}W_{h}(\tilde{X}_{i}-\tilde{x})(\|{\tilde{Y}_{i}-\theta}\|_{2,\alpha}-\|{\tilde{Y}_{i}}\|_{2,\alpha}).\end{split} (A.3)

Multivariate KOWCPI

Suppose we are sequentially observing (Xt,Yt)d×s(X_{t},Y_{t})\in\mathbb{R}^{d}\times\mathbb{R}^{s}, t=1,2,t=1,2,\ldots. Based on the construction of the multivariate version of the RNW estimator, we can extend our KOWCPI approach to multivariate responses in the same manner as described in Algorithm 1, with multivariate residuals

ε^t=Ytf^(Xt)s,\hat{\varepsilon}_{t}=Y_{t}-\hat{f}(X_{t})\in\mathbb{R}^{s},

as non-conformity scores. This adaptation allows for the application of our methodology to a broader range of data scenarios involving dependent data with multivariate response variables, which were similarly studied in [Xu et al., 2024, Sun and Yu, 2024, Stankevičiūtė et al., 2021].

Appendix B Proofs

The following lemma is adapted from the proof of Lemma 1 of Tibshirani et al. [2019]; however, we do not assume exchangeability.

Lemma B.1 (Weights on quantile for non-exchangeable data).

Given a sequence of random variables {V1,,Vn+1}\{V_{1},\ldots,V_{n+1}\} with joint density 𝒫\mathcal{P} and a sequence of observations {v1,,vn+1}\{v_{1},\ldots,v_{n+1}\}. Define the event

E={{V1,,Vn+1}={v1,,vn+1}}.E=\{\{V_{1},\ldots,V_{n+1}\}=\{v_{1},\ldots,v_{n+1}\}\}.

Then we have for i=1,,n+1i=1,\ldots,n+1,

{Vn+1=vi|E}=σ:σ(n+1)=i𝒫(vσ(1),,vσ(n+1))σ𝒫(vσ(1),,vσ(n+1))[0,1].\mathbb{P}\{V_{n+1}=v_{i}|E\}=\frac{\sum_{\sigma:\sigma(n+1)=i}\mathcal{P}(v_{\sigma(1)},\ldots,v_{\sigma(n+1)})}{\sum_{\sigma}\mathcal{P}(v_{\sigma(1)},\ldots,v_{\sigma(n+1)})}\in[0,1].

Note that when the residuals are exchangeable, Wi=1/(n+1)W_{i}^{*}=1/(n+1), as also observed in Tibshirani et al. [2019]. Now we prove Proposition 4.1.

Proof of Proposition 4.1.

The proof assumes that Y~i\tilde{Y}_{i}, for i=1,,n+1i=1,\ldots,n+1, are almost surely distinct. However, the proof remains valid, albeit with more complex notations involving multisets, if this is not the case. Denote by Quantileβ(){\rm Quantile}_{\beta}(\mathbb{Q}) the β\beta-quantile of the distribution \mathbb{Q} on \mathbb{R}, and by δa\delta_{a} the point mass distribution at aa\in\mathbb{R}. Define the event E={{Y~1,,Y~n+1}={v1,,vn+1}}E=\{\{\tilde{Y}_{1},\ldots,\tilde{Y}_{n+1}\}=\{v_{1},\ldots,v_{n+1}\}\}. Then, by the tower property, we have

(YT+1C^Tα(XT+1))\displaystyle\mathbb{P}(Y_{T+1}\in\widehat{C}_{T}^{\alpha}(X_{T+1}))
=𝔼((YT+1C^Tα(XT+1)E))\displaystyle=\mathbb{E}(\mathbb{P}(Y_{T+1}\in\widehat{C}_{T}^{\alpha}(X_{T+1})\mid E))
=𝔼[(Quantileβ(i=1nW^iδvi)Y~n+1Quantile1α+β(i=1nW^iδvi)|E)]\displaystyle=\left.\mathbb{E}\left[\mathbb{P}\left({\rm Quantile}_{\beta^{*}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\leq\tilde{Y}_{n+1}\leq{\rm Quantile}_{1-\alpha+{\beta^{*}}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\right|E\right)\right]
=𝔼[VPW(Quantileβ(i=1nW^iδvi)VQuantile1α+β(i=1nW^iδvi))],\displaystyle=\mathbb{E}\left[\mathbb{P}_{V\sim P^{W^{*}}}\left({\rm Quantile}_{\beta^{*}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\leq V\leq{\rm Quantile}_{1-\alpha+{\beta^{*}}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\right)\right],

where PW=i=1n+1WiδviP^{W^{*}}=\sum_{i=1}^{n+1}W^{*}_{i}\delta_{v_{i}}, and in the last line, we have used the result from Lemma B.1,

Y~n+1|EPW.\tilde{Y}_{n+1}|E\sim P^{W^{*}}.

Denote the weighted empirical distributions based on W^=W^(X~n+1)\widehat{W}=\widehat{W}(\tilde{X}_{n+1}) as

PW^\displaystyle P^{\widehat{W}} =i=1nW^iδvi.\displaystyle=\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}.

This gives the marginal coverage gap as

|(Yn+1C^nα(Xn+1))(1α)|\displaystyle\left|\mathbb{P}(Y_{n+1}\in\widehat{C}_{n}^{\alpha}(X_{n+1}))-(1-\alpha)\right|
𝔼[|VPW(Quantileβ(i=1nW^iδvi)VQuantile1α+β(i=1nW^iδvi))\displaystyle\leq\mathbb{E}\left[\left|\mathbb{P}_{V\sim P^{W^{*}}}\Bigg{(}{\rm Quantile}_{\beta^{*}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\leq V\leq{\rm Quantile}_{1-\alpha+{\beta^{*}}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\Bigg{)}\right.\right.
VPW^(Quantileβ(i=1nW^iδvi)VQuantile1α+β(i=1nW^iδvi))|]\displaystyle\quad\quad\left.-\left.\mathbb{P}_{V\sim P^{\widehat{W}}}\left({\rm Quantile}_{\beta^{*}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\leq V\leq{\rm Quantile}_{1-\alpha+{\beta^{*}}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\right)\right|\right]
+𝔼|VPW^(Quantileβ(i=1nW^iδvi)VQuantile1α+β(i=1nW^iδvi))(1α)|\displaystyle\quad+\mathbb{E}\left|\mathbb{P}_{V\sim P^{\widehat{W}}}\left({\rm Quantile}_{\beta^{*}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\leq V\leq{\rm Quantile}_{1-\alpha+{\beta^{*}}}\left(\sum_{i=1}^{n}\widehat{W}_{i}\delta_{v_{i}}\right)\right)-(1-\alpha)\right|
𝔼[dTV(PW,PW^)]\displaystyle\leq\mathbb{E}[d_{\rm TV}(P^{W^{*}},P^{\widehat{W}})]
+𝔼|F^(Q^1α+β(X~n+1)|X~n+1)(1α+β)|+𝔼|F^(Q^β(X~n+1)|X~n+1)β|\displaystyle\quad+\mathbb{E}\left|\widehat{F}\left(\widehat{Q}_{1-\alpha+{\beta^{*}}}(\tilde{X}_{n+1})\Big{|}\tilde{X}_{n+1}\right)-(1-\alpha+{\beta^{*}})\right|+\mathbb{E}\left|\widehat{F}\left(\widehat{Q}_{{\beta^{*}}}(\tilde{X}_{n+1})\Big{|}\tilde{X}_{n+1}\right)-{\beta^{*}}\right|
12𝔼(W)1:nW^1+12𝔼Wn+1+2𝔼maxi=1,,nW^i(X~n+1),\displaystyle\leq\frac{1}{2}\mathbb{E}\|(W^{*})_{1:n}-\widehat{W}\|_{1}+\frac{1}{2}\mathbb{E}W^{*}_{n+1}+2\mathbb{E}\max_{i=1,\ldots,n}\widehat{W}_{i}(\tilde{X}_{n+1}),

where we denote by dTV(,)d_{\rm TV}(\cdot,\cdot) the total variation distance between probability measures, and the second inequality is due to the definition of the total variation distance. ∎

B.1 Proof of asymptotic conditional coverage of KOWCPI (Corollary 4.8 and Theorem 4.9)

In deriving the asymptotic conditional coverage property of KOWCPI, the consistency of the RNW estimator plays a crucial role. Therefore, we first introduce the proof of Proposition 4.6, which discusses the consistency of the CDF estimator. Corollary 4.7, which states the consistency of the quantile estimator, is a natural consequence of Proposition 4.6 and leads us to the proof for our main results, Corollary 4.8 and Theorem 4.9. Proof of Proposition 4.6 adopts the similar strategy as Salha [2006] and Cai [2002].

To prove Proposition 4.6, it is essential to first understand the nature of the adjustment weight pi(𝐳)p_{i}(\mathbf{z}). Thus, Lemma 3.1 is not only crucial for the practical implementation of the RNW estimator but also indispensable in the proof process of Proposition 4.6.

Proof of Lemma 3.1.

For display purposes, denote [X]1[X]_{1} as X1X_{1}. By (5), we have that

i=1npi(𝐳)(X~i1x~1)Kh(X~i𝐳)=0.\sum_{i=1}^{n}p_{i}(\mathbf{z})(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z})=0. (A.4)

Let

(λ1,λ2,p1(𝐳),,pn(𝐳))\displaystyle\mathcal{L}(\lambda_{1},\lambda_{2},p_{1}(\mathbf{z}),\ldots,p_{n}(\mathbf{z}))
=i=1nlogpi(𝐳)+λ1(1i=1npi(𝐳))+λ2i=1npi(𝐳)(X~i1x~1)Kh(X~i𝐳),\displaystyle=\sum_{i=1}^{n}\log p_{i}(\mathbf{z})+\lambda_{1}\Bigg{(}1-\sum_{i=1}^{n}p_{i}(\mathbf{z})\Bigg{)}+\lambda_{2}\sum_{i=1}^{n}p_{i}(\mathbf{z})(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z}),

where λ1,λ2\lambda_{1},\lambda_{2}\in\mathbb{R} are the Lagrange multipliers. From /pi(𝐳)=0\partial\mathcal{L}/\partial p_{i}(\mathbf{z})=0 for i=1,,ni=1,\ldots,n, we get

pi1(𝐳)λ1+λ2(X~i1x~1)Kh(X~i𝐳)=0,\displaystyle p_{i}^{-1}(\mathbf{z})-\lambda_{1}+\lambda_{2}(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z})=0,

Since pi(𝐳)p_{i}(\mathbf{z})’s sum up to 1 as in (4), letting λ=λ2/λ1\lambda=-\lambda_{2}/\lambda_{1}, we have

pi(𝐳)=[1+λ(X~i1x~1)Kh(X~i𝐳)]1j=1n[1+λ(X~j1x~1)Kh(X~j𝐳)]1.p_{i}(\mathbf{z})=\frac{[1+\lambda(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z})]^{-1}}{\sum_{j=1}^{n}[1+\lambda(\tilde{X}_{j1}-\tilde{x}_{1})K_{h}(\tilde{X}_{j}-\mathbf{z})]^{-1}}.

Using (4) again with (A.4), this gives

j=1n[1+λ(X~j1x~1)Kh(𝐗~j𝐳)]1=n(i=1npi(𝐳)[1+λ(X~i1x~1)Kh(X~i𝐳)])1=n,\displaystyle\sum_{j=1}^{n}\left[1+\lambda(\tilde{X}_{j1}-\tilde{x}_{1})K_{h}(\tilde{\mathbf{X}}_{j}-\mathbf{z})\right]^{-1}=n\Bigg{(}\sum_{i=1}^{n}p_{i}(\mathbf{z})[1+\lambda(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z})]\Bigg{)}^{-1}=n,

and therefore (8) holds. With (5), this gives

0=i=1n(X~i1x~1)Kh(X~i𝐳)1+λ(X~i1x~1)Kh(X~i𝐳)=L(γ;𝐳)γ|λ.0=\sum_{i=1}^{n}\frac{(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z})}{1+\lambda(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z})}=\left.-\frac{\partial L(\gamma;\mathbf{z})}{\partial\gamma}\right|_{\lambda}.

Note that 2L(γ;𝐳)γ20\frac{\partial^{2}L(\gamma;\mathbf{z})}{\partial{\gamma}^{2}}\geq 0, implying that L(;𝐳)L(\cdot;\mathbf{z}) is indeed a convex function. ∎

Lemma B.2.

Under the assumptions of Proposition 4.6, define

c(𝐳)=g(𝐳)x~1μ2g(𝐳)ν2.c(\mathbf{z})=\frac{\frac{\partial g(\mathbf{z})}{\partial\tilde{x}_{1}}\mu_{2}}{g(\mathbf{z})\nu_{2}}.

Then,

λ=hwc(𝐳)(1+op(1))=𝒪p(hw).\lambda=h^{w}\cdot c(\mathbf{z})(1+o_{p}(1))=\mathcal{O}_{p}(h^{w}). (A.5)
Proof.

Let

Si=(X~i1x~1)Kh(X~i𝐳).S_{i}=(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z}).

Then, by Assumption 4.4, SiS_{i} is bounded above by some constant C1C_{1}. Let

Sk¯=1ni=1n(Si)k,\overline{S^{k}}=\frac{1}{n}\sum_{i=1}^{n}(S_{i})^{k},

for k=1,2k=1,2. Then, from (A.4), we have

0\displaystyle 0 =1ni=1n(1+λSi)1Si|λ||1ni=1nSi2(1+λSi)1||S1¯||λ|(1+C1|λ|)1S2¯|S1¯|,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(1+\lambda S_{i})^{-1}S_{i}\geq\left|\lambda\right|\left|\frac{1}{n}\sum_{i=1}^{n}S_{i}^{2}(1+\lambda S_{i})^{-1}\right|-\left|\overline{S^{1}}\right|\geq\left|\lambda\right|(1+C_{1}\left|\lambda\right|)^{-1}\overline{S^{2}}-\left|\overline{S^{1}}\right|,

which gives

|λ||S1¯|S2¯C1|S1¯|.\left|\lambda\right|\leq\frac{\left|\overline{S^{1}}\right|}{\overline{S^{2}}-C_{1}\left|\overline{S^{1}}\right|}.

Using the Taylor expansion [Wand and Jones, 1994], we obtain that

𝔼S1¯\displaystyle\mathbb{E}\overline{S^{1}} =w(u1x~1)Kh(𝐮𝐳)g(𝐮)𝑑𝐮\displaystyle=\int_{\mathbb{R}^{w}}(u_{1}-\tilde{x}_{1})K_{h}(\mathbf{u}-\mathbf{z})g(\mathbf{u})d\mathbf{u}
=hwu1K(𝐮)g(𝐳+h𝐮)𝑑𝐮\displaystyle=h\int_{\mathbb{R}^{w}}u_{1}K(\mathbf{u})g(\mathbf{z}+h\mathbf{u})d\mathbf{u}
=hwu1K(𝐮)(g(𝐳)+hj=1wujg(𝐳)x~j)𝑑𝐮+o(h2)\displaystyle=h\int_{\mathbb{R}^{w}}u_{1}K(\mathbf{u})\Bigg{(}g(\mathbf{z})+h\sum_{j=1}^{w}u_{j}\frac{\partial g(\mathbf{z})}{\partial\tilde{x}_{j}}\Bigg{)}d\mathbf{u}+o(h^{2})
=h2{g(𝐳)x~1μ2+op(1)},\displaystyle=h^{2}\left\{\frac{\partial g(\mathbf{z})}{\partial\tilde{x}_{1}}\mu_{2}+o_{p}(1)\right\},

where the last equation comes from Assumptions 4.4-(i) and (ii). With a similar argument, we can derive that

𝔼S2¯\displaystyle\mathbb{E}\overline{S^{2}} =w(u1x~1)2Kh2(𝐮𝐳)g(𝐮)𝑑𝐮=hw+2{g(𝐳)ν2+op(h)},\displaystyle=\int_{\mathbb{R}^{w}}(u_{1}-\tilde{x}_{1})^{2}K_{h}^{2}(\mathbf{u}-\mathbf{z})g(\mathbf{u})d\mathbf{u}=h^{-w+2}\left\{g(\mathbf{z})\nu_{2}+o_{p}(h)\right\},

using Assumption 4.4-(iii). Therefore, we obtain (A.5). ∎

Decomposing F^(y~|𝐳)F(y~|𝐳)\widehat{F}(\tilde{y}|\mathbf{z})-F(\tilde{y}|\mathbf{z}) in bias and variance terms, we get

F^(y~|𝐳)F(y~|𝐳)\displaystyle\widehat{F}(\tilde{y}|\mathbf{z})-F(\tilde{y}|\mathbf{z})
=i=1npi(𝐳)Kh(X~i𝐳){𝟙(Y~i<y~)F(y~|𝐳)}i=1npi(𝐳)Kh(X~i𝐳)\displaystyle=\frac{\sum_{i=1}^{n}p_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})\{\mathds{1}(\tilde{Y}_{i}<\tilde{y})-F(\tilde{y}|\mathbf{z})\}}{\sum_{i=1}^{n}p_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})}
=i=1npi(𝐳)Kh(X~i𝐳)δi+i=1npi(𝐳)Kh(X~i𝐳){F(y~|X~i)F(y~|𝐳)}i=1npi(𝐳)Kh(X~i𝐳),\displaystyle=\frac{\sum_{i=1}^{n}p_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})\delta_{i}+\sum_{i=1}^{n}p_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})\{F(\tilde{y}|\tilde{X}_{i})-F(\tilde{y}|\mathbf{z})\}}{\sum_{i=1}^{n}p_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})},

where δi=𝟙(Y~ty~)F(y~|X~i)\delta_{i}=\mathds{1}(\tilde{Y}_{t}\leq\tilde{y})-F(\tilde{y}|\tilde{X}_{i}). Note that 𝔼[δi]=0\mathbb{E}[\delta_{i}]=0 due to the tower property. Now, let

bi(𝐳)=bi(X~i,𝐳)[1+hwc(𝐳)(X~i1x~1)Kh(X~i𝐳)]1.b_{i}(\mathbf{z})=b_{i}(\tilde{X}_{i},\mathbf{z})\coloneqq\left[1+h^{w}\cdot c(\mathbf{z})(\tilde{X}_{i1}-\tilde{x}_{1})K_{h}(\tilde{X}_{i}-\mathbf{z})\right]^{-1}.

Then, by Lemma B.2, we have that

pi(𝐳)=n1bi(𝐳)(1+op(1)).p_{i}(\mathbf{z})=n^{-1}b_{i}(\mathbf{z})(1+o_{p}(1)). (A.6)

Define

J1=n1/2hw/2i=1nbi(𝐳)δiKh(X~i𝐳),\displaystyle J_{1}=n^{-1/2}h^{w/2}\sum_{i=1}^{n}b_{i}(\mathbf{z})\delta_{i}K_{h}(\tilde{X}_{i}-\mathbf{z}),
J2=n1i=1n{F(y~|X~i)F(y~|𝐳)}bi(𝐳)Kh(X~i𝐳),\displaystyle J_{2}=n^{-1}\sum_{i=1}^{n}\left\{F(\tilde{y}|\tilde{X}_{i})-F(\tilde{y}|\mathbf{z})\right\}b_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z}),
J3=n1i=1nbi(𝐳)Kh(X~i𝐳),\displaystyle J_{3}=n^{-1}\sum_{i=1}^{n}b_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z}),

so that

F^(y~|𝐳)F(y~|𝐳)={(nhw)1/2J1+J2}J31{1+op(1)}.\widehat{F}(\tilde{y}|\mathbf{z})-F(\tilde{y}|\mathbf{z})=\{(nh^{w})^{-1/2}J_{1}+J_{2}\}J_{3}^{-1}\{1+o_{p}(1)\}. (A.7)

Therefore, we will derive Proposition 4.6 by controlling the terms J1J_{1}, J2J_{2} and J3J_{3}.

Lemma B.3.

Under the assumptions of Proposition 4.6,

J1=𝒪p(1).J_{1}=\mathcal{O}_{p}(1). (A.8)
Proof.

Let

ξi=hw/2bi(𝐳)δiKh(X~i𝐳),\xi_{i}=h^{w/2}b_{i}(\mathbf{z})\delta_{i}K_{h}(\tilde{X}_{i}-\mathbf{z}),

so that J1=n1/2i=1nξiJ_{1}=n^{-1/2}\sum_{i=1}^{n}\xi_{i}. Since 𝔼(δi|X~i)=0\mathbb{E}(\delta_{i}|\tilde{X}_{i})=0, we have that 𝔼(ξi)=𝔼(𝔼(ξi|X~i))=0\mathbb{E}(\xi_{i})=\mathbb{E}(\mathbb{E}(\xi_{i}|\tilde{X}_{i}))=0, and thus

𝔼J1=0.\displaystyle\mathbb{E}J_{1}=0. (A.9)

Also, due to the stationarity of X~i\tilde{X}_{i}, we have that

Var(J1)=𝔼ξi2+i=2n(1i1n)Cov(ξ1,ξi).\mbox{Var}(J_{1})=\mathbb{E}\xi_{i}^{2}+\sum_{i=2}^{n}\left(1-\frac{i-1}{n}\right)\mbox{Cov}(\xi_{1},\xi_{i}). (A.10)

By Assumption 4.4, we have that limnbi(𝐳)=1\lim_{n\to\infty}b_{i}(\mathbf{z})=1, which gives 𝔼(bi)=1+op(1)\mathbb{E}(b_{i})=1+o_{p}(1). Therefore, through expansion, we have

𝔼ξi2\displaystyle\mathbb{E}\xi_{i}^{2} =hw𝔼[𝔼[bi2(𝐳)Kh2(X~i𝐳)δi2X~i]]\displaystyle=h^{w}\mathbb{E}\left[\mathbb{E}\left[b_{i}^{2}(\mathbf{z})K_{h}^{2}(\tilde{X}_{i}-\mathbf{z})\delta_{i}^{2}\mid\tilde{X}_{i}\right]\right]
=hw𝔼[bi2(𝐳)Kh2(X~i𝐳)F(y~|X~i)(1F(y~|X~i))]\displaystyle=h^{w}\mathbb{E}\left[b_{i}^{2}(\mathbf{z})K_{h}^{2}(\tilde{X}_{i}-\mathbf{z})F(\tilde{y}|\tilde{X}_{i})(1-F(\tilde{y}|\tilde{X}_{i}))\right]
=[(K2)h{bi2(,𝐳)F(y~|)(1F(y~|))g()}](𝐳)\displaystyle=\left[(K^{2})_{h}\ast\{b_{i}^{2}(\cdot,\mathbf{z})F(\tilde{y}|\cdot)(1-F(\tilde{y}|\cdot))g(\cdot)\}\right](\mathbf{z})
=ν0F(y~|𝐳)(1F(y~|𝐳))g(𝐳)+op(1),\displaystyle=\nu_{0}F(\tilde{y}|\mathbf{z})(1-F(\tilde{y}|\mathbf{z}))g(\mathbf{z})+o_{p}(1),

where \ast in the third line is the convolution operator. To control the second term in the right-hand side of (A.10), we borrow the idea of Masry [1986]. Choose dn=𝒪(hw1+δ/2)d_{n}=\mathcal{O}(h^{-\frac{w}{1+\delta/2}}) and decompose

i=2n(1i1n)Cov(ξ1,ξi)=i=2dn(1i1n)Cov(ξ1,ξi)+i=dn+1n(1i1n)Cov(ξ1,ξi).\displaystyle\sum_{i=2}^{n}\left(1-\frac{i-1}{n}\right)\mbox{Cov}(\xi_{1},\xi_{i})=\sum_{i=2}^{d_{n}}\left(1-\frac{i-1}{n}\right)\mbox{Cov}(\xi_{1},\xi_{i})+\sum_{i=d_{n}+1}^{n}\left(1-\frac{i-1}{n}\right)\mbox{Cov}(\xi_{1},\xi_{i}).

We have that |bi(𝐳)δi|C2\left|b_{i}(\mathbf{z})\delta_{i}\right|\leq C_{2} for some constant C2C_{2}. By Assumption 4.3-(iv), we obtain

|Cov(ξ1,ξi)|\displaystyle\left|\mbox{Cov}(\xi_{1},\xi_{i})\right| =|wwξ1ξig1,i(𝐮,v)𝑑𝐮𝑑vwξ1g(𝐮)𝑑𝐮wξig(v)𝑑v|\displaystyle=\left|\int_{\mathbb{R}^{w}}\int_{\mathbb{R}^{w}}\xi_{1}\xi_{i}g_{1,i}(\mathbf{u},v)d\mathbf{u}dv-\int_{\mathbb{R}^{w}}\xi_{1}g(\mathbf{u})d\mathbf{u}\int_{\mathbb{R}^{w}}\xi_{i}g(v)dv\right|
C22hwwwK(𝐮)K(v)|g1,i(𝐳h𝐮,𝐳hv)g(𝐳h𝐮)g(𝐳hv)|𝑑𝐮𝑑v\displaystyle\leq C_{2}^{2}h^{w}\int_{\mathbb{R}^{w}}\int_{\mathbb{R}^{w}}K(\mathbf{u})K(v)\left|g_{1,i}(\mathbf{z}-h\mathbf{u},\mathbf{z}-hv)-g(\mathbf{z}-h\mathbf{u})g(\mathbf{z}-hv)\right|d\mathbf{u}dv
C22Mhw,\displaystyle\leq C_{2}^{2}Mh^{w},

so that

i=2dn(1i1n)Cov(ξ1,ξi)=𝒪p(dnhw)=op(1).\sum_{i=2}^{d_{n}}\left(1-\frac{i-1}{n}\right)\mbox{Cov}(\xi_{1},\xi_{i})=\mathcal{O}_{p}(d_{n}h^{w})=o_{p}(1).

By Assumption 4.4, we have (X~i𝐳)Kh(X~i𝐳)C3\|(\tilde{X}_{i}-\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})\|\leq C_{3}, so that |ξi|Chw/2\left|\xi_{i}\right|\leq Ch^{-w/2}. Then, by Theorem 17.2.1 of Ibragimov et al. [1971], we have that

|Cov(ξ1,ξi)|Chwα(i1).\left|\mbox{Cov}(\xi_{1},\xi_{i})\right|\leq Ch^{-w}\alpha(i-1).

Thus, we get

i=dn+1n(1i1n)Cov(ξ1,ξi)Chwidnα(i)Chwdn1δ=o(1).\displaystyle\sum_{i=d_{n}+1}^{n}\left(1-\frac{i-1}{n}\right)\mbox{Cov}(\xi_{1},\xi_{i})\leq Ch^{-w}\sum_{i\geq d_{n}}\alpha(i)\leq Ch^{-w}d_{n}^{-1-\delta}=o(1).

Therefore, we obtain

J1=𝔼J1+𝒪p(Var(J1))=𝒪p(1).J_{1}=\mathbb{E}J_{1}+\mathcal{O}_{p}\left(\sqrt{\mbox{Var}(J_{1})}\right)=\mathcal{O}_{p}(1). (A.11)

Lemma B.4.

Under the assumptions of Proposition 4.6,

J2\displaystyle J_{2} =12h2μ2tr(D𝐳2F(y~|𝐳))g(𝐳)+op(h2),\displaystyle=\frac{1}{2}h^{2}\mu_{2}\mbox{tr}(D_{\mathbf{z}}^{2}F(\tilde{y}|\mathbf{z}))g(\mathbf{z})+o_{p}(h^{2}), (A.12)
J3\displaystyle J_{3} =g(𝐳)+op(1).\displaystyle=g(\mathbf{z})+o_{p}(1). (A.13)
Proof.

By Assumption 1, (4) and (5), we have through expansion that

J2=(2n)1i=1n(X~i𝐳)(D𝐳2F(y~|𝐳))(X~i𝐳)bi(𝐳)Kh(X~i𝐳)+op(h2).\displaystyle J_{2}=(2n)^{-1}\sum_{i=1}^{n}(\tilde{X}_{i}-\mathbf{z})^{\top}(D_{\mathbf{z}}^{2}F(\tilde{y}|\mathbf{z}))(\tilde{X}_{i}-\mathbf{z})b_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})+o_{p}(h^{2}).

Since

𝔼[(X~i𝐳)D𝐳2F(y~|𝐳)(X~i𝐳)bi(𝐳)Kh(X~i𝐳)]\displaystyle\mathbb{E}[(\tilde{X}_{i}-\mathbf{z})^{\top}D_{\mathbf{z}}^{2}F(\tilde{y}|\mathbf{z})(\tilde{X}_{i}-\mathbf{z})b_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})]
=tr(D𝐳2F(y~|𝐳)𝔼[(X~i𝐳)(X~i𝐳)bi(𝐳)Kh(X~i𝐳)])\displaystyle=\mbox{tr}(D_{\mathbf{z}}^{2}F(\tilde{y}|\mathbf{z})\mathbb{E}[(\tilde{X}_{i}-\mathbf{z})(\tilde{X}_{i}-\mathbf{z})^{\top}b_{i}(\mathbf{z})K_{h}(\tilde{X}_{i}-\mathbf{z})])
=h2μ2g(𝐳)tr(D𝐳2F(y~|𝐳))+op(h2),\displaystyle=h^{2}\mu_{2}g(\mathbf{z})\mbox{tr}(D_{\mathbf{z}}^{2}F(\tilde{y}|\mathbf{z}))+o_{p}(h^{2}),

we have

J2=12h2μ2tr(D𝐳2F(y~|𝐳))g(𝐳)+op(h2).J_{2}=\frac{1}{2}h^{2}\mu_{2}\mbox{tr}(D_{\mathbf{z}}^{2}F(\tilde{y}|\mathbf{z}))g(\mathbf{z})+o_{p}(h^{2}). (A.14)

Finally, by applying the expansion argument routinely, we get

J3=g(𝐳)+op(1).J_{3}=g(\mathbf{z})+o_{p}(1). (A.15)

Proof of Proposition 4.6 [Cai, 2002, Salha, 2006].

Combining Lemmas B.3 and B.4 with (A.7), Assumption 4.5 gives the result. ∎

Proof of Corollary 4.7 [Cai, 2002].

Given 𝐳\mathbf{z}, Proposition 4.6 implies uniform convergence of F^(|𝐳)\widehat{F}(\cdot|\mathbf{z}) to F(|𝐳)F(\cdot|\mathbf{z}) in probability [Tucker, 1967, p.127-128] since F(|𝐳)F(\cdot|\mathbf{z}) is a CDF. That is,

supy~|F^(y~|𝐳)F(y~|𝐳)|0 in probability.\sup_{\tilde{y}\in\mathbb{R}}\left|\widehat{F}(\tilde{y}|\mathbf{z})-F(\tilde{y}|\mathbf{z})\right|\to 0\mbox{ in probability}.

Given ε>0\varepsilon>0, let δ=δ(ε)min{βF(Qβ(𝐳)ε|𝐳),F(Qβ(𝐳)+ε|𝐳)β}\delta=\delta(\varepsilon)\coloneqq\min\{\beta-F(Q_{\beta}(\mathbf{z})-\varepsilon|\mathbf{z}),F(Q_{\beta}(\mathbf{z})+\varepsilon|\mathbf{z})-\beta\}. Note that δ>0\delta>0 due to the uniqueness of the quantile. We have

(|Q^β(𝐳)Qβ(𝐳)|>ε)\displaystyle\mathbb{P}\left(\left|\widehat{Q}_{\beta}(\mathbf{z})-Q_{\beta}(\mathbf{z})\right|>\varepsilon\right) (|F(Q^β(𝐳)|𝐳)β|>δ)\displaystyle\leq\mathbb{P}\left(\left|F(\widehat{Q}_{\beta}(\mathbf{z})|\mathbf{z})-\beta\right|>\delta\right)
(supy~|F^(y~|𝐳)F(y~|𝐳)|>δ).\displaystyle\leq\mathbb{P}\left(\sup_{\tilde{y}\in\mathbb{R}}\left|\widehat{F}(\tilde{y}|\mathbf{z})-F(\tilde{y}|\mathbf{z})\right|>\delta\right).

The uniform convergence of F^(|x~)\widehat{F}(\cdot|\tilde{x}) in probability gives the result. ∎

Proof of Corollary 4.8.

From the definition of C^t1α\hat{C}_{t-1}^{\alpha} in (3), we have

(YtC^t1α(Xt)|Xt)\displaystyle\mathbb{P}\left(Y_{t}\in\hat{C}_{t-1}^{\alpha}(X_{t})\Big{|}X_{t}\right) =(Y~n+1[Q^β(X~n+1),Q^1α+β(X~n+1)]|X~n+1)\displaystyle=\mathbb{P}\left(\tilde{Y}_{n+1}\in\left[\widehat{Q}_{\beta^{*}}(\tilde{X}_{n+1}),\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{X}_{n+1})\right]\Big{|}\,\tilde{X}_{n+1}\right)
=F(Q^1α+β(X~n+1)|X~n+1)F(Q^β(X~n+1)|X~n+1).\displaystyle=F\left(\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{X}_{n+1})\Big{|}\tilde{X}_{n+1}\right)-F\left(\widehat{Q}_{\beta^{*}}(\tilde{X}_{n+1})\Big{|}\tilde{X}_{n+1}\right).

By Theorem 4.7, we have the consistency of Q^β\widehat{Q}_{\beta} for all β(0,1)\beta\in(0,1). On that, the continuous mapping theorem and Assumption 4.3 gives

F(Q^1α+β(X~n+1)|X~n+1)F(Q^β(X~n+1)|X~n+1)\displaystyle F\left(\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{X}_{n+1})\Big{|}\tilde{X}_{n+1}\right)-F\left(\widehat{Q}_{\beta^{*}}(\tilde{X}_{n+1})\Big{|}\tilde{X}_{n+1}\right)
F(Q1α+β(X~n+1)|X~n+1)F(Qβ(X~n+1)|X~n+1)\displaystyle\to F\left(Q_{1-\alpha+\beta^{*}}(\tilde{X}_{n+1})\Big{|}\tilde{X}_{n+1}\right)-F\left(Q_{\beta^{*}}(\tilde{X}_{n+1})\Big{|}\tilde{X}_{n+1}\right)
=1α,\displaystyle=1-\alpha,

where the convergence is in probability. ∎

Proof of Theorem 4.9.

From the definition of C^t1α\widehat{C}_{t-1}^{\alpha} in (3), we have

|(YtC^t1α(x)|Xt=x)(1α)|\displaystyle\left|\mathbb{P}\left(Y_{t}\in\widehat{C}_{t-1}^{\alpha}(x)\Big{|}X_{t}=x\right)-(1-\alpha)\right|
=|(Y~n+1[Q^β(x~),Q^1α+β(x~)]|X~n+1=x~)(1α)|\displaystyle=\left|\mathbb{P}\left(\tilde{Y}_{n+1}\in\left[\widehat{Q}_{\beta^{*}}(\tilde{x}),\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{x})\right]\Big{|}\ \tilde{X}_{n+1}=\tilde{x}\right)-(1-\alpha)\right|
=|F(Q^1α+β(x~)|X~n+1=x~)F(Q^β(x~)|X~n+1=x~)(1α)|\displaystyle=\left|F\left(\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{x})\Big{|}\tilde{X}_{n+1}=\tilde{x}\right)-F\left(\widehat{Q}_{\beta^{*}}(\tilde{x})\Big{|}\tilde{X}_{n+1}=\tilde{x}\right)-(1-\alpha)\right|
|F^(Q^1α+β(x~)|X~n+1=x~)(1α+β)|+|F^(Q^β(x~)|X~n+1=x~)β|\displaystyle\leq\left|\widehat{F}\left(\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{x})\Big{|}\tilde{X}_{n+1}=\tilde{x}\right)-(1-\alpha+\beta^{*})\right|+\left|\widehat{F}\left(\widehat{Q}_{\beta^{*}}(\tilde{x})\Big{|}\tilde{X}_{n+1}=\tilde{x}\right)-\beta^{*}\right|
+|F(Q^1α+β(x~)|X~n+1=x~)F^(Q^1α+β(x~)|X~n+1=x~)|\displaystyle\quad+\left|F\left(\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{x})\Big{|}\tilde{X}_{n+1}=\tilde{x}\right)-\widehat{F}\left(\widehat{Q}_{1-\alpha+\beta^{*}}(\tilde{x})\Big{|}\tilde{X}_{n+1}=\tilde{x}\right)\right|
+|F(Q^β(x~)|X~n+1=x~)F^(Q^β(x~)|X~n+1=x~)|\displaystyle\quad+\left|F\left(\widehat{Q}_{\beta^{*}}(\tilde{x})\Big{|}\tilde{X}_{n+1}=\tilde{x}\right)-\widehat{F}\left(\widehat{Q}_{\beta^{*}}(\tilde{x})\Big{|}\tilde{X}_{n+1}=\tilde{x}\right)\right|
2Δ(x~)+𝒪p((nhw)1/2+h2),\displaystyle\leq 2\Delta(\tilde{x})+\mathcal{O}_{p}((nh^{w})^{-1/2}+h^{2}),

where the last inequality comes from (A.7) and the definition of the discrete gap Δ\Delta. ∎

Appendix C Additional experiment results

We show additional rolling comparison on the solar (Figure A.1) and wind dataset (Figure A.2), where the setup and conclusions remain consistent with those on the electric dataset (Figure 2).

Refer to caption
(a) Rolling coverage (boxplot)
Refer to caption
(b) Width of prediction intervals (boxplot)
Refer to caption
(c) Rolling coverage over prediction time indices
Figure A.1: Rolling coverage and width comparison on the solar dataset by different methods.
Refer to caption
(a) Rolling coverage (boxplot)
Refer to caption
(b) Width of prediction intervals (boxplot)
Refer to caption
(c) Rolling coverage over prediction time indices
Figure A.2: Rolling coverage and width comparison on the wind dataset by different methods.