Conformal inference for regression on Riemannian Manifolds

Alejandro Cholaquidis, Fabrice Gamboa, Leonardo Moreno

Abstract

Regression on manifolds, and, more broadly, statistics on manifolds, has garnered significant importance in recent years due to the vast number of applications for non Euclidean data. Circular data is a classic example, but so is data in the space of covariance matrices, data on the Grassmannian manifold obtained as a result of principal component analysis, among many others. In this work we investigate prediction sets for regression scenarios when the response variable, denoted by $Y$ , resides in a manifold, and the covariable, denoted by $X$ , lies in an Euclidean space. This extends the concepts delineated in [61] to this novel context. Aligning with traditional principles in conformal inference, these prediction sets are distribution-free, indicating that no specific assumptions are imposed on the joint distribution of $(X,Y)$ , and they maintain a non-parametric character. We prove the asymptotic almost sure convergence of the empirical version of these regions on the manifold to their population counterparts. The efficiency of this method is shown through a comprehensive simulation study and an analysis involving real-world data.

1 Introduction

Conformal prediction is a powerful set of statistical tools that operates under minimal assumptions about the underlying model. It is primarily used to construct confidence sets that are applicable to a diverse range of problems in fields such as machine learning and statistics. Is unique in its ability to construct prediction regions that guarantee a specified level of coverage for finite samples, irrespective of the distribution of the data. This is particularly crucial in high-stakes decision-making scenarios where maintaining a certain level of coverage is critical.

Unlike other methods, conformal prediction does not require strong assumptions about the sample distribution, such as normality. The aim is to construct prediction regions as small as possible to yield informative and precise predictions, enhancing the tool’s utility in various applications.

The approach was first proposed by Vovk, Gammerman, and Shafer in the late 1990s, as referenced in [57]. Since its inception, it has been the subject of intense research activity. Originally formulated for binary classification problems, the method has since been expanded to accommodate regression, multi-class classification, functional data, functional time series, and anomaly detection, among others. Several applications of this method can be found in the book [5].

In the context of regression, conformal prediction has proven to be efficient in constructing prediction sets, as evidenced by works such as [61], [62], and [34]. To enhance the performance of these prediction sets, particularly to decrease the length of prediction intervals (when the output is one-dimensional), a combination of conformal inference and quantile regression was proposed in [53]. In [20], computationally efficient conformal inference methods for Bayesian models were proposed.

The method has also been extended to functional regression, where the predictors and responses are functions, rather than vectors, see for instance [41], [21] and [18]. In this case, the prediction regions take the form of functional envelopes that have a high probability of containing the true function.

In the field of classification, conformal prediction has been employed to tackle a broad spectrum of problems. These include image classification, as in [40], and text classification, as in [58]. For multi-class classification, a prevalent approach is to create prediction sets that have a high likelihood of encompassing the correct class. This can be realized via the use of ‘Venn prediction sets’, as outlined in [43]. These sets partition the label space into overlapping regions, each one corresponding to a distinct class.

In summary, conformal prediction provides powerful statistical tools and has been successfully applied to a wide range of problems in machine learning, statistics, and related fields. Its main advantage is its ability to provide distribution-free prediction regions that can be used in the presence of any underlying distribution of the data. As this field of research continues to evolve, it is expected to find even more applications in the future.

2 Conformal inference on manifolds

Although, as mentioned, it has been extended to various scenarios, there are no proposals or extensions to the case where the output is in a Riemannian manifold. This case is of cumbersome importance because there are certain types of data that, due to their inherent characteristics, must be treated as data on manifolds. A prominent example of this are covariance matrices (for instance, the volatility of a portfolio, see [8] and [28]), which are data in the manifold of positive definite matrices (see [14]). Another example is given by Vectorcardiograms that are condensed into data on the Stiefel manifold (see [13]). The outcome of performing a principal component analysis results in data on the Grassmannian manifold (see [27]), and the wind measurements can be represented as data on the cylinder (see [12]).

Other applications in image analysis are developed in [49] and more generally in machine learning in [24].

The extension to this context is not trivial since when the output $Y$ belongs to a Riemannian manifold $\mathcal{M}$ , several arguments used in Euclidean conformal inference are not straightforwardly extendable. For instance, the Frechet mean on the manifold can be defined, but it does not always exist nor is it necessarily unique. Moreover, formulating a regression model on the manifold is not easy because of the absence of any additive structure, although there have been recent advances in this area (see for instance [50]).

We extend the methodologies outlined by [61] to the broader context of pairs $(X,Y)$ where $X\in\mathbb{R}^{d}$ and the response $Y$ has support included on a, smooth enough, $\ell$ -dimensional manifold $\mathcal{M}$ . [61] is focused on the scenario where $Y$ is a real-valued variable, constructing confidence sets—referred to as confidence bands—via density estimators. One of the key tools to get the consistency of the empirical conformal region to its population counterpart, as in [61], is Theorem 1. It states that the kernel-density estimator is uniformly consistent on manifolds. Once this is established, to adapt the ideas in [61], it must also be proved that the conditional density is uniformly bounded from above, this is done in Lemma 4. Finally, a further distinction from [61] is that—because the manifold may possess a nonempty boundary—points located well inside the manifold are handled differently from those situated near its edge. Here, we obtain an upper bound for the discrepancy between the empirical confidence set $\hat{\mathbf{C}}_{n}(X)$ and its theoretical counterpart $\mathbf{C}_{P}(X)$ , in terms of $\nu$ , the volume measure on $\mathcal{M}$ . More precisely, as detailed in Theorem 2 (see also Remark 1) the probability that this discrepancy being larger than $n^{1/((\ell+3)(d+2))}$ is bounded from above by $A_{\lambda}n^{-\lambda}$ , for any $\lambda>0$ , $A_{\lambda}$ being a constant that depends only on $\lambda$ , where $\ell$ denotes the dimensionality of $\mathcal{M}$ and $X\in\mathbb{R}^{d}$ . Our main and stronger theorem is for compact manifolds, however, we considered in section 5 the case of non-compact manifolds.

The kernel-based density estimator proposed in [12] uses the Euclidean distance instead of the geodesic distance on the manifold, which may be unknown in some cases, see Equation (4). Due to the curse of dimensionality, kernel-based density estimators are generally unsuitable for high-dimensional problems. In Section 6, we discuss an approach to overcome this limitation that builds on the ideas introduced in [30]. In essence, instead of estimating the conditional density $p(y|x)$ locally (i.e., using only sample points close to $x$ ), the proposed method uses all sample points $(X_{i},Y_{i})$ whose estimated conditional densities $\hat{p}(Y_{i}|X_{i})$ are similar (in a sense to be defined).

Conformal inference is tackled in metric spaces in [44]; however, given the generality, convergence rates are not provided, as in the present work, but only consistency in probability, for the case where the confidence bands are constructed through regression.

2.1 Conformal inference in a nutshell

In this section, we briefly provide the basic foundations of conformal inference to facilitate the reading of the following sections. For a more detailed reading see [22]. A key hypothesis in conformal inference is the exchangeability assumption, which means that for any permutation $\pi$ of $\{1,\dots,n\}$ the distribution of the sample $(Z_{1},\dots,Z_{n})$ where $Z_{i}$ is a random element in some measurable space $\mathbf{Z}$ is the same as the distribution of $(Z_{\pi(1)},\dots,Z_{\pi(n)})$ . Quoting [22] “A nonconformity measure $A(B,z):\mathbf{Z}^{n}\times\mathbf{Z}\to\mathbb{R}$ is a way of scoring how different an example $z$ is from a bag $B=\{Z_{1},\dots,Z_{n}\}$ . Let us define, $p_{z}:=|\{i=1,\ldots,n+1:R_{i}\geq R_{n+1}\}|/(n+1)$ where $R_{i}:=A\left(\{Z_{1},\ldots,Z_{i-1},Z_{i+1},\ldots,Z_{n},z\},Z_{i}\right)$ and $R_{n+1}:=A\left(\{Z_{1},\ldots,Z_{n}\},z\right).$ ” For $\alpha\in[0,1]$ “we define the prediction set $\gamma^{\alpha}(Z_{1},\ldots,Z_{n}):=\{z\in\mathbf{Z}:p_{z}>\alpha\}$ .” The following Proposition is given in [58].

Proposition 1 (Proposition 2.1).

Under the exchangeability assumption,

\mathbb{P}(Z_{n+1}\not\in\gamma^{\alpha}(Z_{1},\ldots,Z_{n}))\leq\alpha\quad\text{ for any }\alpha.

In the regression setting, that is, when $Z_{i}=(X_{i},Y_{i})$ has distribution $P$ , previous construction build a level set on the joint distribution (see Equation 4 in [61]). Nevertheless, controlling $\mathbb{P}(Y\in\mathbf{C}_{P}(x)|X=x)$ is much more convenient in regression (see [61]). In this case we speak of “conditional coverage”. When there exists a conditional density $p(y|x)$ (see hypothesis H1 in subsection 3), the “conditional oracle set”, $\mathbf{C}_{P}(x)$ , (in [61] it is called conditional oracle band), is defined as

\mathbf{C}_{P}(x)=\{y:p(y|x)\geq t^{\alpha}_{x}\},

(1)

where $t^{\alpha}(x)$ satisfies

\int\mathbbm{1}_{\{p(y|x)\geq t^{\alpha}_{x}\}}p(y|x)dy=1-\alpha.

(2)

The empirical version of $\mathbf{C}_{P}(x)$ obtained from $\aleph_{n}=\{(X_{i},Y_{i}):i=1,\dots,n\}$ constituted by an i.i.d. sample of $(X,Y)$ with distribution $P$ , is called a conditionally valid set $\mathbf{C}_{n}(x)$ (see [61]).

Definition 1.

Given $\alpha\in(0,1)$ , and $x$ in the support of $P_{X}$ , a set $\mathbf{C}_{n}(x)\subset\mathcal{M}$ is said to be conditionally valid if

\mathbb{P}\big{(}Y\in\mathbf{C}_{n}(x)|X=x\big{)}\geq 1-\alpha.

This definition captures the notion of a set of possible values of $Y$ that provides a specified level of coverage for a given imput $x$ .

As is shown in Lemma 1 of [61], non-trivial finite sample conditional validity for all $x$ in the support of $P_{X}$ is impossible for a continuous distribution. To overcome this limitation, the following notion of local validity is introduced.

Definition 2.

Let $\mathcal{A}=\{A_{j}:j\geq 1\}$ be a partition of $\textrm{supp}(P_{X})$ . A prediction set $\mathbf{C}_{n}$ is locally valid with respect to $\mathcal{A}$ if

\mathbb{P}(Y_{n+1}\in\mathbf{C}_{n}(X_{n+1})|X_{n+1}\in A_{j})\geq 1-\alpha\text{ for all }j\text{ and all }P.

(3)

Whenever $A_{k}\in\mathcal{A}$ , we will write $p(y|A_{k})$ for the conditional density, w.r.t. $\nu$ , of $Y$ given $X\in A_{k}$ . We aim to prove (see Theorem 2) that locally valid sets converges to $\mathbf{C}_{P}$ when $Y$ is supported on a manifold.

3 Assumptions

In the following, $\aleph_{n}=\{(X_{i},Y_{i}):i=1,\dots,n\}$ denotes an i.i.d. sample of $(X,Y)$ with distribution $P$ , where $X\in\mathbb{R}^{d}$ and $Y\in\mathcal{M}$ . Here $(\mathcal{M},\rho)$ is a compact $\ell$ -dimensional submanifold of $\mathbb{R}^{D}$ . The case of non compact manifolds is discussed in Section 5. We further denote by $P_{X}$ the marginal distribution of $X$ and by $\text{supp}(P_{X})$ its support. We denote the volume measure on $\mathcal{M}$ by $\nu$ and use $\|\cdot\|$ to denote the Euclidean norm on $\mathbb{R}^{D}$ . We denote by $\mu$ the $d$ -dimensional Lebesgue measure in $\mathbb{R}^{d}$ .

We will now give the set of assumptions that we will require. H1 to H4 are also imposed in [61]. Hypotheses H0 and H5 are imposed to guarantee the uniform convergence of the kernel-based density estimator of the conditional density.

H0

$\mathcal{M}$ is a $\mathcal{C}^{2}$ submanifold, and if $\partial\mathcal{M}\neq\emptyset$ , then $\partial\mathcal{M}$ is also a $\mathcal{C}^{2}$ submanifold.
H1

The joint distribution $P$ has a density $p_{X,Y}(x,y)$ w.r.t. $\mu\times\nu$ , and the marginal distribution $P_{X}$ has a density $p_{X}$ (w.r.t. $\mu$ ). We denote by $p(y|x)=p_{X,Y}(x,y)/p_{X}(x)$ the conditional density, where $p(y|x)=0$ if $p_{X}(x)=0$ .
H2

$p_{X}$ is such that that there exist $b_{1},b_{2}$ such that $0<b_{1}\leq p_{X}(x)\leq b_{2}<\infty$ for all $x$ in the support of $X$ .
H3

$p(y|x)$ is Lipschitz continuous as a function of $x$ , i.e., there exists a constant $L>0$ such that $\|p(\cdot|x)-p(\cdot|x^{\prime})\|_{\infty}\leq L|x-x^{\prime}|$ .
H4

There are positive constants $\epsilon_{0}$ , $\gamma$ , $c_{1}$ , and $c_{2}$ such that for all $x$ in the support of $X$ ,

$c_{1}\epsilon^{\gamma}\leq P(\{y:|p(y|x)-t_{x}^{\alpha}|<\epsilon\}|X=x)\leq c_{2}\epsilon^{\gamma},$

for all $\epsilon\leq\epsilon_{0}$ , where $t_{x}^{\alpha}$ is given by (2). Moreover $\inf_{x}t_{x}^{(\alpha)}\geq t_{0}>0$ .
H5

$p(y|A_{k})$ has continuous partial second derivatives, for short $p(y|A_{k})\in\mathcal{C}^{2}$ .

The hypothesis H0 is satisfied by a very wide range of manifolds, among them, the Stieffel and Grassmannian Manifold. The cone of positive definite matrices, among many others. It is not a restrictive hypothesis in applications.

Assumptions H1 to H5 impose regularity on the joint and conditional densities of the data. In particular, the assumption of Lipschitz continuity H3 ensures that small changes in the input $x$ lead to small changes in the output $y$ . Assumption H2 implies that $X$ es compactly supported. Regarding Assumption H4, quoting [61], “is related to the notion of the ‘ $\gamma$ -exponent’ condition that was introduced by [51], and widely used in the density level set literature ([56, 52]). It ensures that the conditional density $p(\cdot|x)$ is neither too flat nor too steep near the contour at level $t_{x}^{\alpha}$ , so the cut-off value $t_{x}^{(\alpha)}$ and the conditional density level set $\mathbf{C}_{P}(x)=L_{x}(t_{x}^{\alpha})$ can be approximated from a finite sample […]. Assumption H4 also requires that the optimal cut-off values $t_{x}^{(\alpha)}$ be bounded away from zero.”. Assumption H5 is required to get the uniform convergence of the Kernel based estimator of $p(y|A_{k})$ on manifolds, see [7]. These assumptions play a crucial role in the development and analysis of conformal prediction methods.

Let us introduce some important sequences of positive real numbers that will play a key rol all along the manuscript.

1.

$w_{n}=(\log(n)/n)^{1/(d+2)}$ ,
2.

$\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor$ , being $b_{1}$ the positive constant introduced in H2. Observe that $\gamma_{n}\to\infty$ .
3.

Given $h_{n},c_{n}\to 0$ as $n\to\infty$ , we will consider the subsquences $c_{\gamma_{n}}$ , and $h_{\gamma_{n}}$ being $\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor$ as before.

4 Locally valid sets from a kernel density estimator

In this section we introduce a slightly modified version of the estimated local marginal density $\hat{p}^{(x,y)}(v|A_{k})$ and of the local conformity rank $\pi_{n,k}(x,y)$ originally introduced in [61]. We make the assumption that $\textrm{supp}(P_{X})=[0,1]^{d}$ , and that $\mathcal{A}=\{A_{k},k=1,\dots,T\}$ is a finite partition of $[0,1]^{d}$ consisting of equilateral cubes with sides of length $w_{n}$ . This is a quite common and technical assumption in conformal inference, but we can assume that, for instance, $X$ is such that $\textrm{supp}(P_{X})\subset[-R,R]^{d}$ , for some $R>0$ . It only changes the constants appearing in Theorem 2, which remains true.

Let $n_{k}=\sum_{i=1}^{n}\mathbbm{1}_{\{X_{i}\in A_{k}\}}$ . Given a sequence $h_{n}\to 0$ , a kernel function $K(\cdot):\mathbb{R}\to\mathbb{R}$ , and $h_{n_{k}}$ , we define

\hat{p}(v|A_{k})=\frac{1}{n_{k}h_{n_{k}}^{\ell}}\sum_{i=1}^{n}\mathbbm{1}_{\{X_{i}\in A_{k}\}}K\Bigg{(}\frac{||Y_{i}-v||}{h_{n_{k}}}\Bigg{)}.

(4)

We aim to prove that $\hat{p}(v|A_{k})$ provides a uniform estimate of $p(y|A_{k})$ across $v$ and $k$ . To achieve this, we need to ensure that there are sufficiently many sample points in each $A_{k}$ . This is guaranteed by lemma 9 of [61], which states that if we choose $w_{n}=(\log(n)/n)^{1/(d+2)}$ , then, with probability one, for all $n$ large enough,

\forall k:\quad b_{1}nw_{n}^{d}/2\leq n_{k}\leq 3b_{2}nw_{n}^{2}/2.

(5)

Recall that $b_{1}$ and $b_{2}$ were defined in Assumption H2. In what follows, we assume that $n$ is sufficiently large to ensure (5).

The corresponding augmented estimate, based on $\aleph_{n}\cup\{(x,y)\}$ is, for any $(x,y)\in A_{k}\times\mathcal{M}$ ,

\hat{p}^{(x,y)}(v|A_{k})=\frac{n_{k}}{n_{k}+1}\hat{p}(v|A_{k})+\frac{1}{(n_{k}+1)h_{n_{k}}}K\Bigg{(}\frac{||y-v||}{h_{n_{k}}}\Bigg{)}.

For any $(X_{n+1},Y_{n+1})=(x,y)\in A_{k}\times\mathcal{M}$ , consider the following local conformity rank

\pi_{n,k}(x,y)=\frac{1}{n_{k}+1}\sum_{i=1}^{n+1}\mathbbm{1}_{\{X_{i}\in A_{k}\}}\mathbbm{1}_{\{\hat{p}^{(x,y)}(Y_{i}|A_{k})\leq\hat{p}^{(x,y)}(Y_{n+1}|A_{k})\}}.

(6)

As proved in Proposition 2 of [61], the set

\hat{\mathbf{C}}_{n}(x)=\{y:\pi_{n,k}(x,y)\geq\alpha\},

(7)

has finite sample local validity, i.e., it satisfies (3). The finite sample local validity, as established by (7), does not require any assumptions and it holds under very general conditions.

The following result is the key theorem. Its proof is deferred to the Appendix. The proof follows the ideas used to prove Theorem 1 of [12]. It states that $p(y|A_{k})$ can be estimated uniformly by (4) for all $y$ that are far enough from the boundary of $\mathcal{M}$ . Additionally, this estimation can be made uniformly across all $k$ . We assume, as in [12], that $K$ is a Gaussian kernel. This restriction is, as in [12], purely technical. More recent references on kernel-density estimation on manifolds are [9, 10], [15, 16, 6, 63].

This result is of fundamental importance in conformal prediction, as it provides a way to estimate the conditional density of $Y$ given $A_{k}$ for any $k$ in a non-parametric way. It implies that the estimation error is uniform across all partitions $A_{k}$ , which is a key requirement for conformal prediction methods. In particular, it allows us to construct conformal prediction regions that are valid with a given level of confidence.

Theorem 1.

Let $\mathcal{M}\subset\mathbb{R}^{D}$ be a compact $\ell$ -dimensional manifold satisfying H0. Let $\mathcal{M}_{n}\subset\mathcal{M}$ be a sequence of closed sets, and let $h_{n}\to 0$ be a sequence of setwidths such that $nh_{n}^{\ell+3}/\log(n)\to\infty$ . We assume that $c_{n}:=\inf_{y\in\mathcal{M}_{n}}\rho(y,\partial\mathcal{M})\to 0$ is such that $h_{n}/c_{n}\to 0$ monotonically. Additionally, we assume that $p(y|A_{k})$ satisfies H5. Then, we have

\sup_{k}\sup_{y\in\mathcal{M}_{n_{k}}}|\hat{p}(y|A_{k})-p(y|A_{k})|=o(h_{\gamma_{n}}/c_{\gamma_{n}})\quad\text{a.s.},

(8)

where $\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor$ .

The following theorem is the main result of our paper. It states that $\hat{\mathbf{C}}_{n}(X_{i})$ consistently estimates $\mathbf{C}_{P}(X_{i})$ uniformly for all $X_{i}\in\aleph_{n}$ . The proof, which is deferred to the Appendix, see Subsection 9.1, is based on some ideas from the proof of Theorem 1 of [61]. Specifically, Lemma 1 and Lemma 3 are adapted from [61], while Lemma 2 is new and is required to adapt the proof of Lemma 3. Lemma 4 is the same as Lemma 8 of [61]. Finally, the proof of Theorem 2 uses the adapted lemmas and considers separately the sample points that are close to $\mathcal{M}$ and those that are far away from this boundary. The first set of points is shown to be negligible with respect to $\nu$ using techniques from geometric measure theory.

Theorem 2.

Assume H0 to H5. Let $\hat{\mathbf{C}}_{n}(x)$ be given by (7) and $\mathbf{C}_{P}(x)$ be given by (1). Let $w_{n}=(\log(n)/n)^{1/(d+2)}$ and $\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor$ . Then, for any $\lambda>0$ , there exists an $A_{\lambda}>0$ such that, for $n$ large enough,

\mathbb{P}\Big{(}\sup_{X:(X,Y)\in\aleph_{n}}\nu\big{(}\hat{\mathbf{C}}_{n}(X)\triangle\mathbf{C}_{P}(X)\big{)}>A_{\lambda}c_{\gamma_{n}}\Big{)}=\mathcal{O}(n^{-\lambda}),

(9)

where $c_{\gamma_{n}}$ is such that $h_{\gamma_{n}}/c_{\gamma_{n}}^{2}\to 0$ and $h_{\gamma_{n}}\to 0$ such that $\gamma_{n}h_{\gamma_{n}}^{\ell+3}/\log(\gamma_{n})\to\infty$ .

Remark 1.

It can be easily seen that, up to logarithmic factors, the rate of $c_{\gamma_{n}}$ is $n^{-1/((\ell+3)(d+2))}$ .

5 Non-compact manifolds

In this section, we discuss how to remove the compactness assumption imposed in Theorem 2, while retaining all the other hypotheses. We will prove that the set $\hat{\mathbf{C}}_{n}(X_{n+1})$ satisfies asymptotic conditional validity (see [30]). In other words, there exist random sets $\Lambda_{n}$ such that

\mathbb{P}\Bigl{(}X_{n+1}\in\Lambda_{n}|\Lambda_{n}\Bigr{)}=1-o(1),

and

\sup_{x_{n+1}\in\Lambda_{n}}\Bigl{|}\mathbb{P}\Bigl{(}Y_{n+1}\in\hat{\mathbf{C}}_{n}(X_{n+1})\,\Big{|}\,X_{n+1}=x_{n+1}\Bigr{)}-\bigl{(}1-\alpha\bigr{)}\Bigr{|}=o(1).

According to Theorem 6 in [30], to obtain asymptotic conditional validity it suffices to prove that

\mathbb{P}\Bigl{(}Y_{n+1}\in\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\Bigr{)}=o(1).

(10)

Theorem 3.

Assume H0 to H5. Let $\hat{\mathbf{C}}_{n}(x)$ be given by (7) and $\mathbf{C}_{P}(x)$ by (1). Then, $\hat{\mathbf{C}}_{n}(X_{n+1})$ is asymptotically conditionally valid.

It worth to be mentioned that asymptotic conditional validity is considerably weaker than (9) when $p(y|x)$ is bounded from above.

6 Computational aspects

A different method for estimating the conformal region is introduced in [30]. Instead of using a partition of $\mathbb{R}^{d}$ on cubes to get a partition of $X_{1},\dots,X_{n}$ , to build kernel-based estimators, this approach employs a new partition constructed as follows. Consider, as in [30], the functions

H(z|x)=\int_{\{y:p(y|x)\leq z\}}p(y|x)\,d\nu(y),\qquad\hat{H}(z|x)=\int_{\{y:\hat{p}(y|x)\leq z\}}\hat{p}(y|x)\,d\nu(y).

where $\hat{p}(y|x)$ is any estimator of $p(y|x)$ , not necessarily kernel-based. Define the conditional $\alpha$ -quantile of $Y|X$ by $q_{\alpha}(x)=H^{-1}(\alpha|x).$ Similarly, let $\hat{q}_{\alpha}(x)=\hat{H}^{-1}(\alpha|x)$ , be the estimate of the conditional $\alpha$ -quantile of $Y|X$ .

Let $\mathcal{F}$ be a partition of $\mathbb{R}^{+}$ . Then we create a partition $\mathcal{A}=\{A_{1},\dots,A_{T}\}$ of $X_{1},\dots,X_{n}$ by assigning $X_{i}$ and $X_{j}$ to the same set $A_{r}\in\mathcal{A}$ if and only if $\hat{q}_{\alpha}(X_{i})$ and $\hat{q}_{\alpha}(X_{j})$ lie in the same element of $\mathcal{F}$ . We then build the kernel-based estimator (4) according to this new partition. This approach is particularly suitable for high-dimensional feature spaces.

We propose either the set $\hat{\mathbf{C}}_{n}(x)$ (as defined in (7)) or one of the algorithms presented in [30]. According to Theorem 25 in [30], to establish the asymptotic conditional validity of this proposal (in the sense of (10)), it is necessary to assume some smoothness of $H(z,x)$ (see Assumption 23 in [30]), as well as the compactness of $M$ and the existence of sequences $\eta_{n}=o(1)$ and $\rho_{n}=o(1)$ satisfying

\mathbb{P}\!\Biggl{(}\mathbb{E}\Bigg{[}\sup_{y\in\mathcal{M}}\bigl{(}\hat{p}(y|X)-p(y|X)\bigr{)}^{2}\;\Big{|}\;\hat{p}\Bigg{]}\;\geq\;\eta_{n}\Biggr{)}\;\leq\;\rho_{n}.

This condition indeed holds in our case; however, proving it in detail would require rewriting the proof of Theorem 1 in [12]. We leave this verification to the reader. As in the non-compact case, the consistency result obtained is considerably weaker than (9).

Since the computational burden of approximating $\hat{\mathbf{C}}_{n}(x)$ is high, [61] proposes replacing that set with $\mathbf{C}_{n}^{+}(x)$ , see (11), which, by including $\hat{\mathbf{C}}_{n}(x)$ , has coverage of at least $1-\alpha$ . Let $A_{x}\in\mathcal{A}$ the element that contains $x$ .

1.

Let $\hat{p}(x,y)$ be the joint density estimator based on the subsample (of cardinality $n_{x}$ ) of points for which $X_{i}\in A_{x}$
2.

Let $Z_{i}=(X_{i},Y_{i})$ for $i=1,\dots,n_{x}$ , and let $Z_{(1)},Z_{(2)},\dots,Z_{(n_{x})}$ denote the sample ordered increasingly by $\hat{p}(X_{i},Y_{i})$ .

Let $j=\lfloor n_{x}\alpha\rfloor$ and define

\mathbf{C}_{n}^{+}(x)\;=\;\Bigl{\{}\,y:\;\hat{p}(x,y)\;\geq\;\hat{p}\bigl{(}X_{(j)},Y_{(j)}\bigr{)}\;-\;\frac{K(0)^{2}}{n_{x}\,h^{d+1}}\Bigr{\}}.

(11)

7 Simulation examples

7.1 Toy model: Output on the sphere

We consider a regression model with output variable denoted by $Y_{i}$ defined on the unit sphere $S^{2}$ . The input variable $X_{i}$ takes values in the interval $[-1,1]$ . The model is given by the following probability distribution:

Y_{i}\sim\mathbb{F}\left(\frac{\eta+\beta X_{i}}{\|\eta+\beta X_{i}\|},\kappa\right),\,i=1,2\ldots,400.

Here, $\mathbb{F}$ denotes the Von Mises–Fisher distribution (see [45]), and $X_{i}\sim\textrm{U}(-1,1)$ is an i.i.d. sample. For the simulation study, we set $\eta=(1,0,0)$ , $\beta=(0,0,1)$ , and $\kappa=200$ .

To estimate the kernel density, we use the estimator given by Equation (4) with a setwidth parameter $h=0.5$ . Additionally, we partition the data into intervals $A_{k}=\left(-1+(k-1)/2,-1+k/2\right)$ , where $k=1,\ldots,4$ .

Using the proposed method, we obtain a $90\%$ confidence set $\hat{\mathbf{C}}_{400}(0)$ , shown in green in Figure 1. In this particular case, the unobserved output was $Y=(1,0,0)$ , which is illustrated in purple in Figure 1. Points belonging to the boundary of the theoretical confidence set 1 for $\alpha=0.1$ are displayed in blue.

Refer to caption — Figure 1: Red: a sample $Y_{1},\ldots,Y_{400}$ on $S^{2}$ . Green: the estimated $90\%$ -confidence set for $Y$ if $x=0$ . Purple: the point prediction for $x=0$ , which is $y=(1,0,0)$ . Blue dots belong to the boundary of the theoretical confidence set (1) for $\alpha=0.1$ .

7.2 Output in a Stiefel manifold.

It is common to summarize information from a set of variables using Principal Component Analysis (PCA) or Factor Analysis. This approach is frequently employed in constructing indices within the social sciences, as demonstrated by Vyas (2006). Subsequently, researchers often attempt to explain these indices using other covariates. To illustrate this scenario, we provide a ‘toy example’ using simulations.

We will define a regression model whose output lies on the Stiefel Riemannian manifold $SO(3,2)$ , represented by two-dimensional orthonormal vectors in $\mathbb{R}^{3}$ . The construction of the regression is as follows.

The input variable, denoted by $X$ , is uniformly distributed on the interval $[0,1]$ . For each value of $X$ , we sample $400$ points from a 3-dimensional Gaussian distribution centered at the origin $(0,0,0)$ . The covariance matrix of this Gaussian distribution has eigenvalues $2$ , $1$ , and $1/2$ , along with the corresponding eigenvectors $(2+X,2+2X,2-X)$ , $(-2+X,2-3X,-2+X)$ , and $(1,0,0)$ , respectively.

After generating these $400$ points for each value of $X$ , the output $Y\in SO(3,2)$ is obtained by applying a PCA to these points and retaining only the first two principal directions. To conduct the simulations analysis, we sample $500$ points, denoted by $X_{1},\dots,X_{500}$ .

In Figure 2, we display a sample of $500$ output values, represented as gray arcs. The specific output value $y_{0.1}\in SO(3,2)$ , corresponding to the input $X=0.1$ , is depicted as a purple dotted arc.

To estimate the density, we use the estimator given by Equation (4), with a setwidth parameter $h=1$ . Additionally, we divide the data into intervals $A_{k}=\left((k-1)/5,k/5\right)$ , where $k=1,\ldots,5$ . It is important to note that the $SO(3,2)$ data are embedded in $\mathbb{R}^{6}$ ( $D=6$ ), and we consider the Euclidean distance in this space. The dimension of the submanifold in this case is $\ell=3$ .

To visualize the confidence set, we draw $10000$ points within $SO(3,2)$ . The points falling within the confidence set are highlighted in orange in Figure 2.

8 Real-data examples

8.1 Example 1: Regression between cylindrical manifolds with application to wind modeling

In this study we address a problem of substantial practical relevance for the development of wind energy infrastructure in Uruguay. Specifically, the goal is construct prediction sets for wind intensity $R$ and direction $\theta$ , at a meteorological station located in Montevideo, based on simultaneous observations from a nearby station in Maldonado. These variables together form a point on a cylinder $\mathcal{D}=S^{1}\times\mathbb{R}^{+}\subset\mathbb{R}^{3}$ , where $S^{1}$ represents the directional component (wind angle) and $\mathbb{R}^{+}$ the non-negative intensity. Thus, the regression problem consists of learning a mapping from one cylindrical manifold to another.

This task is not only of theoretical interest due to the non-Euclidean structure of the input and output spaces, but also of strategic importance for national energy planning. Accurate prediction of wind behavior at candidate sites is a critical step in the optimal placement of wind turbines, which are a major source of renewable energy in Uruguay’s energy matrix.

We focus on moderate and extreme wind regimes (within the range 0 to 20 m/s), using hourly meteorological data recorded during the month of July between 2008 and 2016. The data were collected at two locations: the Laguna del Sauce Weather Station in Maldonado (Station 1) and the Carrasco International Airport Weather Station in Montevideo (Station 2), which are approximately 84 kilometers apart. To ensure reliability, only time points for which both stations reported valid measurements were retained, resulting in a total of 5,362 matched observations.

Our objective is to construct conformal prediction sets, for wind conditions at Station 2 given observations from Station 1. These sets aim to quantify predictive uncertainty while respecting the cylindrical geometry of the variables involved, offering robust tools for risk assessment and decision-making in wind farm siting.

Figures 3 and 4 show that there is a correlation in both wind direction and intensity between the two stations. The $\xi$ -correlation, see [11], between the intensities is $0.45$ . In the case of the directions, the angular correlation is $0.63$ . The angular correlation is the linear correlation between the variables $\sin\left(\theta_{1}-\bar{\theta}_{1}\right)$ and $\sin\left(\theta_{2}-\bar{\theta}_{2}\right)$ , see [31].

In this example, the partition is

A_{i,j}=\Big{\{}(\theta,R)\in\mathcal{D}:\theta\in[i\pi/4,(i+1)\pi/4]\textrm{ and }R\in[4j,4(j+1)]\Big{\}}

with $i=0,1,\ldots,7$ and $j=0,1,\ldots,4$ . For the kernel density estimator we choose $h_{k}=0.4$ .

Figure 5 shows the confidence set (at 80%) obtained by our method for $(\theta_{1},R_{1})=(2.3\textrm{ radians},5.1\textrm{ m/s})$ . This was recorded at Station 1 on 2008-07-01 at 7 pm. On the same date and time the data recorded at Station 2 were $(\theta_{2},R_{2})=(2.4\textrm{ radians},6.6\textrm{ m/s})$ (point depicted in purple in Figure 5).

8.2 Example 2: Regression on the simplex with application to multiclass classification

Regression models whose outputs lie in the $(d-1)$ -dimensional unit simplex $\Delta^{d-1}$ play an essential role in many applied contexts (see, for example, [3, 48]). In supervised classification, a variety of machine-learning algorithms produce vectors of class-membership probabilities—i.e., elements of $\Delta^{d-1}$ —which we denote by $Y_{i}$ . Here, each $Y_{i}$ is obtained by using Random Forest (RF). Because these vectors reside in $\Delta^{d-1}$ , their components naturally estimate the probability of membership in each class. The final class prediction is then chosen as the one with the highest estimated probability, equivalent to identifying the Voronoi cell of the simplex in which $Y_{i}$ falls (see Figure 6).

Our goal is to construct a $100(1-\alpha)\%$ confidence region for observations on the simplex $\Delta^{d-1}$ by adapting Algorithm 1 from [30] (CD-split) to this setting. In particular, we follow [29] to estimate the conditional density, employing an appropriate Fourier‐type basis on the simplex—such as the Bernstein polynomial basis [19, 23].

The intersection of this confidence region with the Voronoi tessellation of $\Delta^{d-1}$ yields the conformal class set, i.e., the subset of classes that are statistically consistent with the observation at level $\alpha$ . Moreover, this approach provides insight into how the predictive uncertainty is distributed among the classes, by analyzing the proportion of the confidence region that falls within each Voronoi cell.

An application of the proposed methodology is carried out on a real-world dataset, where the covariate space is 5-dimensional and the response variable lies on $\Delta^{2}$ , representing class probabilities, obtained, we it was said, by means of RF. To this end, we consider the publicly available Vehicle dataset from the mlbench package in R [42].

The Vehicle dataset originates from a benchmark study conducted at the Turing Institute in the early 1990s. It was designed to evaluate the performance of classification algorithms in distinguishing between different types of road vehicles based on geometric features extracted from digitized images of their silhouettes. Each observation in the dataset corresponds to a vehicle image and is characterized by $5$ numerical covariates, such as aspect ratio, edge count, and moment-based shape descriptors.

The categorical response variable indicates the type of vehicle. For the purposes of this study, we restrict attention to a subset of the data containing three of the four classes (bus, opel, and van), so that the associated output probabilities lie on $\Delta^{2}$ . This structure is ideal for illustrating our simplex-based predictive modelling approach. The dataset contains a total of $628$ observations, and its moderate size, combined with its multi-class nature, makes it well-suited for benchmarking methods involving compositional or probabilistic responses.

The dataset is randomly split into two equal parts: 50% of the observations are used for training the model, and the remaining 50% are reserved as a test set for evaluating predictive performance.

Figure 7 displays the test data projected onto $\Delta^{2}$ . Each point corresponds to a predicted probability vector, and its color indicates the true class label. The predicted class for a point is the Voronoi cell that contains the point.

We assess confidence levels of $90\%$ and $95\%$ for the covariate vector $\mathbf{x}_{n+1}=(84,45,66,150,65)$ . The corresponding class probabilities predicted by the Random Forest are $\mathbf{y}_{n+1}=(0.845,0.01,0.145)$ , highlighted in purple in Figure 7. The resulting class band, restricted to the “bus” and “van” categories, offers a detailed decomposition of predictive uncertainty by indicating the fraction of the confidence region attributable to each class. At the $95\%$ confidence level, approximately $93\%$ of the band resides within the “bus” region, while the remaining $7\%$ falls within the “van” region.

9 Conclusions and future work

We work within the conformal‐inference framework, where the response variable takes values on a Riemannian manifold while the covariates lie in $\mathbb{R}^{d}$ . Under assumptions analogous to those in [61], we extend their results to show that the conditional oracle set $\mathbf{C}_{P}(x)$ can be estimated consistently. A central tool in our analysis is the almost‐sure, uniform consistency of the classical kernel density estimator on manifolds, as proved in [12]. We also cover the cases of manifolds with boundary and non‐compact manifolds, and we discuss computational strategies for handling high‐dimensional input data.

Our theoretical contributions are demonstrated through two simulation studies and two real-world data examples.

Finally, one could explore alternative nonconformity scores beyond the KDE-based criterion we employ—for example, quantile-based or regression-based scores—each of which demands a nontrivial adaptation of the Euclidean theory to the manifold setting.

Appendix

Proof of Theorem 1.

Since $n_{k}h_{n_{k}}^{\ell+1}/\log(n_{k})\to\infty$ , then, from theorem 1 of [12], for all $\epsilon>0$ , with probability one, for all $n$ large enough,

\frac{c_{n_{k}}}{h_{n_{k}}}\sup_{y\in\mathcal{M}_{n_{k}}}|\hat{p}(y|A_{k})-p(y|A_{k})|\leq\epsilon.

(12)

Note that the index $n_{0}$ for which (12) holds depends on the choice of $k$ . However, from (5), we know that with probability one, $n_{k}\geq\gamma_{n}$ for all $k$ when $n$ is sufficiently large. Since $h_{n}/c_{n}\to 0$ monotonically, it follows that for all $k$ , $c_{n_{k}}/h_{n_{k}}\leq c_{\gamma_{n}}/h_{\gamma_{n}}$ . Hence, we can conclude that (8) holds for all $k$ . ∎

9.1 Proof of Theorem 2

The proof of Theorem 2 is based on some technical lemmas. To state them, let us define $L_{x}(t)$ as the set of $y$ such that $p(y|x)\geq t$ , and $L_{x}^{l}(t)$ as the set of $y$ such that $p(y|x)\leq t$ . We also define $\hat{L}_{x}(t)$ and $\hat{L}_{k}^{l}(t)$ as the corresponding sets for $\hat{p}(\cdot|A_{k})$ , given by (4).

Lemma 1.

Under the hypotheses of Theorem 1, assume also H0 to H3. Let $R_{n}(x)=\sup_{y\in\mathcal{M}_{\gamma_{n}}}|\hat{p}(y|A_{k})-p(y|x)|$ . Then, with probability one, for $n$ sufficiently large, we have

\sup_{x\in\textrm{supp}(P_{X})}R_{n}(x)\leq h_{\gamma_{n}}/c_{\gamma_{n}}+Lw_{n}\sqrt{d},

(13)

where $\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor$ and $\inf_{x\in M_{\gamma_{n}}}\rho(x,\partial\mathcal{M})=c_{\gamma_{n}}$ .

Proof.

From (5), we can assume that $n$ is large enough and such that for all $k$ , $b_{1}nw_{n}^{d}\leq n_{k}\leq b_{2}nw_{n}^{d}$ . From (8), we obtain with probability one for $n$ large enough,

\sup_{y\in\mathcal{M}_{\gamma_{n}}}|\hat{p}(y|A_{k})-p(y|A_{k})|\leq h_{n_{k}}/c_{n_{k}}.

(14)

Using assumption H3 and (5), we have

	$\displaystyle\sup_{x\in\textrm{supp}(P_{X})}R_{n}(x)$	$\displaystyle\leq\sup_{x\in\textrm{supp}(P_{X})}\sup_{y\in\mathcal{M}_{\gamma_{n}}}\|\hat{p}(y\|A_{k})-p(y\|x)\|$
		$\displaystyle\leq\sup_{x\in\textrm{supp}(P_{X})}\sup_{y\in\mathcal{M}_{\gamma_{n}}}\|\hat{p}(y\|A_{k})-p(y\|A_{k})\|+\sup_{y\in\mathcal{M}_{\gamma_{n}}}\|p(y\|A_{k})-p(y\|x)\|$
		$\displaystyle\leq h_{n_{k}}/c_{n_{k}}+Lw_{n}\|x-y\|$
		$\displaystyle\leq h_{\gamma_{n}}/c_{\gamma_{n}}+Lw_{n}\sqrt{d},$

where the last inequality follows from (5) and the fact that

\inf_{x\in\mathcal{M}_{\gamma_{n}},y\in\mathcal{M}}|x-y|\geq\rho(\mathcal{M}_{\gamma_{n}},\partial\mathcal{M}).

Thus, we have shown (13). ∎

Lemma 2.

Assume H0 to H4. Let $w_{n}=(\log(n)/n)^{1/(d+2)}$ , and assume that $\gamma_{n}$ and $h_{n}$ are as in Lemma 1. Then, there exists $C>0$ such that with probability one, for $n$ large enough

\sup_{k}|\hat{p}(\cdot|A_{k})|_{\infty}\leq C.

Proof.

Recall that we are assuming $K$ to be a Gaussian kernel. We assume that $n$ is large enough so that (5) holds. Following the proof of theorem 1 of [12], we define

W_{j}(y)=(1/(n_{k}h_{n_{k}}^{\ell}))\mathbbm{1}_{\{X_{j}\in A_{k}\}}K(|Y_{j}-y|/h{n_{k}}),

$V_{j}(y)=W_{j}(y)-\mathbb{E}(W_{j}(y))$ , and $S_{n}(y)=\sum_{j=1}^{n}V_{j}(y)$ . Let $\eta>\ell+1$ and consider a covering ${B(p_{1},h^{\eta}),\dots,B(p_{l},h^{\eta})}$ of $\mathcal{M}$ . Then,

\sup_{y}|S_{n}(y)|\leq\max_{1\leq j\leq l}\sup_{y\in B(p_{j},h^{\eta})}\frac{1}{n_{k}h_{n_{k}}^{\ell}}|S_{n}(y)-S_{n}(p_{j})|+\\ \max_{1\leq j\leq l}\frac{1}{n_{k}h_{n_{k}}^{\ell}}|S_{n}(p_{j})|=I_{1}+I_{2}.

(15)

Since $K$ is Lipschitz, we have $I_{1}\leq ch^{\eta-\ell-1}$ for some constant $c>0$ . Applying Bernstein’s inequality to $S_{n}(p_{j})$ for fixed $p_{j}$ and $A_{k}$ , we get

\mathbb{P}\Big{(}\frac{1}{n_{k}h_{n_{k}}^{\ell}}|S_{n}(p_{j})|>\epsilon\Big{)}\leq c_{1}\exp(-c_{2}n_{k}h_{n_{k}}^{\ell}\epsilon^{2}),

where $c_{1}$ and $c_{2}$ are positive constants.

Then, if $\kappa_{n}=\lfloor 3b_{2}nw_{n}^{d}\rfloor$ ,

\mathbb{P}\Big{(}\sup_{k}\sup_{y}|S_{n}(y)|>\epsilon\Big{)}\leq c_{3}w_{n}^{-d}\exp(-c_{3}nw_{n}^{d}h_{\kappa_{n}}^{\ell}).

By the Borel–Cantelli lemma, it follows that $\sup_{k}\sup_{y}|S_{n}(y)|\to 0$ almost surely.

Next, we will bound $\sup_{y}\mathbb{E}(V_{j}(y))$ . First, we bound $1/h_{n_{k}}^{\ell}\leq 1/h_{\gamma_{n}}^{\ell}$ ,

\mathbb{E}(V_{j}(y))=\mathbb{E}[\mathbb{E}(V_{j}(y)|X)]\leq\int_{[0,1]^{d}}\int_{\mathcal{M}}\frac{1}{h_{\gamma_{n}}^{\ell}}K(|z-y|/h_{\kappa_{n}})p(z|x)\nu(dz)p_{X}(x)dx.

Since $p_{X}(x)$ and $p(z|x)$ are bounded for all $z$ and $x$ , it is enough to bound from above

\sup_{y}\int_{\mathcal{M}}\frac{1}{h_{\gamma_{n}}^{\ell}}K(|z-y|/h_{\gamma_{n}})\nu(dz),

which is bounded because we assumed that $K$ is a Gaussian kernel. ∎

Lemma 6 of [61] proves a slightly modified version of the following lemma.

Lemma 3.

Assume H0 to H5. Let $\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor$ , $h_{\gamma_{n}}\to 0$ , and $\mathcal{M}_{\gamma_{n}}\subset\mathcal{M}$ be a sequence of closed sets such that $\inf_{x\in\mathcal{M}_{\gamma_{n}}}\rho(x,\partial\mathcal{M})/h_{\gamma_{n}}\to\infty$ . Then, for any $\lambda>0$ , there exists $\xi_{2,\lambda}$ such that, for $n$ large enough,

\mathbb{P}\Bigg{(}\sup_{x\in\aleph_{n}}V_{n}(x)>\xi_{2,\lambda}w_{n}\Bigg{)}=\mathcal{O}(n^{-\lambda}),

(16)

where $V_{n}(x)=\sup_{t\geq t_{0}}|\hat{P}(L_{x}^{l}(t)|A_{k})-P(L_{x}^{l}(t)|x)|$ . Here, $\hat{P}(\cdot|A_{k})$ is the empirical distribution of $Y|X\in A_{k}$ , $t_{0}$ is given in H4, and $w_{n}=(\log(n)/n)^{1/(d+2)}$ .

Proof.

We will provide a sketch of the proof of this lemma since it follows essentially the same idea used to prove lemma 6 of [61].

Let $x\in A_{k}$ be fixed. Note that $\{L_{x}^{l}(t):t\geq t_{0}\}$ is a nested class of sets with Vapnik–Chervonenkis dimension 2. Then, for all $B>0$ and $\lambda>0$ ,

\mathbb{P}\Bigg{(}\sup_{t}\Big{|}\hat{P}(L_{x}^{l}(t)|A_{k})-P(L_{x}^{l}(t)|A_{k})\Big{|}>B\sqrt{w_{n}}\Bigg{)}=\mathcal{O}(n^{-\lambda}),

where we used that $w_{n}=(\log(n)/n)^{1/(d+2)}$ . On the other hand,

|P(L_{x}^{l}(t)|A_{k})-P(L_{x}^{l}(t)|x)|\leq Lw_{n}\nu(L_{x}(t))\sqrt{d}\leq Lw_{n}\nu(L_{x}(t_{0}))\sqrt{d}\\ \leq L\nu(\mathcal{M})\sqrt{d}w_{n}.

Let $x^{\prime}\in A_{k}$ . Then

|\hat{P}(L^{l}_{x^{\prime}}|A_{k})-P(L^{l}_{x^{\prime}}(t)|x^{\prime})|\leq|\hat{p}(\cdot|A_{k})|_{\infty}\nu(L_{x}^{l}(t)\triangle L^{l}_{x^{\prime}}(t))+V_{n}(x)+|G_{x}(t)-G_{x^{\prime}}(t)|

where $G_{x}(t)=P(L_{x}^{l}(t)|x)$ . From (30) in [61],

|G_{x}(t)-G_{x^{\prime}}(t)|\leq c_{3}w_{n}^{1\wedge\gamma}.

Here, $c_{3}$ is a positive constant and $\gamma$ is as in H4. From (29) in [61], $\nu(L_{x}^{l}(t)\triangle L^{l}_{x^{\prime}}(t))\leq c_{4}w_{n}^{\gamma}$ for some positive constant $c_{4}$ . Lastly, $\sup_{k}|\hat{p}(\cdot|A_{k})|_{\infty}$ is bounded, for $n$ large enough, using Lemma 2. ∎

We recall lemma 8 of [61].

Lemma 4.

Fix $\alpha>0$ , $t_{0}>0$ and $\epsilon>0$ . Suppose that $p$ is a density function that satisfies H4, and $\hat{p}$ an estimator such that $|\hat{p}-p|_{\infty}<v_{1}$ . Let $\hat{P}$ be a probability measure satisfying $\sup_{t\geq t_{0}}|\hat{P}(L^{l}(t))-P(L^{l}(t))|<v_{2}$ . Define

\hat{t}^{\alpha}=\inf\{t\geq 0:\hat{P}(\hat{L}^{l}(t))\geq\alpha\}.

Assume that $v_{1}$ and $v_{2}$ are sufficiently small so that $v_{1}+c_{1}^{1/\gamma}v_{2}^{1/\gamma}\leq t^{\alpha}-t_{0}$ and $c_{1}^{-1/\gamma}v_{2}^{1/\gamma}\leq\epsilon_{0}$ where $c_{1}$ and $\gamma$ are the constants given in H4. Then

|\hat{t}^{\alpha}-t^{\alpha}|\leq v_{1}+c_{1}^{-1/\gamma}v_{2}^{1/\gamma}

Moreover, for any $\tilde{t}^{\alpha}$ such that $|\tilde{t}^{\alpha}-\hat{t}^{\alpha}|\leq v_{3}$ , if $2v_{1}+c_{1}^{1/\gamma}v_{2}^{1/\gamma}+v_{3}\leq\epsilon_{0}$ , then there are constants $\xi_{1},\xi_{2}$ and $\xi_{3}$ such that $\nu(\hat{L}(\tilde{t}^{\alpha})\triangle L(t^{\alpha}))\leq\xi_{1}v_{1}+\xi_{2}v_{2}+\xi_{3}v_{3}$ .

Proof of Theorem 2

In what follows we assume that $n$ is sufficiently large to satisfy (5). We will consider a sequence of compact sets $\mathcal{M}_{n}\subset\mathcal{M}$ such that $\inf_{x\in\mathcal{M}_{n}}\rho(x,\partial\mathcal{M})=c_{n}>0$ and $\sup_{x\in\partial\mathcal{M}}\rho(x,\mathcal{M}_{n})\leq 2c_{n}$ , where $c_{n}$ is chosen such that $h_{\gamma_{n}}/c_{\gamma_{n}}^{2}\to 0$ .

Throughout the proof we denote by $a$ a generic positive constant. Write

\nu(\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x))=\nu(\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\cap\mathcal{M}_{\gamma_{n}})+\nu\Big{(}\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\cap(\mathcal{M}\setminus\mathcal{M}_{\gamma_{n}})\Big{)}

where $\gamma_{n}$ is as in Lemma 3 and $\mathcal{M}_{\gamma_{n}}\subset\mathcal{M}$ is a sequence of closed sets such that $\inf_{x\in\mathcal{M}_{\gamma_{n}}}\rho(x,\partial\mathcal{M})=c_{\gamma_{n}}$ satisfies $h_{\gamma_{n}}/c_{\gamma_{n}}\to 0$ .

We will first prove that there exists a $a>0$ such that

\nu(\mathcal{M}\setminus\mathcal{M}_{\gamma_{n}})\leq ac_{\gamma_{n}}.

(17)

Since $\mathcal{M}$ is $\mathcal{C}^{2}$ , it has positive reach, denoted by $\tau_{\mathcal{M}}$ , as shown in proposition 14 of [55]. Let $A_{r}=\{x\in\mathcal{M}:\rho(x,\partial\mathcal{M})<r\}$ . From

\sup_{x\in\partial\mathcal{M}}\rho(x,\mathcal{M}_{n})\leq 2c_{n},

it follows that $\mathcal{M}\setminus\mathcal{M}_{\gamma_{n}}\subset A_{2c_{\gamma_{n}}}$ .

We write $m_{n}=2\tau_{M}\sin(c_{\gamma_{n}}/\tau_{\mathcal{M}})$ , and from proposition A.1 in [1], we know that $\Upsilon_{\gamma_{n}}:=\{B(x,m_{n}):x\in\partial\mathcal{M}\}$ covers $A_{2\gamma_{n}}$ . Since $\partial\mathcal{M}$ has finite Minkowski content due to its positive reach (see corollary 3 of [4]), and $\sin(x)\approx x$ when $x\to 0$ , there exists a $a>0$ such that $|\Upsilon_{\gamma_{n}}|/c_{\gamma_{n}}^{D-\ell-1}\leq a$ for all sufficiently large $n$ , where $|\Upsilon_{\gamma_{n}}|$ is the $D$ -dimensional Lebesgue measure of $\Upsilon_{\gamma_{n}}$ .

Thus, $A_{2c_{\gamma_{n}}}$ can be covered by at most $ac_{\gamma_{n}}^{-\ell+1}$ balls of radius $2m_{n}$ centered at $\partial\mathcal{M}$ , and the $\nu$ -measure of each of these balls is bounded from above by $Bc_{\gamma_{n}}^{\ell}$ by corollary 1 of [2]. From this, 17 follows.

Now, to bound $\nu(\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\cap\mathcal{M}_{\gamma_{n}})$ , we follow the approach used in the proof of theorem 1 in [61]. We apply Lemma 4 to the density function $p(y|x)$ and the empirical measure $\hat{P}(\cdot|A_{k})$ , as well as the estimated density function $\hat{p}(\cdot|A_{k})$ . Here, we provide a sketch of the main changes made to the proof. We denote by $\hat{L}$ the upper level set of $\hat{p}(\cdot|A_{k})$ .

Let $\{i_{1},\dots,i_{n_{k}}\}=\{i:1\leq i\leq n,X_{i}\in A_{k}\}$ . From lemma 3 in [61], conditioning on $(i_{1},\dots,i_{n_{k}})$ ,

\hat{L}\Big{\{}\hat{p}(X_{(i_{\alpha})},Y_{(i_{\alpha})}|A_{k})\Big{\}}\subset\hat{\mathbf{C}}_{n}(x)\subset\hat{L}\Big{\{}\hat{p}(X_{(i_{\alpha})},Y_{(i_{\alpha})}|A_{k})-(n_{k}h_{n_{k}})^{-1}\psi_{K}\Big{\}},

where $\psi_{K}=\sup_{x,x^{\prime}}|K(x)-K(x^{\prime})|$ with $(X_{(i_{\alpha})},Y_{(i_{\alpha})})$ is the element of

\{(X_{1_{1}},Y_{i_{1}}),\dots,(X_{i_{n_{k}}},Y_{i_{n_{k}}})\}

such that $\hat{p}(Y_{i_{\alpha}}|A_{k})$ ranks $\lfloor n_{k}\alpha\rfloor$ . Let $\hat{t}^{\alpha}=\hat{p}(X_{(i_{\alpha})},Y_{(i_{\alpha})})$ . It is easy to check that

\hat{t}^{\alpha}=\inf\{t\geq 0:\hat{P}(L^{l}(t)|A_{k})\geq\alpha\}.

Consider the event

E=\Big{\{}\sup_{x}R_{n}(x)\leq 2\frac{h_{\gamma_{n}}}{c_{\gamma_{n}}},\ \sup_{x}V_{n}(x)\leq\xi_{2,\lambda}w_{n}\Big{\}}.

From $nw_{n}^{d}h_{\gamma_{n}}^{\ell+3}/\log(\gamma_{n})\to\infty$ , it follows that $w_{n}/h_{\gamma_{n}}\to 0$ . Then from Lemmas 1 and 3, $\mathbb{P}(E^{c})=\mathcal{O}(n^{-\lambda})$ . Let $r_{n}:=h_{\gamma_{n}}/c_{\gamma_{n}}$ . Since $w_{n}/h_{\gamma_{n}}\to 0$ , then the event

E_{1}=\Big{\{}\sup_{x}R_{n}(x)\leq 2r_{n},\sup_{x}V_{n}(x)\leq\xi_{2,\lambda}r_{n}\Big{\}}

is such that for all $n$ large enough, $\mathbb{P}(E_{1}^{c})=\mathcal{O}(n^{-\lambda})$ . From Lemma (4) with $v_{1}=2r_{n}$ and $v_{2}=\xi_{2,\lambda}r_{n}$ , we obtain that, for $n$ large enough,

\mathbb{P}\Big{(}\sup_{x}\nu\Big{(}\hat{L}(\hat{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}\geq\xi_{\lambda}r_{n}\Big{)}=\mathcal{O}(n^{-\lambda}).

(18)

Let $\tilde{t}^{\alpha}=\hat{t}^{\alpha}-(\gamma_{n}h_{\gamma_{n}})^{-1}\psi_{K}$ and $v_{3}=(\gamma_{n}h_{\gamma_{n}})^{-1}\psi_{K}$ . From $nw_{n}^{d}h_{\gamma_{n}}^{\ell+3}/\log(\gamma_{n})\to\infty$ , it follows that $v_{3}\to 0$ . Applying Lemma 4, we get that

\mathbb{P}\Big{(}\sup_{x}\nu\Big{(}\hat{L}(\tilde{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}\geq\xi^{1}_{\lambda}r_{n}+\xi^{2}_{\lambda}v_{3}\Big{)}=\mathcal{O}(n^{-\lambda}).

Since $v_{3}\leq r_{n}$ for $n$ large enough, we get that

\mathbb{P}\Big{(}\sup_{x}\nu\Big{(}\hat{L}(\tilde{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}\geq 2\xi^{1}_{\lambda}r_{n}\Big{)}=\mathcal{O}(n^{-\lambda}).

(19)

Lastly, we write

\nu(\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\cap\mathcal{M}_{\gamma_{n}})\leq\nu\Big{(}\hat{L}(\hat{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}+\\ \nu\Big{(}\hat{L}(\tilde{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}

and then the theorem follows from (18) and (19).

Proof of Theorem 3

To prove (10), fix any $\epsilon>0$ and choose a sufficiently large compact smooth submanifold $\mathcal{K}=\mathcal{K}(\epsilon)\subset\mathcal{M}$ fulfilling H0, such that $\mathbb{P}(Y\in\mathcal{K}^{c})<\epsilon.$ The existence of such a submanifold follows from the existence of smooth exhaustion functions (see Proposition 2.28 in [35]). Then, by a simple union bound,

\mathbb{P}\Bigl{(}Y_{n+1}\in\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\Bigr{)}\leq\\ \mathbb{P}\Bigl{(}Y_{n+1}\in\Bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\Bigr{)}\cap\mathcal{K}\Bigr{)}+\epsilon.

Now, fixing $\mathcal{K}$ , we take—as in the proof of Theorem 2—a sequence of closed sets $\mathcal{K}_{\gamma_{n}}\subset\mathcal{K}$ such that

\inf_{x\in\mathcal{K}_{\gamma_{n}}}\rho(x,\partial\mathcal{K})=c_{\gamma_{n}}\to 0\quad\text{and}\quad\frac{h_{\gamma_{n}}}{c_{\gamma_{n}}}\to 0\quad\text{ as }n\to\infty.

Then we decompose $\mathbb{P}\Bigl{(}Y_{n+1}\in\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\cap\mathcal{K}\Bigr{)}$ equals

\mathbb{P}\Bigl{(}Y_{n+1}\in\bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\bigr{)}\cap\mathcal{K}_{\gamma_{n}}\Bigr{)}+\\ \mathbb{P}\Bigl{(}Y_{n+1}\in\bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\bigr{)}\cap\bigl{(}\mathcal{K}\setminus\mathcal{K}_{\gamma_{n}}\bigr{)}\Bigr{)}.

Since $\mathcal{K}$ is compact and, for all $x$ , $p(y|x)$ is a continuous function of $y$ , for $n$ large enough we have

\mathbb{P}\Bigl{(}Y_{n+1}\in\bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\bigr{)}\cap\bigl{(}\mathcal{K}\setminus\mathcal{K}_{\gamma_{n}}\bigr{)}\Bigr{)}<\epsilon.

Finally, the proof that

\mathbb{P}\Bigl{(}Y_{n+1}\in\bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\bigr{)}\cap\mathcal{K}_{\gamma_{n}}\Bigr{)}\to 0

follows by using the continuity of $p(y|x)$ , that $\mathcal{K}$ is in the hypotheses of Theorem 2, and the fact established in the proof of Theorem 2 that

\nu\Bigl{(}\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\triangle\mathcal{K}_{\gamma_{n}}\Bigr{)}\to 0

as $n\to\infty$ .

Acknowledgment

Authors are grateful with Tyrus Berry, for his insightful comments on the results obtained in his work with Timothy Sauer. . The research of the first and third authors has been partially supported by grant FCE-3-2022-1-172289 from ANII (Uruguay), 22MATH-07 form MATH – AmSud (France-Uruguay) and 22520220100031UD from CSIC (Uruguay).

References

[1] Aamari, E., C. Aaron, and C. Levrard (2023). Minimax boundary estimation and estimation with boundary. Bernoulli, 29(4), 3334–3368.
[2] Aaron, C. and A. Cholaquidis (2020). On boundary detection. Ann. Inst. H. Poincaré Probab. Statist., 56(3), 2028–2050.
[3] Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall.
[4] Ambrosio, L., A. Colesanti, and E. Villa (2008). Outer Minkowski content for some classes of closed sets. Mathematische Annalen, 342(4), 727–748.
[5] Balasubramanian, V., S.-S. Ho, and V. Vovk (2014). Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes.
[6] Berenfeld, C. and M. Hoffmann (2021). Density estimation on an unknown submanifold. Electronic Journal of Statistics, 15(1), 2179–2223.
[7] Berry, T. and T. Sauer (2017). Density estimation on manifolds with boundary. Computational Statistics & Data Analysis, 107, 1–17.
[8] Best, M. J. (2010). Portfolio optimization. CRC Press.
[9] Bouzebda, S. and N. Taachouche (2024a). Oracle inequalities and upper bounds for kernel conditional U-statistics estimators on manifolds and more general metric spaces associated with operators. Stochastics: An International Journal of Probability and Stochastic Processes, 96(8), 2135–2198.
[10] Bouzebda, S. and N. Taachouche (2024b). Rates of the Strong Uniform Consistency with Rates for Conditional U-Statistics Estimators with General Kernels on Manifolds. Mathematical Methods of Statistics, 33(2), 95–153.
[11] Chatterjee, S. (2021). A new coefficient of correlation. Journal of the American Statistical Association, 116(536), 2009–2022.
[12] Cholaquidis, A., R. Fraiman, and L. Moreno (2022). Level set and density estimation on manifolds. Journal of Multivariate Analysis, 189, 104925.
[13] Cholaquidis, A., R. Fraiman, and L. Moreno (2023). Level sets of depth measures in abstract spaces. TEST, 1–16.
[14] Cholaquidis, A., R. Fraiman, F. Gamboa, and L. Moreno (2023). Weighted lens depth: Some applications to supervised classification. Canadian Journal of Statistics, 51(2), 652–673.
[15] Cleanthous, G., A. G. Georgiadis, G. Kerkyacharian, P. Petrushev, and D. Picard (2020). Kernel and wavelet density estimators on manifolds and more general metric spaces. Bernoulli, 26(3), 1832–1862.
[16] Cleanthous, G., A. G. Georgiadis, and E. Porcu (2022). Oracle inequalities and upper bounds for kernel density estimators on manifolds and more general metric spaces. Journal of Nonparametric Statistics, 34(4), 734–757.
[17] Diquigiovanni, J., M. Fontana, and S. Vantini (2021). Distribution-free prediction bands for multivariate functional time series: an application to the Italian gas market. arXiv preprint arXiv:2107.00527.
[18] Diquigiovanni, J., M. Fontana, and S. Vantini (2022). Conformal prediction bands for multivariate functional data. Journal of Multivariate Analysis, 189, 104879.
[19] Farouki, R. T., T. N. Goodman, and T. Sauer (2003). Construction of orthogonal bases for polynomials in Bernstein form on triangular and simplex domains. Computer Aided Geometric Design, 20(4), 209–230.
[20] Fong, E. and C. C. Holmes (2021). Conformal Bayesian computation. Advances in Neural Information Processing Systems, 34, 18268–18279.
[21] Fontana, M., S. Vantini, M. Tavoni, and A. Gammerman (2020). A Conformal Approach for Distribution-free Prediction of Functional Data. In Functional and High-Dimensional Statistics and Related Fields 5, pp. 83–90. Springer.
[22] Fontana, M., G. Zeni, and S. Vantini (2023). Conformal prediction: A unified review of theory and new challenges. Bernoulli, 29(1), 1–23.
[23] Ghosal, S. (2001). Convergence rates for density estimation with Bernstein polynomials. The Annals of Statistics, 29(5), 1264–1280.
[24] Guigui, N., N. Miolane, X. Pennec, et al. (2023). Introduction to Riemannian Geometry and Geometric Statistics: from basic theory to implementation with Geomstats. Foundations and Trends® in Machine Learning, 16(3), 329–493.
[25] Guo, C., G. Pleiss, Y. Sun, and K. Q. Weinberger (2017). On Calibration of Modern Neural Networks. International Conference on Machine Learning.
[26] Hall, P., J. Racine, and Q. Li (2004). Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association, 99(468), 1015–1026.
[27] Hong, Y., R. Kwitt, N. Singh, N. Vasconcelos, and M. Niethammer (2016). Parametric regression on the Grassmannian. IEEE transactions on pattern analysis and machine intelligence, 38(11), 2284–2297.
[28] Huang, Y. and G. Kou (2014). A kernel entropy manifold learning approach for financial data analysis. Decision Support Systems, 64, 31–42.
[29] Izbicki, R. and A. B. Lee (2017). Converting high-dimensional regression to high-dimensional conditional density estimation. Electronic Journal of Statistics, 11(2), 2800–2831.
[30] Izbicki, R., G. Shimizu, and R. B. Stern (2022). Cd-split and hpd-split: Efficient conformal regions in high dimensions. Journal of Machine Learning Research, 23(87), 1–32.
[31] Jammalamadaka, S. R. and Y. R. Sarma (1988). A correlation coefficient for angular variables. Statistical theory and data analysis II, 349–364.
[32] Koh, P. W., P. Liang, and J. Zhu (2019). Understanding Black-Box Predictions via Influence Functions. International Conference on Machine Learning.
[33] Koh, P. W., P. Liang, A. Bekasov, J. Fu, R. Kondor, and J. Zhu (2020). Probabilistic Uncertainty Quantification for Deep Learning. Advances in Neural Information Processing Systems, 33.
[34] Kuleshov, A., A. Bernstein, and E. Burnaev (2018). Conformal prediction in manifold learning. In Conformal and Probabilistic Prediction and Applications, pp. 234–253. PMLR.
[35] Lee, J. M. (2013). Introduction to Smooth Manifolds (2nd ed.), Volume 218 of Graduate Texts in Mathematics. Springer.
[36] Lei, J. (2014). Classification with confidence. Biometrika, 101(4), 755–769.
[37] Lei, J. and L. Wasserman (2016). The Adaptive and the Honest: A Dichotomy in Conformal Prediction and Its Implications. The Annals of Statistics, 44(6), 2293–2323.
[38] Lei, L. and E. J. Candès (2021). Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5), 911–938.
[39] Lei, J., J. Robins, and L. Wasserman (2013). Distribution-free prediction sets. Journal of the American Statistical Association, 108(501), 278–287.
[40] Lei, J. (2014). Classification with confidence. Biometrika, 101(4), 755–769.
[41] Lei, J., A. Rinaldo, and L. Wasserman (2015). A conformal prediction approach to explore functional data. Annals of Mathematics and Artificial Intelligence, 74, 29–43.
[42] Leisch, F., E. Dimitriadou, K. Hornik, D. Meyer, and A. Weingessel (2007). mlbench: Machine Learning Benchmark Problems. https://CRAN.R-project.org/package=mlbench. R package version 2.1-3.
[43] Liu, H. and L. Wasserman (2016). Conformal Prediction for Multi-Class Classification. Journal of Machine Learning Research, 17, 1–28.
[44] Lugosi, G. and M. Matabuena (2024). Uncertainty quantification in metric spaces. arXiv preprint arXiv:2405.05110.
[45] Mardia, K. V., P. E. Jupp, and K. V. Mardia (2000). Directional statistics, Volume 2. Wiley Online Library.
[46] Papadopoulos, H., T. Poggio, and S. Vempala (2002). Combining Classifiers via Confidence. In Advances in Neural Information Processing Systems, pp. 705–712.
[47] Papadopoulos, H., K. Proedrou, V. Vovk, and A. Gammerman (2002). Inductive confidence machines for regression. In Machine Learning: ECML 2002, pp. 345–356. Springer.
[48] Pawlowsky-Glahn, V., J. J. Egozcue, and R. Tolosana-Delgado (2015). Modeling and analysis of compositional data. John Wiley & Sons.
[49] Pennec, X., S. Sommer, and T. Fletcher (2019). Riemannian geometric statistics in medical image analysis. Academic Press.
[50] Petersen, A. and H.-G. Müller (2019). Fréchet regression for random objects with Euclidean predictors. arXiv preprint arXiv:1608.03012.
[51] Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-an excess mass approach. The annals of Statistics, 855–881.
[52] Rigollet, P. and R. Vert (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli, 14, 1154–1178.
[53] Romano, Y., E. Patterson, and E. Candes (2019). Conformalized quantile regression. Advances in neural information processing systems, 32.
[54] Shafer, G. (2008). A Mathematical Theory of Evidence. Princeton University Press.
[55] Thäle, C. (2008). 50 years sets with positive reach–a survey. Surveys in Mathematics and its Applications, 3, 123–165.
[56] Tsybakov, A. B. (1997). On nonparametric estimation of density level sets. The Annals of Statistics, 25(3), 948–969.
[57] Vovk, V., A. Gammerman, and G. Shafer (1998). Algorithmic Learning in a Random World. Machine Learning, 30(2-3), 119–138.
[58] Vovk, V., A. Gammerman, and G. Shafer (2005). Algorithmic Learning in a Random World. Springer-Verlag.
[59] Vovk, V. (2009). Conditional Validity, Exchangeability, and Calibration of Probability Forecasts. Journal of Machine Learning Research, 10, 987–1016.
[60] Vyas, S. and L. Kumaranayake (2006). Constructing socio-economic status indices: how to use principal components analysis. Health policy and planning, 21(6), 459–468.
[61] Lei, J. and L. Wasserman (2014). Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 71–96.
[62] Wäschle, K., M. Bobak, S. Pölsterl, S. Gehrmann, D. Dettmering, J. Hornegger, and A. Maier (2014). Conformal Prediction for Regression. IEEE Transactions on Medical Imaging, 33(2), 407–418.
[63] Wu, H.-T. and N. Wu (2022). Strong Uniform Consistency with Rates for Kernel Density Estimators with General Kernels on Manifolds. Information and Inference: A Journal of the IMA, 11(2), 781–799.
[64] Xu, Z., Z. Zhao, and W. B. Wu (2020). Conformal Prediction for Time Series. Journal of the American Statistical Association, 115(530), 1585–1594.

	$\displaystyle\sup_{x\in\textrm{supp}(P_{X})}R_{n}(x)$	$\displaystyle\leq\sup_{x\in\textrm{supp}(P_{X})}\sup_{y\in\mathcal{M}_{\gamma_{n}}}\|\hat{p}(y\|A_{k})-p(y\|x)\|$
		$\displaystyle\leq\sup_{x\in\textrm{supp}(P_{X})}\sup_{y\in\mathcal{M}_{\gamma_{n}}}\|\hat{p}(y\|A_{k})-p(y\|A_{k})\|+\sup_{y\in\mathcal{M}_{\gamma_{n}}}\|p(y\|A_{k})-p(y\|x)\|$
		$\displaystyle\leq h_{n_{k}}/c_{n_{k}}+Lw_{n}\|x-y\|$
		$\displaystyle\leq h_{\gamma_{n}}/c_{\gamma_{n}}+Lw_{n}\sqrt{d},$