This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conformal inference for regression on Riemannian Manifolds

Alejandro Cholaquidis, Fabrice Gamboa, Leonardo Moreno
Abstract

Regression on manifolds, and, more broadly, statistics on manifolds, has garnered significant importance in recent years due to the vast number of applications for non Euclidean data. Circular data is a classic example, but so is data in the space of covariance matrices, data on the Grassmannian manifold obtained as a result of principal component analysis, among many others. In this work we investigate prediction sets for regression scenarios when the response variable, denoted by YY, resides in a manifold, and the covariable, denoted by XX, lies in an Euclidean space. This extends the concepts delineated in [61] to this novel context. Aligning with traditional principles in conformal inference, these prediction sets are distribution-free, indicating that no specific assumptions are imposed on the joint distribution of (X,Y)(X,Y), and they maintain a non-parametric character. We prove the asymptotic almost sure convergence of the empirical version of these regions on the manifold to their population counterparts. The efficiency of this method is shown through a comprehensive simulation study and an analysis involving real-world data.

1 Introduction

Conformal prediction is a powerful set of statistical tools that operates under minimal assumptions about the underlying model. It is primarily used to construct confidence sets that are applicable to a diverse range of problems in fields such as machine learning and statistics. Is unique in its ability to construct prediction regions that guarantee a specified level of coverage for finite samples, irrespective of the distribution of the data. This is particularly crucial in high-stakes decision-making scenarios where maintaining a certain level of coverage is critical.

Unlike other methods, conformal prediction does not require strong assumptions about the sample distribution, such as normality. The aim is to construct prediction regions as small as possible to yield informative and precise predictions, enhancing the tool’s utility in various applications.

The approach was first proposed by Vovk, Gammerman, and Shafer in the late 1990s, as referenced in [57]. Since its inception, it has been the subject of intense research activity. Originally formulated for binary classification problems, the method has since been expanded to accommodate regression, multi-class classification, functional data, functional time series, and anomaly detection, among others. Several applications of this method can be found in the book [5].

In the context of regression, conformal prediction has proven to be efficient in constructing prediction sets, as evidenced by works such as [61], [62], and [34]. To enhance the performance of these prediction sets, particularly to decrease the length of prediction intervals (when the output is one-dimensional), a combination of conformal inference and quantile regression was proposed in [53]. In [20], computationally efficient conformal inference methods for Bayesian models were proposed.

The method has also been extended to functional regression, where the predictors and responses are functions, rather than vectors, see for instance [41], [21] and [18]. In this case, the prediction regions take the form of functional envelopes that have a high probability of containing the true function.

In the field of classification, conformal prediction has been employed to tackle a broad spectrum of problems. These include image classification, as in [40], and text classification, as in [58]. For multi-class classification, a prevalent approach is to create prediction sets that have a high likelihood of encompassing the correct class. This can be realized via the use of ‘Venn prediction sets’, as outlined in [43]. These sets partition the label space into overlapping regions, each one corresponding to a distinct class.

In summary, conformal prediction provides powerful statistical tools and has been successfully applied to a wide range of problems in machine learning, statistics, and related fields. Its main advantage is its ability to provide distribution-free prediction regions that can be used in the presence of any underlying distribution of the data. As this field of research continues to evolve, it is expected to find even more applications in the future.

2 Conformal inference on manifolds

Although, as mentioned, it has been extended to various scenarios, there are no proposals or extensions to the case where the output is in a Riemannian manifold. This case is of cumbersome importance because there are certain types of data that, due to their inherent characteristics, must be treated as data on manifolds. A prominent example of this are covariance matrices (for instance, the volatility of a portfolio, see [8] and [28]), which are data in the manifold of positive definite matrices (see [14]). Another example is given by Vectorcardiograms that are condensed into data on the Stiefel manifold (see [13]). The outcome of performing a principal component analysis results in data on the Grassmannian manifold (see [27]), and the wind measurements can be represented as data on the cylinder (see [12]).

Other applications in image analysis are developed in [49] and more generally in machine learning in [24].

The extension to this context is not trivial since when the output YY belongs to a Riemannian manifold \mathcal{M}, several arguments used in Euclidean conformal inference are not straightforwardly extendable. For instance, the Frechet mean on the manifold can be defined, but it does not always exist nor is it necessarily unique. Moreover, formulating a regression model on the manifold is not easy because of the absence of any additive structure, although there have been recent advances in this area (see for instance [50]).

We extend the methodologies outlined by [61] to the broader context of pairs (X,Y)(X,Y) where XdX\in\mathbb{R}^{d} and the response YY has support included on a, smooth enough, \ell-dimensional manifold \mathcal{M}. [61] is focused on the scenario where YY is a real-valued variable, constructing confidence sets—referred to as confidence bands—via density estimators. One of the key tools to get the consistency of the empirical conformal region to its population counterpart, as in [61], is Theorem 1. It states that the kernel-density estimator is uniformly consistent on manifolds. Once this is established, to adapt the ideas in [61], it must also be proved that the conditional density is uniformly bounded from above, this is done in Lemma 4. Finally, a further distinction from [61] is that—because the manifold may possess a nonempty boundary—points located well inside the manifold are handled differently from those situated near its edge. Here, we obtain an upper bound for the discrepancy between the empirical confidence set 𝐂^n(X)\hat{\mathbf{C}}_{n}(X) and its theoretical counterpart 𝐂P(X)\mathbf{C}_{P}(X), in terms of ν\nu, the volume measure on \mathcal{M}. More precisely, as detailed in Theorem 2 (see also Remark 1) the probability that this discrepancy being larger than n1/((+3)(d+2))n^{1/((\ell+3)(d+2))} is bounded from above by AλnλA_{\lambda}n^{-\lambda}, for any λ>0\lambda>0, AλA_{\lambda} being a constant that depends only on λ\lambda, where \ell denotes the dimensionality of \mathcal{M} and XdX\in\mathbb{R}^{d}. Our main and stronger theorem is for compact manifolds, however, we considered in section 5 the case of non-compact manifolds.

The kernel-based density estimator proposed in [12] uses the Euclidean distance instead of the geodesic distance on the manifold, which may be unknown in some cases, see Equation (4). Due to the curse of dimensionality, kernel-based density estimators are generally unsuitable for high-dimensional problems. In Section 6, we discuss an approach to overcome this limitation that builds on the ideas introduced in [30]. In essence, instead of estimating the conditional density p(y|x)p(y|x) locally (i.e., using only sample points close to xx), the proposed method uses all sample points (Xi,Yi)(X_{i},Y_{i}) whose estimated conditional densities p^(Yi|Xi)\hat{p}(Y_{i}|X_{i}) are similar (in a sense to be defined).

Conformal inference is tackled in metric spaces in [44]; however, given the generality, convergence rates are not provided, as in the present work, but only consistency in probability, for the case where the confidence bands are constructed through regression.

2.1 Conformal inference in a nutshell

In this section, we briefly provide the basic foundations of conformal inference to facilitate the reading of the following sections. For a more detailed reading see [22]. A key hypothesis in conformal inference is the exchangeability assumption, which means that for any permutation π\pi of {1,,n}\{1,\dots,n\} the distribution of the sample (Z1,,Zn)(Z_{1},\dots,Z_{n}) where ZiZ_{i} is a random element in some measurable space 𝐙\mathbf{Z} is the same as the distribution of (Zπ(1),,Zπ(n))(Z_{\pi(1)},\dots,Z_{\pi(n)}). Quoting [22] “A nonconformity measure A(B,z):𝐙n×𝐙A(B,z):\mathbf{Z}^{n}\times\mathbf{Z}\to\mathbb{R} is a way of scoring how different an example zz is from a bag B={Z1,,Zn}B=\{Z_{1},\dots,Z_{n}\}. Let us define, pz:=|{i=1,,n+1:RiRn+1}|/(n+1)p_{z}:=|\{i=1,\ldots,n+1:R_{i}\geq R_{n+1}\}|/(n+1) where Ri:=A({Z1,,Zi1,Zi+1,,Zn,z},Zi)R_{i}:=A\left(\{Z_{1},\ldots,Z_{i-1},Z_{i+1},\ldots,Z_{n},z\},Z_{i}\right) and Rn+1:=A({Z1,,Zn},z).R_{n+1}:=A\left(\{Z_{1},\ldots,Z_{n}\},z\right).” For α[0,1]\alpha\in[0,1] “we define the prediction set γα(Z1,,Zn):={z𝐙:pz>α}\gamma^{\alpha}(Z_{1},\ldots,Z_{n}):=\{z\in\mathbf{Z}:p_{z}>\alpha\}.” The following Proposition is given in [58].

Proposition 1 (Proposition 2.1).

Under the exchangeability assumption,

(Zn+1γα(Z1,,Zn))α for any α.\mathbb{P}(Z_{n+1}\not\in\gamma^{\alpha}(Z_{1},\ldots,Z_{n}))\leq\alpha\quad\text{ for any }\alpha.

In the regression setting, that is, when Zi=(Xi,Yi)Z_{i}=(X_{i},Y_{i}) has distribution PP, previous construction build a level set on the joint distribution (see Equation 4 in [61]). Nevertheless, controlling (Y𝐂P(x)|X=x)\mathbb{P}(Y\in\mathbf{C}_{P}(x)|X=x) is much more convenient in regression (see [61]). In this case we speak of “conditional coverage”. When there exists a conditional density p(y|x)p(y|x) (see hypothesis H1 in subsection 3), the “conditional oracle set”, 𝐂P(x)\mathbf{C}_{P}(x), (in [61] it is called conditional oracle band), is defined as

𝐂P(x)={y:p(y|x)txα},\mathbf{C}_{P}(x)=\{y:p(y|x)\geq t^{\alpha}_{x}\}, (1)

where tα(x)t^{\alpha}(x) satisfies

𝟙{p(y|x)txα}p(y|x)𝑑y=1α.\int\mathbbm{1}_{\{p(y|x)\geq t^{\alpha}_{x}\}}p(y|x)dy=1-\alpha. (2)

The empirical version of 𝐂P(x)\mathbf{C}_{P}(x) obtained from n={(Xi,Yi):i=1,,n}\aleph_{n}=\{(X_{i},Y_{i}):i=1,\dots,n\} constituted by an i.i.d. sample of (X,Y)(X,Y) with distribution PP, is called a conditionally valid set 𝐂n(x)\mathbf{C}_{n}(x) (see [61]).

Definition 1.

Given α(0,1)\alpha\in(0,1), and xx in the support of PXP_{X}, a set 𝐂n(x)\mathbf{C}_{n}(x)\subset\mathcal{M} is said to be conditionally valid if

(Y𝐂n(x)|X=x)1α.\mathbb{P}\big{(}Y\in\mathbf{C}_{n}(x)|X=x\big{)}\geq 1-\alpha.

This definition captures the notion of a set of possible values of YY that provides a specified level of coverage for a given imput xx.

As is shown in Lemma 1 of [61], non-trivial finite sample conditional validity for all xx in the support of PXP_{X} is impossible for a continuous distribution. To overcome this limitation, the following notion of local validity is introduced.

Definition 2.

Let 𝒜={Aj:j1}\mathcal{A}=\{A_{j}:j\geq 1\} be a partition of supp(PX)\textrm{supp}(P_{X}). A prediction set 𝐂n\mathbf{C}_{n} is locally valid with respect to 𝒜\mathcal{A} if

(Yn+1𝐂n(Xn+1)|Xn+1Aj)1α for all j and all P.\mathbb{P}(Y_{n+1}\in\mathbf{C}_{n}(X_{n+1})|X_{n+1}\in A_{j})\geq 1-\alpha\text{ for all }j\text{ and all }P. (3)

Whenever Ak𝒜A_{k}\in\mathcal{A}, we will write p(y|Ak)p(y|A_{k}) for the conditional density, w.r.t. ν\nu, of YY given XAkX\in A_{k}. We aim to prove (see Theorem 2) that locally valid sets converges to 𝐂P\mathbf{C}_{P} when YY is supported on a manifold.

3 Assumptions

In the following, n={(Xi,Yi):i=1,,n}\aleph_{n}=\{(X_{i},Y_{i}):i=1,\dots,n\} denotes an i.i.d. sample of (X,Y)(X,Y) with distribution PP, where XdX\in\mathbb{R}^{d} and YY\in\mathcal{M}. Here (,ρ)(\mathcal{M},\rho) is a compact \ell-dimensional submanifold of D\mathbb{R}^{D}. The case of non compact manifolds is discussed in Section 5. We further denote by PXP_{X} the marginal distribution of XX and by supp(PX)\text{supp}(P_{X}) its support. We denote the volume measure on \mathcal{M} by ν\nu and use \|\cdot\| to denote the Euclidean norm on D\mathbb{R}^{D}. We denote by μ\mu the dd-dimensional Lebesgue measure in d\mathbb{R}^{d}.

We will now give the set of assumptions that we will require. H1 to H4 are also imposed in [61]. Hypotheses H0 and H5 are imposed to guarantee the uniform convergence of the kernel-based density estimator of the conditional density.

  • H0

    \mathcal{M} is a 𝒞2\mathcal{C}^{2} submanifold, and if \partial\mathcal{M}\neq\emptyset, then \partial\mathcal{M} is also a 𝒞2\mathcal{C}^{2} submanifold.

  • H1

    The joint distribution PP has a density pX,Y(x,y)p_{X,Y}(x,y) w.r.t. μ×ν\mu\times\nu, and the marginal distribution PXP_{X} has a density pXp_{X} (w.r.t. μ\mu). We denote by p(y|x)=pX,Y(x,y)/pX(x)p(y|x)=p_{X,Y}(x,y)/p_{X}(x) the conditional density, where p(y|x)=0p(y|x)=0 if pX(x)=0p_{X}(x)=0.

  • H2

    pXp_{X} is such that that there exist b1,b2b_{1},b_{2} such that 0<b1pX(x)b2<0<b_{1}\leq p_{X}(x)\leq b_{2}<\infty for all xx in the support of XX.

  • H3

    p(y|x)p(y|x) is Lipschitz continuous as a function of xx, i.e., there exists a constant L>0L>0 such that p(|x)p(|x)L|xx|\|p(\cdot|x)-p(\cdot|x^{\prime})\|_{\infty}\leq L|x-x^{\prime}|.

  • H4

    There are positive constants ϵ0\epsilon_{0}, γ\gamma, c1c_{1}, and c2c_{2} such that for all xx in the support of XX,

    c1ϵγP({y:|p(y|x)txα|<ϵ}|X=x)c2ϵγ,c_{1}\epsilon^{\gamma}\leq P(\{y:|p(y|x)-t_{x}^{\alpha}|<\epsilon\}|X=x)\leq c_{2}\epsilon^{\gamma},

    for all ϵϵ0\epsilon\leq\epsilon_{0}, where txαt_{x}^{\alpha} is given by (2). Moreover infxtx(α)t0>0\inf_{x}t_{x}^{(\alpha)}\geq t_{0}>0.

  • H5

    p(y|Ak)p(y|A_{k}) has continuous partial second derivatives, for short p(y|Ak)𝒞2p(y|A_{k})\in\mathcal{C}^{2}.

The hypothesis H0 is satisfied by a very wide range of manifolds, among them, the Stieffel and Grassmannian Manifold. The cone of positive definite matrices, among many others. It is not a restrictive hypothesis in applications.

Assumptions H1 to H5 impose regularity on the joint and conditional densities of the data. In particular, the assumption of Lipschitz continuity H3 ensures that small changes in the input xx lead to small changes in the output yy. Assumption H2 implies that XX es compactly supported. Regarding Assumption H4, quoting [61], “is related to the notion of the ‘γ\gamma-exponent’ condition that was introduced by [51], and widely used in the density level set literature ([56, 52]). It ensures that the conditional density p(|x)p(\cdot|x) is neither too flat nor too steep near the contour at level txαt_{x}^{\alpha}, so the cut-off value tx(α)t_{x}^{(\alpha)} and the conditional density level set 𝐂P(x)=Lx(txα)\mathbf{C}_{P}(x)=L_{x}(t_{x}^{\alpha}) can be approximated from a finite sample […]. Assumption H4 also requires that the optimal cut-off values tx(α)t_{x}^{(\alpha)} be bounded away from zero.”. Assumption H5 is required to get the uniform convergence of the Kernel based estimator of p(y|Ak)p(y|A_{k}) on manifolds, see [7]. These assumptions play a crucial role in the development and analysis of conformal prediction methods.

Let us introduce some important sequences of positive real numbers that will play a key rol all along the manuscript.

  1. 1.

    wn=(log(n)/n)1/(d+2)w_{n}=(\log(n)/n)^{1/(d+2)},

  2. 2.

    γn=b1nwnd/2\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor, being b1b_{1} the positive constant introduced in H2. Observe that γn\gamma_{n}\to\infty.

  3. 3.

    Given hn,cn0h_{n},c_{n}\to 0 as nn\to\infty, we will consider the subsquences cγnc_{\gamma_{n}}, and hγnh_{\gamma_{n}} being γn=b1nwnd/2\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor as before.

4 Locally valid sets from a kernel density estimator

In this section we introduce a slightly modified version of the estimated local marginal density p^(x,y)(v|Ak)\hat{p}^{(x,y)}(v|A_{k}) and of the local conformity rank πn,k(x,y)\pi_{n,k}(x,y) originally introduced in [61]. We make the assumption that supp(PX)=[0,1]d\textrm{supp}(P_{X})=[0,1]^{d}, and that 𝒜={Ak,k=1,,T}\mathcal{A}=\{A_{k},k=1,\dots,T\} is a finite partition of [0,1]d[0,1]^{d} consisting of equilateral cubes with sides of length wnw_{n}. This is a quite common and technical assumption in conformal inference, but we can assume that, for instance, XX is such that supp(PX)[R,R]d\textrm{supp}(P_{X})\subset[-R,R]^{d}, for some R>0R>0. It only changes the constants appearing in Theorem 2, which remains true.

Let nk=i=1n𝟙{XiAk}n_{k}=\sum_{i=1}^{n}\mathbbm{1}_{\{X_{i}\in A_{k}\}}. Given a sequence hn0h_{n}\to 0, a kernel function K():K(\cdot):\mathbb{R}\to\mathbb{R}, and hnkh_{n_{k}}, we define

p^(v|Ak)=1nkhnki=1n𝟙{XiAk}K(Yivhnk).\hat{p}(v|A_{k})=\frac{1}{n_{k}h_{n_{k}}^{\ell}}\sum_{i=1}^{n}\mathbbm{1}_{\{X_{i}\in A_{k}\}}K\Bigg{(}\frac{||Y_{i}-v||}{h_{n_{k}}}\Bigg{)}. (4)

We aim to prove that p^(v|Ak)\hat{p}(v|A_{k}) provides a uniform estimate of p(y|Ak)p(y|A_{k}) across vv and kk. To achieve this, we need to ensure that there are sufficiently many sample points in each AkA_{k}. This is guaranteed by lemma 9 of [61], which states that if we choose wn=(log(n)/n)1/(d+2)w_{n}=(\log(n)/n)^{1/(d+2)}, then, with probability one, for all nn large enough,

k:b1nwnd/2nk3b2nwn2/2.\forall k:\quad b_{1}nw_{n}^{d}/2\leq n_{k}\leq 3b_{2}nw_{n}^{2}/2. (5)

Recall that b1b_{1} and b2b_{2} were defined in Assumption H2. In what follows, we assume that nn is sufficiently large to ensure (5).

The corresponding augmented estimate, based on n{(x,y)}\aleph_{n}\cup\{(x,y)\} is, for any (x,y)Ak×(x,y)\in A_{k}\times\mathcal{M},

p^(x,y)(v|Ak)=nknk+1p^(v|Ak)+1(nk+1)hnkK(yvhnk).\hat{p}^{(x,y)}(v|A_{k})=\frac{n_{k}}{n_{k}+1}\hat{p}(v|A_{k})+\frac{1}{(n_{k}+1)h_{n_{k}}}K\Bigg{(}\frac{||y-v||}{h_{n_{k}}}\Bigg{)}.

For any (Xn+1,Yn+1)=(x,y)Ak×(X_{n+1},Y_{n+1})=(x,y)\in A_{k}\times\mathcal{M}, consider the following local conformity rank

πn,k(x,y)=1nk+1i=1n+1𝟙{XiAk}𝟙{p^(x,y)(Yi|Ak)p^(x,y)(Yn+1|Ak)}.\pi_{n,k}(x,y)=\frac{1}{n_{k}+1}\sum_{i=1}^{n+1}\mathbbm{1}_{\{X_{i}\in A_{k}\}}\mathbbm{1}_{\{\hat{p}^{(x,y)}(Y_{i}|A_{k})\leq\hat{p}^{(x,y)}(Y_{n+1}|A_{k})\}}. (6)

As proved in Proposition 2 of [61], the set

𝐂^n(x)={y:πn,k(x,y)α},\hat{\mathbf{C}}_{n}(x)=\{y:\pi_{n,k}(x,y)\geq\alpha\}, (7)

has finite sample local validity, i.e., it satisfies (3). The finite sample local validity, as established by (7), does not require any assumptions and it holds under very general conditions.

The following result is the key theorem. Its proof is deferred to the Appendix. The proof follows the ideas used to prove Theorem 1 of [12]. It states that p(y|Ak)p(y|A_{k}) can be estimated uniformly by (4) for all yy that are far enough from the boundary of \mathcal{M}. Additionally, this estimation can be made uniformly across all kk. We assume, as in [12], that KK is a Gaussian kernel. This restriction is, as in [12], purely technical. More recent references on kernel-density estimation on manifolds are [9, 10], [15, 16, 6, 63].

This result is of fundamental importance in conformal prediction, as it provides a way to estimate the conditional density of YY given AkA_{k} for any kk in a non-parametric way. It implies that the estimation error is uniform across all partitions AkA_{k}, which is a key requirement for conformal prediction methods. In particular, it allows us to construct conformal prediction regions that are valid with a given level of confidence.

Theorem 1.

Let D\mathcal{M}\subset\mathbb{R}^{D} be a compact \ell-dimensional manifold satisfying H0. Let n\mathcal{M}_{n}\subset\mathcal{M} be a sequence of closed sets, and let hn0h_{n}\to 0 be a sequence of setwidths such that nhn+3/log(n)nh_{n}^{\ell+3}/\log(n)\to\infty. We assume that cn:=infynρ(y,)0c_{n}:=\inf_{y\in\mathcal{M}_{n}}\rho(y,\partial\mathcal{M})\to 0 is such that hn/cn0h_{n}/c_{n}\to 0 monotonically. Additionally, we assume that p(y|Ak)p(y|A_{k}) satisfies H5. Then, we have

supksupynk|p^(y|Ak)p(y|Ak)|=o(hγn/cγn)a.s.,\sup_{k}\sup_{y\in\mathcal{M}_{n_{k}}}|\hat{p}(y|A_{k})-p(y|A_{k})|=o(h_{\gamma_{n}}/c_{\gamma_{n}})\quad\text{a.s.}, (8)

where γn=b1nwnd/2\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor.

The following theorem is the main result of our paper. It states that 𝐂^n(Xi)\hat{\mathbf{C}}_{n}(X_{i}) consistently estimates 𝐂P(Xi)\mathbf{C}_{P}(X_{i}) uniformly for all XinX_{i}\in\aleph_{n}. The proof, which is deferred to the Appendix, see Subsection 9.1, is based on some ideas from the proof of Theorem 1 of [61]. Specifically, Lemma 1 and Lemma 3 are adapted from [61], while Lemma 2 is new and is required to adapt the proof of Lemma 3. Lemma 4 is the same as Lemma 8 of [61]. Finally, the proof of Theorem 2 uses the adapted lemmas and considers separately the sample points that are close to \mathcal{M} and those that are far away from this boundary. The first set of points is shown to be negligible with respect to ν\nu using techniques from geometric measure theory.

Theorem 2.

Assume H0 to H5. Let 𝐂^n(x)\hat{\mathbf{C}}_{n}(x) be given by (7) and 𝐂P(x)\mathbf{C}_{P}(x) be given by (1). Let wn=(log(n)/n)1/(d+2)w_{n}=(\log(n)/n)^{1/(d+2)} and γn=b1nwnd/2\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor. Then, for any λ>0\lambda>0, there exists an Aλ>0A_{\lambda}>0 such that, for nn large enough,

(supX:(X,Y)nν(𝐂^n(X)𝐂P(X))>Aλcγn)=𝒪(nλ),\mathbb{P}\Big{(}\sup_{X:(X,Y)\in\aleph_{n}}\nu\big{(}\hat{\mathbf{C}}_{n}(X)\triangle\mathbf{C}_{P}(X)\big{)}>A_{\lambda}c_{\gamma_{n}}\Big{)}=\mathcal{O}(n^{-\lambda}), (9)

where cγnc_{\gamma_{n}} is such that hγn/cγn20h_{\gamma_{n}}/c_{\gamma_{n}}^{2}\to 0 and hγn0h_{\gamma_{n}}\to 0 such that γnhγn+3/log(γn)\gamma_{n}h_{\gamma_{n}}^{\ell+3}/\log(\gamma_{n})\to\infty.

Remark 1.

It can be easily seen that, up to logarithmic factors, the rate of cγnc_{\gamma_{n}} is n1/((+3)(d+2))n^{-1/((\ell+3)(d+2))}.

5 Non-compact manifolds

In this section, we discuss how to remove the compactness assumption imposed in Theorem 2, while retaining all the other hypotheses. We will prove that the set 𝐂^n(Xn+1)\hat{\mathbf{C}}_{n}(X_{n+1}) satisfies asymptotic conditional validity (see [30]). In other words, there exist random sets Λn\Lambda_{n} such that

(Xn+1Λn|Λn)=1o(1),\mathbb{P}\Bigl{(}X_{n+1}\in\Lambda_{n}|\Lambda_{n}\Bigr{)}=1-o(1),

and

supxn+1Λn|(Yn+1𝐂^n(Xn+1)|Xn+1=xn+1)(1α)|=o(1).\sup_{x_{n+1}\in\Lambda_{n}}\Bigl{|}\mathbb{P}\Bigl{(}Y_{n+1}\in\hat{\mathbf{C}}_{n}(X_{n+1})\,\Big{|}\,X_{n+1}=x_{n+1}\Bigr{)}-\bigl{(}1-\alpha\bigr{)}\Bigr{|}=o(1).

According to Theorem 6 in [30], to obtain asymptotic conditional validity it suffices to prove that

(Yn+1𝐂^n(Xn+1)𝐂P(Xn+1))=o(1).\mathbb{P}\Bigl{(}Y_{n+1}\in\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\Bigr{)}=o(1). (10)
Theorem 3.

Assume H0 to H5. Let 𝐂^n(x)\hat{\mathbf{C}}_{n}(x) be given by (7) and 𝐂P(x)\mathbf{C}_{P}(x) by (1). Then, 𝐂^n(Xn+1)\hat{\mathbf{C}}_{n}(X_{n+1}) is asymptotically conditionally valid.

It worth to be mentioned that asymptotic conditional validity is considerably weaker than (9) when p(y|x)p(y|x) is bounded from above.

6 Computational aspects

A different method for estimating the conformal region is introduced in [30]. Instead of using a partition of d\mathbb{R}^{d} on cubes to get a partition of X1,,XnX_{1},\dots,X_{n}, to build kernel-based estimators, this approach employs a new partition constructed as follows. Consider, as in [30], the functions

H(z|x)={y:p(y|x)z}p(y|x)𝑑ν(y),H^(z|x)={y:p^(y|x)z}p^(y|x)𝑑ν(y).H(z|x)=\int_{\{y:p(y|x)\leq z\}}p(y|x)\,d\nu(y),\qquad\hat{H}(z|x)=\int_{\{y:\hat{p}(y|x)\leq z\}}\hat{p}(y|x)\,d\nu(y).

where p^(y|x)\hat{p}(y|x) is any estimator of p(y|x)p(y|x), not necessarily kernel-based. Define the conditional α\alpha-quantile of Y|XY|X by qα(x)=H1(α|x).q_{\alpha}(x)=H^{-1}(\alpha|x). Similarly, let q^α(x)=H^1(α|x)\hat{q}_{\alpha}(x)=\hat{H}^{-1}(\alpha|x), be the estimate of the conditional α\alpha-quantile of Y|XY|X.

Let \mathcal{F} be a partition of +\mathbb{R}^{+}. Then we create a partition 𝒜={A1,,AT}\mathcal{A}=\{A_{1},\dots,A_{T}\} of X1,,XnX_{1},\dots,X_{n} by assigning XiX_{i} and XjX_{j} to the same set Ar𝒜A_{r}\in\mathcal{A} if and only if q^α(Xi)\hat{q}_{\alpha}(X_{i}) and q^α(Xj)\hat{q}_{\alpha}(X_{j}) lie in the same element of \mathcal{F}. We then build the kernel-based estimator (4) according to this new partition. This approach is particularly suitable for high-dimensional feature spaces.

We propose either the set 𝐂^n(x)\hat{\mathbf{C}}_{n}(x) (as defined in (7)) or one of the algorithms presented in [30]. According to Theorem 25 in [30], to establish the asymptotic conditional validity of this proposal (in the sense of (10)), it is necessary to assume some smoothness of H(z,x)H(z,x) (see Assumption 23 in [30]), as well as the compactness of MM and the existence of sequences ηn=o(1)\eta_{n}=o(1) and ρn=o(1)\rho_{n}=o(1) satisfying

(𝔼[supy(p^(y|X)p(y|X))2|p^]ηn)ρn.\mathbb{P}\!\Biggl{(}\mathbb{E}\Bigg{[}\sup_{y\in\mathcal{M}}\bigl{(}\hat{p}(y|X)-p(y|X)\bigr{)}^{2}\;\Big{|}\;\hat{p}\Bigg{]}\;\geq\;\eta_{n}\Biggr{)}\;\leq\;\rho_{n}.

This condition indeed holds in our case; however, proving it in detail would require rewriting the proof of Theorem 1 in [12]. We leave this verification to the reader. As in the non-compact case, the consistency result obtained is considerably weaker than (9).

Since the computational burden of approximating 𝐂^n(x)\hat{\mathbf{C}}_{n}(x) is high, [61] proposes replacing that set with 𝐂n+(x)\mathbf{C}_{n}^{+}(x), see (11), which, by including 𝐂^n(x)\hat{\mathbf{C}}_{n}(x), has coverage of at least 1α1-\alpha. Let Ax𝒜A_{x}\in\mathcal{A} the element that contains xx.

  1. 1.

    Let p^(x,y)\hat{p}(x,y) be the joint density estimator based on the subsample (of cardinality nxn_{x}) of points for which XiAxX_{i}\in A_{x}

  2. 2.

    Let Zi=(Xi,Yi)Z_{i}=(X_{i},Y_{i}) for i=1,,nxi=1,\dots,n_{x}, and let Z(1),Z(2),,Z(nx)Z_{(1)},Z_{(2)},\dots,Z_{(n_{x})} denote the sample ordered increasingly by p^(Xi,Yi)\hat{p}(X_{i},Y_{i}).

  3. 3.

    Let j=nxαj=\lfloor n_{x}\alpha\rfloor and define

    𝐂n+(x)={y:p^(x,y)p^(X(j),Y(j))K(0)2nxhd+1}.\mathbf{C}_{n}^{+}(x)\;=\;\Bigl{\{}\,y:\;\hat{p}(x,y)\;\geq\;\hat{p}\bigl{(}X_{(j)},Y_{(j)}\bigr{)}\;-\;\frac{K(0)^{2}}{n_{x}\,h^{d+1}}\Bigr{\}}. (11)

7 Simulation examples

7.1 Toy model: Output on the sphere

We consider a regression model with output variable denoted by YiY_{i} defined on the unit sphere S2S^{2}. The input variable XiX_{i} takes values in the interval [1,1][-1,1]. The model is given by the following probability distribution:

Yi𝔽(η+βXiη+βXi,κ),i=1,2,400.Y_{i}\sim\mathbb{F}\left(\frac{\eta+\beta X_{i}}{\|\eta+\beta X_{i}\|},\kappa\right),\,i=1,2\ldots,400.

Here, 𝔽\mathbb{F} denotes the Von Mises–Fisher distribution (see [45]), and XiU(1,1)X_{i}\sim\textrm{U}(-1,1) is an i.i.d. sample. For the simulation study, we set η=(1,0,0)\eta=(1,0,0), β=(0,0,1)\beta=(0,0,1), and κ=200\kappa=200.

To estimate the kernel density, we use the estimator given by Equation (4) with a setwidth parameter h=0.5h=0.5. Additionally, we partition the data into intervals Ak=(1+(k1)/2,1+k/2)A_{k}=\left(-1+(k-1)/2,-1+k/2\right), where k=1,,4k=1,\ldots,4.

Using the proposed method, we obtain a 90%90\% confidence set 𝐂^400(0)\hat{\mathbf{C}}_{400}(0), shown in green in Figure 1. In this particular case, the unobserved output was Y=(1,0,0)Y=(1,0,0), which is illustrated in purple in Figure 1. Points belonging to the boundary of the theoretical confidence set 1 for α=0.1\alpha=0.1 are displayed in blue.

Refer to caption
Figure 1: Red: a sample Y1,,Y400Y_{1},\ldots,Y_{400} on S2S^{2}. Green: the estimated 90%90\%-confidence set for YY if x=0x=0. Purple: the point prediction for x=0x=0, which is y=(1,0,0)y=(1,0,0). Blue dots belong to the boundary of the theoretical confidence set (1) for α=0.1\alpha=0.1.

7.2 Output in a Stiefel manifold.

It is common to summarize information from a set of variables using Principal Component Analysis (PCA) or Factor Analysis. This approach is frequently employed in constructing indices within the social sciences, as demonstrated by Vyas (2006). Subsequently, researchers often attempt to explain these indices using other covariates. To illustrate this scenario, we provide a ‘toy example’ using simulations.

We will define a regression model whose output lies on the Stiefel Riemannian manifold SO(3,2)SO(3,2), represented by two-dimensional orthonormal vectors in 3\mathbb{R}^{3}. The construction of the regression is as follows.

The input variable, denoted by XX, is uniformly distributed on the interval [0,1][0,1]. For each value of XX, we sample 400400 points from a 3-dimensional Gaussian distribution centered at the origin (0,0,0)(0,0,0). The covariance matrix of this Gaussian distribution has eigenvalues 22, 11, and 1/21/2, along with the corresponding eigenvectors (2+X,2+2X,2X)(2+X,2+2X,2-X), (2+X,23X,2+X)(-2+X,2-3X,-2+X), and (1,0,0)(1,0,0), respectively.

After generating these 400400 points for each value of XX, the output YSO(3,2)Y\in SO(3,2) is obtained by applying a PCA to these points and retaining only the first two principal directions. To conduct the simulations analysis, we sample 500500 points, denoted by X1,,X500X_{1},\dots,X_{500}.

In Figure 2, we display a sample of 500500 output values, represented as gray arcs. The specific output value y0.1SO(3,2)y_{0.1}\in SO(3,2), corresponding to the input X=0.1X=0.1, is depicted as a purple dotted arc.

To estimate the density, we use the estimator given by Equation (4), with a setwidth parameter h=1h=1. Additionally, we divide the data into intervals Ak=((k1)/5,k/5)A_{k}=\left((k-1)/5,k/5\right), where k=1,,5k=1,\ldots,5. It is important to note that the SO(3,2)SO(3,2) data are embedded in 6\mathbb{R}^{6} (D=6D=6), and we consider the Euclidean distance in this space. The dimension of the submanifold in this case is =3\ell=3.

To visualize the confidence set, we draw 1000010000 points within SO(3,2)SO(3,2). The points falling within the confidence set are highlighted in orange in Figure 2.

Refer to caption
Figure 2: A uniform sample of 10000 points in the 95%95\% confidence set (orange arcs) for YSO(3,2)Y\in SO(3,2) if x=0.1x=0.1. The sample value for x=0.1x=0.1 is the 3×23\times 2 matrix, yxy_{x}, whose first column is (0.62,0.5,0.6)(0.62,0.5,0.6) and second column is (0.3,0.86,0.44)(-0.3,0.86,-0.44), which is depicted in purple. The observations are represented by gray arcs in the sample in the output.

8 Real-data examples

8.1 Example 1: Regression between cylindrical manifolds with application to wind modeling

In this study we address a problem of substantial practical relevance for the development of wind energy infrastructure in Uruguay. Specifically, the goal is construct prediction sets for wind intensity RR and direction θ\theta, at a meteorological station located in Montevideo, based on simultaneous observations from a nearby station in Maldonado. These variables together form a point on a cylinder 𝒟=S1×+3\mathcal{D}=S^{1}\times\mathbb{R}^{+}\subset\mathbb{R}^{3}, where S1S^{1} represents the directional component (wind angle) and +\mathbb{R}^{+} the non-negative intensity. Thus, the regression problem consists of learning a mapping from one cylindrical manifold to another.

This task is not only of theoretical interest due to the non-Euclidean structure of the input and output spaces, but also of strategic importance for national energy planning. Accurate prediction of wind behavior at candidate sites is a critical step in the optimal placement of wind turbines, which are a major source of renewable energy in Uruguay’s energy matrix.

We focus on moderate and extreme wind regimes (within the range 0 to 20 m/s), using hourly meteorological data recorded during the month of July between 2008 and 2016. The data were collected at two locations: the Laguna del Sauce Weather Station in Maldonado (Station 1) and the Carrasco International Airport Weather Station in Montevideo (Station 2), which are approximately 84 kilometers apart. To ensure reliability, only time points for which both stations reported valid measurements were retained, resulting in a total of 5,362 matched observations.

Our objective is to construct conformal prediction sets, for wind conditions at Station 2 given observations from Station 1. These sets aim to quantify predictive uncertainty while respecting the cylindrical geometry of the variables involved, offering robust tools for risk assessment and decision-making in wind farm siting.

Figures 3 and 4 show that there is a correlation in both wind direction and intensity between the two stations. The ξ\xi-correlation, see [11], between the intensities is 0.450.45. In the case of the directions, the angular correlation is 0.630.63. The angular correlation is the linear correlation between the variables sin(θ1θ¯1)\sin\left(\theta_{1}-\bar{\theta}_{1}\right) and sin(θ2θ¯2)\sin\left(\theta_{2}-\bar{\theta}_{2}\right), see [31].

Refer to caption
Figure 3: Scatter plot between the angular directions (in radians) of the winds recorded at Station 1 and Station 2 in the months of July between 2008 and 2016.
Refer to caption
Figure 4: Scatter plot between the intensities (in m/s) of the winds recorded at Station 1 and Station 2 in the months of July between 2008 and 2016.

In this example, the partition is

Ai,j={(θ,R)𝒟:θ[iπ/4,(i+1)π/4] and R[4j,4(j+1)]}A_{i,j}=\Big{\{}(\theta,R)\in\mathcal{D}:\theta\in[i\pi/4,(i+1)\pi/4]\textrm{ and }R\in[4j,4(j+1)]\Big{\}}

with i=0,1,,7i=0,1,\ldots,7 and j=0,1,,4j=0,1,\ldots,4. For the kernel density estimator we choose hk=0.4h_{k}=0.4.

Figure 5 shows the confidence set (at 80%) obtained by our method for (θ1,R1)=(2.3 radians,5.1 m/s)(\theta_{1},R_{1})=(2.3\textrm{ radians},5.1\textrm{ m/s}). This was recorded at Station 1 on 2008-07-01 at 7 pm. On the same date and time the data recorded at Station 2 were (θ2,R2)=(2.4 radians,6.6 m/s)(\theta_{2},R_{2})=(2.4\textrm{ radians},6.6\textrm{ m/s}) (point depicted in purple in Figure 5).

Refer to caption
Figure 5: Winds (not exceeding 20 m/s) at Station 2 in the months of July between 2008 and 2016 (red). The green area is the 80% confidence set for the data recorded at Station 1 of (2.3 radians,5.1 m/s)(2.3\textrm{ radians},5.1\textrm{ m/s}) on 2008-07-01 at 7 pm. The purple dot represents the wind recorded at the same time at Station 2 ((θ2,R2)=(2.4 radians,6.6 m/s)(\theta_{2},R_{2})=(2.4\textrm{ radians},6.6\textrm{ m/s})).

8.2 Example 2: Regression on the simplex with application to multiclass classification

Regression models whose outputs lie in the (d1)(d-1)-dimensional unit simplex Δd1\Delta^{d-1} play an essential role in many applied contexts (see, for example, [3, 48]). In supervised classification, a variety of machine-learning algorithms produce vectors of class-membership probabilities—i.e., elements of Δd1\Delta^{d-1}—which we denote by YiY_{i}. Here, each YiY_{i} is obtained by using Random Forest (RF). Because these vectors reside in Δd1\Delta^{d-1}, their components naturally estimate the probability of membership in each class. The final class prediction is then chosen as the one with the highest estimated probability, equivalent to identifying the Voronoi cell of the simplex in which YiY_{i} falls (see Figure 6).

Our goal is to construct a 100(1α)%100(1-\alpha)\% confidence region for observations on the simplex Δd1\Delta^{d-1} by adapting Algorithm 1 from [30] (CD-split) to this setting. In particular, we follow [29] to estimate the conditional density, employing an appropriate Fourier‐type basis on the simplex—such as the Bernstein polynomial basis [19, 23].

The intersection of this confidence region with the Voronoi tessellation of Δd1\Delta^{d-1} yields the conformal class set, i.e., the subset of classes that are statistically consistent with the observation at level α\alpha. Moreover, this approach provides insight into how the predictive uncertainty is distributed among the classes, by analyzing the proportion of the confidence region that falls within each Voronoi cell.

Refer to caption
Figure 6: Test set predictions for vehicle classification. Colors denote the true class labels, and Voronoi regions indicate the predicted class according to Random Forest.

An application of the proposed methodology is carried out on a real-world dataset, where the covariate space is 5-dimensional and the response variable lies on Δ2\Delta^{2}, representing class probabilities, obtained, we it was said, by means of RF. To this end, we consider the publicly available Vehicle dataset from the mlbench package in R [42].

The Vehicle dataset originates from a benchmark study conducted at the Turing Institute in the early 1990s. It was designed to evaluate the performance of classification algorithms in distinguishing between different types of road vehicles based on geometric features extracted from digitized images of their silhouettes. Each observation in the dataset corresponds to a vehicle image and is characterized by 55 numerical covariates, such as aspect ratio, edge count, and moment-based shape descriptors.

The categorical response variable indicates the type of vehicle. For the purposes of this study, we restrict attention to a subset of the data containing three of the four classes (bus, opel, and van), so that the associated output probabilities lie on Δ2\Delta^{2}. This structure is ideal for illustrating our simplex-based predictive modelling approach. The dataset contains a total of 628628 observations, and its moderate size, combined with its multi-class nature, makes it well-suited for benchmarking methods involving compositional or probabilistic responses.

The dataset is randomly split into two equal parts: 50% of the observations are used for training the model, and the remaining 50% are reserved as a test set for evaluating predictive performance.

Figure 7 displays the test data projected onto Δ2\Delta^{2}. Each point corresponds to a predicted probability vector, and its color indicates the true class label. The predicted class for a point is the Voronoi cell that contains the point.

We assess confidence levels of 90%90\% and 95%95\% for the covariate vector 𝐱n+1=(84,45,66,150,65)\mathbf{x}_{n+1}=(84,45,66,150,65). The corresponding class probabilities predicted by the Random Forest are 𝐲n+1=(0.845,0.01,0.145)\mathbf{y}_{n+1}=(0.845,0.01,0.145), highlighted in purple in Figure 7. The resulting class band, restricted to the “bus” and “van” categories, offers a detailed decomposition of predictive uncertainty by indicating the fraction of the confidence region attributable to each class. At the 95%95\% confidence level, approximately 93%93\% of the band resides within the “bus” region, while the remaining 7%7\% falls within the “van” region.

Refer to caption
Refer to caption
Figure 7: Left panel: 90%90\% conformal prediction region; right panel: 95%95\% conformal prediction region for yn+1=(0.845, 0.01, 0.145)y_{n+1}=(0.845,\ 0.01,\ 0.145) (shown in green). The background colors represent the Voronoi regions of the predicted classes over the 2-dimensional simplex Δ2\Delta^{2}. Classification was performed using a Random Forest model.

9 Conclusions and future work

We work within the conformal‐inference framework, where the response variable takes values on a Riemannian manifold while the covariates lie in d\mathbb{R}^{d}. Under assumptions analogous to those in [61], we extend their results to show that the conditional oracle set 𝐂P(x)\mathbf{C}_{P}(x) can be estimated consistently. A central tool in our analysis is the almost‐sure, uniform consistency of the classical kernel density estimator on manifolds, as proved in [12]. We also cover the cases of manifolds with boundary and non‐compact manifolds, and we discuss computational strategies for handling high‐dimensional input data.

Our theoretical contributions are demonstrated through two simulation studies and two real-world data examples.

Finally, one could explore alternative nonconformity scores beyond the KDE-based criterion we employ—for example, quantile-based or regression-based scores—each of which demands a nontrivial adaptation of the Euclidean theory to the manifold setting.

Appendix

Proof of Theorem 1.

Since nkhnk+1/log(nk)n_{k}h_{n_{k}}^{\ell+1}/\log(n_{k})\to\infty, then, from theorem 1 of [12], for all ϵ>0\epsilon>0, with probability one, for all nn large enough,

cnkhnksupynk|p^(y|Ak)p(y|Ak)|ϵ.\frac{c_{n_{k}}}{h_{n_{k}}}\sup_{y\in\mathcal{M}_{n_{k}}}|\hat{p}(y|A_{k})-p(y|A_{k})|\leq\epsilon. (12)

Note that the index n0n_{0} for which (12) holds depends on the choice of kk. However, from (5), we know that with probability one, nkγnn_{k}\geq\gamma_{n} for all kk when nn is sufficiently large. Since hn/cn0h_{n}/c_{n}\to 0 monotonically, it follows that for all kk, cnk/hnkcγn/hγnc_{n_{k}}/h_{n_{k}}\leq c_{\gamma_{n}}/h_{\gamma_{n}}. Hence, we can conclude that (8) holds for all kk. ∎

9.1 Proof of Theorem 2

The proof of Theorem 2 is based on some technical lemmas. To state them, let us define Lx(t)L_{x}(t) as the set of yy such that p(y|x)tp(y|x)\geq t, and Lxl(t)L_{x}^{l}(t) as the set of yy such that p(y|x)tp(y|x)\leq t. We also define L^x(t)\hat{L}_{x}(t) and L^kl(t)\hat{L}_{k}^{l}(t) as the corresponding sets for p^(|Ak)\hat{p}(\cdot|A_{k}), given by (4).

Lemma 1.

Under the hypotheses of Theorem 1, assume also H0 to H3. Let Rn(x)=supyγn|p^(y|Ak)p(y|x)|R_{n}(x)=\sup_{y\in\mathcal{M}_{\gamma_{n}}}|\hat{p}(y|A_{k})-p(y|x)|. Then, with probability one, for nn sufficiently large, we have

supxsupp(PX)Rn(x)hγn/cγn+Lwnd,\sup_{x\in\textrm{supp}(P_{X})}R_{n}(x)\leq h_{\gamma_{n}}/c_{\gamma_{n}}+Lw_{n}\sqrt{d}, (13)

where γn=b1nwnd/2\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor and infxMγnρ(x,)=cγn\inf_{x\in M_{\gamma_{n}}}\rho(x,\partial\mathcal{M})=c_{\gamma_{n}}.

Proof.

From (5), we can assume that nn is large enough and such that for all kk, b1nwndnkb2nwndb_{1}nw_{n}^{d}\leq n_{k}\leq b_{2}nw_{n}^{d}. From (8), we obtain with probability one for nn large enough,

supyγn|p^(y|Ak)p(y|Ak)|hnk/cnk.\sup_{y\in\mathcal{M}_{\gamma_{n}}}|\hat{p}(y|A_{k})-p(y|A_{k})|\leq h_{n_{k}}/c_{n_{k}}. (14)

Using assumption H3 and (5), we have

supxsupp(PX)Rn(x)\displaystyle\sup_{x\in\textrm{supp}(P_{X})}R_{n}(x) supxsupp(PX)supyγn|p^(y|Ak)p(y|x)|\displaystyle\leq\sup_{x\in\textrm{supp}(P_{X})}\sup_{y\in\mathcal{M}_{\gamma_{n}}}|\hat{p}(y|A_{k})-p(y|x)|
supxsupp(PX)supyγn|p^(y|Ak)p(y|Ak)|+supyγn|p(y|Ak)p(y|x)|\displaystyle\leq\sup_{x\in\textrm{supp}(P_{X})}\sup_{y\in\mathcal{M}_{\gamma_{n}}}|\hat{p}(y|A_{k})-p(y|A_{k})|+\sup_{y\in\mathcal{M}_{\gamma_{n}}}|p(y|A_{k})-p(y|x)|
hnk/cnk+Lwn|xy|\displaystyle\leq h_{n_{k}}/c_{n_{k}}+Lw_{n}|x-y|
hγn/cγn+Lwnd,\displaystyle\leq h_{\gamma_{n}}/c_{\gamma_{n}}+Lw_{n}\sqrt{d},

where the last inequality follows from (5) and the fact that

infxγn,y|xy|ρ(γn,).\inf_{x\in\mathcal{M}_{\gamma_{n}},y\in\mathcal{M}}|x-y|\geq\rho(\mathcal{M}_{\gamma_{n}},\partial\mathcal{M}).

Thus, we have shown (13). ∎

Lemma 2.

Assume H0 to H4. Let wn=(log(n)/n)1/(d+2)w_{n}=(\log(n)/n)^{1/(d+2)}, and assume that γn\gamma_{n} and hnh_{n} are as in Lemma 1. Then, there exists C>0C>0 such that with probability one, for nn large enough

supk|p^(|Ak)|C.\sup_{k}|\hat{p}(\cdot|A_{k})|_{\infty}\leq C.
Proof.

Recall that we are assuming KK to be a Gaussian kernel. We assume that nn is large enough so that (5) holds. Following the proof of theorem 1 of [12], we define

Wj(y)=(1/(nkhnk))𝟙{XjAk}K(|Yjy|/hnk),W_{j}(y)=(1/(n_{k}h_{n_{k}}^{\ell}))\mathbbm{1}_{\{X_{j}\in A_{k}\}}K(|Y_{j}-y|/h{n_{k}}),

Vj(y)=Wj(y)𝔼(Wj(y))V_{j}(y)=W_{j}(y)-\mathbb{E}(W_{j}(y)), and Sn(y)=j=1nVj(y)S_{n}(y)=\sum_{j=1}^{n}V_{j}(y). Let η>+1\eta>\ell+1 and consider a covering B(p1,hη),,B(pl,hη){B(p_{1},h^{\eta}),\dots,B(p_{l},h^{\eta})} of \mathcal{M}. Then,

supy|Sn(y)|max1jlsupyB(pj,hη)1nkhnk|Sn(y)Sn(pj)|+max1jl1nkhnk|Sn(pj)|=I1+I2.\sup_{y}|S_{n}(y)|\leq\max_{1\leq j\leq l}\sup_{y\in B(p_{j},h^{\eta})}\frac{1}{n_{k}h_{n_{k}}^{\ell}}|S_{n}(y)-S_{n}(p_{j})|+\\ \max_{1\leq j\leq l}\frac{1}{n_{k}h_{n_{k}}^{\ell}}|S_{n}(p_{j})|=I_{1}+I_{2}. (15)

Since KK is Lipschitz, we have I1chη1I_{1}\leq ch^{\eta-\ell-1} for some constant c>0c>0. Applying Bernstein’s inequality to Sn(pj)S_{n}(p_{j}) for fixed pjp_{j} and AkA_{k}, we get

(1nkhnk|Sn(pj)|>ϵ)c1exp(c2nkhnkϵ2),\mathbb{P}\Big{(}\frac{1}{n_{k}h_{n_{k}}^{\ell}}|S_{n}(p_{j})|>\epsilon\Big{)}\leq c_{1}\exp(-c_{2}n_{k}h_{n_{k}}^{\ell}\epsilon^{2}),

where c1c_{1} and c2c_{2} are positive constants.

Then, if κn=3b2nwnd\kappa_{n}=\lfloor 3b_{2}nw_{n}^{d}\rfloor,

(supksupy|Sn(y)|>ϵ)c3wndexp(c3nwndhκn).\mathbb{P}\Big{(}\sup_{k}\sup_{y}|S_{n}(y)|>\epsilon\Big{)}\leq c_{3}w_{n}^{-d}\exp(-c_{3}nw_{n}^{d}h_{\kappa_{n}}^{\ell}).

By the Borel–Cantelli lemma, it follows that supksupy|Sn(y)|0\sup_{k}\sup_{y}|S_{n}(y)|\to 0 almost surely.

Next, we will bound supy𝔼(Vj(y))\sup_{y}\mathbb{E}(V_{j}(y)). First, we bound 1/hnk1/hγn1/h_{n_{k}}^{\ell}\leq 1/h_{\gamma_{n}}^{\ell},

𝔼(Vj(y))=𝔼[𝔼(Vj(y)|X)][0,1]d1hγnK(|zy|/hκn)p(z|x)ν(dz)pX(x)𝑑x.\mathbb{E}(V_{j}(y))=\mathbb{E}[\mathbb{E}(V_{j}(y)|X)]\leq\int_{[0,1]^{d}}\int_{\mathcal{M}}\frac{1}{h_{\gamma_{n}}^{\ell}}K(|z-y|/h_{\kappa_{n}})p(z|x)\nu(dz)p_{X}(x)dx.

Since pX(x)p_{X}(x) and p(z|x)p(z|x) are bounded for all zz and xx, it is enough to bound from above

supy1hγnK(|zy|/hγn)ν(dz),\sup_{y}\int_{\mathcal{M}}\frac{1}{h_{\gamma_{n}}^{\ell}}K(|z-y|/h_{\gamma_{n}})\nu(dz),

which is bounded because we assumed that KK is a Gaussian kernel. ∎

Lemma 6 of [61] proves a slightly modified version of the following lemma.

Lemma 3.

Assume H0 to H5. Let γn=b1nwnd/2\gamma_{n}=\lfloor b_{1}nw_{n}^{d}/2\rfloor, hγn0h_{\gamma_{n}}\to 0, and γn\mathcal{M}_{\gamma_{n}}\subset\mathcal{M} be a sequence of closed sets such that infxγnρ(x,)/hγn\inf_{x\in\mathcal{M}_{\gamma_{n}}}\rho(x,\partial\mathcal{M})/h_{\gamma_{n}}\to\infty. Then, for any λ>0\lambda>0, there exists ξ2,λ\xi_{2,\lambda} such that, for nn large enough,

(supxnVn(x)>ξ2,λwn)=𝒪(nλ),\mathbb{P}\Bigg{(}\sup_{x\in\aleph_{n}}V_{n}(x)>\xi_{2,\lambda}w_{n}\Bigg{)}=\mathcal{O}(n^{-\lambda}), (16)

where Vn(x)=suptt0|P^(Lxl(t)|Ak)P(Lxl(t)|x)|V_{n}(x)=\sup_{t\geq t_{0}}|\hat{P}(L_{x}^{l}(t)|A_{k})-P(L_{x}^{l}(t)|x)|. Here, P^(|Ak)\hat{P}(\cdot|A_{k}) is the empirical distribution of Y|XAkY|X\in A_{k}, t0t_{0} is given in H4, and wn=(log(n)/n)1/(d+2)w_{n}=(\log(n)/n)^{1/(d+2)}.

Proof.

We will provide a sketch of the proof of this lemma since it follows essentially the same idea used to prove lemma 6 of [61].

Let xAkx\in A_{k} be fixed. Note that {Lxl(t):tt0}\{L_{x}^{l}(t):t\geq t_{0}\} is a nested class of sets with Vapnik–Chervonenkis dimension 2. Then, for all B>0B>0 and λ>0\lambda>0,

(supt|P^(Lxl(t)|Ak)P(Lxl(t)|Ak)|>Bwn)=𝒪(nλ),\mathbb{P}\Bigg{(}\sup_{t}\Big{|}\hat{P}(L_{x}^{l}(t)|A_{k})-P(L_{x}^{l}(t)|A_{k})\Big{|}>B\sqrt{w_{n}}\Bigg{)}=\mathcal{O}(n^{-\lambda}),

where we used that wn=(log(n)/n)1/(d+2)w_{n}=(\log(n)/n)^{1/(d+2)}. On the other hand,

|P(Lxl(t)|Ak)P(Lxl(t)|x)|Lwnν(Lx(t))dLwnν(Lx(t0))dLν()dwn.|P(L_{x}^{l}(t)|A_{k})-P(L_{x}^{l}(t)|x)|\leq Lw_{n}\nu(L_{x}(t))\sqrt{d}\leq Lw_{n}\nu(L_{x}(t_{0}))\sqrt{d}\\ \leq L\nu(\mathcal{M})\sqrt{d}w_{n}.

Let xAkx^{\prime}\in A_{k}. Then

|P^(Lxl|Ak)P(Lxl(t)|x)||p^(|Ak)|ν(Lxl(t)Lxl(t))+Vn(x)+|Gx(t)Gx(t)||\hat{P}(L^{l}_{x^{\prime}}|A_{k})-P(L^{l}_{x^{\prime}}(t)|x^{\prime})|\leq|\hat{p}(\cdot|A_{k})|_{\infty}\nu(L_{x}^{l}(t)\triangle L^{l}_{x^{\prime}}(t))+V_{n}(x)+|G_{x}(t)-G_{x^{\prime}}(t)|

where Gx(t)=P(Lxl(t)|x)G_{x}(t)=P(L_{x}^{l}(t)|x). From (30) in [61],

|Gx(t)Gx(t)|c3wn1γ.|G_{x}(t)-G_{x^{\prime}}(t)|\leq c_{3}w_{n}^{1\wedge\gamma}.

Here, c3c_{3} is a positive constant and γ\gamma is as in H4. From (29) in [61], ν(Lxl(t)Lxl(t))c4wnγ\nu(L_{x}^{l}(t)\triangle L^{l}_{x^{\prime}}(t))\leq c_{4}w_{n}^{\gamma} for some positive constant c4c_{4}. Lastly, supk|p^(|Ak)|\sup_{k}|\hat{p}(\cdot|A_{k})|_{\infty} is bounded, for nn large enough, using Lemma 2. ∎

We recall lemma 8 of [61].

Lemma 4.

Fix α>0\alpha>0, t0>0t_{0}>0 and ϵ>0\epsilon>0. Suppose that pp is a density function that satisfies H4, and p^\hat{p} an estimator such that |p^p|<v1|\hat{p}-p|_{\infty}<v_{1}. Let P^\hat{P} be a probability measure satisfying suptt0|P^(Ll(t))P(Ll(t))|<v2\sup_{t\geq t_{0}}|\hat{P}(L^{l}(t))-P(L^{l}(t))|<v_{2}. Define

t^α=inf{t0:P^(L^l(t))α}.\hat{t}^{\alpha}=\inf\{t\geq 0:\hat{P}(\hat{L}^{l}(t))\geq\alpha\}.

Assume that v1v_{1} and v2v_{2} are sufficiently small so that v1+c11/γv21/γtαt0v_{1}+c_{1}^{1/\gamma}v_{2}^{1/\gamma}\leq t^{\alpha}-t_{0} and c11/γv21/γϵ0c_{1}^{-1/\gamma}v_{2}^{1/\gamma}\leq\epsilon_{0} where c1c_{1} and γ\gamma are the constants given in H4. Then

|t^αtα|v1+c11/γv21/γ|\hat{t}^{\alpha}-t^{\alpha}|\leq v_{1}+c_{1}^{-1/\gamma}v_{2}^{1/\gamma}

Moreover, for any t~α\tilde{t}^{\alpha} such that |t~αt^α|v3|\tilde{t}^{\alpha}-\hat{t}^{\alpha}|\leq v_{3}, if 2v1+c11/γv21/γ+v3ϵ02v_{1}+c_{1}^{1/\gamma}v_{2}^{1/\gamma}+v_{3}\leq\epsilon_{0}, then there are constants ξ1,ξ2\xi_{1},\xi_{2} and ξ3\xi_{3} such that ν(L^(t~α)L(tα))ξ1v1+ξ2v2+ξ3v3\nu(\hat{L}(\tilde{t}^{\alpha})\triangle L(t^{\alpha}))\leq\xi_{1}v_{1}+\xi_{2}v_{2}+\xi_{3}v_{3}.

Proof of Theorem 2

In what follows we assume that nn is sufficiently large to satisfy (5). We will consider a sequence of compact sets n\mathcal{M}_{n}\subset\mathcal{M} such that infxnρ(x,)=cn>0\inf_{x\in\mathcal{M}_{n}}\rho(x,\partial\mathcal{M})=c_{n}>0 and supxρ(x,n)2cn\sup_{x\in\partial\mathcal{M}}\rho(x,\mathcal{M}_{n})\leq 2c_{n}, where cnc_{n} is chosen such that hγn/cγn20h_{\gamma_{n}}/c_{\gamma_{n}}^{2}\to 0.

Throughout the proof we denote by aa a generic positive constant. Write

ν(𝐂^n(x)𝐂P(x))=ν(𝐂^n(x)𝐂P(x)γn)+ν(𝐂^n(x)𝐂P(x)(γn))\nu(\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x))=\nu(\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\cap\mathcal{M}_{\gamma_{n}})+\nu\Big{(}\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\cap(\mathcal{M}\setminus\mathcal{M}_{\gamma_{n}})\Big{)}

where γn\gamma_{n} is as in Lemma 3 and γn\mathcal{M}_{\gamma_{n}}\subset\mathcal{M} is a sequence of closed sets such that infxγnρ(x,)=cγn\inf_{x\in\mathcal{M}_{\gamma_{n}}}\rho(x,\partial\mathcal{M})=c_{\gamma_{n}} satisfies hγn/cγn0h_{\gamma_{n}}/c_{\gamma_{n}}\to 0.

We will first prove that there exists a a>0a>0 such that

ν(γn)acγn.\nu(\mathcal{M}\setminus\mathcal{M}_{\gamma_{n}})\leq ac_{\gamma_{n}}. (17)

Since \mathcal{M} is 𝒞2\mathcal{C}^{2}, it has positive reach, denoted by τ\tau_{\mathcal{M}}, as shown in proposition 14 of [55]. Let Ar={x:ρ(x,)<r}A_{r}=\{x\in\mathcal{M}:\rho(x,\partial\mathcal{M})<r\}. From

supxρ(x,n)2cn,\sup_{x\in\partial\mathcal{M}}\rho(x,\mathcal{M}_{n})\leq 2c_{n},

it follows that γnA2cγn\mathcal{M}\setminus\mathcal{M}_{\gamma_{n}}\subset A_{2c_{\gamma_{n}}}.

We write mn=2τMsin(cγn/τ)m_{n}=2\tau_{M}\sin(c_{\gamma_{n}}/\tau_{\mathcal{M}}), and from proposition A.1 in [1], we know that Υγn:={B(x,mn):x}\Upsilon_{\gamma_{n}}:=\{B(x,m_{n}):x\in\partial\mathcal{M}\} covers A2γnA_{2\gamma_{n}}. Since \partial\mathcal{M} has finite Minkowski content due to its positive reach (see corollary 3 of [4]), and sin(x)x\sin(x)\approx x when x0x\to 0, there exists a a>0a>0 such that |Υγn|/cγnD1a|\Upsilon_{\gamma_{n}}|/c_{\gamma_{n}}^{D-\ell-1}\leq a for all sufficiently large nn, where |Υγn||\Upsilon_{\gamma_{n}}| is the DD-dimensional Lebesgue measure of Υγn\Upsilon_{\gamma_{n}}.

Thus, A2cγnA_{2c_{\gamma_{n}}} can be covered by at most acγn+1ac_{\gamma_{n}}^{-\ell+1} balls of radius 2mn2m_{n} centered at \partial\mathcal{M}, and the ν\nu-measure of each of these balls is bounded from above by BcγnBc_{\gamma_{n}}^{\ell} by corollary 1 of [2]. From this, 17 follows.

Now, to bound ν(𝐂^n(x)𝐂P(x)γn)\nu(\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\cap\mathcal{M}_{\gamma_{n}}), we follow the approach used in the proof of theorem 1 in [61]. We apply Lemma 4 to the density function p(y|x)p(y|x) and the empirical measure P^(|Ak)\hat{P}(\cdot|A_{k}), as well as the estimated density function p^(|Ak)\hat{p}(\cdot|A_{k}). Here, we provide a sketch of the main changes made to the proof. We denote by L^\hat{L} the upper level set of p^(|Ak)\hat{p}(\cdot|A_{k}).

Let {i1,,ink}={i:1in,XiAk}\{i_{1},\dots,i_{n_{k}}\}=\{i:1\leq i\leq n,X_{i}\in A_{k}\}. From lemma 3 in [61], conditioning on (i1,,ink)(i_{1},\dots,i_{n_{k}}),

L^{p^(X(iα),Y(iα)|Ak)}𝐂^n(x)L^{p^(X(iα),Y(iα)|Ak)(nkhnk)1ψK},\hat{L}\Big{\{}\hat{p}(X_{(i_{\alpha})},Y_{(i_{\alpha})}|A_{k})\Big{\}}\subset\hat{\mathbf{C}}_{n}(x)\subset\hat{L}\Big{\{}\hat{p}(X_{(i_{\alpha})},Y_{(i_{\alpha})}|A_{k})-(n_{k}h_{n_{k}})^{-1}\psi_{K}\Big{\}},

where ψK=supx,x|K(x)K(x)|\psi_{K}=\sup_{x,x^{\prime}}|K(x)-K(x^{\prime})| with (X(iα),Y(iα))(X_{(i_{\alpha})},Y_{(i_{\alpha})}) is the element of

{(X11,Yi1),,(Xink,Yink)}\{(X_{1_{1}},Y_{i_{1}}),\dots,(X_{i_{n_{k}}},Y_{i_{n_{k}}})\}

such that p^(Yiα|Ak)\hat{p}(Y_{i_{\alpha}}|A_{k}) ranks nkα\lfloor n_{k}\alpha\rfloor. Let t^α=p^(X(iα),Y(iα))\hat{t}^{\alpha}=\hat{p}(X_{(i_{\alpha})},Y_{(i_{\alpha})}). It is easy to check that

t^α=inf{t0:P^(Ll(t)|Ak)α}.\hat{t}^{\alpha}=\inf\{t\geq 0:\hat{P}(L^{l}(t)|A_{k})\geq\alpha\}.

Consider the event

E={supxRn(x)2hγncγn,supxVn(x)ξ2,λwn}.E=\Big{\{}\sup_{x}R_{n}(x)\leq 2\frac{h_{\gamma_{n}}}{c_{\gamma_{n}}},\ \sup_{x}V_{n}(x)\leq\xi_{2,\lambda}w_{n}\Big{\}}.

From nwndhγn+3/log(γn)nw_{n}^{d}h_{\gamma_{n}}^{\ell+3}/\log(\gamma_{n})\to\infty, it follows that wn/hγn0w_{n}/h_{\gamma_{n}}\to 0. Then from Lemmas 1 and 3, (Ec)=𝒪(nλ)\mathbb{P}(E^{c})=\mathcal{O}(n^{-\lambda}). Let rn:=hγn/cγnr_{n}:=h_{\gamma_{n}}/c_{\gamma_{n}}. Since wn/hγn0w_{n}/h_{\gamma_{n}}\to 0, then the event

E1={supxRn(x)2rn,supxVn(x)ξ2,λrn}E_{1}=\Big{\{}\sup_{x}R_{n}(x)\leq 2r_{n},\sup_{x}V_{n}(x)\leq\xi_{2,\lambda}r_{n}\Big{\}}

is such that for all nn large enough, (E1c)=𝒪(nλ)\mathbb{P}(E_{1}^{c})=\mathcal{O}(n^{-\lambda}). From Lemma (4) with v1=2rnv_{1}=2r_{n} and v2=ξ2,λrnv_{2}=\xi_{2,\lambda}r_{n}, we obtain that, for nn large enough,

(supxν(L^(t^α)Lx(tα)γn)ξλrn)=𝒪(nλ).\mathbb{P}\Big{(}\sup_{x}\nu\Big{(}\hat{L}(\hat{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}\geq\xi_{\lambda}r_{n}\Big{)}=\mathcal{O}(n^{-\lambda}). (18)

Let t~α=t^α(γnhγn)1ψK\tilde{t}^{\alpha}=\hat{t}^{\alpha}-(\gamma_{n}h_{\gamma_{n}})^{-1}\psi_{K} and v3=(γnhγn)1ψKv_{3}=(\gamma_{n}h_{\gamma_{n}})^{-1}\psi_{K}. From nwndhγn+3/log(γn)nw_{n}^{d}h_{\gamma_{n}}^{\ell+3}/\log(\gamma_{n})\to\infty, it follows that v30v_{3}\to 0. Applying Lemma 4, we get that

(supxν(L^(t~α)Lx(tα)γn)ξλ1rn+ξλ2v3)=𝒪(nλ).\mathbb{P}\Big{(}\sup_{x}\nu\Big{(}\hat{L}(\tilde{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}\geq\xi^{1}_{\lambda}r_{n}+\xi^{2}_{\lambda}v_{3}\Big{)}=\mathcal{O}(n^{-\lambda}).

Since v3rnv_{3}\leq r_{n} for nn large enough, we get that

(supxν(L^(t~α)Lx(tα)γn)2ξλ1rn)=𝒪(nλ).\mathbb{P}\Big{(}\sup_{x}\nu\Big{(}\hat{L}(\tilde{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}\geq 2\xi^{1}_{\lambda}r_{n}\Big{)}=\mathcal{O}(n^{-\lambda}). (19)

Lastly, we write

ν(𝐂^n(x)𝐂P(x)γn)ν(L^(t^α)Lx(tα)γn)+ν(L^(t~α)Lx(tα)γn)\nu(\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\cap\mathcal{M}_{\gamma_{n}})\leq\nu\Big{(}\hat{L}(\hat{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}+\\ \nu\Big{(}\hat{L}(\tilde{t}^{\alpha})\triangle L_{x}(t^{\alpha})\cap\mathcal{M}_{\gamma_{n}}\Big{)}

and then the theorem follows from (18) and (19).

Proof of Theorem 3

To prove (10), fix any ϵ>0\epsilon>0 and choose a sufficiently large compact smooth submanifold 𝒦=𝒦(ϵ)\mathcal{K}=\mathcal{K}(\epsilon)\subset\mathcal{M} fulfilling H0, such that (Y𝒦c)<ϵ.\mathbb{P}(Y\in\mathcal{K}^{c})<\epsilon. The existence of such a submanifold follows from the existence of smooth exhaustion functions (see Proposition 2.28 in [35]). Then, by a simple union bound,

(Yn+1𝐂^n(Xn+1)𝐂P(Xn+1))(Yn+1(𝐂^n(Xn+1)𝐂P(Xn+1))𝒦)+ϵ.\mathbb{P}\Bigl{(}Y_{n+1}\in\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\Bigr{)}\leq\\ \mathbb{P}\Bigl{(}Y_{n+1}\in\Bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\Bigr{)}\cap\mathcal{K}\Bigr{)}+\epsilon.

Now, fixing 𝒦\mathcal{K}, we take—as in the proof of Theorem 2—a sequence of closed sets 𝒦γn𝒦\mathcal{K}_{\gamma_{n}}\subset\mathcal{K} such that

infx𝒦γnρ(x,𝒦)=cγn0andhγncγn0 as n.\inf_{x\in\mathcal{K}_{\gamma_{n}}}\rho(x,\partial\mathcal{K})=c_{\gamma_{n}}\to 0\quad\text{and}\quad\frac{h_{\gamma_{n}}}{c_{\gamma_{n}}}\to 0\quad\text{ as }n\to\infty.

Then we decompose (Yn+1𝐂^n(Xn+1)𝐂P(Xn+1)𝒦)\mathbb{P}\Bigl{(}Y_{n+1}\in\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\cap\mathcal{K}\Bigr{)} equals

(Yn+1(𝐂^n(Xn+1)𝐂P(Xn+1))𝒦γn)+(Yn+1(𝐂^n(Xn+1)𝐂P(Xn+1))(𝒦𝒦γn)).\mathbb{P}\Bigl{(}Y_{n+1}\in\bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\bigr{)}\cap\mathcal{K}_{\gamma_{n}}\Bigr{)}+\\ \mathbb{P}\Bigl{(}Y_{n+1}\in\bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\bigr{)}\cap\bigl{(}\mathcal{K}\setminus\mathcal{K}_{\gamma_{n}}\bigr{)}\Bigr{)}.

Since 𝒦\mathcal{K} is compact and, for all xx, p(y|x)p(y|x) is a continuous function of yy, for nn large enough we have

(Yn+1(𝐂^n(Xn+1)𝐂P(Xn+1))(𝒦𝒦γn))<ϵ.\mathbb{P}\Bigl{(}Y_{n+1}\in\bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\bigr{)}\cap\bigl{(}\mathcal{K}\setminus\mathcal{K}_{\gamma_{n}}\bigr{)}\Bigr{)}<\epsilon.

Finally, the proof that

(Yn+1(𝐂^n(Xn+1)𝐂P(Xn+1))𝒦γn)0\mathbb{P}\Bigl{(}Y_{n+1}\in\bigl{(}\hat{\mathbf{C}}_{n}(X_{n+1})\triangle\mathbf{C}_{P}(X_{n+1})\bigr{)}\cap\mathcal{K}_{\gamma_{n}}\Bigr{)}\to 0

follows by using the continuity of p(y|x)p(y|x), that 𝒦\mathcal{K} is in the hypotheses of Theorem 2, and the fact established in the proof of Theorem 2 that

ν(𝐂^n(x)𝐂P(x)𝒦γn)0\nu\Bigl{(}\hat{\mathbf{C}}_{n}(x)\triangle\mathbf{C}_{P}(x)\triangle\mathcal{K}_{\gamma_{n}}\Bigr{)}\to 0

as nn\to\infty.

Acknowledgment

Authors are grateful with Tyrus Berry, for his insightful comments on the results obtained in his work with Timothy Sauer. . The research of the first and third authors has been partially supported by grant FCE-3-2022-1-172289 from ANII (Uruguay), 22MATH-07 form MATH – AmSud (France-Uruguay) and 22520220100031UD from CSIC (Uruguay).

References

  • [1] Aamari, E., C. Aaron, and C. Levrard (2023). Minimax boundary estimation and estimation with boundary. Bernoulli, 29(4), 3334–3368.
  • [2] Aaron, C. and A. Cholaquidis (2020). On boundary detection. Ann. Inst. H. Poincaré Probab. Statist., 56(3), 2028–2050.
  • [3] Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall.
  • [4] Ambrosio, L., A. Colesanti, and E. Villa (2008). Outer Minkowski content for some classes of closed sets. Mathematische Annalen, 342(4), 727–748.
  • [5] Balasubramanian, V., S.-S. Ho, and V. Vovk (2014). Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes.
  • [6] Berenfeld, C. and M. Hoffmann (2021). Density estimation on an unknown submanifold. Electronic Journal of Statistics, 15(1), 2179–2223.
  • [7] Berry, T. and T. Sauer (2017). Density estimation on manifolds with boundary. Computational Statistics & Data Analysis, 107, 1–17.
  • [8] Best, M. J. (2010). Portfolio optimization. CRC Press.
  • [9] Bouzebda, S. and N. Taachouche (2024a). Oracle inequalities and upper bounds for kernel conditional U-statistics estimators on manifolds and more general metric spaces associated with operators. Stochastics: An International Journal of Probability and Stochastic Processes, 96(8), 2135–2198.
  • [10] Bouzebda, S. and N. Taachouche (2024b). Rates of the Strong Uniform Consistency with Rates for Conditional U-Statistics Estimators with General Kernels on Manifolds. Mathematical Methods of Statistics, 33(2), 95–153.
  • [11] Chatterjee, S. (2021). A new coefficient of correlation. Journal of the American Statistical Association, 116(536), 2009–2022.
  • [12] Cholaquidis, A., R. Fraiman, and L. Moreno (2022). Level set and density estimation on manifolds. Journal of Multivariate Analysis, 189, 104925.
  • [13] Cholaquidis, A., R. Fraiman, and L. Moreno (2023). Level sets of depth measures in abstract spaces. TEST, 1–16.
  • [14] Cholaquidis, A., R. Fraiman, F. Gamboa, and L. Moreno (2023). Weighted lens depth: Some applications to supervised classification. Canadian Journal of Statistics, 51(2), 652–673.
  • [15] Cleanthous, G., A. G. Georgiadis, G. Kerkyacharian, P. Petrushev, and D. Picard (2020). Kernel and wavelet density estimators on manifolds and more general metric spaces. Bernoulli, 26(3), 1832–1862.
  • [16] Cleanthous, G., A. G. Georgiadis, and E. Porcu (2022). Oracle inequalities and upper bounds for kernel density estimators on manifolds and more general metric spaces. Journal of Nonparametric Statistics, 34(4), 734–757.
  • [17] Diquigiovanni, J., M. Fontana, and S. Vantini (2021). Distribution-free prediction bands for multivariate functional time series: an application to the Italian gas market. arXiv preprint arXiv:2107.00527.
  • [18] Diquigiovanni, J., M. Fontana, and S. Vantini (2022). Conformal prediction bands for multivariate functional data. Journal of Multivariate Analysis, 189, 104879.
  • [19] Farouki, R. T., T. N. Goodman, and T. Sauer (2003). Construction of orthogonal bases for polynomials in Bernstein form on triangular and simplex domains. Computer Aided Geometric Design, 20(4), 209–230.
  • [20] Fong, E. and C. C. Holmes (2021). Conformal Bayesian computation. Advances in Neural Information Processing Systems, 34, 18268–18279.
  • [21] Fontana, M., S. Vantini, M. Tavoni, and A. Gammerman (2020). A Conformal Approach for Distribution-free Prediction of Functional Data. In Functional and High-Dimensional Statistics and Related Fields 5, pp. 83–90. Springer.
  • [22] Fontana, M., G. Zeni, and S. Vantini (2023). Conformal prediction: A unified review of theory and new challenges. Bernoulli, 29(1), 1–23.
  • [23] Ghosal, S. (2001). Convergence rates for density estimation with Bernstein polynomials. The Annals of Statistics, 29(5), 1264–1280.
  • [24] Guigui, N., N. Miolane, X. Pennec, et al. (2023). Introduction to Riemannian Geometry and Geometric Statistics: from basic theory to implementation with Geomstats. Foundations and Trends® in Machine Learning, 16(3), 329–493.
  • [25] Guo, C., G. Pleiss, Y. Sun, and K. Q. Weinberger (2017). On Calibration of Modern Neural Networks. International Conference on Machine Learning.
  • [26] Hall, P., J. Racine, and Q. Li (2004). Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association, 99(468), 1015–1026.
  • [27] Hong, Y., R. Kwitt, N. Singh, N. Vasconcelos, and M. Niethammer (2016). Parametric regression on the Grassmannian. IEEE transactions on pattern analysis and machine intelligence, 38(11), 2284–2297.
  • [28] Huang, Y. and G. Kou (2014). A kernel entropy manifold learning approach for financial data analysis. Decision Support Systems, 64, 31–42.
  • [29] Izbicki, R. and A. B. Lee (2017). Converting high-dimensional regression to high-dimensional conditional density estimation. Electronic Journal of Statistics, 11(2), 2800–2831.
  • [30] Izbicki, R., G. Shimizu, and R. B. Stern (2022). Cd-split and hpd-split: Efficient conformal regions in high dimensions. Journal of Machine Learning Research, 23(87), 1–32.
  • [31] Jammalamadaka, S. R. and Y. R. Sarma (1988). A correlation coefficient for angular variables. Statistical theory and data analysis II, 349–364.
  • [32] Koh, P. W., P. Liang, and J. Zhu (2019). Understanding Black-Box Predictions via Influence Functions. International Conference on Machine Learning.
  • [33] Koh, P. W., P. Liang, A. Bekasov, J. Fu, R. Kondor, and J. Zhu (2020). Probabilistic Uncertainty Quantification for Deep Learning. Advances in Neural Information Processing Systems, 33.
  • [34] Kuleshov, A., A. Bernstein, and E. Burnaev (2018). Conformal prediction in manifold learning. In Conformal and Probabilistic Prediction and Applications, pp. 234–253. PMLR.
  • [35] Lee, J. M. (2013). Introduction to Smooth Manifolds (2nd ed.), Volume 218 of Graduate Texts in Mathematics. Springer.
  • [36] Lei, J. (2014). Classification with confidence. Biometrika, 101(4), 755–769.
  • [37] Lei, J. and L. Wasserman (2016). The Adaptive and the Honest: A Dichotomy in Conformal Prediction and Its Implications. The Annals of Statistics, 44(6), 2293–2323.
  • [38] Lei, L. and E. J. Candès (2021). Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5), 911–938.
  • [39] Lei, J., J. Robins, and L. Wasserman (2013). Distribution-free prediction sets. Journal of the American Statistical Association, 108(501), 278–287.
  • [40] Lei, J. (2014). Classification with confidence. Biometrika, 101(4), 755–769.
  • [41] Lei, J., A. Rinaldo, and L. Wasserman (2015). A conformal prediction approach to explore functional data. Annals of Mathematics and Artificial Intelligence, 74, 29–43.
  • [42] Leisch, F., E. Dimitriadou, K. Hornik, D. Meyer, and A. Weingessel (2007). mlbench: Machine Learning Benchmark Problems. https://CRAN.R-project.org/package=mlbench. R package version 2.1-3.
  • [43] Liu, H. and L. Wasserman (2016). Conformal Prediction for Multi-Class Classification. Journal of Machine Learning Research, 17, 1–28.
  • [44] Lugosi, G. and M. Matabuena (2024). Uncertainty quantification in metric spaces. arXiv preprint arXiv:2405.05110.
  • [45] Mardia, K. V., P. E. Jupp, and K. V. Mardia (2000). Directional statistics, Volume 2. Wiley Online Library.
  • [46] Papadopoulos, H., T. Poggio, and S. Vempala (2002). Combining Classifiers via Confidence. In Advances in Neural Information Processing Systems, pp. 705–712.
  • [47] Papadopoulos, H., K. Proedrou, V. Vovk, and A. Gammerman (2002). Inductive confidence machines for regression. In Machine Learning: ECML 2002, pp. 345–356. Springer.
  • [48] Pawlowsky-Glahn, V., J. J. Egozcue, and R. Tolosana-Delgado (2015). Modeling and analysis of compositional data. John Wiley & Sons.
  • [49] Pennec, X., S. Sommer, and T. Fletcher (2019). Riemannian geometric statistics in medical image analysis. Academic Press.
  • [50] Petersen, A. and H.-G. Müller (2019). Fréchet regression for random objects with Euclidean predictors. arXiv preprint arXiv:1608.03012.
  • [51] Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-an excess mass approach. The annals of Statistics, 855–881.
  • [52] Rigollet, P. and R. Vert (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli, 14, 1154–1178.
  • [53] Romano, Y., E. Patterson, and E. Candes (2019). Conformalized quantile regression. Advances in neural information processing systems, 32.
  • [54] Shafer, G. (2008). A Mathematical Theory of Evidence. Princeton University Press.
  • [55] Thäle, C. (2008). 50 years sets with positive reach–a survey. Surveys in Mathematics and its Applications, 3, 123–165.
  • [56] Tsybakov, A. B. (1997). On nonparametric estimation of density level sets. The Annals of Statistics, 25(3), 948–969.
  • [57] Vovk, V., A. Gammerman, and G. Shafer (1998). Algorithmic Learning in a Random World. Machine Learning, 30(2-3), 119–138.
  • [58] Vovk, V., A. Gammerman, and G. Shafer (2005). Algorithmic Learning in a Random World. Springer-Verlag.
  • [59] Vovk, V. (2009). Conditional Validity, Exchangeability, and Calibration of Probability Forecasts. Journal of Machine Learning Research, 10, 987–1016.
  • [60] Vyas, S. and L. Kumaranayake (2006). Constructing socio-economic status indices: how to use principal components analysis. Health policy and planning, 21(6), 459–468.
  • [61] Lei, J. and L. Wasserman (2014). Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 71–96.
  • [62] Wäschle, K., M. Bobak, S. Pölsterl, S. Gehrmann, D. Dettmering, J. Hornegger, and A. Maier (2014). Conformal Prediction for Regression. IEEE Transactions on Medical Imaging, 33(2), 407–418.
  • [63] Wu, H.-T. and N. Wu (2022). Strong Uniform Consistency with Rates for Kernel Density Estimators with General Kernels on Manifolds. Information and Inference: A Journal of the IMA, 11(2), 781–799.
  • [64] Xu, Z., Z. Zhao, and W. B. Wu (2020). Conformal Prediction for Time Series. Journal of the American Statistical Association, 115(530), 1585–1594.