This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

NewReferences

Dimension Reduction for Fréchet Regression

Qi Zhang, Lingzhou Xue, and Bing Li
Department of Statistics, Pennsylvania State University
Abstract

With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications. Fréchet regression model (Peterson & Müller 2019) provides a promising framework for regression analysis with metric space-valued responses. In this paper, we introduce a flexible sufficient dimension reduction (SDR) method for Fréchet regression to achieve two purposes: to mitigate the curse of dimensionality caused by high-dimensional predictors and to provide a visual inspection tool for Fréchet regression. Our approach is flexible enough to turn any existing SDR method for Euclidean (X,Y)(X,Y) into one for Euclidean XX and metric space-valued YY. The basic idea is to first map the metric space-valued random object YY to a real-valued random variable f(Y)f(Y) using a class of functions and then perform classical SDR to the transformed data. If the class of functions is sufficiently rich, then we are guaranteed to uncover the Fréchet SDR space. We showed that such a class, which we call an ensemble, can be generated by a universal kernel (cc-universal kernel). We established the consistency and asymptotic convergence rate of the proposed methods. The finite-sample performance of the proposed methods is illustrated through simulation studies for several commonly encountered metric spaces that include Wasserstein space, the space of symmetric positive definite matrices, and the sphere. We illustrated the data visualization aspect of our method by the human mortality distribution data from the United Nations Databases.

Keywords: Ensembled sufficient dimension reduction, Inverse regression, Statistical objects, Universal kernel, Wasserstein space.

1 Introduction

With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications, such as the graph Laplacians of networks, the covariance or correlation matrices for the brain functional connectivity in neuroscience Ferreira & Busatto (2013), and probability distributions in CT hematoma density data (Petersen & Müller, 2019). These data objects, also known as random objects, do not obey the operation rules of a vector space with an inner product or a norm but instead reside in a general metric space. In a prescient paper, Fréchet (1948) proposed the Fréchet mean as a natural generalization of the expectation of a random vector. By extending the Fréchet mean to the conditional Fréchet mean, Petersen & Müller (2019) introduced the Fréchet regression model with random objects as the response and Euclidean vectors as predictors, which provides a promising framework for regression analysis with metric space-valued responses. Dubey & Müller (2019) showed the consistency of the sample Fréchet mean using the results of Petersen & Müller (2019), derived a central limit theorem for the sample Fréchet variance that quantifies the variation around the Fréchet mean, and further developed the Fréchet analysis of variance for random objects. Dubey & Müller (2020) designed a method for change-point detection and inference in a sequence of metric-space-valued data objects.

The Fréchet regression of Petersen & Müller (2019) employs the global least squares and the local linear or polynomial regression to fit the conditional Fréchet mean. It is well known that the global least squares are based on a restrictive assumption of the regression relation. Although the local regression is more flexible, it is effective only when the dimension of the predictor is relatively low. As this dimension gets higher, its accuracy drops significantly–a phenomenon known as the curse of dimensionality. To address this issue, it is essential to reduce the dimension of the predictor without losing the information about the response. For classical regression, this task is performed by sufficient dimension reduction (SDR; see Li 1991; Cook & Weisberg 1991; Cook 1996 and Li 2018 among others). SDR works by projecting the high-dimensional predictor onto a low-dimensional subspace that preserves the information about the response through the use of sufficiency.

Besides assisting regression in overcoming the curse of dimensionality, another important function of SDR for classical regression is to provide a data visualization tool to gain insights into how the regression surface looks in high-dimensional space before even fitting a model. By inspecting the sufficient plots of the response objects against the sufficient predictors, we can gain insights into the general trends of the response as the most informative part of the predictor varies, whether there are outlying observations, and whether there are subjects with high leverage that have undue influence on the regression estimates-the usual information a statistician looks for in the exploratory and model checking stages of the regression analysis. This function is also needed in Fréchet regression. In fact, it can be argued that data visualization is even more important for the regression of random objects, as the regression relation may be even more difficult to discern among the complex details of the objects.

To fulfill these demands, in this paper, we systematically develop the theories and methodologies of sufficient dimension reduction for Fréchet regression. To set the stage, we first give an outline of SDR for classical regression. Let XX be a pp-dimensional random vector in p\mathbb{R}^{p} and YY a random variable in \mathbb{R}. The classical SDR aims to find a dimension reduction subspace 𝒮\mathcal{S} of p\mathbb{R}^{p} such that YY and XX are independent conditioning on P𝒮XP_{\mathcal{S}}X, that is, YX|P𝒮X,Y\rotatebox[origin={c}]{90.0}{$\models$}X|P_{\mathcal{S}}X, where P𝒮P_{\mathcal{S}} is the projection on to 𝒮\mathcal{S} with respect to the usual inner product in p\mathbb{R}^{p}. In this way, P𝒮XP_{\mathcal{S}}X can be used as the synthetic predictor without loss of regression information about the response YY. Under mild conditions, the intersection of all such dimension reduction subspaces is also a dimension reduction subspace, and the intersection is called the central subspace denoted by 𝒮Y|X\mathcal{S}_{Y|X} (Cook, 1996; Yin et al., 2008). For the situation where the primary interest is in estimating the regression function, Cook & Li (2002) introduced a weaker form of SDR, the mean dimension reduction subspace. A subspace 𝒮\mathcal{S} of p\mathbb{R}^{p} is a mean SDR subspace if satisfies E(Y|X)=E(Y|P𝒮X)E(Y|X)=E(Y|P_{\mathcal{S}}X), and the intersection of all such spaces if it is still a mean SDR subspace, is the central mean subspace, denoted by 𝒮E(Y|X)\mathcal{S}_{E(Y|X)}. The central mean subspace 𝒮E(Y|X)\mathcal{S}_{E(Y|X)} is always contained in central subspace 𝒮Y|X\mathcal{S}_{Y|X} when they exist. Many estimating methods for the central subspace and the central mean subspace have been developed over the past decades. For example, for the central subspace, we have the sliced inverse regression (SIR; Li 1991), the sliced average variance estimate (SAVE; Cook & Weisberg 1991), the contour regression (CR; Li et al. 2005), and the directional regression (DR; Li & Wang 2007). For the central mean subspace, we have the ordinary least squares (OLS; Li & Duan 1989), the principal Hessian directions (PHD; Li 1992), the iterative Hessian transformation (IHT Cook & Li 2002), the outer product of gradients (OPG) and the minimum average variance estimator (MAVE) of Xia et al. (2002).

SDR has been extended to accommodate some complex data structures in the past, for example, to functional data (Ferré & Yao 2003; Hsing & Ren 2009; Li & Song 2017), to tensorial data (Li et al. 2010; Ding & Cook 2015), and to panel data (Fan et al., 2017; Yu et al., 2020; Luo et al., 2021). Most recently, Ying & Yu (2022) extended SIR to the case where the response takes values in a metric space. Taking a substantial step forward, in this paper, we introduce a comprehensive and flexible method that can adapt any existing SDR estimators to metric space-valued responses.

The basic idea of our method stems from the ensemble SDR for Euclidean XX and YY of Yin & Li (2011), which recovers the central subspace 𝒮Y|X\mathcal{S}_{Y|X} by repeatedly estimating the central mean subspace 𝒮E[f(Y)|X]\mathcal{S}_{E[f(Y)|X]} for a family 𝔉\mathfrak{F} of functions ff that is rich enough to determine the conditional distribution of Y|XY|X. Such a family 𝔉\mathfrak{F} is called an ensemble and satisfies 𝒮Y|X={𝒮E[f(Y)|X]:f𝔉}.\mathcal{S}_{Y|X}=\cup\{\mathcal{S}_{E[f(Y)|X]}:f\in\mathfrak{F}\}. Using this relation, we can turn any method for estimating the central mean space into one that estimates the central subspace.

While borrowing the idea of the ensemble, our goal is different from Yin & Li (2011): we are not interested in turning an estimator for the central mean subspace into one for the central subspace. Instead, we are interested in turning any existing SDR method for Euclidean (X,Y)(X,Y) into one for Euclidean XX and metric space-valued YY. Let XX be a random vector in p\mathbb{R}^{p} and YY a random object that takes values in a metric space (ΩY,d)(\Omega_{Y},d). Still use the symbol SY|XS_{Y|X} to represent the intersection of all subspaces of p\mathbb{R}^{p} satisfying YX|P𝒮XY\rotatebox[origin={c}]{90.0}{$\models$}X|P_{\mathcal{S}}X. We call 𝒮Y|X\mathcal{S}_{\scriptscriptstyle Y|X} the central subspace for Fréchet SDR, or simply the Fréchet central subspace. Let 𝔉\mathfrak{F} be a family of functions f:ΩYf:\Omega_{Y}\to\mathbb{R} that are measurable with respect to the Borel σ\sigma-field on the metric space. We use two types of ensembles to connect classical SDR with Fréchet SDR:

  • Central Mean Space ensemble (CMS-ensemble) is a family 𝔉\mathfrak{F} that is rich enough so that 𝒮Y|X={𝒮E[f(Y)|X]:f𝔉}.\mathcal{S}_{Y|X}=\cup\{\mathcal{S}_{E[f(Y)|X]}:f\in\mathfrak{F}\}. Note that we know how to estimate the spaces 𝒮E(f(Y)|X)\mathcal{S}_{E(f(Y)|X)} using the existing SDR methods since f(Y)f(Y) is a number. We use this ensemble to turn an SDR method that targets the central mean subspace into one that targets the Fréchet central subspace. We will focus on two forward regression methods: OPG and MAVE, and three moment estimators of the CMS.

  • Central Space ensemble (CS-ensemble) is a family 𝔉\mathfrak{F} that is rich enough so that 𝒮Y|X={𝒮f(Y)|X:f𝔉}.\mathcal{S}_{Y|X}=\cup\{\mathcal{S}_{f(Y)|X}:f\in\mathfrak{F}\}. We use this ensemble to turn an SDR method that targets the central subspace for real-valued response into one that targets the Fréchet central subspace. We will focus on three inverse regression methods: SIR, SAVE, and DR.

A key step in implementing both of the above schemes is to construct an ensemble 𝔉\mathfrak{F} in each case. For this purpose, we assume that the metric space (ΩY,d)(\Omega_{Y},d) is continuously embeddable into a Hilbert space. Under this assumption, one can construct a universal reproducing kernel, which leads to an 𝔉\mathfrak{F} that satisfies the required characterizing property.

As with classical SDR, the Fréchet SDR can also be used to assist data visualization. To illustrate this aspect, we consider an application involving factors that influence the mortality distributions of 162 countries (see Section 7 for details). For each country, the response is a histogram with the numbers of deaths for each five-year period from age 0 to age 100, which is smoothed to produce a density estimate, as shown in panel (a) of Figure 1. We considered nine predictors characterizing each country’s demography, economy, labor market, health care, and environment. Using our ensemble method we obtained a set of sufficient predictors. In panel (b) of Figure 1, we show the mortality densities plotted against the first sufficient predictor. A clear pattern is shown in the plot: for countries with low values of the first sufficient predictor, the modes of the mortality distributions are at lower ages, and there are upticks at age 0, indicating high infant mortality rates; for countries with high values of the first sufficient predictor, the modes of the mortality distributions are significantly higher, and there are no upticks at age 0, indicating very low infant mortality rates. The information provided by the plot is clearly useful, and many further insights can be gained about what affects the mortality distribution by taking a careful look at the loadings of the first sufficient predictor, as will be detailed in Section 7.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Data visualization in Fréchet regression for mortality distributions of 162 countries. Panel (a) plots mortality densities that are placed in random order, and Panel (b) plots mortality densities versus the first sufficient predictor estimated by our ensemble method.

The rest of this paper is organized as follows. Section 2 defines the Fréchet SDR problem and provides sufficient conditions for a family 𝔉\mathfrak{F} to characterize the central subspace. Section 3 then constructs ensemble 𝔉\mathfrak{F} for the Wasserstein space of univariate distributions, the space of covariance matrix, and a special Riemannian manifold, the sphere. Section 4 proposes the CMS-ensembles by extending five SDR methods that target the central mean space for real-valued response: OLS, PHD, IHT, OPG and MAVE, and CS-ensembles by extending three SDR methods that target the central space for real-valued response: SIR, SAVE, and DR. Section 5 establishes the convergence rate of the proposed methods. Section 6 uses simulation studies to examine the numerical performances of different ensemble estimators in different settings, including distributional responses and covariance matrix responses. In Section 7, we analyze the mortality distribution data to demonstrate the usefulness of our methods. Section 8 includes a few concluding remarks and discussion. All the proofs and additional simulation studies and real application are presented in the Supplementary Material.

2 Characterization of the Fréchet Central Subspace

Let (Ω,,P)(\Omega,\mathcal{F},P) be a probability space. Let (ΩY,d)(\Omega_{Y},d) be a metric space with metric dd and Y\mathcal{B}_{Y} the Borel σ\sigma-field generated by the open sets in ΩY\Omega_{Y}. Let ΩX\Omega_{X} be a subset of p\mathbb{R}^{p} and X\mathcal{B}_{X} the Borel σ\sigma-field generated by the open sets in ΩX\Omega_{X}. Let (X,Y)(X,Y) be a random element mapping from Ω\Omega to ΩX×ΩY\Omega_{X}\times\Omega_{Y} measurable with respect to the product σ\sigma-field X×Y\mathcal{B}_{X}\times\mathcal{B}_{Y}. We denote the marginal distributions of XX and YY by PXP_{X} and PYP_{Y}, respectively, and the conditional distributions of Y|XY|X and X|YX|Y by PY|XP_{Y|X} and PX|YP_{X|Y}, respectively. We formulate the Fréchet SDR problem as finding a subspace 𝒮\mathcal{S} of p\mathbb{R}^{p} such that YY and XX are independent conditioning on P𝒮XP_{\mathcal{S}}X:

YX|P𝒮X,Y\rotatebox[origin={c}]{90.0}{$\models$}X|P_{\mathcal{S}}X, (1)

where P𝒮P_{\mathcal{S}} is the projection on to 𝒮\mathcal{S} with respect to the inner product in p\mathbb{R}^{p}. As in the classical SDR, the intersection of all such subspaces 𝒮\mathcal{S} still satisfies (1) under mild conditions (Cook & Li, 2002). Indeed, it does not require any structure of the space ΩY\Omega_{\scriptscriptstyle Y}. A sufficient condition shown in Yin et al. (2008) is that XX is supported by a matching set. For example, if the support of XX is convex, then this sufficient condition is satisfied. We call this subspace the Fréchet central subspace and denote it by 𝒮Y|X\mathcal{S}_{Y|X}. Similar to Cook (1996), it can be shown that if the support of X is open and convex, the Fréchet central subspace 𝒮Y|X\mathcal{S}_{Y|X} satisfies (1).

2.1 Two types of ensembles and their sufficient conditions

Let 𝔉\mathfrak{F} be a family of measurable functions f:ΩYf:\Omega_{Y}\to\mathbb{R}, and for an f𝔉f\in\mathfrak{F}, let 𝒮E[f(Y)|X]\mathcal{S}_{E[f(Y)|X]} be the central mean subspace of f(Y)f(Y) versus XX. As mentioned in Section 1, we use two types of ensembles to recover the Fréchet central subspace. The first type is any 𝔉\mathfrak{F} that satisfies

span{𝒮E(f(Y)|X):f𝔉}=𝒮Y|X.\displaystyle{\mathrm{span}}\{\mathcal{S}_{E(f(Y)|X)}:f\in\mathfrak{F}\}=\mathcal{S}_{Y|X}. (2)

This is the same ensemble as that in Yin & Li (2011), except that, here, the right-hand side is the Fréchet central subspace. Relation (2) allows us to recover the Fréchet central subspace 𝒮Y|X\mathcal{S}_{Y|X} by a collection of classical central mean subspaces. We call a class 𝔉\mathfrak{F} that satisfies (2) a CMS-ensemble. The second type of ensembles is any family 𝔉\mathfrak{F} that satisfies

span{𝒮f(Y)|X:f𝔉}=𝒮Y|X,\displaystyle\mathrm{span}\{\mathcal{S}_{f(Y)|X}:\,f\in\mathfrak{F}\}=\mathcal{S}_{Y|X}, (3)

which we call a CS-ensemble. Proposition 1 shows that a CMS ensemble is a CS-ensemble.

PROPOSITION 1.

If 𝔉\mathfrak{F} is a CMS-ensemble, then it is a CS-ensemble.

We next develop a sufficient condition for an 𝔉\mathfrak{F} to be a CMS-ensemble and hence also a CS-ensemble. Let 𝔅={IB:Bis Borel set inΩY}\mathfrak{B}=\{I_{B}:B\ \text{is Borel set in}\ \Omega_{Y}\} be the family of measurable indicator functions on ΩY\Omega_{Y}, and let span(𝔉)={i=1kαifi:k,α1,,αk,f1,,fk𝔉}\mathrm{span}(\mathfrak{F})=\left\{\sum_{i=1}^{k}\alpha_{i}f_{i}:k\in\mathbb{N},\alpha_{1},\dots,\alpha_{k}\in\mathbb{R},f_{1},\dots,f_{k}\in\mathfrak{F}\right\} be the linear span of 𝔉\mathfrak{F}, where ={1,2,}\mathbb{N}=\{1,2,\dots\}. Yin & Li (2011) showed that if 𝔉\mathfrak{F} is a subset of L2(PY)L_{2}(P_{Y}) that is dense in 𝔅\mathfrak{B}, then (2) holds for the classical 𝒮Y|X\mathcal{S}_{Y|X}. Here, we generalize that result to our setting by requiring only span(𝔉)\mathrm{span}(\mathfrak{F}) to be dense in 𝔅\mathfrak{B}.

LEMMA 1.

If 𝔉\mathfrak{F} is a subset of L2(PY)L_{2}(P_{Y}) and span{𝔉}\mathrm{span}\{\mathfrak{F}\} is dense in 𝔅\mathfrak{B} with respect to the L2(PY)L_{2}(P_{Y})-metric, then 𝔉\mathfrak{F} is a CMS-ensemble and hence also a CS-ensemble.

2.2 Construction of the CMS-ensemble

To construct a CMS-ensemble, we resort to the notion of the universal kernel. Let C(ΩY)C(\Omega_{Y}) be the family of continuous real-valued functions on ΩY\Omega_{Y}. When ΩY\Omega_{\scriptscriptstyle Y} is compact, Steinwart (2001) defined a continuous kernel κ\kappa as universal (we refer to it as c-universal) if its associated RKHS Y{\cal{H}}_{\scriptscriptstyle Y} is dense in C(ΩY)C(\Omega_{\scriptscriptstyle Y}) under the uniform norm. To relax the compactness assumption, Micchelli et al. (2006) proposed the following notion of universality, which is referred to cc-univsersal in Sriperumbudur et al. (2011). For any compact set KΩYK\subseteq\Omega_{\scriptscriptstyle Y}, let Y(K){\cal{H}}_{\scriptscriptstyle Y}(K) be the RKHS generated by {κ(,y):yK}\{\kappa(\cdot,y):y\in K\}. We should note that a member ff of Y(K){\cal{H}}_{\scriptscriptstyle Y}(K) is supported on ΩY\Omega_{\scriptscriptstyle Y}, rather than KK. Let f|Kf|K denote the restriction of ff on KK, and C(K)C(K) the class of all continuous functions with respect to the topology in (ΩY,d)(\Omega_{\scriptscriptstyle Y},d) restricted on KK.

DEFINITION 1.

(Micchelli et al., 2006) We say that κ\kappa is cc-universal if, for any compact set KΩYK\subseteq\Omega_{\scriptscriptstyle Y}, any member ff of C(K)C(K), and any ϵ>0\epsilon>0, there is an hY(K)h\in{\cal{H}}_{\scriptscriptstyle Y}(K) such that f(h|K)=supyK|f(y)h(y)|<ϵ.\|f-(h|{K})\|_{\scriptscriptstyle\infty}=\sup_{\scriptscriptstyle y\in K}|f(y)-h(y)|<\epsilon.

When ΩY\Omega_{\scriptscriptstyle Y} is compact, Sriperumbudur et al. (2011) showed that two notions of universality are equivalent. In the following, we look into the conditions under which a metric space has a cc-universal kernel and how to construct such a kernel when it does.

Micchelli et al. (2006) showed that when ΩY=d\Omega_{\scriptscriptstyle Y}=\mathbb{R}^{d}, many standard kernels, including Laplacian kernels and Gaussian RBF kernels, are cc-universal. Unfortunately, when ΩY\Omega_{Y} is a general metric space, direct extension of these types of kernels, for example, k(y,y)=exp(γd(y,y)2)k(y,y^{\prime})=\exp(-\gamma d(y,y^{\prime})^{2}), are no longer guaranteed to be cc-universal. Christmann & Steinwart (2010) showed that for compact ΩY\Omega_{\scriptscriptstyle Y}, if there exists a separable Hilbert space \mathcal{H} and a continuous injection ρ:ΩY\rho:\Omega_{Y}\to\mathcal{H}, then for any analytic function F:F:\mathbb{R}\to\mathbb{R} whose Taylor series at zero has strictly positive coefficients, the function κ(y,y)=F(ρ(y),ρ(y))\kappa(y,y^{\prime})=F(\langle\rho(y),\rho(y^{\prime})\rangle_{\mathcal{H}}) defines a c-universal kernel on ΩY\Omega_{Y}. They also provide an analogous definition of the Gaussian-type kernel in the above case. We extend this result to construct cc-universal kernels on non-compact metric space. The proof is given in the Supplementary Material.

PROPOSITION 2.

Suppose (ΩY,d)(\Omega_{Y},d) is a complete and separable metric space, and there exists a separable Hilbert space \mathcal{H} and a continuous injection ρ:ΩY\rho:\Omega_{Y}\to\mathcal{H}. If F:F:\mathbb{R}\to\mathbb{R} is an analytic function of the form F(t)=n=0antn,an0for alln1,F(t)=\sum_{n=0}^{\infty}a_{n}t^{n},a_{n}\geq 0\,\,\text{for all}\,\,n\geq 1, then the function κ:ΩY×ΩY\kappa:\Omega_{Y}\times\Omega_{Y}\to\mathbb{R} defined by κ(y,y)=F(ρ(y),ρ(y))\kappa(y,y^{\prime})=F(\langle\rho(y),\rho(y^{\prime})\rangle_{\mathcal{H}}) is a positive definite kernel. Furthermore, if an>0a_{n}>0 for all n1n\geq 1, then κ\kappa is a cc-universal kernel on ΩY\Omega_{\scriptscriptstyle Y}.

As an example, Corollary 1 shows that the Gaussian-type kernel is cc-universal on ΩY\Omega_{Y}.

COROLLARY 1.

Suppose the conditions in Proposition 2 are satisfied, then the Gaussian-type kernel κγ(y,y)=exp(γρ(y)ρ(y)2),whereγ>0,\kappa_{\gamma}(y,y^{\prime})=\exp(-\gamma\|\rho(y)-\rho(y^{\prime})\|_{\mathcal{H}}^{2}),\ \text{where}\,\ \gamma>0, is cc-universal. Furthermore, if the continuous function ρ:ΩY\rho:\Omega_{Y}\to\mathcal{H} is isometric, that is, d(y,y)=ρ(y)ρ(y)d(y,y^{\prime})=\|\rho(y)-\rho(y^{\prime})\|_{\mathcal{H}}, then Gaussian-type kernel κγ(y,y)=exp(γd2(y,y))\kappa_{\gamma}(y,y^{\prime})=\exp(-\gamma d^{2}(y,y^{\prime})) is cc-universal.

The second part of Corollary 1 is straightforward since an isometry is an injection. Similar results can be established for Laplacian-type kernel κγ(y,y)=exp(γρ(y)ρ(y))\kappa_{\gamma}(y,y^{\prime})=\exp(-\gamma\|\rho(y)-\rho(y^{\prime})\|_{\mathcal{H}}).

The continuous embedding condition in Proposition 2 covers several metric spaces often encountered in statistical applications. Section 3 employs it to construct cc-universal kernels on the space of univariate distributions endowed with Wasserstein-2 distance, correlation matrices endowed with Frobenius distance, and spheres endowed with geodesic distance.

By using the notion of regular probability measure, we connect the cc-universal kernel on (ΩY,d)(\Omega_{Y},d) with the CMS-ensemble, which is the theoretical foundation of our method. Recall that a measure PYP_{\scriptscriptstyle Y} on (ΩY,d)(\Omega_{\scriptscriptstyle Y},d) is regular if, for any Borel subset BΩYB\subseteq\Omega_{\scriptscriptstyle Y} and any ε>0\varepsilon>0, there is a compact set KBK\subseteq B and an open set GBG\supseteq B, such that P(G\K)<εP(G\backslash K)<\varepsilon.

THEOREM 1.

Suppose, on metric space (ΩY,d)(\Omega_{\scriptscriptstyle Y},d), (1) κ\kappa is a bounded cc-universal kernel and (2) PYP_{\scriptscriptstyle Y} is a regular probability measure. Then the family 𝔉={κ(,y):yΩY}\mathfrak{F}=\{\kappa(\cdot,y):y\in\Omega_{\scriptscriptstyle Y}\} is a CMS-ensemble.

The proof of Theorem 1 is given in the Supplementary Material. Condition (2), which requires PYP_{\scriptscriptstyle Y} to be regular, is quite mild: it is known that any Borel measure on a complete and separable metric space is regular (see Granirer (1970, Chapter 2: Theorem 1.2, Theorem 3.2)). Thus, a sufficient condition of Condition (2) is (ΩY,d)(\Omega_{\scriptscriptstyle Y},d) being complete and separable, which is satisfied by all the metric spaces we consider. Specifically, note that if MM is separable and complete, then so is the Wasserstein-2 space W2(M)W_{\scriptscriptstyle 2}(M) (Panaretos & Zemel, 2020, Proposition 2.2.8, Theorem 2.2.7). Therefore, W2()W_{\scriptscriptstyle 2}(\mathbb{R}) is complete and separable. Similarly, the SPD matrix space endowed with Frobenius distance and the sphere endowed with geodesic distance are both completely separable metric spaces. Furthermore, the Gaussian kernel and Laplacian kernel we considered satisfy Condition (1) in Theorem 1.

Thus, Proposition 2 and Theorem 1 provide a general mechanism to construct the CMS-ensemble over any separable and complete metric space without a linear structure, provided it can be continuously embedded in a separable Hilbert space. For the case where multiple cc-universal kernels exist, we design a cross-validation framework in Section 6 to choose the kernel type and the bandwidth γ\gamma.

3 Important Metric Spaces and their CMS Ensembles

This section gives the construction of CMS-ensembles for three commonly used metric spaces.

3.1 Wasserstein space

Let II be \mathbb{R} or a closed interval of \mathbb{R}, (I)\mathcal{B}(I) the σ\sigma-field of Borel subsets of II, and 𝒫(I)\mathcal{P}(I) the collection of all probability measures on (I,(I))(I,\mathcal{B}(I)). The Wasserstein space 𝒲2(I)\mathcal{W}_{2}(I) is defined as the subset of 𝒫(I)\mathcal{P}(I) with finite second moment, that is, 𝒲2(I)={μ𝒫(I):It2𝑑μ(t)<},\mathcal{W}_{2}(I)=\{\mu\in\mathcal{P}(I):\int_{I}t^{2}\,d\mu(t)<\infty\}, endowed with the quadratic Wasserstein distance dW(μ1,μ2)=(01[Fμ11(s)Fμ21(s)]2𝑑s)1/2,d_{\mathrm{W}}(\mu_{1},\mu_{2})=\left(\int_{0}^{1}\left[F_{\mu_{1}}^{-1}(s)-F_{\mu_{2}}^{-1}(s)\right]^{2}\,ds\right)^{1/2}, where μ1\mu_{1} and μ2\mu_{2} are members of 𝒲2(I)\mathcal{W}_{2}(I) and Fμ11F^{-1}_{\mu_{1}} and Fμ21F^{-1}_{\mu_{2}} are the quantile functions of μ1\mu_{1} and μ2\mu_{2}, which we assume to be well defined. This distance can be equivalently written as dW(μ1,μ2)=(I[Fμ11Fμ2(t)t]2𝑑μ2(t))1/2.d_{\mathrm{W}}(\mu_{1},\mu_{2})=\left(\int_{I}\left[F_{\mu_{1}}^{-1}\circ F_{\mu_{2}}(t)-t\right]^{2}\,d\mu_{2}(t)\right)^{1/2}. The set 𝒲2(I)\mathcal{W}_{2}(I) endowed with dWd_{\mathrm{W}} is a metric space with a formal Riemannian structure (Ambrosio et al., 2004).

Here, we present some basic results that characterize 𝒲2(I)\mathcal{W}_{2}(I), whose proofs can be found, for example, in Ambrosio et al. (2004) and Bigot et al. (2017). For μ1,μ2𝒲2(I)\mu_{1},\,\mu_{2}\in\mathcal{W}_{2}(I), we say that a (I)\mathcal{B}(I)-measurable map r:IIr:I\to I transports μ1\mu_{1} to μ2\mu_{2} if μ2=μ1r1\mu_{2}=\mu_{1}\circ r^{-1}. This relation is often written as μ2=r#μ1\mu_{2}={r}_{\#}\mu_{1}. Let μ0𝒲2(I)\mu_{0}\in\mathcal{W}_{2}(I) be a reference measure with a continuous Fμ0F_{\mu_{0}}. The tangent space at μ0\mu_{0} is Tμ0=clL2(μ0){λ(Fμ1Fμ0id):μ𝒲2(I),λ>0},T_{\mu_{0}}=\mathrm{cl}_{L_{2}(\mu_{0})}{\{\lambda(F_{\mu}^{-1}\circ F_{\mu_{0}}-\mathrm{id}):\mu\in\mathcal{W}_{2}(I),\lambda>0\}}, where, for a set AL2(μ0)A\subseteq L_{2}(\mu_{0}), clL2(μ0)(A)\mathrm{cl}_{L_{2}(\mu_{0})}(A) denotes the L2(μ0)L_{2}(\mu_{0})-closure of AA. The exponential map expμ0\exp_{\mu_{0}} from Tμ0T_{\mu_{0}} to 𝒲2(I)\mathcal{W}_{2}(I), defined by expμ0(r)=(r+id)#μ0\exp_{\mu_{0}}({r})=({r}+\mathrm{id})_{\#}\mu_{0}, is surjective. Therefore its inverse, logμ0:𝒲2(I)Tμ0\log_{\mu_{0}}:\mathcal{W}_{2}(I)\to T_{\mu_{0}}, defined by logμ0(μ)=Fμ1Fμ0id\log_{\mu_{0}}(\mu)=F_{\mu}^{-1}\circ F_{\mu_{0}}-\mathrm{id}, is well defined on 𝒲2(I)\mathcal{W}_{2}(I). It is well known that the exponential map restricted to the image of log\log map, denoted as expμ0|logμ0(μ)(𝒲2(I))\exp_{\mu_{0}}|_{\log_{\mu_{0}}(\mu)(\mathcal{W}_{2}(I))}, is an isometric homeomorphism (Bigot et al., 2017). Therefore, logμ0\log_{\mu_{0}} is a continuous injection from 𝒲2(I)\mathcal{W}_{2}(I) to L2(μ0)L_{2}(\mu_{0}). We can then construct CMS-ensembles using the general constructive method provided by Theorem 1 and Proposition 2. The next proposition gives two such constructions, where the subscripts “G” and “L” for the two kernels refer to “Gaussian” and “Laplacian”, respectively.

PROPOSITION 3.

For II\subseteq\mathbb{R}, κG(y,y)=exp(γlogμ0(y)logμ0(y)μ022)=exp(γdW(y,y)2)\kappa_{G}(y,y^{\prime})=\exp(-\gamma\|\log_{\mu_{0}}(y)-\log_{\mu_{0}}(y^{\prime})\|^{2}_{\mathcal{L}_{\mu_{0}}^{2}})=\exp(-\gamma d_{\mathrm{W}}(y,y^{\prime})^{2}) and κL(y,y)=exp(γlogμ0(y)logμ0(y)μ02)=exp(γdW(y,y))\kappa_{L}(y,y^{\prime})=\exp(-\gamma\|\log_{\mu_{0}}(y)-\log_{\mu_{0}}(y^{\prime})\|_{\mathcal{L}_{\mu_{0}}^{2}})=\exp(-\gamma d_{\mathrm{W}}(y,y^{\prime})) are both cc-universal kernels on 𝒲2(I)\mathcal{W}_{2}(I). Consequently, the families 𝔉G={exp(γdW(,t)2):t𝒲2(I)}\mathfrak{F}_{G}=\{\exp(-\gamma d_{\mathrm{W}}(\cdot,t)^{2}):t\in\mathcal{W}_{2}(I)\} and 𝔉L={exp(γdW(,t)):t𝒲2(I)}\mathfrak{F}_{L}=\{\exp(-\gamma d_{\mathrm{W}}(\cdot,t)):t\in\mathcal{W}_{2}(I)\} are CMS-ensembles.

3.2 Space of symmetric positive definite matrices

We first introduce some notations. Let Sym(r)\mathrm{Sym}(r) be the set of r×rr\times r invertible symmetric matrices with real entries and Sym+(r)\mathrm{Sym}^{+}(r) the set of r×rr\times r symmetric positive definite (SPD) matrices. For any Yr×rY\in\mathbb{R}^{\scriptscriptstyle r\times r}, the matrix exponential of YY is defined as the infinite power series exp(Y)=k=0Yk/k!\exp(Y)=\sum_{k=0}^{\infty}Y^{k}/k!. For any XSym+(r)X\in\mathrm{Sym}^{+}(r), the matrix logarithm of XX is defined as any r×rr\times r matrix YY such that exp(Y)=X\exp(Y)=X and denoted by log(X)\log(X).

Let dFd_{\mathrm{F}} be the Frobenius metric. Then (Sym(r),dF)(\mathrm{Sym}(r),d_{\mathrm{F}}) is a metric space continuously embedded by identity mapping in Sym(r){\mathrm{Sym}}(r), which is a Hilbert space with the Frobenius inner product A,B=tr(ATB)\langle A,B\rangle=\mathrm{tr}(A^{\mbox{\tiny{\sf T}}}B). Also, the identity mapping id:Sym+(r)Sym(r)\mathrm{id}:\mathrm{Sym}^{+}(r)\to\mathrm{Sym}(r) is obviously isometric. Therefore, by Corollary 1, the two types of radial basis function kernels for Wasserstein space can be similarly extended to Sym+(r)\mathrm{Sym}^{+}(r). That is, let κG(y,y)=exp(γdF(y,y)2)\kappa_{G}(y,y^{\prime})=\exp(-\gamma d_{\mathrm{F}}(y,y^{\prime})^{2}) and κL(y,y)=exp(γdF(y,y)),\kappa_{L}(y,y^{\prime})=\exp(-\gamma d_{\mathrm{F}}(y,y^{\prime})), then 𝔉G={κG(y,y),ySym+(r)}\mathfrak{F}_{G}=\{\kappa_{G}(y,y^{\prime}),y^{\prime}\in\mathrm{Sym}^{+}(r)\} and 𝔉L={κL(y,y),ySym+(r)}\mathfrak{F}_{L}=\{\kappa_{L}(y,y^{\prime}),y^{\prime}\in\mathrm{Sym}^{+}(r)\} are CMS-ensembles.

Another widely used metric over Sym+(r)\mathrm{Sym}^{+}(r) is the log-Euclidean distance that is defined as dlog(Y1,Y2)=log(Y1)log(Y2)F.d_{\log}(Y_{1},Y_{2})=\|\log(Y_{1})-\log(Y_{2})\|_{\mathrm{F}}. Basically, it pulls the Frobenius metric on Sym(r)\mathrm{Sym}(r) back to Sym+(r)\mathrm{Sym}^{+}(r) by the matrix logarithm map. The matrix logarithm log()\log(\cdot) is a continuous injection to Hilbert Sym(r)\mathrm{Sym}(r). By Corollary 1, the two types of radial basis function kernels κG,log(y,y)=exp(γdlog(y,y)2)\kappa_{G,\log}(y,y^{\prime})=\exp(-\gamma d_{\log}(y,y^{\prime})^{2}) and κL,log(y,y)=exp(γdlog(y,y))\kappa_{L,\log}(y,y^{\prime})=\exp(-\gamma d_{\log}(y,y^{\prime})) are cc-universal. Then, 𝔉G,log={κG,log(y,y),y𝒲2(I)}\mathfrak{F}_{G,\log}=\{\kappa_{G,\log}(y,y^{\prime}),y^{\prime}\in\mathcal{W}_{2}(I)\} and 𝔉L,log={κL,log(y,y),y𝒲2(I)}\mathfrak{F}_{L,\log}=\{\kappa_{L,\log}(y,y^{\prime}),y^{\prime}\in\mathcal{W}_{2}(I)\} are CMS-ensembles.

3.3 The sphere

Consider the random vector taking values in the sphere 𝕊n={xn:x=1}\mathbb{S}^{n}=\{x\in\mathbb{R}^{n}:\|x\|=1\}. To respect the nonzero curvature of 𝕊n\mathbb{S}^{n}, the geodesic distance dg(Y1,Y2)=arccos(Y1TY2)d_{g}(Y_{1},Y_{2})=\arccos(Y_{1}^{\mbox{\tiny{\sf T}}}Y_{2}), which is derived from its Riemannian geometry, is often used rather than the Euclidean distance. However, the popular Gaussian-type RBF kernel κG(y,y)=exp(γdg(y,y)2)\kappa_{G}(y,y^{\prime})=\exp(-\gamma d_{g}(y,y^{\prime})^{2}) is not positive definite on 𝕊n\mathbb{S}^{n} (Jayasumana et al., 2013). In fact, Feragen et al. (2015) proved that for complete Riemannian manifold MM with its associated geodesic distance dgd_{g}, κG(y,y)=exp(γdg(y,y)2)\kappa_{G}(y,y^{\prime})=\exp(-\gamma d_{g}(y,y^{\prime})^{2}) is positive semidefinite only if MM is isometric to a Euclidean space. Honeine & Richard (2010) and Jayasumana et al. (2013) proved that the Laplacian-type kernel κL(y,y)=exp(γdg(y,y))\kappa_{L}(y,y^{\prime})=\exp(-\gamma d_{g}(y,y^{\prime})) is positive definite on the sphere 𝕊n\mathbb{S}^{n}. We show in the following proposition that κL(y,y)\kappa_{L}(y,y^{\prime}) is cc-universal.

PROPOSITION 4.

The Laplacian-type kernel κL(y,y):𝕊n×𝕊n\kappa_{L}(y,y^{\prime}):\mathbb{S}^{n}\times\mathbb{S}^{n}\to\mathbb{R}, defined by κL(y,y)=exp(γdg(y,y))\kappa_{L}(y,y^{\prime})=\exp(-\gamma d_{g}(y,y^{\prime})), where dgd_{g} is the geodesic distance on 𝕊n\mathbb{S}^{n}, is a cc-universal kernel for any γ>0\gamma>0. Consequently, 𝔉L={exp(γdg(,t)),t𝕊n}\mathfrak{F}_{L}=\{\exp(-\gamma d_{g}(\cdot,t)),t\in\mathbb{S}^{n}\} is a CMS-ensemble.

4 Fréchet Sufficient Dimension Reduction

In this section, we develop the Fréchet SDR estimators based on the CMS-ensembles and CS-ensembles and establish their Fisher consistency.

4.1 Ensembled moment estimators via CMS ensembles

We first develop a general class of Fréchet SDR estimators based on the ensembled moment estimators of the CMS, such as the OLS, PHD, and IHT. Let 𝒫XY{\cal{P}}_{\scriptscriptstyle XY} be the collection of all distributions of (X,Y)(X,Y), and let M:𝒫XYp×pM:{\cal{P}}_{\scriptscriptstyle XY}\to\mathbb{R}^{\scriptscriptstyle p\times p} be a measurable function to be used as an estimator of the Fréchet central subspace 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X}. A function defined on 𝒫XY{\cal{P}}_{\scriptscriptstyle XY} is called statistical functional; see, for example, Chapter 9 of Li (2018). In the SDR literature, such a function is also called a candidate matrix (Ye & Weiss, 2003). Let FXYF_{\scriptscriptstyle XY} be a generic member of 𝒫XY{\cal{P}}_{\scriptscriptstyle XY}, FXY(0)F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)} the true distribution of (X,Y)(X,Y), and F^XY(n)\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(n)} the empirical distribution of (X,Y)(X,Y) based on an i.i.d. sample (X1,Y1),,(Xn,Yn)(X_{\scriptscriptstyle 1},Y_{\scriptscriptstyle 1}),\ldots,(X_{\scriptscriptstyle n},Y_{\scriptscriptstyle n}). Extending the terminology of classical SDR (see, for example, Li 2018, Chapter 2), we say that the estimate M(F^XY(n))M(\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(n)}) is unbiased if M(FXY(0))𝒮Y|XM(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)})\subseteq{\cal{S}}_{\scriptscriptstyle Y|X}, exhaustive if M(FXY(0))𝒮Y|XM(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)})\supseteq{\cal{S}}_{\scriptscriptstyle Y|X}, and Fisher consistent if M(FXY(0))=𝒮Y|XM(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)})={\cal{S}}_{\scriptscriptstyle Y|X}. We refer to MM as the Fréchet candidate matrix.

Suppose we are given a CMS-ensemble 𝔉\mathfrak{F}. Let M0:𝒫XY×𝔉p×pM_{\scriptscriptstyle 0}:{\cal{P}}_{\scriptscriptstyle XY}\times\mathfrak{F}\to\mathbb{R}^{\scriptscriptstyle p\times p} be a function to be used as an estimator of 𝒮E[f(Y)|X]{\cal{S}}_{\scriptscriptstyle E[f(Y)|X]} for each ff. This is not a statistical functional in the classical sense, as it involves an additional set 𝔉\mathfrak{F}. So, we redefine unbiasedness, exhaustiveness, and Fisher consistency for this type of augmented statistical functional.

DEFINITION 2.

We say that M0M_{\scriptscriptstyle 0} is unbiased for estimating {𝒮E[f(Y)|X]:f𝔉}\{{\cal{S}}_{\scriptscriptstyle E[f(Y)|X]}:f\in\mathfrak{F}\} if, for each f𝔉f\in\mathfrak{F}, span{M0(FXY(0),f)}𝒮E[f(Y)|X].\mathrm{span}\{M_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},f)\}\subseteq{\cal{S}}_{\scriptscriptstyle E[f(Y)|X]}. Exhaustiveness and Fisher consistency of M0M_{\scriptscriptstyle 0} are defined by replacing \subseteq in the above by \supseteq and ==, respectively.

Note that M0(,f)M_{\scriptscriptstyle 0}(\cdot,f) is an estimator of the classical central mean subspace 𝒮E[f(Y)|X]{\cal{S}}_{\scriptscriptstyle E[f(Y)|X]}, as f(Y)f(Y) is a random number rather than a random object. We refer to M0M_{\scriptscriptstyle 0} as the ensemble candidate matrix or, when confusion is possible, the CMS-ensemble candidate matrix. Our goal is to construct a Fréchet candidate matrix M:𝒫XYp×pM:{\cal{P}}_{\scriptscriptstyle XY}\to\mathbb{R}^{\scriptscriptstyle p\times p} from the ensemble candidate matrix M0:𝒫XY×𝔉p×pM_{\scriptscriptstyle 0}:{\cal{P}}_{\scriptscriptstyle XY}\times\mathfrak{F}\to\mathbb{R}^{\scriptscriptstyle p\times p}. To do so, we assume 𝔉\mathfrak{F} is of the form {κ(,y):yΩY}\{\kappa(\cdot,y):y\in\Omega_{\scriptscriptstyle Y}\}, where κ:ΩY×ΩY\kappa:\Omega_{\scriptscriptstyle Y}\times\Omega_{\scriptscriptstyle Y}\to\mathbb{R} is a cc-universal kernel. Given such an 𝔉\mathfrak{F} and M0M_{\scriptscriptstyle 0}, we define MM as follows

M(FXY)=ΩYM0(FXY,κ(,y))𝑑FY(y),\displaystyle M(F_{\scriptscriptstyle XY})=\int_{\scriptscriptstyle\Omega_{\scriptscriptstyle Y}}M_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY},\kappa(\cdot,y))dF_{\scriptscriptstyle Y}(y),

where FYF_{\scriptscriptstyle Y} is the distribution of YY derived from FXYF_{\scriptscriptstyle XY}.

We now adapt several estimates for the classical central mean subspace to the estimation of Fréchet SDR: the ordinary least squares (OLS; Li & Duan 1989), the principal Hessian directions (PHD; Li 1992), and the Iterative Hessian Transformation (IHT; Cook & Li 2002). These estimates are based on sample moments and require additional conditions on the predictor XX for their unbiasedness. Specifically, we make the following assumptions :

ASSUMPTION 1.

1.Linear Conditional Mean (LCM): E(X|βTX)E(X|\beta^{\mbox{\tiny{\sf T}}}X) is a linear function of βTX\beta^{{}^{\mbox{\tiny{\sf T}}}}X, where β\beta is a basis matrix of the Fréchet central subspace 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X};
2. Constant Conditional Variance (CCV): var(X|βTX)\mathrm{var}(X|\beta^{\mbox{\tiny{\sf T}}}X) is a nonrandom matrix.

Under the first assumption, the ensemble OLS and IHT are unbiased for estimating the Fréchet central subspace; under both assumptions, the ensemble PHD is unbiased for estimating 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X}. More detailed discussions on the unbiasedness and fisher consistency of ensemble estimators are presented in Section 4.4. In practice, the two assumptions above cannot be checked directly since we do not know β\beta. However, as was shown by Eaton (1986), if Assumption 1 holds for all β\beta, then the distribution of XX is elliptical, and vice versa. If further XX is multivariate normal, then Assumption 2 is satisfied. Currently, the scatter plot matrix is the most commonly used empirical method to check the elliptical distribution assumption. If non-elliptical features are observed, one can use marginal transformations of the predictors, such as the Box-Cox transformation, to mitigate the non-ellipticity problem. Furthermore, in practice, the SDR methods that require ellipticity usually still work reasonably well even when the elliptical distribution assumption is violated. This occurs particularly when the dimension pp of XX is high. See Hall & Li (1993) and Li & Yin (2007) for the theoretical supports. Our simulation results in Section 6 support this phenomenon.

It is most convenient to construct these ensemble estimators using standardized predictors. The theoretical basis for doing so is an equivariant property of the Fréchet central subspace, as stated in the next proposition.

PROPOSITION 5.

If 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X} is the Fréchet central subspace, Ap×pA\in\mathbb{R}^{\scriptscriptstyle p\times p} is a non-singular matrix, and bb is a vector in p\mathbb{R}^{\scriptscriptstyle p}, then 𝒮Y|AX+b=ATSY|X.{\cal{S}}_{\scriptscriptstyle Y|AX+b}=A^{\mbox{\tiny{\sf T}}}S_{\scriptscriptstyle Y|X}.

The proof is essentially the same as that for the classical central subspace (see, for example, Li, 2018, page 24), and is omitted. Using this property, we first transform XX to Z=var(X)1/2(XEX)Z=\mathrm{var}(X)^{\scriptscriptstyle-1/2}(X-EX), estimate the Fréchet central subspace 𝒮Y|Z{\cal{S}}_{\scriptscriptstyle Y|Z}, and then transform it by var(X)1/2𝒮Y|Z\mathrm{var}(X)^{\scriptscriptstyle-1/2}{\cal{S}}_{\scriptscriptstyle Y|Z}, which is the same as 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X}. The candidate matrices M0M_{\scriptscriptstyle 0} and MM for ensemble OLS, PHD, and IHT are formulated in Remark 1. Detailed motivation for each can be found in Li (2018, Chapter 8). The sample estimates can then be constructed by replacing the expectations in M0M_{\scriptscriptstyle 0} and MM with sample moments whenever possible. Algorithm 1 summarize the steps to implement an ensembled moment estimator, where κc(y,y)\kappa_{\scriptscriptstyle c}(y,y^{\prime}) stands for the centered kernel κ(y,y)Enκ(Y,y)\kappa(y,y^{\prime})-E_{\scriptscriptstyle n}\kappa(Y,y^{\prime}).

\justifyStep 1. Standardize predictors. Compute sample mean μ^=En(X)\hat{\mu}=E_{n}(X) and sample variance Σ^=varn(X)\hat{\Sigma}=\mathrm{var}_{n}(X). Then let Zi=Σ^1/2(Xiμ^)Z_{i}=\hat{\Sigma}^{-1/2}(X_{i}-\hat{\mu}).
Step 2. Compute M^0(y)\hat{M}_{0}(y) for y=Y1,,Yny=Y_{1},\dots,Y_{n}, with specific form given in Remark 1.
Step 3. Compute M^=1ni=1nM^0(Yi)\hat{M}=\frac{1}{n}\sum_{i=1}^{n}\hat{M}_{0}(Y_{i}).
Step 4. Let v^1,,v^d0\hat{v}_{1},\dots,\hat{v}_{d_{0}} be the leading d0d_{0} eigenvectors of M^\hat{M}, and let uk=Σ^1/2vku_{k}=\hat{\Sigma}^{-1/2}v_{k}, for k=1,,d0k=1,\dots,d_{0}. Then use {u1,,ud0}\{u_{1},\dots,u_{d_{0}}\} to estimate a basis of the Fréchet central subspace 𝒮Y|X\mathcal{S}_{Y|X}.
Algorithm 1 Fréchet OLS, PHD, IHT, SIR, SAVE, and DR
REMARK 1.

The candidate matrix M0(y)M_{\scriptscriptstyle 0}(y) for Fréchet OLS, PHD, and IHT are

  • (1)

    (Fréchet OLS) M0(y)=C(y)C(y)T{M}_{0}(y)={C}(y){C}(y)^{\mbox{\tiny{\sf T}}}, where C(y)=cov[Z,κ(Y,y)]{C}(y)=\mathrm{cov}[Z,\kappa(Y,y)];

  • (2)

    (Fréchet PHD) M0(y)=E[ZZTκc(Y,y)]{M}_{0}(y)=E[ZZ^{\mbox{\tiny{\sf T}}}\kappa_{\scriptscriptstyle c}(Y,y)];

  • (3)

    (Fréchet IHT) M0(y)=W(y)W(y){M}_{0}(y)={W}(y)W(y), where H(y)=E[ZZTκc(Y,y)],W(y)=(C(y),H(y)C(y),,H(y)rC(y)){H}(y)=E[{ZZ^{\mbox{\tiny{\sf T}}}\kappa_{\scriptscriptstyle c}(Y,y)}],{W}(y)=({C}(y),{H}(y){C}(y),\ldots,{H}(y)^{\scriptscriptstyle r}{C}(y)).

4.2 Ensembled forward regression estimators via CMS ensembles

In this subsection, we adapt the OPG (Xia et al. 2002), a popular method for estimating the classical CMS based on nonparametric forward regression, to the estimation of the Fréchet central subspace, which do not require LCM and CCV conditions. The adaption of another forward regression method MAVE is similar and presented in Section S.3.2 of the Supplementary Material. The framework of the statistical functional M0(FXY,f)M_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY},f) is no longer sufficient to cover this case because we now have a tuning parameter here. So, we adopt the notion of tuned statistical functional in Section 11.2 of Li (2018) to accommodate a tuning parameter.

Let 𝒫XY{\cal{P}}_{\scriptscriptstyle XY}, FXYF_{\scriptscriptstyle XY}, FXY(0)F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)} and F^XY(n)\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(n)} be as defined in Section 4.1. For simplicity, we assume the tuning parameter hh to be a scalar, but it could also be a vector. Given a CMS-ensemble 𝔉\mathfrak{F}, let T0:𝒫XY×𝔉×p×pT_{0}:{\cal{P}}_{\scriptscriptstyle XY}\times\mathfrak{F}\times\mathbb{R}\to\mathbb{R}^{\scriptscriptstyle p\times p} be a tuned functional to be used as an estimator of 𝒮E[f(Y)|X]{\cal{S}}_{\scriptscriptstyle E[f(Y)|X]} for each ff. We refer to T0T_{0} as the ensemble-tuned candidate matrix. The unbiasedness, exhaustiveness, and Fisher consistency of T0T_{0} are defined as follows.

DEFINITION 3.

We say that T0T_{\scriptscriptstyle 0} is unbiased for estimating {𝒮E[f(Y)|X]:f𝔉}\{{\cal{S}}_{\scriptscriptstyle E[f(Y)|X]}:f\in\mathfrak{F}\} if, for each f𝔉f\in\mathfrak{F}, span{limh0T0(FXY(0),f,h)}𝒮E[f(Y)|X].\mathrm{span}\{\lim_{h\to 0}T_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},f,h)\}\subseteq{\cal{S}}_{\scriptscriptstyle E[f(Y)|X]}. Exhaustiveness and Fisher consistency of T0T_{\scriptscriptstyle 0} are defined by replacing \subseteq in the above by \supseteq and ==, respectively.

Given 𝔉={κ(,y):yΩY}\mathfrak{F}=\{\kappa(\cdot,y):y\in\Omega_{Y}\} and T0T_{\scriptscriptstyle 0}, we define the tuned Fréchet candidate matrix T:𝒫XY×p×pT:{\cal{P}}_{\scriptscriptstyle XY}\times\mathbb{R}\to\mathbb{R}^{p\times p} as T(FXY,h)=ΩYT0(FXY,κ(,y),h)𝑑FY(y).T(F_{\scriptscriptstyle XY},h)=\int_{\Omega_{Y}}T_{0}(F_{\scriptscriptstyle XY},\kappa(\cdot,y),h)dF_{\scriptscriptstyle Y}(y). We say that the estimate T(TXY(n),h)T(T^{\scriptscriptstyle(n)}_{\scriptscriptstyle XY},h) is unbiased if span(limh0T(FXY(0),h))𝒮Y|X\mathrm{span}(\lim_{\scriptscriptstyle h\to 0}T(F^{\scriptscriptstyle(0)}_{\scriptscriptstyle XY},h))\subseteq{\cal{S}}_{\scriptscriptstyle Y|X}, exhaustive if span(limh0T(FXY(0),h))𝒮Y|X\mathrm{span}(\lim_{\scriptscriptstyle h\to 0}T(F^{\scriptscriptstyle(0)}_{\scriptscriptstyle XY},h))\supseteq{\cal{S}}_{\scriptscriptstyle Y|X}, and Fisher consistent if span(limh0T(FXY(0),h))=𝒮Y|X\mathrm{span}(\lim_{\scriptscriptstyle h\to 0}T(F^{\scriptscriptstyle(0)}_{\scriptscriptstyle XY},h))={\cal{S}}_{\scriptscriptstyle Y|X}.

For a function h(x)h(x), we use h(X)/X\partial h(X)/\partial X to denote h(x)/x\partial h(x)/\partial x evaluated at x=Xx=X. The OPG aims to estimate central mean subspace 𝒮E[κ(Y,y)|X]{\cal{S}}_{\scriptscriptstyle E[\kappa(Y,y)|X]} by E[E(κ(Y,y)|X)XE(κ(Y,y)|X)XT]E\left[\frac{\partial E(\kappa(Y,y)|X)}{\partial X}\frac{\partial E(\kappa(Y,y)|X)}{\partial X^{\mbox{\tiny{\sf T}}}}\right] where the gradient E(κ(Y,y)|X)/X\partial E(\kappa(Y,y)|X)/{\partial X} is estimated by local linear approximation as follows. Let K0:[0,)K_{0}:\mathbb{R}\to[0,\infty) be a kernel function as used in kernel estimation. For any vpv\in\mathbb{R}^{p} and bandwidth h>0h>0, let Kh(v)=hpK0(v/h)K_{h}(v)=h^{-p}K_{0}(\|v\|/h). At the population level, for fixed xΩXx\in\Omega_{X} and yΩYy\in\Omega_{Y}, we minimize the objective function

E{[κ(Y,y)abT(Xx)]2Kh(Xx)}/EKh(Xx)E\{[\kappa(Y,y)-a-b^{\mbox{\tiny{\sf T}}}(X-x)]^{2}K_{h}(X-x)\}/EK_{h}(X-x) (4)

over all aa\in\mathbb{R} and bd0b\in\mathbb{R}^{d_{0}}. The minimizer depends on x,yx,y and we write it as (ah(x,y)(a_{h}(x,y), bh(x,y))b_{h}(x,y)). The ensemble tuned candidate matrix for estimating the central mean subspace 𝒮E[κ(Y,y)|X]{\cal{S}}_{\scriptscriptstyle E[\kappa(Y,y)|X]} is T0(FXY,κ(,y),h)=E[bh(X,y)bh(X,y)T]T_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY},\kappa(\cdot,y),h)=E[b_{\scriptscriptstyle h}(X,y)b_{\scriptscriptstyle h}(X,y)^{\mbox{\tiny{\sf T}}}] and the tuned Fréchet candidate matrix is T(FXY,h)=E[bh(X,Y)bh(X,Y)T]T(F_{\scriptscriptstyle XY},h)=E[b_{\scriptscriptstyle h}(X,Y)b_{\scriptscriptstyle h}(X,Y)^{\mbox{\tiny{\sf T}}}].

At the sample level, we minimize, for each j,k=1,,nj,k=1,\dots,n, the empirical objective function

i=1nwh(Xi,Xj)[κγ(Yi,Yk)ajkbjkT(XiXj)]2\sum_{i=1}^{n}w_{h}(X_{i},X_{j})\left[\kappa_{\gamma}(Y_{i},Y_{k})-a_{jk}-b_{jk}^{\mbox{\tiny{\sf T}}}(X_{i}-X_{j})\right]^{2} (5)

over ajka_{jk}\in\mathbb{R} and bjkpb_{jk}\in\mathbb{R}^{p}, where wh(Xi,Xj)=Kh(XiXj)/l=1nKh(XlXj).w_{h}(X_{i},X_{j})=K_{h}(X_{i}-X_{j})/\sum_{l=1}^{n}K_{h}(X_{l}-X_{j}). Following Xia et al. (2002), we take the bandwidth to be h=c0n1/(p0+6)h=c_{0}n^{1/(p_{0}+6)} where p0=max{p,3}p_{0}=\max\{p,3\} and c0=2.34c_{0}=2.34, which is slightly larger than the optimal n1/(p+4)n^{-1/(p+4)} in terms of the mean integrated squared errors. As proposed in Li (2018, Lemma 11.6), instead of solving bjkb_{jk} from (5) n2n^{2} times, we solve multivariate weighted least squares to obtain bj1,,bjnb_{j1},\dots,b_{jn} simultaneously. Computation details are given in Section S.3 of the Supplementary Material.

We can further enhance the performance of FOPG by projecting the original predictors onto the directions produced by the FOPG to re-estimate 𝒮Y|X\mathcal{S}_{Y|X}. Specifically, after the first round of FOPG, we form the matrix B^=(v^1,,v^d)\hat{B}=(\hat{v}_{1},\dots,\hat{v}_{d}) and replace the kernel Kh(XjXi)K_{h}(X_{j}-X_{i}) by Kh(B^T(XjXi))K_{h}(\hat{B}^{\mbox{\tiny{\sf T}}}(X_{j}-X_{i})) with an updated bandwidth hh, and complete the next round of iteration, which leads to an updated B^\hat{B}. We then iterate this process until convergence. In this way, we reduce the dimension of the kernel from pp to d0d_{0} and mitigate the “curse of dimensionality”. To avoid confusion, We call this refined version of Fréchet OPG as FOPG in the following. The algorithms for FOPG are summarized as Algorithm 2.

\justifyStep 1. Standardize X1,,XnX_{1},\dots,X_{n} by Zai=(XaiX¯i)/σ^iZ_{a}^{i}=(X_{a}^{i}-\bar{X}^{i})/\hat{\sigma}_{i}, where XaiX^{i}_{a} denotes the ii-th component of XaX_{a} and σ^i=varn(Xi)\hat{\sigma}_{i}=\sqrt{\mathrm{var}_{n}(X^{i})}. Set B^(0)=Ip\hat{B}^{(0)}=I_{p} as the initial value and set the maximum number of iterations as 10. Set iteration time tt to 1.
Step 2. For each j=1,,nj=1,\dots,n and iteration tt, solving vectors b^j1(t),,b^jn(t)\hat{b}^{(t)}_{j1},\dots,\hat{b}^{(t)}_{jn} from (5).
Step 3. Compute Λ^(t)=1n2j,k=1nb^jk(t)(b^jk(t))T.\hat{\Lambda}^{(t)}=\frac{1}{n^{2}}\sum_{j,k=1}^{n}\hat{b}^{(t)}_{jk}\left(\hat{b}^{(t)}_{jk}\right)^{\mbox{\tiny{\sf T}}}. Then perform eigen-decomposition for Λ^(t)\hat{\Lambda}^{(t)} and get the d0d_{0} eigenvectors corresponding to its largest eigenvalues v^1,,v^d0\hat{v}_{1},\cdots,\hat{v}_{d_{0}}. Let B^(t)=(v^1,,v^d0)\hat{B}_{(t)}=(\hat{v}_{1},\cdots,\hat{v}_{d_{0}}).
Step 4. If t10t\leq 10, reset ht=max(rnht1;c0n1/(d+4))h_{t}=\max(r_{n}h_{t-1};c_{0}n^{-1/(d+4)}), where rn=n1/2(p0+6)r_{n}=n^{-1/2(p_{0}+6)}, increase tt by 1 and return to step 2. Otherwise, Let D^\hat{D} be the diagonal matrix diag(σ^1,,σ^p)\mbox{diag}(\hat{\sigma}_{1},\dots,\hat{\sigma}_{p}). A basis of the central subspace SY|XS_{Y|X} is estimated by {D^1/2v^1,,D^1/2v^d0}\{\hat{D}^{-1/2}\hat{v}_{1},\dots,\hat{D}^{-1/2}\hat{v}_{d_{0}}\}.
Algorithm 2 Fréchet OPG

4.3 Ensembled inverse regression estimators via CS ensembles

In this subsection, we adapt several well-known estimators for the classical central subspace to Fréchet SDR, which include SIR (Li, 1991), SAVE (Cook & Weisberg, 1991), and DR (Li & Wang, 2007). We use the CS-ensemble to combine these classical estimates through (3). Let 𝔉={κ(,y):yΩY}\mathfrak{F}=\{\kappa(\cdot,y):y\in\Omega_{\scriptscriptstyle Y}\} be a CS ensemble, where κ\kappa is a cc-universal kernel. Let M0:𝒫XY×𝔉p×pM_{\scriptscriptstyle 0}:{\cal{P}}_{\scriptscriptstyle XY}\times\mathfrak{F}\to\mathbb{R}^{\scriptscriptstyle p\times p} be a CS-ensemble candidate matrix. Let M(FXY)=M0(FXY,κ(,y))𝑑FY(y)M(F_{\scriptscriptstyle XY})=\int M_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY},\kappa(\cdot,y))dF_{\scriptscriptstyle Y}(y) be the Fréchet candidate matrix.

Again, we work with the standard predictor ZZ. The candidate matrices M0(y)M_{\scriptscriptstyle 0}(y) for ensemble SIR, SAVE, and DR are formulated in Remark 2. Detailed motivation for each can be found in Li (2018, Chapter 3,5,6). At the sample level, we replace any unconditional moment EE by the sample average EnE_{\scriptscriptstyle n}, and replace any conditional moment, such as E(Z|κ(Y,y))E(Z|\kappa(Y,y)), by the slice mean. The algorithms are also included in Algorithm 1. A more detailed algorithm is given by Algorithm 5 in the Supplementary Material.

REMARK 2.

The candidate matrices M0(y)M_{\scriptscriptstyle 0}(y) for Fréchet SIR, SAVE, and DR are var[E(Z|κ(Y,y)]\mathrm{var}[E(Z|\kappa(Y,y)], [Ipvar(Z|κ(Y,y))]2[I_{\scriptscriptstyle p}-\mathrm{var}(Z|\kappa(Y,y))]^{\scriptscriptstyle 2}, and 2E{E[ZZT|κ(Y,y)]}2+2E2{E[Z|κ(Y,y)]E[ZT|κ(Y,y)]}+2E{E[ZT|κ(Y,y)]E[Z|κ(Y,y)]}E{E[Z|κ(Y,y)]E[ZT|κ(Y,y)]}2Ip2E\{E[ZZ^{\mbox{\tiny{\sf T}}}|\kappa(Y,y)]\}^{\scriptscriptstyle 2}+2E^{\scriptscriptstyle 2}\{E[Z|\kappa(Y,y)]E[Z^{\mbox{\tiny{\sf T}}}|\kappa(Y,y)]\}+2E\{E[Z^{\mbox{\tiny{\sf T}}}|\kappa(Y,y)]E[Z|\kappa(Y,y)]\}\,E\{E[Z|\kappa(Y,y)]E[Z^{\mbox{\tiny{\sf T}}}|\kappa(Y,y)]\}-2I_{\scriptscriptstyle p}, respectively.

REMARK 3.

Regarding the time complexity of the Frechet SDR methods, by construction, the ensemble estimator requires nn times the computing time of the original estimator because it needs to reapply the original estimator for each κ(,yi)\kappa(\cdot,y_{\scriptscriptstyle i}), i=1,,ni=1,\ldots,n. For example, if SAVE is used as the original estimator, then the largest matrix multiplication is Ap×nBn×pA_{\scriptscriptstyle p\times n}B_{\scriptscriptstyle n\times p} which requires p2np^{\scriptscriptstyle 2}n basic computation units; the largest matrix to invert or eigen decomposition to perform is a p×pp\times p matrix, which requires p3p^{\scriptscriptstyle 3} basic computation units. So the net computation complexity is n×max(O(np2),O(p3))n\times\max(O(np^{\scriptscriptstyle 2}),O(p^{\scriptscriptstyle 3})).

4.4 Fisher consistency

In this subsection, we establish the unbiasedness and Fisher consistency of the tuned Fréchet candidate matrix. As a special case, the Fréchet candidate matrix constructed by any moment-based methods in Section 4.1 can be considered as tuned Fréchet candidate matrix with the tuning parameter hh taken to be 0. The next theorem shows that if T0T_{\scriptscriptstyle 0} is unbiased (or Fisher consistent), then TT is unbiased (or Fisher consistent). In the following, we say that a measure μ\mu on ΩY\Omega_{Y} is strictly positive if and only if for any nonempty open set UΩYU\subseteq\Omega_{Y}, μ(U)>0\mu(U)>0. For a matrix AA, A\|A\| represents the operator norm.

THEOREM 2.

Suppose 𝔉={κ(,y):yΩY}\mathfrak{F}=\{\kappa(\cdot,y):y\in\Omega_{\scriptscriptstyle Y}\} is a CMS-ensemble, where κ\kappa is a cc-universal kernel. We have the following results regarding unbiasedness and Fisher consistency for TT.

  1. 1.

    If T0T_{\scriptscriptstyle 0} is unbiased for {𝒮E[κ(Y,y)|X]:f𝔉}\{{\cal{S}}_{\scriptscriptstyle E[\kappa(Y,y)|X]}:f\in\mathfrak{F}\} and T0(FXY(0),κ(,Y),h)G(Y)\|T_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},\kappa(\cdot,Y^{\prime}),h)\|\leq G(Y^{\prime}), where G(Y)G(Y^{\prime}) is a real-valued function with E[G(Y)]<E[G(Y^{\prime})]<\infty, then TT is unbiased for 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X};

  2. 2.

    If (a) T0T_{\scriptscriptstyle 0} is Fisher consistent for {𝒮E[κ(Y,y)|X]:f𝔉}\{{\cal{S}}_{\scriptscriptstyle E[\kappa(Y,y)|X]}:f\in\mathfrak{F}\}, (b) T0(FXY,κ(,y),h)T_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY},\kappa(\cdot,y),h) is positive semidefinite for each yΩYy\in\Omega_{\scriptscriptstyle Y}, hh\in\mathbb{R} and FXY𝒫XYF_{\scriptscriptstyle XY}\in{\cal{P}}_{\scriptscriptstyle XY}, (c) limsuph0T0(FXY(0),κ(,Y),h)G(Y)\lim{\sup}_{\scriptscriptstyle h\to 0}\|T_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},\kappa(\cdot,Y^{\prime}),h)\|\leq G(Y^{\prime}) with E[G(Y)]<E[G(Y^{\prime})]<\infty, (d) FYF_{Y} is strictly positive on ΩY\Omega_{Y}, and (e) the mapping ylimh0T0(FXY,κ(,y),h)y^{\prime}\mapsto\lim_{\scriptscriptstyle h\to 0}T_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY},\kappa(\cdot,y^{\prime}),h) is continuous, then TT is Fisher consistent for 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X}.

We similarly develop Fisher consistency for Fréchet SDR based on the CS-ensemble, including methods in Section 4.3. The next corollary says that if M0M_{\scriptscriptstyle 0} is Fréchet consistent for {𝒮κ(Y,y)|X:yΩY}\{{\cal{S}}_{\scriptscriptstyle\kappa(Y,y)|X}:y\in\Omega_{\scriptscriptstyle Y}\}, then MM is Fréchet consistent for 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X}. The proof is similar to that of Theorem 2 and is omitted.

COROLLARY 2.

Suppose 𝔉={κ(,y):yΩY}\mathfrak{F}=\{\kappa(\cdot,y):y\in\Omega_{\scriptscriptstyle Y}\} is a CS-ensemble, where κ\kappa is a cc-universal kernel. We have the following results regarding unbiasedness and Fisher consistency for MM.

  1. 1.

    If M0M_{\scriptscriptstyle 0} is unbiased for {𝒮κ(Y,y)|X:f𝔉}\{{\cal{S}}_{\scriptscriptstyle\kappa(Y,y)|X}:f\in\mathfrak{F}\}, then MM is unbiased for 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X};

  2. 2.

    If M0M_{\scriptscriptstyle 0} is Fisher consistent for {𝒮κ(Y,y)|X:f𝔉}\{{\cal{S}}_{\scriptscriptstyle\kappa(Y,y)|X}:f\in\mathfrak{F}\}, M0(FXY,κ(,y))M_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY},\kappa(\cdot,y)) is positive semidefinite for each yΩXy\in\Omega_{\scriptscriptstyle X} and FXY𝒫XYF_{\scriptscriptstyle XY}\in{\cal{P}}_{\scriptscriptstyle XY}, FYF_{Y} is strictly positive, and the mapping yM0(FXY,κ(,y))y^{\prime}\mapsto M_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY},\kappa(\cdot,y^{\prime})) is continuous, then MM is Fisher consistent for 𝒮Y|X{\cal{S}}_{\scriptscriptstyle Y|X}.

Unbiasedness and Fisher consistency of T0T_{\scriptscriptstyle 0} or M0M_{\scriptscriptstyle 0} are satisfied by different sets of sufficient conditions for the moment-based or forward-regression-based estimators. We outline these conditions below.

  1. 1.

    For ensembled moment estimators in Section 4.1 and ensembled inverse regression estimators in Section 4.3, most of them are unbiased under either the LCM assumption or both the LCM and CCV assumption. For example, the unbiasedness of SIR, OLS, and IHT requires the LCM assumption, whereas the unbiasedness of SAVE, DR, and PHD requires both the LCM and the CCV assumptions. The estimators SIR, OLS, IHT, and PHD are generally not exhaustive (recall that unbiased along with exhaustiveness is equivalent to Fisher consistency). But sufficient conditions for SAVE, DR to be exhaustive are reasonably mild (see Li & Wang (2007) and Li (2018, Chapter 6)).

  2. 2.

    Sufficient conditions for Fisher consistency for OPG are given in Li (2018, Section 11.2). Specifically, it requires: (a) the smooth kernel function K0K_{\scriptscriptstyle 0} is a spherically-contoured p.d.f. with finite fourth moments; (b) the p.d.f. of XX is supported on p\mathbb{R}^{\scriptscriptstyle p} and has continuous bounded second derivatives. Note that neither LCM nor CCV assumption is needed for the OPG estimator.

5 Convergence Rates of the Ensemble Estimates

In this section, we develop the convergence rates of the ensemble estimates for Fréchet SDR. To save space, we will only consider the CMS-ensemble; the results for the CS-ensemble are largely parallel. To simplify the asymptotic development, we make a slight modification of the ensemble estimator, which does not result in any significant numerical difference from the original ensembles developed in the previous sections. For each i=1,,ni=1,\ldots,n, let F^XY(i)\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(-i)} be the empirical distribution based on the sample with iith subject removed: {(X1,Y1),,(Xn,Yn)}{(Xi,Yi)}\{(X_{\scriptscriptstyle 1},Y_{\scriptscriptstyle 1}),\ldots,(X_{\scriptscriptstyle n},Y_{\scriptscriptstyle n})\}\setminus\{(X_{\scriptscriptstyle i},Y_{\scriptscriptstyle i})\}. Our modified ensemble estimate is of the form

T(F^XY(n),hn)=n1i=1nT0(F^XY(i),κ(,Yi),hn).\displaystyle T(\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(n)},h_{n})=n^{\scriptscriptstyle\scriptscriptstyle-1}\textstyle{\sum}_{\scriptscriptstyle i=1}^{\scriptscriptstyle n}T_{\scriptscriptstyle 0}(\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(-i)},\kappa(\cdot,Y_{\scriptscriptstyle i}),h_{n}).

The purpose of this modification is to break the dependence between the ensemble member κ(,Yi)\kappa(\cdot,Y_{\scriptscriptstyle i}) and the CMS estimate, which substantially simplifies the asymptotic argument. Here, we let the tuning parameter hnh_{n} depend on nn. Again, the Fréceht candidate matrix constructed by moment-based methods can be considered as a special case with hn=0h_{n}=0.

Rather than deriving the convergence rate of each individual ensemble estimate case by case, we will show that, under some mild conditions, the ensemble convergence rate is the same as the corresponding CMS-estimate’s rate. Since the convergence rates of many CMS-estimates are well established, including all the forward regression and sample moment-based estimates mentioned earlier, our general result covers all the CMS-ensemble estimates.

In this following, for a matrix AA, A\|A\| represents the operator norm and AF\|A\|_{\scriptscriptstyle\mathrm{F}} the Frobenius norm. If {an}\{a_{\scriptscriptstyle n}\} and {bn}\{b_{\scriptscriptstyle n}\} are sequences of positive numbers, we write anbna_{\scriptscriptstyle n}\prec b_{\scriptscriptstyle n} if limnan/bn=0\lim_{\scriptscriptstyle n\to\infty}a_{\scriptscriptstyle n}/b_{\scriptscriptstyle n}=0; we write anbna_{\scriptscriptstyle n}\preceq b_{\scriptscriptstyle n} if an/bna_{\scriptscriptstyle n}/b_{\scriptscriptstyle n} is a bounded sequence. We write bnanb_{\scriptscriptstyle n}\succ a_{\scriptscriptstyle n} (or bnanb_{\scriptscriptstyle n}\succeq a_{\scriptscriptstyle n}) if anbna_{\scriptscriptstyle n}\prec b_{\scriptscriptstyle n} (or anbna_{\scriptscriptstyle n}\preceq b_{\scriptscriptstyle n}). We write anbna_{\scriptscriptstyle n}\asymp b_{\scriptscriptstyle n} if anbna_{\scriptscriptstyle n}\preceq b_{\scriptscriptstyle n} and bnanb_{\scriptscriptstyle n}\preceq a_{\scriptscriptstyle n}. Let T0(FXY(0),κ(,y))=limh0T0(FXY(0),κ(,y),h)T_{\scriptscriptstyle 0}^{\scriptscriptstyle*}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},\kappa(\cdot,y))=\lim_{\scriptscriptstyle h\to 0}T_{\scriptscriptstyle 0}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},\kappa(\cdot,y),h) and T(FXY(0))=limh0T(FXY(0),h)T^{\scriptscriptstyle*}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)})=\lim_{\scriptscriptstyle h\to 0}T(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},h).

THEOREM 3.

Let Cn(y)=ET0(F^XY(n),κ(,y),hn)T0(FXY(0),κ(,y))C_{\scriptscriptstyle n}(y)=E\|T_{\scriptscriptstyle 0}(\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(n)},\kappa(\cdot,y),h_{\scriptscriptstyle n})-T_{\scriptscriptstyle 0}^{\scriptscriptstyle*}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},\kappa(\cdot,y))\| and {an}\{a_{\scriptscriptstyle n}\} be a positive sequence of numbers satisfying an+1/an1a_{\scriptscriptstyle n+1}/a_{\scriptscriptstyle n}\asymp 1 and ann1/2a_{\scriptscriptstyle n}\succeq n^{\scriptscriptstyle-1/2}. Suppose the entries of T0(FXY(0),κ(,Y))T_{\scriptscriptstyle 0}^{\scriptscriptstyle*}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)},\kappa(\cdot,Y)) have finite variances. If E[Cn(Y)]=O(an)E[C_{\scriptscriptstyle n}(Y)]=O(a_{\scriptscriptstyle n}), then T(F^XY(n),hn)T(FXY(0))=OP(an)\|T(\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(n)},h_{\scriptscriptstyle n})-T^{\scriptscriptstyle*}(F_{\scriptscriptstyle XY}^{\scriptscriptstyle(0)})\|=O_{\scriptscriptstyle P}(a_{\scriptscriptstyle n}).

The above theorem says that, under some conditions, the convergence rate of an ensemble Fréchet SDR estimator is the same as the corresponding CMS estimator. This covers all the estimators developed in Section 4. Specifically:

  1. 1.

    For all moment-based ensemble methods, such as OLS, PHD, IHT, SIR, SAVE, DR, the ensemble candidate matrices can be written in the form M0(F^XY(n),κ(,y))=Λ^(y)Λ^(y)TM_{0}(\hat{F}_{\scriptscriptstyle XY}^{\scriptscriptstyle(n)},\kappa(\cdot,y))=\hat{\Lambda}(y)\hat{\Lambda}(y)^{\mbox{\tiny{\sf T}}}, where Λ^(y)\hat{\Lambda}(y) is a matrix possessing the second order von Mises expansion, implying E[Cn(Y)]=O(n1/2)E[C_{n}(Y)]=O(n^{\scriptscriptstyle-1/2}). See, for example, Li (2018)

  2. 2.

    For nonparametric forward regression ensemble methods, OPG and MAVE, the convergence rate of Cn(y)C_{n}(y) was reported in Xia (2007) as O(hn2+hn1δn2)O(h^{\scriptscriptstyle 2}_{\scriptscriptstyle n}+h_{\scriptscriptstyle n}^{\scriptscriptstyle-1}\delta^{\scriptscriptstyle 2}_{\scriptscriptstyle n}) where δn=(logn)/nhnp\delta_{\scriptscriptstyle n}=\sqrt{(\log n)/nh^{\scriptscriptstyle p}_{\scriptscriptstyle n}}. Although the convergence was established in terms of convergence in probability, under mild conditions such as uniformly integrability, we can obtain the same rate for E[Cn(Y)]E[C_{n}(Y)].

6 Simulations

We evaluate the performance of the proposed Fréchet SDR methods with distributions and symmetric positive definite matrices as responses. For space consideration, the additional simulation for spherical data is presented in the Supplementary Material.

6.1 Computational Details

Choice of tuning parameters and kernel types. We first implement a unified cross-validation procedure to select the kernel type and bandwidth γ\gamma in the kernel. For both distributional response and symmetric positive definite matrix response, we consider Gaussian radial basis kernel κG(y,y)=exp(γd(y,y)2)\kappa_{\scriptscriptstyle\mathrm{G}}(y,y^{\prime})=\exp(-\gamma d(y,y^{\prime})^{2}) and Laplacian radial basis kernel κL(y,y)\kappa_{\scriptscriptstyle\mathrm{L}}(y,y^{\prime}) as candidates to construct the ensembles. For the bandwidth γ\gamma, we set the default value as

γG=ρY2σG2,whereσG2=(n2)1i<jd(Yi,Yj)2,ρY=1,\gamma_{\scriptscriptstyle\mathrm{G}}=\frac{\rho_{Y}}{2\sigma_{\scriptscriptstyle\mathrm{G}}^{2}},\quad\mbox{where}\quad\sigma_{\scriptscriptstyle\mathrm{G}}^{2}=\binom{n}{2}^{\scriptscriptstyle\scriptscriptstyle-1}\sum_{i<j}d(Y_{i},Y_{j})^{2},\quad\rho_{Y}=1, (6)

in the Gaussian radial basis kernel, and

γL=ρY2σL,whereσL=(n2)1i<jd(Yi,Yj),ρY=1,\displaystyle\gamma_{\scriptscriptstyle\mathrm{L}}=\frac{\rho_{Y}}{2\sigma_{L}},\quad\mbox{where}\quad\sigma_{\scriptscriptstyle\mathrm{L}}=\binom{n}{2}^{\scriptscriptstyle\scriptscriptstyle-1}\sum_{i<j}d(Y_{i},Y_{j}),\quad\rho_{Y}=1,

in the Laplacian radial basis kernel. The same choices were used in Lee et al. (2013) and Li & Song (2017). We then fine-tune ρY\rho_{\scriptscriptstyle Y} and kernel types together via the kk-fold cross-validation as follows. Randomly split the whole sample into kk subsets of roughly equal sizes, say D1,,DkD_{\scriptscriptstyle 1},\dots,D_{\scriptscriptstyle k}. For each i=1,,ki=1,\dots,k, use DiD_{\scriptscriptstyle i} as the test set and its complement as the training set. We first use the training set to implement the Fréchet SDR with an initial dimension dd, say 55. This choice of a relatively large dimension helps to guarantee the unbiasedness of the estimated Fréchet central subspace. We then substitute the estimated β^\hat{\beta} into the testing set to produce the sufficient predictor β^TX\hat{\beta}^{\mbox{\tiny{\sf T}}}X and then fit a global Fréchet regression model (Petersen & Müller, 2019) to predict the response in the testing set. Compute the prediction error for each ii and aggregate the error for all rotations i=1,,ki=1,\dots,k, which yields an overall cross-validation error. This overall error depends on the tuning parameter ρY\rho_{\scriptscriptstyle Y} and kernel type and is then minimized over a grid {102,101,1,10}×{κG,κL}\{10^{\scriptscriptstyle-2},10^{\scriptscriptstyle-1},1,10\}\times\{\kappa_{\scriptscriptstyle\mathrm{G}},\kappa_{\scriptscriptstyle\mathrm{L}}\} to obtain the optimal combinations.

Estimation of the dimensions. For the ensemble estimators that possess a candidate matrix (such as the ensemble moment estimators in Section 4.1), the recently developed order-determination methods, such as the ladle estimate (Luo & Li, 2016), and predictor augmentation estimator (Luo & Li, 2021) can be directly applied to estimate d0d_{\scriptscriptstyle 0}. In addition, the BIC-criterion introduced by Zhu et al. (2006) can also be used for this purpose.

In this paper, we adapted the predictor augmentation estimator to the current setting. A detailed introduction of the predictor augmentation method and more simulation results are included in the Supplementary Material. For the predictor augmentation estimator, we take the times of augmentations s=10s=10 and the dimension of augmented predictors r=p/2r=\lceil p/2\rceil, where pp is the original dimension of predictors.

Estimation error assessment. We used the error measurement for subspace estimation as in Li et al. (2005): if 𝒮1{\cal{S}}_{\scriptscriptstyle 1} and 𝒮2{\cal{S}}_{\scriptscriptstyle 2} are two subspaces of p\mathbb{R}^{\scriptscriptstyle p} of the same dimension, then their distance is defined as d(𝒮1,𝒮2)=P𝒮1P𝒮2F,d({\cal{S}}_{\scriptscriptstyle 1},{\cal{S}}_{\scriptscriptstyle 2})=\|P_{\scriptscriptstyle{\cal{S}}_{\scriptscriptstyle 1}}-P_{\scriptscriptstyle{\cal{S}}_{\scriptscriptstyle 2}}\|_{\scriptscriptstyle\mathrm{F}}, where P𝒮P_{\scriptscriptstyle{\cal{S}}} is the projection on to 𝒮{\cal{S}}, and F\|\cdot\|_{\scriptscriptstyle\mathrm{F}} is the Frobenius matrix norm. If B1B_{\scriptscriptstyle 1} and B2B_{\scriptscriptstyle 2} are two matrices whose columns form bases of 𝒮1{\cal{S}}_{\scriptscriptstyle 1} and 𝒮2{\cal{S}}_{\scriptscriptstyle 2} respectively, this distance can be equivalently written as B1(B1TB1)1B1TB2(B2TB2)1B2TF.\|B_{\scriptscriptstyle 1}(B_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}B_{\scriptscriptstyle 1})^{\scriptscriptstyle\scriptscriptstyle-1}B_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}-B_{\scriptscriptstyle 2}(B_{\scriptscriptstyle 2}^{\mbox{\tiny{\sf T}}}B_{\scriptscriptstyle 2})^{\scriptscriptstyle\scriptscriptstyle-1}B_{\scriptscriptstyle 2}^{\mbox{\tiny{\sf T}}}\|_{\scriptscriptstyle\mathrm{F}}. This distance is coordinate-free, as it is invariant to the basis matrices involved.

To facilitate the comparison, we also include the benchmark error, which is set as the expectation of the above distance when B1B_{\scriptscriptstyle 1} is taken as any basis matrix of the true central subspace, and entries of B2B_{\scriptscriptstyle 2} are generated randomly from i.i.d. N(0,1)N(0,1). This expectation is computed by Monte Carlo with 1000 repeats.

6.2 Scenario I: Fréchet SDR for distributions

Let (ΩY,dW)(\Omega_{Y},d_{\mathrm{W}}) be the metric space of univariate distributions endowed with Wasserstein metric dWd_{\mathrm{W}}, as described in Section 3. The construction of the ensembles requires computing the Wasserstein distances dW(Yi,Yj)d_{\mathrm{W}}(Y_{i},Y_{j}) for i,j=1,,ni,j=1,\dots,n. However, the distributions Y1,,YnY_{1},\dots,Y_{n} are usually not fully observed in practice, which means we need to estimate them in the implementation of the proposed methods. There are multiple ways to do so, such as by estimating the c.d.f.’s, the quantile functions (Parzen, 1979), or the p.d.f’s (Petersen & Müller, 2016; Chen et al., 2021). For computation simplicity, we use the Wasserstein distances between the empirical measures. Specifically, suppose we observe (X1,{W1j}j=1m1),,(Xn,{Wnj}j=1mn)(X_{1},\{W_{1j}\}_{j=1}^{m_{1}}),\dots,(X_{n},\{W_{nj}\}_{j=1}^{m_{n}}), where {Wij}j=1mi\{W_{ij}\}_{j=1}^{m_{i}} are independent samples from the distribution YiY_{i}. Let Y^i\hat{Y}_{i} be the empirical measure mi1j=1miδWij{m_{i}}^{-1}\sum_{j=1}^{m_{i}}\delta_{W_{ij}}, where δa\delta_{a} is the Dirac measure at aa, then we estimate dW(Yi,Yk)d_{\mathrm{W}}(Y_{i},Y_{k}) by dW(Yi^,Y^k)d_{\mathrm{W}}(\hat{Y_{i}},\hat{Y}_{k}). For the theoretical justification, see Fournier & Guillin (2015) and Lei (2020). For simplicity, we assume the sample sizes m1,,mnm_{1},\dots,m_{\scriptscriptstyle n} to be the same and denote the common sample size by mm. Then the distance between empirical measures Y^i\hat{Y}_{\scriptscriptstyle i} and Y^k\hat{Y}_{\scriptscriptstyle k} is a simple function of the order statistics: dW(Y^i,Y^k)={j=1m(Wi(j)Wk(j))2}1/2,d_{\mathrm{W}}(\hat{Y}_{i},\hat{Y}_{k})=\Bigl{\{}\sum_{j=1}^{m}(W_{i(j)}-W_{k(j)})^{2}\Bigr{\}}^{1/2}, where Wi(j)W_{i(j)} is the jj-th order statistics of the sample Wi1,WimW_{i1}\dots,W_{im}.

Let β1T=(1,1,0,,0)\beta_{1}^{\mbox{\tiny{\sf T}}}=(1,1,0,\dots,0), β2T=(0,,0,1,1)\beta_{2}^{\mbox{\tiny{\sf T}}}=(0,\dots,0,1,1), β3T=(1,2,0,,0,2)\beta_{3}^{\mbox{\tiny{\sf T}}}=(1,2,0,\dots,0,2) and β4T=(0,0,1,2,2,,0)\beta_{4}^{\mbox{\tiny{\sf T}}}=(0,0,1,2,2,\dots,0). To generate univariate distributional response YY, we let Y=N(μY,σY2)Y=N(\mu_{\scriptscriptstyle Y},\sigma_{\scriptscriptstyle Y}^{\scriptscriptstyle 2}), where μY\mu_{\scriptscriptstyle Y} and σY2\sigma_{\scriptscriptstyle Y}^{\scriptscriptstyle 2} are random variables dependent on XX, and σY>0\sigma_{\scriptscriptstyle Y}>0 almost surely, defined by the following models:

  1. I-1

    : μY|XN(exp(β1TX),1)\mu_{Y}|X\sim N(\exp(\beta_{1}^{\mbox{\tiny{\sf T}}}X),1) and σY=1\sigma_{Y}=1.

  2. I-2

    : μY|XN(exp(β1TX),1)\mu_{Y}|X\sim N(\exp(\beta_{1}^{\mbox{\tiny{\sf T}}}X),1) and σY=1011{ς(X)<101}+ς(X)1{101ς(X)10}+101{ς(X)>10}\sigma_{\scriptscriptstyle Y}=10^{\scriptscriptstyle\scriptscriptstyle-1}\cdot 1\{\varsigma(X)<10^{\scriptscriptstyle\scriptscriptstyle-1}\}+\varsigma(X)\cdot 1\{10^{\scriptscriptstyle\scriptscriptstyle-1}\leq\varsigma(X)\leq 10\}+10\cdot 1\{\varsigma(X)>10\} where ς(X)=exp(β2TX)\varsigma(X)=\exp(\beta_{2}^{\mbox{\tiny{\sf T}}}X).

  3. I-3

    : μY|XN(3(β3TX),0.52)\mu_{Y}|X\sim N(3(\beta_{3}^{\mbox{\tiny{\sf T}}}X),0.5^{2}) and σY|X=Gamma((2+2β4TX)2/ν,ν/(2+2β4TX))\sigma_{Y}|X=\text{Gamma}((2+2\beta_{4}^{\mbox{\tiny{\sf T}}}X)^{2}/\nu,\nu/(2+2\beta_{4}^{\mbox{\tiny{\sf T}}}X)) with truncated range (101,10)(10^{-1},10) and ν=0.5\nu=0.5.

  4. I-4

    : μY|XN(3sin(β3TX),0.52)\mu_{Y}|X\sim N(3\sin(\beta_{3}^{\mbox{\tiny{\sf T}}}X),0.5^{2}) and σY|X=Gamma((2+2β4TX)2/ν,ν/(2+2β4TX))\sigma_{Y}|X=\text{Gamma}((2+2\beta_{4}^{\mbox{\tiny{\sf T}}}X)^{2}/\nu,\nu/(2+2\beta_{4}^{\mbox{\tiny{\sf T}}}X)) with truncated range (101,10)(10^{-1},10) and ν=0.5\nu=0.5.

To generate the predictor XX, we consider both the scenarios where Assumption 1 is satisfied and violated. Specifically, for Model I-1 and I-2, XX is generated by the following two scenarios:

  1. (a)

    XN(0,1)X\sim N(0,1); in this case both LCM and CCV in Assumption 1 are satisfied;

  2. (b)

    we first generate U1,,UpU_{\scriptscriptstyle 1},\ldots,U_{\scriptscriptstyle p} from the 𝔸(1)\mathbb{AR}(1) model with mean 0 and covariance matrix Σ=(0.5|ij|)i,j\Sigma=(0.5^{\scriptscriptstyle|i-j|})_{\scriptscriptstyle i,j}, and then generate XX by (sin(U1),|U2|,U3,,Up)(\sin(U_{\scriptscriptstyle 1}),|U_{\scriptscriptstyle 2}|,U_{\scriptscriptstyle 3},\ldots,U_{\scriptscriptstyle p}). For this model, both LCM and CCS are violated.

For Model I-3 and I-4, we generate X=Φ(U)X=\Phi(U), where Φ\Phi is the multivariate c.d.f. of N(0,Ip)N(0,I_{\scriptscriptstyle p}) and UU is generated as in (b). For this model, both LCM and CCS are violated.

Ying & Yu (2022) considered similar models to Model I-1 and Model I-2. In Model I-3 and Model I-4, the error depends on XX, which means the Fréchet central subspace contains direction out of the conditional Fréchet mean function. For Model I-1, B0=β1B_{0}=\beta_{1} and d0=1d_{0}=1; for Model I-2, B0=(β1,β2)B_{0}=(\beta_{1},\beta_{2}) and d0=2d_{0}=2; and for Models I-3, I-4, B0=(β3,β4)B_{0}=(\beta_{3},\beta_{4}) and d0=2d_{0}=2. In the simulation, we first generate X1,,XnX_{\scriptscriptstyle 1},\ldots,X_{\scriptscriptstyle n}, then generate (μY1,σY1),,(μYn,σYn)(\mu_{\scriptscriptstyle Y_{\scriptscriptstyle 1}},\sigma_{\scriptscriptstyle Y_{\scriptscriptstyle 1}}),\ldots,(\mu_{\scriptscriptstyle Y_{\scriptscriptstyle n}},\sigma_{\scriptscriptstyle Y_{\scriptscriptstyle n}}). For each i=1,,ni=1,\ldots,n, we then generate Wi1,,WimW_{\scriptscriptstyle i1},\ldots,W_{\scriptscriptstyle im} independently from N(μYi,σYi2)N(\mu_{\scriptscriptstyle Y_{\scriptscriptstyle i}},\sigma_{\scriptscriptstyle Y_{\scriptscriptstyle i}}^{\scriptscriptstyle 2}). We take (n,p)=(200,10),(400,20)(n,p)=(200,10),(400,20) and m=100m=100.

We compare performances of the CMS ensemble methods and CS ensemble methods described in Section 4, including FOLS, FPHD, FIHT, FSIR, FSAVE, FDR, and FOPG (with refinement). We first implement the predictor augmentation (PA) method to estimate the dimension of the Fréchet central subspace. Then with estimated d^\hat{d}, we evaluate the estimation error. For FOPG, the number of iterations is set as 5, which is large enough to guarantee numerical convergence. For FSIR, FSAVE, the number of slices is chosen as n/2p\lfloor n/2p\rfloor; for FDR, the number of slices is chosen as n/6p\lfloor n/6p\rfloor. We also implement the weighted inverse regression ensemble (WIRE) method proposed by Ying & Yu (2022) for comparison. We repeat the experiments 100 times and summarize the proportion of correct identification of order and the mean and standard deviation of estimation error in Table 1 and Table 2. A smaller distance indicates a more accurate estimate, and the estimate with the smallest distance is shown in boldface. The benchmark distances are shown at the bottom of the table.

Model (p,n)(p,n) FOLS FPHD FIHT FSIR FSAVE FDR FOPG WIRE
I-1-(a) 100% 87% 100% 98% 75% 87% 100% 100%
(10,200) 0.343 0.667 0.349 0.271 0.561 0.407 0.167 0.235
(0.088) (0.229) (0.089) (0.128) (0.316) (0.261) (0.057) (0.054)
100% 84% 100% 100% 83% 90% 100% 100%
(20,400) 0.359 0.705 0.365 0.262 0.511 0.393 0.220 0.249
(0.073) (0.214) (0.073) (0.044) (0.27) (0.225) (0.052) (0.041)
I-1-(b) 94% 95% 93% 86% 75% 91% 100% 86%
(10,200) 0.415 0.663 0.437 0.33 0.504 0.338 0.135 0.3
(0.18) (0.155) (0.187) (0.283) (0.341) (0.228) (0.036) (0.294)
95% 95% 96% 89% 73% 95% 99% 84%
(20,400) 0.402 0.667 0.415 0.303 0.511 0.297 0.196 0.314
(0.156) (0.141) (0.142) (0.254) (0.329) (0.176) (0.089) (0.309)
I-2-(a) 100% 57% 100% 99% 97% 100% 100% 100%
(10,200) 0.413 1.048 0.416 0.387 0.537 0.418 0.253 0.316
(0.092) (0.206) (0.092) (0.101) (0.161) (0.099) (0.054) (0.056)
100% 48% 100% 100% 99% 100% 100% 100%
(20,400) 0.419 1.131 0.423 0.367 0.552 0.433 0.290 0.318
(0.072) (0.175) (0.073) (0.05) (0.1) (0.062) (0.051) (0.039)
I-2-(b) 100% 56% 100% 97% 98% 100% 99% 99%
(10,200) 0.552 1.135 0.558 0.477 0.627 0.504 0.288 0.379
(0.111) (0.143) (0.109) (0.147) (0.139) (0.107) (0.097) (0.11)
100% 56% 100% 100% 100% 100% 100% 100%
(20,400) 0.570 1.161 0.576 0.469 0.658 0.534 0.337 0.391
(0.072) (0.121) (0.072) (0.066) (0.089) (0.07) (0.049) (0.054)
Table 1: The percentages of correct order determination, and the mean (standard deviation) of estimation error as measured by PB0PB^F\|P_{B_{0}}-P_{\hat{B}}\|_{\mathrm{F}} for Models I-1 and I-2 with settings (a) and (b); the benchmark for Model I-1 with p=10,20p=10,20 are 1.334,1.3731.334,1.373 respectively, and for Model I-2 with p=10,20p=10,20 are 1.785,1.8931.785,1.893, respectively. The bold-faced number indicates the best performer.
Model (p,n)(p,n) FOLS FPHD FIHT FSIR FSAVE FDR FOPG WIRE
I-3 100% 13% 100% 100% 75% 99% 100% 93%
(10,200) 0.269 1.150 0.269 0.342 0.741 0.408 0.242 0.311
(0.05) (0.11) (0.051) (0.067) (0.303) (0.117) (0.047) (0.199)
100% 11% 100% 100% 96% 100% 100% 100%
(20,400) 0.279 1.165 0.279 0.333 0.674 0.407 0.259 0.273
(0.046) (0.092) (0.046) (0.061) (0.177) (0.071) (0.042) (0.045)
I-4 100% 39% 100% 100% 69% 99% 99% 67%
(10,200) 0.383 1.303 0.384 0.431 0.897 0.535 0.311 0.583
(0.08) (0.172) (0.08) (0.104) (0.248) (0.136) (0.104) (0.322)
100% 47% 100% 100% 86% 99% 100% 95%
(20,400) 0.408 1.349 0.409 0.443 0.843 0.536 0.328 0.419
(0.072) (0.152) (0.072) (0.078) (0.234) (0.106) (0.048) (0.156)
Table 2: The percentages of correct order determination, and the mean (standard deviation) of estimation error as measured by PB0PB^F\|P_{B_{0}}-P_{\hat{B}}\|_{\mathrm{F}} for Models I-3 and I-4; the benchmark for Model I-3 and I-4 with p=10,20p=10,20 are 1.787,1.8951.787,1.895 respectively. The bold-faced number indicates the best performer.

For Model I-1 and Model I-2, the best performer FOPG achieves 100% correct order determination percentage and enjoys the smallest estimation error. The moment-based ensemble methods are slightly less accurate than FOPG. Compared with the benchmark, all methods can successfully estimate the true central subspace except FPHD. Compared to the results from predictor settings (a) and (b), we see that most moment-based methods and inverse-regression-based methods have larger estimation error and less percentage of correct order determination under setting (b), but FOPG, which is free from the elliptical assumption of predictors, still give the most precise estimation. Overall, the correlation between predictors and non-ellipticity does not affect the results much compared with the benchmark error. In Model I-3 and Model I-4, there exist directions in the central subspace but not contained in the conditional Fréchet mean. In these cases, the best performer is still FOPG.

We also show the plots of YY versus the sufficient predictors obtained by FOPG for Model I-3. From Figure 2, we see a strong relation between YY and the first two estimated sufficient predictors β^1TX\hat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X and β^2TX\hat{\beta}_{\scriptscriptstyle 2}^{\mbox{\tiny{\sf T}}}X, compared with YY versus individual predictor X3X_{3}. We also observe that the first sufficient predictor captures the location variation and the second sufficient predictor captures the scale variation.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
Figure 2: 3D plots of the distributional responses versus (a) first sufficient predictor β^1TX\hat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X; (b) second sufficient predictor β^2TX\hat{\beta}_{\scriptscriptstyle 2}^{\mbox{\tiny{\sf T}}}X and (c) X3X_{3} for Model I-2 with (n,p)=(200,10)(n,p)=(200,10)

6.3 Scenario II: Fréchet SDR for positive definite matrices

Let ΩY\Omega_{Y} be the space Sym+(r)\mathrm{Sym}^{+}(r) endowed with Frobenius distance dF(Y1,Y2)=Y1Y2Fd_{\mathrm{F}}(Y_{1},Y_{2})=\|Y_{1}-Y_{2}\|_{\mathrm{F}}. To accommodate the anatomical intersubject variability, Schwartzman (2006) introduced the symmetric matrix variate Normal distributions. We adopt this distribution to construct the regression model with correlation matrix response. We say that ZSym(r)Z\in\mathrm{Sym}(r) has the standard symmetric matrix variate Normal distribution Nrr(0;Ir)N_{rr}(0;I_{r}) if it has density φ(Z)=(2π)q/2exp(tr(Z)2/2)\varphi(Z)=(2\pi)^{-q/2}\exp(-\mathrm{tr}(Z)^{2}/2) with respect to Lebesgue measure on p(p+1)\mathbb{R}^{p(p+1)}. As pointed out in Schwartzman (2006), this definition is equivalent to a symmetric matrix with independent N(0,1)N(0,1) diagonal elements and N(0,1/2)N(0,1/2) off-diagonal elements. We say YSym(r)Y\in\mathrm{Sym}(r) has symmetric matrix variate Normal distribution Nrr(M;Σ)N_{rr}(M;\Sigma) if Y=GZGT+MY=GZG^{T}+M where MSym(r)M\in\mathrm{Sym}(r), G(r×r)G\in\mathbb{R}(r\times r) is a non-singular matrix, and Σ=GTG\Sigma=G^{T}G. As a special case, we say YSym(r)Nrr(M;σ2)Y\in\mathrm{Sym}(r)\sim N_{rr}(M;\sigma^{2}) if Y=σZ+MY=\sigma Z+M.

We generate predictors XX as in settings (a) and (b) in of Scenario II. We generate log(Y)\log(Y) following Ndd(log{D(X)},0.25)N_{dd}(\log\{D(X)\},0.25), where log()\log(\cdot) is the matrix logarithm defined in Section 3, and D(X)D(X) is specified by the following models:

  1. II-1:D(X)=(1ρ(X)ρ(X)1)D(X)=\begin{pmatrix}1&\rho(X)\\ \rho(X)&1\\ \end{pmatrix}, where ρ(X)=[exp(β1TX)1]/[exp(β1TX)+1]\rho(X)=[\exp(\beta_{1}^{\mbox{\tiny{\sf T}}}X)-1]/[\exp(\beta_{1}^{\mbox{\tiny{\sf T}}}X)+1].

  2. II-2: D(X)=(1ρ1(X)ρ2(X)ρ1(X)1ρ1(X)ρ2(X)ρ1(X)1)D(X)=\begin{pmatrix}1&\rho_{1}(X)&\rho_{2}(X)\\ \rho_{1}(X)&1&\rho_{1}(X)\\ \rho_{2}(X)&\rho_{1}(X)&1\end{pmatrix}, where ρ1(X)=0.4[exp(β1TX)1]/[exp(β1TX)+1]\rho_{1}(X)=0.4[\exp(\beta_{1}^{\mbox{\tiny{\sf T}}}X)-1]/[\exp(\beta_{1}^{\mbox{\tiny{\sf T}}}X)+1] and ρ2(X)=0.4sin(β3TX)\rho_{2}(X)=0.4\sin(\beta_{3}^{\mbox{\tiny{\sf T}}}X).

In Model II-1, B0=β1B_{0}=\beta_{1} and d0=1d_{0}=1; in Model II-2, B0=(β1,β2)B_{0}=(\beta_{1},\beta_{2}) and d0=2d_{0}=2. We note that D(x)D(x) is not necessarily the Fréchet conditional mean of YY given XX, but it still measures the central tendency of the conditional distribution Y|XY|X. We also compare performances of the CMS ensemble methods and CS ensemble methods, with (n,p)=(200,10),(400,20)(n,p)=(200,10),(400,20). The experiments are repeated 100 times. The proportion of correct identification of order and the means and standard deviations of estimation errors are summarized in Table 3.

Model (p,n)(p,n) FOLS FPHD FIHT FSIR FSAVE FDR FOPG WIRE
II-1-(a) 100% 68% 100% 100% 91% 99% 100% 100%
(10,200) 0.153 0.836 0.153 0.158 0.264 0.159 0.150 0.150
(0.043) (0.296) (0.043) (0.037) (0.245) (0.095) (0.04) (0.041)
100% 69% 100% 100% 89% 99% 100% 100%
(20,400) 0.160 0.905 0.160 0.169 0.283 0.165 0.151 0.158
(0.029) (0.267) (0.029) (0.026) (0.262) (0.088) (0.026) (0.028)
II-1-(b) 97% 64% 97% 95% 53% 64% 93% 98%
(10,200) 0.248 1.017 0.249 0.276 0.726 0.525 0.251 0.218
(0.154) (0.276) (0.155) (0.186) (0.342) (0.387) (0.219) (0.133)
100% 54% 100% 97% 43% 63% 98% 100%
(20,400) 0.233 1.117 0.233 0.252 0.779 0.533 0.216 0.213
(0.049) (0.247) (0.049) (0.144) (0.349) (0.388) (0.121) (0.045)
II-2-(a) 100% 19% 100% 99% 78% 99% 100% 100%
(10,200) 0.295 1.185 0.295 0.310 0.619 0.371 0.149 0.293
(0.085) (0.139) (0.085) (0.113) (0.293) (0.133) (0.034) (0.084)
100% 24% 100% 100% 81% 100% 100% 100%
(20,400) 0.312 1.231 0.312 0.325 0.626 0.382 0.181 0.311
(0.065) (0.155) (0.065) (0.064) (0.233) (0.084) (0.032) (0.065)
II-2-(b) 100% 32% 100% 100% 60% 76% 100% 100%
(10,200) 0.675 1.461 0.675 0.697 1.193 0.920 0.245 0.671
(0.183) (0.195) (0.183) (0.186) (0.246) (0.249) (0.071) (0.182)
100% 43% 100% 100% 57% 90% 100% 100%
(20,400) 0.697 1.492 0.697 0.714 1.253 0.936 0.331 0.694
(0.136) (0.182) (0.136) (0.13) (0.206) (0.208) (0.075) (0.137)
Table 3: Mean(±\pm standard deviation) of estimation error measured by PB0PB^F\|P_{B_{0}}-P_{\hat{B}}\|_{\mathrm{F}} for different methods for Scenario II. The benchmark for Model II-1 with p=10,20p=10,20 are 1.334,1.3731.334,1.373 respectively, for Model II-2 with p=10,20p=10,20 are 1.785,1.8931.785,1.893, respectively. The bold-faced number indicates the best performer.

We conclude that all ensemble methods give reasonable estimation except FPHD. FOPG performs best in all settings except Model II-1-(b). To illustrate the relation between the response and estimated sufficient predictor β^1TX\hat{\beta}_{1}^{\mbox{\tiny{\sf T}}}X, we adopt the ellipsoidal representation of SPD matrices. Each ASym+(d)A\in\mathrm{Sym}^{+}(d) can be associated with an ellipsoid centered at the origin A={x:xTA1x1}\mathcal{E}_{A}=\{x:x^{\mbox{\tiny{\sf T}}}A^{-1}x\leq 1\}. Figure 3 plots the responses ellipsoid versus the estimated sufficient predictor in panel (a), compared with the responses versus predictor X10X_{10} for Model II-1-(a). We can tell a clear pattern of change in shape and rotation of response ellipsoids versus β^1TX\hat{\beta}_{1}^{\mbox{\tiny{\sf T}}}X.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Ellipsoidal plots of the SPD matrix response versus the FOPG predictor β^1TX\hat{\beta}_{1}^{\mbox{\tiny{\sf T}}}X and X10X_{10} using Model II-1-(a) with (n=200,p=10)(n=200,p=10). Each horizontal ellipse is an Ellipsoidal representation of an SPD matrix, and the vertical axis is the value of (a) β^1TX\hat{\beta}_{1}^{\mbox{\tiny{\sf T}}}X; (b) X10X_{10}.

7 Application to the Human Mortality Data

This section presents an application concerning human life spans. Another application concerning intracerebral hemorrhage is presented in Section S.6 of the Supplementary Material.

Compared with summary statistics such as the crude death rate, viewing the entire age-at-death distributions as data objects gives us a more comprehensive understanding of human longevity and health conditions. Mortality distributions are affected by many factors, such as economics, the health care system, as well as social and environmental factors. To investigate the potential factors that are related to the mortality distributions across different countries, we collect nine predictors listed below, covering demography, economics, labor market, nutrition, health, and environmental factors in 2015: (1) Population Density: population per square Kilometer; (2) Sex Ratio: number of males per 100 females in the population; (3) Mean Childbearing Age: the average age of mothers at the birth of their children; (4) Gross Domestic Product (GDP) per Capita; (5) Gross Value Added (GVA) by Agriculture: the percentage of agriculture, hunting, forestry, and fishing activities of gross value added, (6) Consumer price index: treat 2010 as the base year; (7) Unemployment Rate; (8) Expenditure on Health (percentage of GDP); (9) Arable Land (percentage of total land area). The data are collected from United Nation Databases (http://data.un.org/) and UN World Population Prospects 2019 Databases (https://population.un.org/wpp/Download). For each country and age, the life table contains the number of deaths d(x,n)d(x,n) aggregated every 5 years. We treat these data as histograms of the number of deaths at age, with bin widths equal to 5 years. We smooth the histograms for the 162 countries in 2015 using the ‘frechet’ package available at (https://cran.r-project.org/web/packages/frechet/index.html) to obtain smoothed probability density functions. We then calculate the Wasserstein distances between them.

We use Gaussian kernel κ(y,y)=exp(γdW2(y,y))\kappa(y,y^{\prime})=\exp(-\gamma d^{\scriptscriptstyle 2}_{\scriptscriptstyle W}(y,y^{\prime})), where γ\gamma is taken according to (6) in Section 6.1. We standardize all covariates separately, then use the predictor augmentation method combined with FOPG to estimate the dimension of the Fréchet central subspace. The estimated dimension of the Fréchet central subspace is 33. The first three directions obtained by FOPG are

β^1=(0.416,0.424,0.114,0.770,0.027,0.146,0.020,0.130,0.053)T\displaystyle\hat{\beta}_{1}=(0.416,-0.424,0.114,0.770,0.027,-0.146,-0.020,0.130,-0.053)^{\mbox{\tiny{\sf T}}}
β^2=(0.186,0.155,0.159,0.135,0.576,0.714,0.083,0.198,0.096)T\displaystyle\hat{\beta}_{2}=(0.186,0.155,-0.159,-0.135,-0.576,-0.714,-0.083,0.198,0.096)^{\mbox{\tiny{\sf T}}}
β^3=(0.108,0.498,0.507,0.135,0.487,0.403,0.139,0.211,0.045)T.\displaystyle\hat{\beta}_{3}=(-0.108,0.498,0.507,0.135,0.487,-0.403,0.139,0.211,0.045)^{\mbox{\tiny{\sf T}}}.

A plot of mortality densities versus the first sufficient predictor β^1TX\hat{\beta}_{\scriptscriptstyle 1}^{\scriptscriptstyle T}X is shown in Figure 4(a). Clear and useful patterns emerge from Figure 4(a): the mode of the mortality distribution shifts from right to left (with left indicating a longer life span) as the value of the first sufficient predictor increases. Moreover, there is a significant uptick at the right-most end as the first sufficient predictor decreases, indicating high infant mortality. Meanwhile, the loadings of the first sufficient predictor are strongly positive for the GDP per capita, which indicates the levels of economic development and health care of a country, with larger values associated with more developed countries and smaller values associated with less developed countries. From Figure 4(b), we see that the mean of the mortality distribution increases and the standard deviation decreases with the value of the first sufficient predictor. This also makes sense: the mean life span increases with the level of development, consistent with Figure 4(a). The standard deviation decreases with the first predictor because, as the development level increases, the life span is increasingly concentrated on the high values. Moreover, the high mortality in the lower region of the first sufficient predictor also contributes to the larger standard deviation in this region. The plots of mean and standard deviation versus the second sufficient predictor in Figure 4(b) also indicate an increase in mean and a decrease in standard deviation as the value of the second sufficient predictor increases.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a) Mortality distributions versus the first sufficient predictors; (b) Mean and standard deviation of mortality distributions versus the first two sufficient predictors.

8 Discussion

In the classical regression setting, sufficient dimension reduction has been used as a tool for exploratory data analysis, regression diagnostics, and a mechanism to overcome the curse of dimensionality in regression. As a regression tool, it can help us to effectively treat collinearity in the predictor, detect heteroscedasticity in the response, find the most important linear combinations of predictors, and understand the general shape of the regression surface without fitting an elaborate regression model. All the above items can be revealed by the sufficient plot, where the response is plotted against the sufficient predictors obtained by SDR. Also, because the direction of a vector is an easier target to estimate than the vector itself, it helps us to mitigate the effect of the curse of dimensionality usually accompanying high-dimensional regression. In the same vein, it is easier to estimate a subspace than a specific set of vectors that span the subspace.

Although regression with a metric-space valued random object is a new problem, as a regression problem, it shares the same set of issues, such as the need for exploratory analysis before regression, for model diagnostics after the regression, and for mitigating the curse of dimensionality. As shown in Figure 1 in the paper, the first sufficient predictor clearly reveals useful information about a general trend of mortality distributions among countries. At the lower end of the sufficient predictor, the mortality distribution shifts towards higher longevity, whereas at the high end of the sufficient predictor, the mortality distribution shifts towards low longevity, with a visible uptick near age 0, which is caused by infant mortality.

The proposed methodology is very flexible and versatile: it can be used to turn any existing SDR method into one that can deal with the metric-space-valued response variable. Furthermore, it applies to any separable and complete metric space of negative type with an explicit CMS ensemble. It significantly broadens the current field of sufficient dimension reduction and provides a useful set of tools for Fréchet regression.

References

  • (1)
  • Ambrosio et al. (2004) Ambrosio, L., Gigli, N. & Savaré, G. (2004), ‘Gradient flows with metric and differentiable structures, and applications to the wasserstein space’, Atti Accad. Naz. Lincei Cl. Sci. Fis. Mat. Natur. Rend. Lincei (9) Mat. Appl 15(3-4), 327–343.
  • Bigot et al. (2017) Bigot, J., Gouet, R., Klein, T. & López, A. (2017), ‘Geodesic pca in the wasserstein space by convex pca’, Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 53, 1–26.
  • Chen et al. (2021) Chen, Y., Lin, Z. & Müller, H.-G. (2021), ‘Wasserstein regression’, J. Amer. Statist. Assoc. .
  • Christmann & Steinwart (2010) Christmann, A. & Steinwart, I. (2010), Universal kernels on non-standard input spaces, in ‘in Advances in Neural Information Processing Systems’, pp. 406–414.
  • Cook (1996) Cook, R. D. (1996), ‘Graphics for regressions with a binary response’, J. Amer. Statist. Assoc 91, 983–992.
  • Cook & Li (2002) Cook, R. D. & Li, B. (2002), ‘Dimension reduction for conditional mean in regression’, The Annals of Statistics 30(2), 455–474.
  • Cook & Weisberg (1991) Cook, R. D. & Weisberg, S. (1991), ‘Sliced inverse regression for dimension reduction: Comment’, Journal of the American Statistical Association 86(414), 328–332.
  • Ding & Cook (2015) Ding, S. & Cook, R. D. (2015), ‘Tensor sliced inverse regression’, J. Multivar. Anal. 133, 216–231.
  • Dubey & Müller (2019) Dubey, P. & Müller, H.-G. (2019), ‘Fréchet analysis of variance for random objects’, Biometrika 106(4), 803–821.
  • Dubey & Müller (2020) Dubey, P. & Müller, H.-G. (2020), ‘Fréchet change-point detection’, Ann. Stat. 48(6), 3312–3335.
  • Eaton (1986) Eaton, M. L. (1986), ‘A characterization of spherical distributions’, J. Multivar. Anal. 20(2), 272–276.
  • Fan et al. (2017) Fan, J., Xue, L. & Yao, J. (2017), ‘Sufficient forecasting using factor models’, Journal of Econometrics 201(2), 292–306.
  • Feragen et al. (2015) Feragen, A., Lauze, F. & Hauberg, S. (2015), Geodesic exponential kernels: when curvature and linearity conflict, in ‘Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition’, pp. 3032–3042.
  • Ferré & Yao (2003) Ferré, L. & Yao, A.-F. (2003), ‘Functional sliced inverse regression analysis’, Statistics 37(6), 475–488.
  • Ferreira & Busatto (2013) Ferreira, L. K. & Busatto, G. F. (2013), ‘Resting-state functional connectivity in normal brain aging’, Neuroscience & Biobehavioral Reviews 37(3), 384–400.
  • Fournier & Guillin (2015) Fournier, N. & Guillin, A. (2015), ‘On the rate of convergence in wasserstein distance of the empirical measure’, Probability Theory and Related Fields 162(3-4), 707–738.
  • Fréchet (1948) Fréchet, M. (1948), ‘Les éléments aléatoires de nature quelconque dans un espace distancié’, Annales de l’Institut Henri Poincaré p. 215–310.
  • Granirer (1970) Granirer, E. E. (1970), ‘Probability measures on metric spaces. by k. r. parthasarathy. academic press, new york and london (1967). x 276 pp.’, Canadian Mathematical Bulletin 13(2), 290–291.
  • Hall & Li (1993) Hall, P. & Li, K.-C. (1993), ‘On almost linearity of low dimensional projections from high dimensional data’, The Annals of Statistics pp. 867–889.
  • Honeine & Richard (2010) Honeine, P. & Richard, C. (2010), The angular kernel in machine learning for hyperspectral data classification, in ‘2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing’, IEEE, pp. 1–4.
  • Hsing & Ren (2009) Hsing, T. & Ren, H. (2009), ‘An rkhs formulation of the inverse regression dimension-reduction problem’, The Annals of Statistics 37(2), 726–755.
  • Jayasumana et al. (2013) Jayasumana, S., Hartley, R., Salzmann, M., Li, H. & Harandi, M. (2013), Combining multiple manifold-valued descriptors for improved object recognition, in ‘2013 International Conference on Digital Image Computing: Techniques and Applications (DICTA)’, IEEE, pp. 1–6.
  • Lee et al. (2013) Lee, K.-Y., Li, B. & Chiaromonte, F. (2013), ‘A general theory for nonlinear sufficient dimension reduction: Formulation and estimation’, The Annals of Statistics 41(1), 221–249.
  • Lei (2020) Lei, J. (2020), ‘Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces’, Bernoulli 26(1), 767–798.
  • Li (2018) Li, B. (2018), Sufficient Dimension Reduction: Methods and Applications with R, CRC Press.
  • Li et al. (2010) Li, B., Kim, M. K. & Altman, N. (2010), ‘On dimension folding of matrix-or array-valued statistical objects’, The Annals of Statistics 38(2), 1094–1121.
  • Li & Song (2017) Li, B. & Song, J. (2017), ‘Nonlinear sufficient dimension reduction for functional data’, The Annals of Statistics 45(3), 1059–1095.
  • Li & Wang (2007) Li, B. & Wang, S. (2007), ‘On directional regression for dimension reduction’, Journal of the American Statistical Association 102(479), 997–1008.
  • Li & Yin (2007) Li, B. & Yin, X. (2007), ‘On surrogate dimension reduction for measurement error regression: an invariance law’, The Annals of Statistics 35(5), 2143–2172.
  • Li et al. (2005) Li, B., Zha, H. & Chiaromonte, F. (2005), ‘Contour regression: a general approach to dimension reduction’, The Annals of Statistics 33(4), 1580–1616.
  • Li (1991) Li, K.-C. (1991), ‘Sliced inverse regression for dimension reduction’, J. Amer. Statist. Assoc 86, 316–327.
  • Li (1992) Li, K.-C. (1992), ‘On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma’, Journal of the American Statistical Association 87(420), 1025–1039.
  • Li & Duan (1989) Li, K.-C. & Duan, N. (1989), ‘Regression analysis under link violation’, Ann. Stat. 17(3), 1009–1052.
  • Luo & Li (2016) Luo, W. & Li, B. (2016), ‘Combining eigenvalues and variation of eigenvectors for order determination’, Biometrika 103(4), 875–887.
  • Luo & Li (2021) Luo, W. & Li, B. (2021), ‘On order determination by predictor augmentation’, Biometrika 108(3), 557–574.
  • Luo et al. (2021) Luo, W., Xue, L., Yao, J. & Yu, X. (2021), ‘Inverse moment methods for sufficient forecasting using high-dimensional predictors’, Biometrika 109(2), 473––487.
  • Micchelli et al. (2006) Micchelli, C. A., Xu, Y. & Zhang, H. (2006), ‘Universal kernels.’, J. Mach. Learn. Res. 7(12), 2651–2667.
  • Panaretos & Zemel (2020) Panaretos, V. M. & Zemel, Y. (2020), An Invitation to Statistics in Wasserstein Space, Springer Nature.
  • Parzen (1979) Parzen, E. (1979), ‘Nonparametric statistical data modeling’, J. Amer. Statist. Assoc. 74(365), 105–121.
  • Petersen & Müller (2016) Petersen, A. & Müller, H.-G. (2016), ‘Functional data analysis for density functions by transformation to a hilbert space’, The Annals of Statistics 44(1), 183–218.
  • Petersen & Müller (2019) Petersen, A. & Müller, H.-G. (2019), ‘Fréchet regression for random objects with euclidean predictors’, The Annals of Statistics 47(2), 691–719.
  • Schwartzman (2006) Schwartzman, A. (2006), Random ellipsoids and false discovery rates: Statistics for diffusion tensor imaging data, PhD thesis, Stanford University.
  • Sriperumbudur et al. (2011) Sriperumbudur, B. K., Fukumizu, K. & Lanckriet, G. R. (2011), ‘Universality, characteristic kernels and rkhs embedding of measures.’, Journal of Machine Learning Research 12(7), 2389–2410.
  • Steinwart (2001) Steinwart, I. (2001), ‘On the influence of the kernel on the consistency of support vector machines’, Journal of Machine Learning Research 2(Nov), 67–93.
  • Xia (2007) Xia, Y. (2007), ‘A constructive approach to the estimation of dimension reduction directions’, The Annals of Statistics 35(6), 2654–2690.
  • Xia et al. (2002) Xia, Y., Tong, H., Li, W. K. & Zhu, L.-X. (2002), ‘An adaptive estimation of dimension reduction space (with discussion)’, Journal of the Royal Statistical Society. Series B 64(3), 363–410.
  • Ye & Weiss (2003) Ye, Z. & Weiss, R. E. (2003), ‘Using the bootstrap to select one of a new class of dimension reduction methods’, Journal of the American Statistical Association 98(464), 968–979.
  • Yin & Li (2011) Yin, X. & Li, B. (2011), ‘Sufficient dimension reduction based on an ensemble of minimum average variance estimators’, The Annals of Statistics 39(6), 3392–3416.
  • Yin et al. (2008) Yin, X., Li, B. & Cook, R. D. (2008), ‘Successive direction extraction for estimating the central subspace in a multiple-index regression’, Journal of Multivariate Analysis 99(8), 1733–1757.
  • Ying & Yu (2022) Ying, C. & Yu, Z. (2022), ‘Fréchet sufficient dimension reduction for random objects’, Biometrika, in press .
  • Yu et al. (2020) Yu, X., Yao, J. & Xue, L. (2020), ‘Nonparametric estimation and conformal inference of the sufficient forecasting with a diverging number of factors’, J. Bus. Econ. Stat. 40(1), 342–354.
  • Zhu et al. (2006) Zhu, L., Miao, B. & Peng, H. (2006), ‘On sliced inverse regression with high-dimensional covariates’, Journal of the American Statistical Association 101(474), 630–643.