This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adjusting inverse regression for predictors with clustered distribution

Wei Luolabel=e1]weiluo@zju.edu.cn [    Yan Guolabel=e2]yan_guo@zju.edu.cn [ Center for Data Science, Zhejiang University, China 310058
(0000)
Abstract

A major family of sufficient dimension reduction (SDR) methods, called inverse regression, commonly require the distribution of the predictor XX to have a linear E(X|βTX)E(X|\beta^{\mbox{\tiny{\sf T}}}X) and a degenerate var(X|βTX)\mathrm{var}(X|\beta^{\mbox{\tiny{\sf T}}}X) for the desired reduced predictor βTX\beta^{\mbox{\tiny{\sf T}}}X. In this paper, we adjust the first- and second-order inverse regression methods by modeling E(X|βTX)E(X|\beta^{\mbox{\tiny{\sf T}}}X) and var(X|βTX)\mathrm{var}(X|\beta^{\mbox{\tiny{\sf T}}}X) under the mixture model assumption on XX, which allows these terms to convey more complex patterns and is most suitable when XX has a clustered sample distribution. The proposed SDR methods build a natural path between inverse regression and the localized SDR methods, and in particular inherit the advantages of both; that is, they are n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent, efficiently implementable, directly adjustable under the high-dimensional settings, and fully recovering the desired reduced predictor. These findings are illustrated by simulation studies and a real data example at the end, which also suggest the effectiveness of the proposed methods for non-clustered data.

exhaustive SDR estimation,
high-dimensional SDR,
inverse regression,
mixture model,
doi:
10.1214/154957804100000000
keywords:
volume: 0
\startlocaldefs\endlocaldefs

and

1 Introduction

Sufficient dimension reduction (SDR) has attracted intensive research interest in the recent years due to the need of high-dimensional statistical data analysis. Usually, SDR assumes that the predictor XX affects the response YY through a lower-dimensional βTX\beta^{\mbox{\tiny{\sf T}}}X, that is,

Y   XβTX\displaystyle Y\;\,\rule[0.0pt]{0.29999pt}{6.00006pt}\rule[0.0pt]{6.49994pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.00006pt}\;\,X\mid\beta^{\mbox{\tiny{\sf T}}}X (1)

where         means independence; and the research interest of SDR is to find the qualified βTX\beta^{\mbox{\tiny{\sf T}}}X as the subsequent predictor in a model-free manner, i.e. without parametric assumptions on Y|βTXY|\beta^{\mbox{\tiny{\sf T}}}X. For identifiability, Cook (1998) introduced the central subspace, denoted by 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, as the parameter of interest, whose arbitrary basis matrix satisfies (1) with minimal number of columns. This space is unique under fairly general conditions (Yin, Li and Cook, 2008).

When the research interest is focused on regression analysis, SDR is adjusted to find βTX\beta^{\mbox{\tiny{\sf T}}}X that only preserves the information about regression, i.e. E(Y|X)E(Y|X) being measurable with respect to βTX\beta^{\mbox{\tiny{\sf T}}}X. Similar to 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, the unique parameter of interest in this scenario is called the central mean subspace and denoted by 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)} (Cook and Li, 2002). For ease of presentation, we do not distinguish the two parameters 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)} and 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} unless otherwise specified, and call both the central SDR subspace uniformly. Let dd be the dimension of the central SDR subspace and β0p×d\beta_{\scriptscriptstyle 0}\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times d$}}} be its arbitrary basis matrix.

In the literature, a major family of estimators for the central SDR subspace is called inverse regression, which includes the ordinary least squares (OLS; Li and Duan, 1989) and principal Hessian directions (pHd; Li, 1992) that estimate 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)}, and sliced inverse regression (SIR; Li, 1991), sliced average variance estimator (SAVE; Cook and Weisberg, 1991), and directional regression (DR; Li and Wang, 2007) that estimate 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, etc. A common feature of the inverse regression methods is that they first use the moments of X|YX|Y to construct a matrix-valued parameter Ω{\Omega}, whose columns fall in the central SDR subspace under certain parametric assumptions of XX, and then use an estimate of Ω{\Omega} to recover the central SDR subspace. Suppose XX has zero mean and covariance matrix ΣX\Sigma_{\scriptscriptstyle X}. The forms of Ω{\Omega} for OLS, SIR, and SAVE are, respectively,

ΣX1E(XY),ΣX1E{E2(X|Y)}ΣX1,ΣX1E[E{ΣXvar(X|Y)}2]ΣX1,\displaystyle\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E(XY),\quad\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E\{E^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}(X|Y)\}\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}},\quad\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E[E\{\Sigma_{\scriptscriptstyle X}-\mathrm{var}(X|Y)\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}]\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}, (2)

where v2v^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}} denotes vvTvv^{\mbox{\tiny{\sf T}}} for any matrix vv. Clearly, Ω{\Omega} is readily estimable when YY is discrete with limited number of outcomes. To ease the estimation for continuous YY (except for OLS), the slicing technic is commonly applied to replace a univariate YY in Ω{\Omega} with YS=i=1HiI(qi1<Yqi)Y_{\scriptscriptstyle S}=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle H$}}}iI(q_{\scriptscriptstyle i-1}<Y\leq q_{\scriptscriptstyle i}), HH being a fixed integer and qiq_{\scriptscriptstyle i} being the (i/H)(i/H)th quantile of YY; the case of multivariate YY follows similarly. Because the moments of X|YSX|Y_{\scriptscriptstyle S} can be estimated by the sample moments within each slice (qi,qi+1)(q_{\scriptscriptstyle i},q_{\scriptscriptstyle i+1}), a consistent estimator of Ω{\Omega}, denoted by Ω^\widehat{\Omega}, can be readily constructed. The central SDR subspace is then estimated by the linear span of the leading left singular vectors of Ω^\widehat{\Omega}.

Despite their popularity in applications, the inverse regression methods are also commonly criticized for the aforementioned parametric assumptions on XX. For example, the first-order inverse regression methods, i.e. OLS and SIR that only involve E(X|Y)E(X|Y) or its average E(XY)E(XY), require the linearity condition

E(X|β0TX)=PT(ΣX,β0)X\displaystyle E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle X},\beta_{\scriptscriptstyle 0})X (3)

where P(ΣX,β)=β(βTΣXβ)1βTΣXP(\Sigma_{\scriptscriptstyle X},\beta)=\beta(\beta^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle X}\beta)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\beta^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle X} is the projection matrix onto the column space of β\beta under the inner product u,v=uTΣXv\langle u,v\rangle=u^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle X}v. The second-order inverse regression methods that additionally involve var(X|Y)\mathrm{var}(X|Y), e.g. pHd, SAVE, and directional regression, require both the linearity condition (3) and the constant variance condition

var(X|β0TX)=ΣXΣXP(ΣX,β0).\displaystyle\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=\Sigma_{\scriptscriptstyle X}-\Sigma_{\scriptscriptstyle X}P(\Sigma_{\scriptscriptstyle X},\beta_{\scriptscriptstyle 0}). (4)

Since both conditions adopt simple parametric forms on the moments of X|β0TXX|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, they can be violated in practice. Moreover, as the central SDR subspace is unknown, both (3) and (4) are often strengthened to that they hold for any βp×d\beta\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times d$}}}. The strengthened linearity condition requires XX to be elliptically distributed (Li, 1991), and, together with the strengthened constant variance condition, XX must have a multivariate normal distribution (Cook and Weisberg, 1991). Hall and Li (1993) justified the approximate satisfaction of (3) for general XX as pp grows, but their theory is developed for diverging pp, and, more importantly, it is built in a Bayesian sense that β0\beta_{\scriptscriptstyle 0} follows a continuous prior distribution on p×d\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times d$}}}. In the literature of high-dimensional SDR (Chen, Zou and Cook, 2010; Lin, Zhao and Liu, 2016; Zeng, Mai and Zhang, 2022), sparsity is commonly assumed on the central SDR subspace so that β0\beta_{\scriptscriptstyle 0} only has a few nonzero rows. Thus, Hall and Li’s result is often inapplicable.

To enhance the applicability of inverse regression, multiple methods have been proposed to relax the linearity condition (3). Li and Dong (2009) and Dong and Li (2010) allow E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) to lie in a general finite-dimensional functional space, and they reformulate the inverse regression methods accordingly as minimizing certain objective functions. However, the non-convexity of the objective functions may complicate the implementation, and, unlike the connection between the linearity condition and the ellipticity of XX, the relaxed linearity condition is lack of interpretability. Another effort specified for SIR can be found in Guan, Xie and Zhu (2017), which assumes XX to follow a mixture of skew-elliptical distributions with identical shape parameters up to multiplicative scalars. Guan, Xie and Zhu’s method can be easily implemented once the mixture model of XX is consistently fitted.

In a certain sense, the minimum average variance estimator (MAVE; Xia et al., 2002), a state-of-the-art SDR method that estimates the central mean subspace 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)} by local linear regression, can also be regarded as relaxing the linearity condition (3) for OLS. Recall that OLS is commonly used to estimate the coefficients, i.e. the one-dimensional 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)}, of the linear model of E(Y|β0TX)E(Y|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X). Thus, the linearity condition (3) on XX can be regarded as a surrogate to the assumption of linear E(Y|β0TX)E(Y|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) for SDR estimation. As MAVE adopts and fits a linear E(Y|β0TX)E(Y|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) in each local neighborhood of XX, it equivalently adopts the linearity condition in these neighborhoods, which only requires the first-order Taylor expansion of E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) and thus is fairly general.

Following the same spirit as MAVE, the aggregate dimension reduction (ADR; Wang et al., 2020) conducts SIR in the local neighborhoods of XX and thus also relaxes the linearity condition. Recently, the idea of localization was also studied in Fertl and Bura (2022), who proposed both the cumulative covariance estimator (CVE) to recover 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)} and the enemble CVE (eCVE) to recover 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. For convenience, we call MAVE, ADR, CVE, eCVE, and other SDR methods based on localized estimation uniformly the MAVE-type methods. By the nature of localized estimation, these methods share the common limitation that they quickly lose consistency as pp grows.

Generally, the conjugate between the parametric assumptions on X|β0TXX|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, i.e. (3) and (4), and the parametric assumptions on Y|β0TXY|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X can be formulated under the unified SDR framework proposed in Ma and Zhu (2012). In the population level, each SDR method in this framework solves the estimating equation

E[{α(X)μα(βTX)}{g(Y,βTX)μg(βTX)}]=0\displaystyle E[\{\alpha(X)-\mu_{\scriptscriptstyle\alpha}(\beta^{\mbox{\tiny{\sf T}}}X)\}\{g(Y,\beta^{\mbox{\tiny{\sf T}}}X)-\mu_{\scriptscriptstyle g}(\beta^{\mbox{\tiny{\sf T}}}X)\}]=0 (5)

for certain α(X)\alpha(X) and g(Y,βTX)g(Y,\beta^{\mbox{\tiny{\sf T}}}X), where μα(βTX)\mu_{\scriptscriptstyle\alpha}(\beta^{\mbox{\tiny{\sf T}}}X) and μg(βTX)\mu_{\scriptscriptstyle g}(\beta^{\mbox{\tiny{\sf T}}}X) are the working models for E{α(X)|βTX}E\{\alpha(X)|\beta^{\mbox{\tiny{\sf T}}}X\} and E{g(Y,βTX)|βTX}E\{g(Y,\beta^{\mbox{\tiny{\sf T}}}X)|\beta^{\mbox{\tiny{\sf T}}}X\}, respectively. Ma and Zhu (2012) justified a double-robust property that, as long as either μα(β0TX)\mu_{\scriptscriptstyle\alpha}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) or μg(β0TX)\mu_{\scriptscriptstyle g}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) truly specifies the corresponding regression function when β\beta is fixed at β0\beta_{\scriptscriptstyle 0}, any minimal solution to (5) must be included in the central SDR subspace. Up to asymptotically negligible errors, (5) incorporates a large family of SDR methods. For example, with α(X)=X\alpha(X)=X and the linearity condition (3), one can take linear μα(βTX)\mu_{\scriptscriptstyle\alpha}(\beta^{\mbox{\tiny{\sf T}}}X) and force μg(βTX)\mu_{\scriptscriptstyle g}(\beta^{\mbox{\tiny{\sf T}}}X) to be zero, and solving (5) delivers OLS if g(Y,βTX)g(Y,\beta^{\mbox{\tiny{\sf T}}}X) is YY and delivers SIR if it is (I(YS=1),,I(YS=H))(I(Y_{\scriptscriptstyle S}=1),\ldots,I(Y_{\scriptscriptstyle S}=H)).

The double-robustness property for (5) illuminates the tradeoff for general SDR methods: if we choose to avoid parametric assumptions or nonparametric estimation on Y|β0TXY|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, then we must adopt those on X|β0TXX|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X instead. In principle, like the rich literature for modeling Y|β0TXY|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, numerous assumptions can be adopted on X|β0TXX|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X to represent different balances between parsimoniousness and flexibility. Nonetheless, under the unified framework (5), only the two extremes have been widely studied in the SDR literature: the most aggressive parametric models (3) and (4) in inverse regression, and the most conservative nonparametric model in MAVE. Although a compromise between the two was discussed by Li and Dong (2009) and Dong and Li (2010) as mentioned above, there is lack of parametric learning of X|β0TXX|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X that specifies easily interpretable conditional moments and is often satisfied in practice.

In this paper, we adjust both the first- and second-order inverse regression methods by approximating the distribution of XX with mixture models, where we allow each mixture component to differ in either center or shape but it must satisfy the linearity condition (3) and the constant variance condition (4). The mixture model approximation has been widely used in applications to fit generally unknown distributions (Lindsay, 1995; Everitt, 2013), especially those with clustered sample support, and it permits estimating each of E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) and var(X|β0TX)\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) by one of multiple commonly seen parametric models chosen in a data-driven manner. As supported by numerical studies, the proposed methods also have a consistent sample performance under more complex settings, for example, if XX has a unimodal but curved sample support. Compared with the MAVE-type methods, our approach is more robust to the dimensionality and in particular effective under the high-dimensional settings. Compared with Guan, Xie and Zhu (2017), it allows the mixture components to have different shapes and is applicable for all the second-order inverse regression methods. It also alleviates the issue of non-exhaustive recovery of the central SDR subspace that persists in OLS and SIR, in a similar fashion to MAVE and ADR that capture the local patterns of data.

The rest of the paper is organized as follows. Throughout the theoretical development, we assume XX to be continuous and follow a mixture model, under which the parametric forms of the moments of X|β0TXX|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X are studied in Section 22. We adjust SIR in Section 33, and further modify it towards sparsity under the high-dimensional settings in Section 4. Section 5 is devoted for the adjustment of SAVE. The simulation studies are presented in Section 6, where we evaluate the performance of the proposed methods under various settings, including the cases where XX does not follow a mixture model. A real data example is investigated in Section 7. The adjustment for OLS is omitted for its similarity to SIR. The adjustments of other second-order inverse regression methods, as well as the detailed implementation and complementery simulation studies, are deferred to the Appendix. R code for implementing the proposed methods can be downloaded at https://github.com/Yan-Guo1120/IRMN. For convenience, we focus on continuous YY whose sample slices have an equal size, and we regard the dimension dd of the central SDR subspace as known a priori. The generality of the latter is briefly justified in Section 2, given the consistent estimation of dd discussed in Sections 3 and 5.

2 Preliminaries on mixture models

We first clarify the notations. Suppose X=(X1,,Xp)TX=(X_{\scriptscriptstyle 1},\ldots,X_{\scriptscriptstyle p})^{\mbox{\tiny{\sf T}}}. For any matrix βp×d\beta\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times d$}}}, let 𝒮(β){\mathcal{S}}({\beta}) be the column space of β\beta and let vec(β){\mathrm{vec}}(\beta) be the vector that stacks the columns of β\beta together. Like β2\beta^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}} in (2), sometimes we also denote β\beta itself by β1\beta^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 1$}}}. We expand P(ΣX,β)P(\Sigma_{\scriptscriptstyle X},\beta) in (3) to P(Λ,β)P(\Lambda,\beta) for any positive definite matrix Λ\Lambda, and, with Ip\mathrm{I}_{\scriptscriptstyle p} being the identity matrix, we denote IpP(Λ,β)\mathrm{I}_{\scriptscriptstyle p}-P(\Lambda,\beta) by Q(Λ,β)Q(\Lambda,\beta). For a set of matrices {Ωi:i{1,,k}}\{{\Omega}_{\scriptscriptstyle i}:i\in\{1,\ldots,k\}\} with equal number of rows, we denote the ensemble matrix (Ω1,,Ωk)({\Omega}_{\scriptscriptstyle 1},\ldots,{\Omega}_{\scriptscriptstyle k}) by (Ωi)i{1,,k}({\Omega}_{\scriptscriptstyle i})_{\scriptscriptstyle i\in\{1,\ldots,k\}}.

In the literature, the mixture model has been widely employed to approximate unknown distributions (Lindsay, 1995; Everitt, 2013), both clustered and non-clustered, as it provides a wide range of choices to balance between the estimation efficiency and the modeling flexibility, especially for the local patterns of data. Application fields include, for example, astronomy, bioinformatics, ecology, and economics (Marin, Mengersen and Robert, 2005, Subsection 1.1). Given a parametric family of probability density functions ={f(x,α)}{\mathcal{F}}=\{f(x,\alpha)\} on p\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p$}}}, a mixture model for XX means

X=i=1qI(W=i)Z(i),\displaystyle X=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}I(W=i)Z^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}, (6)

where qq is an unknown integer, each mixture component Z(i)Z^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}} is generated from f(,αi)f(\cdot,\alpha_{\scriptscriptstyle i})\in{\mathcal{F}}, WW is a latent variable that follows a multinomial distribution with support {1,,q}\{1,\ldots,q\} and probabilities π1,,πq\pi_{\scriptscriptstyle 1},\ldots,\pi_{\scriptscriptstyle q}, and W,Z(1),,Z(q)W,Z^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(1)$}}},\ldots,Z^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(q)$}}} are mutually independent. For the identifiability of (6), we require each πi\pi_{\scriptscriptstyle i} to be positive and each αi\alpha_{\scriptscriptstyle i} to be distinct, as well as additional regularity conditions on {\mathcal{F}}. The latter can be found in Titterington, Smith and Makov (1985), details omitted. As a special case, when {\mathcal{F}} is the family of multivariate normal distributions, (6) becomes a mixture multivariate normal distribution with αi\alpha_{\scriptscriptstyle i} being (μi,Σi)(\mu_{\scriptscriptstyle i},\Sigma_{\scriptscriptstyle i}), the mean and the covariance matrix of the iith mixture component. Here, we allow Σi\Sigma_{\scriptscriptstyle i}’s to differ.

For the most flexibility of the mixture model (6), one can allow the number of mixture components qq to be relatively large or even diverge with nn in practice, so that it can “facilitate much more careful description of complex systems” (Marin, Mengersen and Robert, 2005, Subsection 1.2.3), including those distributions who convey non-clustered sample supports. An extreme case is the kernel density estimation that uses an average of nn multivariate normal distributions to approximate a general distribution, which can be regarded as a mixture multivariate normal distribution mentioned above but with q=nq=n. For simplicity, we set qq to be fixed as nn grows throughout the theoretical development, by which (6) is a parametric analogue of kernel density estimation, with each mixture component being a “global” neighborhood of XX. The setting of non-clustered XX approximated by mixture model will be investigated numerically in Section 66.

Given that {\mathcal{F}} truly specifies the distributions of the mixture components, the consistency of the mixture model (6) hinges on the true determination of qq. This can be achieved by multiple methods, such as the Bayesian information criterion (BIC; Zhao, Hautamaki and Fränti, 2008), the integrated completed likelihood (Biernacki, Celeux and Govaert, 2000), the sequential testing procedure (McLachlan, Lee and Rathnayake, 2019a), and the recently proposed method based on data augmentation (Luo, 2022). In the presence of these estimators, we regard qq as known a priori throughout the theoretical study to ease the presentation. The generality of doing so is justified in the following lemma, which shows that the asymptotic properties of any statistic that involves qq will be invariant if qq is replaced with its arbitrary consistent estimator q^\widehat{q}. The same invariance property also holds for using the true value of the dimension dd of the central SDR subspace in place of its consistent estimator, which justifies the generality of assuming dd known mentioned at the end of the Introduction.

Lemma 1.

Suppose q^\widehat{q} is a consistent estimator of qq in the sense that prob(q^=q)1\mathrm{prob}(\widehat{q}=q)\rightarrow 1. For any statistic SnS_{\scriptscriptstyle n} that involves q^\widehat{q}, let RnR_{\scriptscriptstyle n} be constructed in the same way as SnS_{\scriptscriptstyle n} but with q^\widehat{q} replaced by qq. Then we have SnRn=oP(nr)S_{\scriptscriptstyle n}-R_{\scriptscriptstyle n}=o_{\scriptscriptstyle P}(n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-r$}}}) for any r>0r>0, that is, RnR_{\scriptscriptstyle n} is asymptotically equivalent to SnS_{\scriptscriptstyle n}.

Proof.

For any r>0r>0 and δ>0\delta>0, we have

prob(|SnRn|>nrδ)\displaystyle\mathrm{prob}(|S_{\scriptscriptstyle n}-R_{\scriptscriptstyle n}|>n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-r$}}}\delta)
=prob(|SnRn|>nrδ,q^=q)+prob(|SnRn|>nrδ,q^q)\displaystyle\hskip 14.22636pt=\mathrm{prob}(|S_{\scriptscriptstyle n}-R_{\scriptscriptstyle n}|>n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-r$}}}\delta,\widehat{q}=q)+\mathrm{prob}(|S_{\scriptscriptstyle n}-R_{\scriptscriptstyle n}|>n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-r$}}}\delta,\widehat{q}\neq q)
=prob(|SnRn|>nrδ,q^q)prob(q^q)0\displaystyle\hskip 14.22636pt=\mathrm{prob}(|S_{\scriptscriptstyle n}-R_{\scriptscriptstyle n}|>n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-r$}}}\delta,\widehat{q}\neq q)\leq\mathrm{prob}(\widehat{q}\neq q)\rightarrow 0

The second equation above is due to Sn=RnS_{\scriptscriptstyle n}=R_{\scriptscriptstyle n} whenever q^=q\widehat{q}=q. Thus, we have SnRn=oP(nr)S_{\scriptscriptstyle n}-R_{\scriptscriptstyle n}=o_{\scriptscriptstyle P}(n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-r$}}}). This completes the proof. ∎

Given qq, the marginal density of XX is i=1qπif(x,αi)\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}f(x,\alpha_{\scriptscriptstyle i}) up to the unknown (πi,αi)(\pi_{\scriptscriptstyle i},\alpha_{\scriptscriptstyle i})’s, which generates a maximal likelihood approach to estimate these parameters using the EM algorithm; see Cai, Ma and Zhang (2019) and McLachlan, Lee and Rathnayake (2019b) for the relative literature. Hereafter, we treat the estimators π^i\widehat{\pi}_{\scriptscriptstyle i} and α^i\widehat{\alpha}_{\scriptscriptstyle i} as granted, and assume their n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistency and asymptotic normality whenever needed. The same applies to μi\mu_{\scriptscriptstyle i} and Σi\Sigma_{\scriptscriptstyle i} defined above, where μ^i\widehat{\mu}_{\scriptscriptstyle i} and Σ^i\widehat{\Sigma}_{\scriptscriptstyle i} for general parametric family {\mathcal{F}} can be derived by appropriate transformations of α^i\widehat{\alpha}_{\scriptscriptstyle i}.

To connect the mixture model (6) with a parametric model on the moments of X|β0TXX|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, we assume that the linearity condition and the constant variance condition holds for each mixture component; that is, for i{1,,q}i\in\{1,\ldots,q\},

E(X|β0TX,W=i)=PT(Σi,β0)(Xμi),\displaystyle E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W=i)=P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})(X-\mu_{\scriptscriptstyle i}), (7)
var(X|β0TX,W=i)=ΣiΣiP(Σi,β0).\displaystyle\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W=i)=\Sigma_{\scriptscriptstyle i}-\Sigma_{\scriptscriptstyle i}P(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0}). (8)

Referring to the discussion about (3) and (4) in the Introduction, this includes but is not limited to the cases where each mixture component of XX is multivariate normal. For clarification, hereafter we call (3) the overall linearity condition and call (7) the mixture component-wise linearity condition whenever necessary, and likewise call (4) and (8) the overall and the mixture component-wise constant variance conditions, respectively. Because WW is latent, we next provide two useful derivations of (7). For each i{1,,q}i\in\{1,\ldots,q\}, let πi(X)\pi_{\scriptscriptstyle i}(X) be prob(W=i|X)\mathrm{prob}(W=i|X), which is a known function up to (πi,αi)(\pi_{\scriptscriptstyle i},\alpha_{\scriptscriptstyle i})’s, and let πi(βTX)\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X) be prob(W=i|βTX)\mathrm{prob}(W=i|\beta^{\mbox{\tiny{\sf T}}}X) for any βp×d\beta\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times d$}}}. Generally, πi(βTX)\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X) differs from πi(X)\pi_{\scriptscriptstyle i}(X) unless in the extreme case W   X|βTXW\;\,\rule[0.0pt]{0.29999pt}{6.00006pt}\rule[0.0pt]{6.49994pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.00006pt}\;\,X|\beta^{\mbox{\tiny{\sf T}}}X. The latter is not assumed in this article. The estimators π^i(X)\widehat{\pi}_{\scriptscriptstyle i}(X) and π^i(βTX)\widehat{\pi}_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X) are readily derived from the mixture model fit.

Lemma 2.

Suppose XX follows a mixture model that satisfies (7). For i{1,,q}i\in\{1,\ldots,q\}, let Ei(X|β0TX)E_{\scriptscriptstyle i}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) be E{(Xμi)πi(X)|β0TX}E\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}. We have

Ei(X|β0TX)=πi(β0TX)PT(Σi,β0)(Xμi),\displaystyle E_{\scriptscriptstyle i}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})(X-\mu_{\scriptscriptstyle i}), (9)
E(X|β0TX)=i=1qπi(β0TX){PT(Σi,β0)(Xμi)+μi}.\displaystyle E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})(X-\mu_{\scriptscriptstyle i})+\mu_{\scriptscriptstyle i}\}. (10)
Proof.

To show (9), we have

E{(Xμi)πi(X)|β0TX}\displaystyle E\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\} =E{(Xμi)I(W=i)|β0TX}\displaystyle=E\{(X-\mu_{\scriptscriptstyle i})I(W=i)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}
=j=1qπj(β0TX)E{(Xμi)I(W=i)|β0TX,W=j}\displaystyle=\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle j}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)E\{(X-\mu_{\scriptscriptstyle i})I(W=i)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W=j\}
=πi(β0TX)E(Xμi|β0TX,W=i)\displaystyle=\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)E(X-\mu_{\scriptscriptstyle i}|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W=i)
=πi(β0TX)PT(Σi,β0)(Xμi)\displaystyle=\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})(X-\mu_{\scriptscriptstyle i})

To show (10), we have

E(X|β0TX)=E{E(X|β0TX,W)|β0TX}=i=1qπi(β0TX){PT(Σi,β0)(Xμi)+μi}.\displaystyle E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=E\{E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})(X-\mu_{\scriptscriptstyle i})+\mu_{\scriptscriptstyle i}\}.

This completes the proof. ∎

From the proof of Lemma 2, πi(X)\pi_{\scriptscriptstyle i}(X) in Ei(X|β0TX)E_{\scriptscriptstyle i}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) can be replaced with the indicator I(W=i)I(W=i), so (9) resembles the mixture component-wise linearity condition (7) for a single mixture component of XX. Similarly, (10) is an ensemble of (7) across all the mixture components of XX.

Depending on how πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) is distributed, that is, how WW is associated with β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, the functional form of E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) in (10) can be approximated by various parametric models. In this sense, (10) relaxes the overall linearity condition (3) to numerous parametric models, among which the most appropriate is automatically chosen by the data. For example, if the mixture components of β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X are well separate, then each πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) is close to a Bernoulli distribution and thus E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) is nearly a linear spline of β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X. This is illustrated by E(X1|X2)E(X_{\scriptscriptstyle 1}|X_{\scriptscriptstyle 2}) in the upper-left and upper-middle panels of Figure 1, where XX is a balanced mixture of N((0,2)T,(r1|ij|))N((0,2)^{\mbox{\tiny{\sf T}}},(r_{\scriptscriptstyle 1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle|i-j|$}}})) and N((0,2)T,(r2|ij|))N((0,-2)^{\mbox{\tiny{\sf T}}},(r_{\scriptscriptstyle 2}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle|i-j|$}}})), with r1=.5r_{\scriptscriptstyle 1}=.5 and r2=.5r_{\scriptscriptstyle 2}=-.5 in the former and both equal to .5.5 in the latter. In the other extreme, if πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)’s are degenerate, which occurs if the mixture components of XX are identical along the directions in β0\beta_{\scriptscriptstyle 0}, then E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) reduces to satisfy the overall linearity condition (3). When πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)’s are continuously distributed, E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) conveys other forms, such as a quadratic function if we change μ1\mu_{\scriptscriptstyle 1} and μ2\mu_{\scriptscriptstyle 2} in the upper-left panel to μ1=μ2=(0,.2)T\mu_{\scriptscriptstyle 1}=-\mu_{\scriptscriptstyle 2}=(0,.2)^{\mbox{\tiny{\sf T}}}, as depicted in the upper-right panel of Figure 1.

Similar to (7), the mixture component-wise constant variance condition (8) can also be modified in two directions towards practical use, with the aid of πi(X)\pi_{\scriptscriptstyle i}(X) and πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), respectively.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: In the upper panels, the solid curve in each plot is E(X1|X2)E(X_{\scriptscriptstyle 1}|X_{\scriptscriptstyle 2}), and the dashed curve is a symmetric linear spline, a monotone linear spline, and a quadratic function in the left, middle, and right panel, respectively. In the lower panels, the solid curve in each plot is var(X1|X2)\mathrm{var}(X_{\scriptscriptstyle 1}|X_{\scriptscriptstyle 2}) that approximates to a piecewise constant function, approximates to a trigonometric function, and is exactly quadratic, respectively.
Lemma 3.

Suppose XX follows a mixture model that satisfies (7) and (8). Let vari(X|β0TX)=E{(Xμi)2πi(X)|β0TX}{Ei(X|β0TX)}2\mathrm{var}_{\scriptscriptstyle i}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=E\{(X-\mu_{\scriptscriptstyle i})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\pi_{\scriptscriptstyle i}(X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}-\{E_{\scriptscriptstyle i}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}} for i=1,,qi=1,\ldots,q, and μk(β0TX)=i=1qπi(β0TX){PT(Σi,β0)(Xμi)+μi}k\mu_{\scriptscriptstyle k}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})(X-\mu_{\scriptscriptstyle i})+\mu_{\scriptscriptstyle i}\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes k$}}} for k=1,2k=1,2. We have,

vari(X|β0TX)=πi(β0TX)[ΣiQ(Σi,β0)+{1πi(β0TX)}{PT(Σi,β0)(Xμi)}2],\displaystyle\mathrm{var}_{\scriptscriptstyle i}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)[\Sigma_{\scriptscriptstyle i}Q(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})+\{1-\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\}\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})(X-\mu_{\scriptscriptstyle i})\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}], (11)
var(X|β0TX)=i=1qπi(β0TX)ΣiQ(Σi,β0)+μ2(β0TX)μ12(β0TX).\displaystyle\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\Sigma_{\scriptscriptstyle i}Q(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})+\mu_{\scriptscriptstyle 2}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)-\mu_{\scriptscriptstyle 1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X). (12)
Proof.

To show (11), we have

E{(Xμi)2πi(X)|β0TX}\displaystyle E\{(X-\mu_{\scriptscriptstyle i})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\pi_{\scriptscriptstyle i}(X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\} =E[E{(Xμi)2I(W=i)|β0TX,W}|β0TX]\displaystyle=E[E\{(X-\mu_{\scriptscriptstyle i})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}I(W=i)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W\}|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X]
=πi(β0TX)E{(Xμi)2|β0TX,W=i}\displaystyle=\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)E\{(X-\mu_{\scriptscriptstyle i})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W=i\}
=πi(β0TX){var(X|β0TX,W=i)\displaystyle=\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\{\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W=i)
+E2(Xμi|β0TX,W=i)}\displaystyle\qquad+E^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}(X-\mu_{\scriptscriptstyle i}|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W=i)\}
=πi(β0TX)[ΣiQ(Σi,β0)+{PT(Σi,β0)(Xμi)}2].\displaystyle=\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)[\Sigma_{\scriptscriptstyle i}Q(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})+\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})(X-\mu_{\scriptscriptstyle i})\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}].

Together with (9), this implies (11).

To show (12), note that var{E(X|β0TX,W)|β0TX}=μ2(β0TX)μ12(β0TX)\mathrm{var}\{E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}=\mu_{\scriptscriptstyle 2}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)-\mu_{\scriptscriptstyle 1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), by which we have

var(X|β0TX)\displaystyle\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) =E{var(X|β0TX,W)|β0TX}+var{E(X|β0TX,W)|β0TX}\displaystyle=E\{\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}+\mathrm{var}\{E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,W)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}
=i=1qπi(β0TX)ΣiQ(Σi,β0)+μ2(β0TX)μ12(β0TX).\displaystyle=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\Sigma_{\scriptscriptstyle i}Q(\Sigma_{\scriptscriptstyle i},\beta_{\scriptscriptstyle 0})+\mu_{\scriptscriptstyle 2}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)-\mu_{\scriptscriptstyle 1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X).

This completes the proof. ∎

Same as E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) in (10), the functional form of var(X|β0TX)\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) in (12) can be approximated by various commonly seen parametric models. This is illustrated in the lower panels of Figure 1, where we again set XX to be a balanced mixture of N(μ1,(r1|ij|))N(\mu_{\scriptscriptstyle 1},(r_{\scriptscriptstyle 1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle|i-j|$}}})) and N(μ1,(r2|ij|))N(-\mu_{\scriptscriptstyle 1},(r_{\scriptscriptstyle 2}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle|i-j|$}}})), with (μ1T,r1,r2)(\mu_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}},r_{\scriptscriptstyle 1},r_{\scriptscriptstyle 2}) being ((0,2),.4,.8)((0,2),.4,.8), ((0,.8),.5,.5)((0,.8),.5,-.5), and ((0,0),.4,.8)((0,0),.4,.8), respectively. The resulting var(X1|X2)\mathrm{var}(X_{\scriptscriptstyle 1}|X_{\scriptscriptstyle 2}) approximates to a piecewise constant function and a trigonometric function in the first two cases. In the third case, πi(X2)\pi_{\scriptscriptstyle i}(X_{\scriptscriptstyle 2}) is degenerate, and, by (12), var(X1|X2)\mathrm{var}(X_{\scriptscriptstyle 1}|X_{\scriptscriptstyle 2}) is exactly a quadratic function. As mentioned below Lemma 2, the overall linearity condition (3) is satisfied for E(X1|X2)E(X_{\scriptscriptstyle 1}|X_{\scriptscriptstyle 2}) in this case. A relative discussion about the inverse regression methods under linear E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) and quadratic var(X|β0TX)\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), but with elliptically distributed XX, can be found in Luo (2018).

Compared with the general parametric families of E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) discussed in Li and Dong (2009) and Dong and Li (2010), the proposed parametric form (10) has a clear interpretation; that is, it is most suitable to use if the sample distribution of XX conveys a clustered pattern. In addition, because we model E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) by averaging a few linear functions of β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, the result is relatively robust to the outliers of XX. The functional form of var(X|β0TX)\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) is also relaxed in our work, whereas it is fixed as a constant in Dong and Li (2010).

Recall that the MAVE-type methods adopt the linearity condition in the local neighborhoods of XX, and that each mixture component in the mixture model (6) can be regarded as a global neighborhood of XX. Since we adopt the linearity condition and the constant variance condition for each mixture component of XX, a corresponding modification of the inverse regression methods will naturally serve as a bridge, or a compromise, that connects inverse regression with the MAVE-type methods. This modification will also inherit the advantages of both sides, including the n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistency and the exhaustive recovery of the central SDR subspace, etc., as seen next.

3 Adjusting SIR for XX under mixture model

Based on the discussions in Section 2, there are two intuitive strategies for adjusting inverse regression for XX that follows a mixture model. First, one can conduct inverse regression on each mixture component of XX using (9) and (11), and then merge the SDR results. Second, one can construct and solve an appropriate estimation equation (5) with the aid of the functional forms (10) for E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) and (12) for var(X|β0TX)\mathrm{var}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X). Because these two strategies use the terms πi(X)\pi_{\scriptscriptstyle i}(X) and πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), respectively, which characterize the “global” neighborhoods of XX and β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, they can be parallelized with MAVE and its refined version (Xia et al., 2002). We start with adjusting SIR in this section, and discuss both strategies sequentially.

To formulate the first strategy, we first parametrize the SDR result specified for an individual mixture component of XX as the conditional central subspace on W=iW=i, which is the unique subspace of p\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p$}}} that satisfies

Y   X(βTX,W=i),\displaystyle Y\;\,\rule[0.0pt]{0.29999pt}{6.00006pt}\rule[0.0pt]{6.49994pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.00006pt}\;\,X\mid(\beta^{\mbox{\tiny{\sf T}}}X,W=i), (13)

with minimal dimension. Denote this space by 𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}. For a general non-latent variable WW, the space spanned by the union of 𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}’s, denoted by 𝒮(i=1q𝒮Y|X(i)){\mathcal{S}}({\bigcup_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}{\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}}), is called the partial central subspace in Chiaromonte, Cook and Li (2002). Naturally, the first strategy aims to estimate 𝒮(i=1q𝒮Y|X(i)){\mathcal{S}}({\bigcup_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}{\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}}), so it recovers 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} if

𝒮Y|X=𝒮(i=1q𝒮Y|X(i)).\displaystyle{\mathcal{S}}_{\scriptscriptstyle Y|X}={\mathcal{S}}\left(\bigcup_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}{\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}\right). (14)

Because WW is the latent variable constructed for the convenience of modeling the distribution of XX, we can naturally assume that WW is uninformative to YY given XX. Intuitively, this means that the term W=iW=i can be removed from (13), by which 𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}} becomes identical to 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. The only issue is that the support of X|W=iX|W=i can be a proper subset of the support of XX, under which 𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}} may only capture a local pattern of Y|XY|X and becomes a proper subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. Nonetheless, since the supports of X|W=iX|W=i for i=1,,qi=1,\ldots,q must together cover the support of XX, the union of 𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}’s must span the entire 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. This reasoning is rigorized in the following lemma, which justifies (14) and thus justifies the consistency of the first strategy.

Lemma 4.

Suppose Y   W|XY\;\,\rule[0.0pt]{0.29999pt}{6.00006pt}\rule[0.0pt]{6.49994pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.00006pt}\;\,W|X. Then the coincidence between 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} and 𝒮(i=1q𝒮Y|X(i)){\mathcal{S}}({\bigcup_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}{\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}}) in (14) always holds. In addition, if the support of X|W=iX|W=i coincides with the support of XX for some i=1,,qi=1,\ldots,q, then 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} further coincides with 𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}} specified for the iith mixture component of XX.

Proof.

Since Y   W|XY\;\,\rule[0.0pt]{0.29999pt}{6.00006pt}\rule[0.0pt]{6.49994pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.00006pt}\;\,W|X, 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} clearly satisfies (13) for each i=1,,qi=1,\ldots,q. Since 𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}} is by definition the intersection of all the subspaces of p\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p$}}} that satisfy (13) (Chiaromonte, Cook and Li, 2002), it must be a subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. Thus, we have

𝒮(i=1q𝒮Y|X(i))𝒮Y|X.\displaystyle{\mathcal{S}}\left(\bigcup_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}{\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}\right)\subseteq{\mathcal{S}}_{\scriptscriptstyle Y|X}. (15)

Conversely, let γ\gamma be an arbitrary basis matrix of 𝒮(i=1q𝒮Y|X(i)){\mathcal{S}}({\bigcup_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}{\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}}). By definition, γ\gamma satisfies (13) for each i=1,,qi=1,\ldots,q, which means that, for any g(Y)L2(Y)g(Y)\in L_{\scriptscriptstyle 2}(Y),

E{g(Y)|X,W}=E{g(Y)|γTX,W}.\displaystyle E\{g(Y)|X,W\}=E\{g(Y)|\gamma^{\mbox{\tiny{\sf T}}}X,W\}. (16)

Since Y   W|XY\;\,\rule[0.0pt]{0.29999pt}{6.00006pt}\rule[0.0pt]{6.49994pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.00006pt}\;\,W|X, we have E{g(Y)|X,W}=E{g(Y)|X}E\{g(Y)|X,W\}=E\{g(Y)|X\}. Thus, (16) implies

E{g(Y)|X}=E{g(Y)|γTX,W},\displaystyle E\{g(Y)|X\}=E\{g(Y)|\gamma^{\mbox{\tiny{\sf T}}}X,W\}, (17)

which means that E{g(Y)|γTX,W}E\{g(Y)|\gamma^{\mbox{\tiny{\sf T}}}X,W\} is invariant as WW varies and thus equal to E{g(Y)|γTX}E\{g(Y)|\gamma^{\mbox{\tiny{\sf T}}}X\}. Hence, (17) implies E{g(Y)|X}=E{g(Y)|γTX)E\{g(Y)|X\}=E\{g(Y)|\gamma^{\mbox{\tiny{\sf T}}}X), which, by the arbitrariness of g()g(\cdot), means Y   X|γTXY\;\,\rule[0.0pt]{0.29999pt}{6.00006pt}\rule[0.0pt]{6.49994pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.00006pt}\;\,X|\gamma^{\mbox{\tiny{\sf T}}}X. By the definition of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, we have 𝒮Y|X𝒮(γ){\mathcal{S}}_{\scriptscriptstyle Y|X}\subseteq{\mathcal{S}}({\gamma}). Together with (15), we have (14). Suppose the support of X|W=iX|W=i is identical to the support of XX. Then, for any g(Y)L2(Y)g(Y)\in L_{\scriptscriptstyle 2}(Y), E{g(Y)|X,W=i}E\{g(Y)|X,W=i\} coincides with E{g(Y)|X}E\{g(Y)|X\} in terms of the same functional form and the same domain. The proof of 𝒮Y|X=𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}={\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}} thus follows the same as above. \Box

The coincidence between the overall support of XX and that specified for an individual mixture component occurs, for example, if XX has a mixture multivariate normal distribution. However, such coincidence should be often unreliable in practice, as the clustered pattern of XX determines that each of its individual mixture components is likely clouded within a specific region of the support of XX. Thus, we recommend using the ensemble 𝒮(i=1q𝒮Y|X(i)){\mathcal{S}}({\bigcup_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}{\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}}) as a conservative choice rather than using an individual 𝒮Y|X(i){\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}, although the former means more parameters to estimate.

To recover 𝒮(i=1q𝒮Y|X(i)){\mathcal{S}}({\bigcup_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}{\mathcal{S}}_{\scriptscriptstyle Y|X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle(i)$}}}}) by conducting SIR within the mixture components of XX, we first formulate the slicing technic in SIR mentioned in the Introduction by replacing YY in (2) with YS(I(YS=1),,I(YS=H))\vec{Y}_{\scriptscriptstyle S}\equiv(I(Y_{\scriptscriptstyle S}=1),\ldots,I(Y_{\scriptscriptstyle S}=H)), which generates

ΣX1E{XYS}\displaystyle\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E\{X\vec{Y}_{\scriptscriptstyle S}\} (18)

that again spans a subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} under the overall linearity condition (3). In light of (9) that reformulates the mixture component-wise linearity condition (7), we incorporate the grouping information for the iith mixture component, i.e. πi(X)\pi_{\scriptscriptstyle i}(X), into (18) and merge the results across all the mixture components. These deliver

Ω1E[{Σi1(Xμi)πi(X)YS}i{1,,q}],\displaystyle\Omega_{\scriptscriptstyle 1}\equiv E[\{\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)\vec{Y}_{\scriptscriptstyle S}\}_{\scriptscriptstyle i\in\{1,\ldots,q\}}], (19)

whose linear span 𝒮(Ω1){\mathcal{S}}({\Omega_{\scriptscriptstyle 1}}) is employed to recover 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. Because 𝒮(Ω1){\mathcal{S}}({\Omega_{\scriptscriptstyle 1}}) is constructed under the mixture model (6), we call any method that recovers 𝒮(Ω1){\mathcal{S}}({\Omega_{\scriptscriptstyle 1}}) SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, M in the subscript for the mixture model. As seen in the next theorem, the consistency of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} in the population level is immediately implied by Lemma 2, under the mixture component-wise linearity condition (7).

Theorem 1.

Suppose XX follows a mixture model that satisfies (7). Then 𝒮(Ω1){\mathcal{S}}({\Omega_{\scriptscriptstyle 1}}) is always a subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. In addition, if, for any v𝒮Y|Xv\!\in\!{\mathcal{S}}_{\scriptscriptstyle Y|X}, there exists i{1,,q}i\in\{1,\ldots,q\} and h{1,,H}h\in\{1,\ldots,H\} such that v𝒮(Σi1E(Xμi|W=i,YS=h))v\in{\mathcal{S}}({\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E(X-\mu_{\scriptscriptstyle i}|W=i,Y_{\scriptscriptstyle S}=h)}), then 𝒮(Ω1){\mathcal{S}}({\Omega_{\scriptscriptstyle 1}}) is identical to 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}.

Proof.

For each i=1,,qi=1,\ldots,q, we have

E{(Xμi)πi(X)YS}\displaystyle E\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)\vec{Y}_{\scriptscriptstyle S}\} =E[E{(Xμi)πi(X)|β0TX,YS}YS]\displaystyle=E[E\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X,\vec{Y}_{\scriptscriptstyle S}\}\vec{Y}_{\scriptscriptstyle S}]
=E[E{(Xμi)πi(X)|β0TX}YS],\displaystyle=E[E\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}\vec{Y}_{\scriptscriptstyle S}],

which, by (9), means Σi1E{(Xμi)πi(X)YS}𝒮Y|X\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)\vec{Y}_{\scriptscriptstyle S}\}\in{\mathcal{S}}_{\scriptscriptstyle Y|X}. This implies 𝒮(Ω1)𝒮Y|X{\mathcal{S}}({\Omega_{\scriptscriptstyle 1}})\subseteq{\mathcal{S}}_{\scriptscriptstyle Y|X}. The second statement of the theorem is straightforward, so its proof is omitted. ∎

Referring to the generality of the mixture model discussed in Section 2, we regard SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} as building a path that connects SIR and ADR. In one extreme, it coincides with SIR if XX has only one mixture component or more generally if both β0TΣiβ0\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta_{\scriptscriptstyle 0} and β0Tμi\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}\mu_{\scriptscriptstyle i} are invariant across different mixture components of XX; the coincidence in the latter case is readily implied by (9), under which Ω1\Omega_{\scriptscriptstyle 1} is a list of qq identical copies of (18). In the other extreme, if we let qq diverge with nn, then the mixture components will resemble the local neighborhoods of XX and thus SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} will essentially resemble ADR. Generally, depending on the underlying distribution of XX, SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} will provide a data-driven balance between the estimation efficiency and the reliability of SDR results.

Similarly to the exhaustiveness of ADR, an important advantage of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} over SIR is that its exhaustiveness is more general, as justified in Theorem 1. In particular, it can detect symmetric associations between YY and XX, which cannot be captured by SIR. To see this, suppose XX follows a symmetric mixture model with respect to the origin, and

Y=X12+ϵ\displaystyle Y=X_{\scriptscriptstyle 1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}+\epsilon (20)

where ϵ\epsilon is a random error. Then SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} can clearly detect X1X_{\scriptscriptstyle 1} as long as X1|YX_{\scriptscriptstyle 1}|Y is asymmetric with respect to zero for at least one mixture component, which occurs if one of μi\mu_{\scriptscriptstyle i}’s has nonzero first entry. Figure 2 illustrates this with X1X_{\scriptscriptstyle 1} being a balanced mixture of N(2,1)N(-2,1) and N(2,1)N(2,1). In general, the asymmetry of β0TX|Y\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X|Y is satisfied for at least one mixture component of XX if αi\alpha_{\scriptscriptstyle i}’s differ along the directions in 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, in which case the multiplicative weight πi(X)\pi_{\scriptscriptstyle i}(X) in Ω1\Omega_{\scriptscriptstyle 1} disturbs any symmetric pattern between β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X and YY into asymmetry. Since αi\alpha_{\scriptscriptstyle i}’s are regulated to differ from each other, with a sufficiently large qq and assuming a continuous prior distribution for 𝒮(β0){\mathcal{S}}({\beta_{\scriptscriptstyle 0}}) (Hall and Li, 1993), the exhaustiveness of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} is satisfied with probability one in a Bayesian sense.

Refer to caption
Figure 2: The scatter plot of (X1,X12)(X_{\scriptscriptstyle 1},X_{\scriptscriptstyle 1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}) with X1.5N(2,1)+.5N(2,1)X_{\scriptscriptstyle 1}\sim.5N(-2,1)+.5N(2,1) and n=100n=100: “\circ” for the first mixture component, and “\triangle” for the second mixture component.

Related to this, another advantage of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} over SIR is that it allows more flexibility in choosing the number of slices HH. When implementing SIR in practice, researchers need to address the tradeoff in selecting HH: while a small HH means better estimation accuracy due to less number of parameters, it elevates the risk of non-exhaustive recovery of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, as SIR can only recover up to H1H-1 directions. By contrast, SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} can recover up to q(H1)q(H-1) directions, relieving the freedom of choosing a coarse slicing to enhance the estimation accuracy. This advantage is crucial when YY is discrete or categorical with a small number of categories, in which case HH must also be small and the issue of non-exhaustiveness is less worrisome for SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}. The same advantage is also enjoyed by ADR, i.e. the localized SIR, which sets H=2H=2 automatically. Again, this complies with the connection between SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and ADR discussed below Theorem 1.

To implement SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} in the sample level, we can replace μi\mu_{\scriptscriptstyle i}, Σi\Sigma_{\scriptscriptstyle i}, and πi(X)\pi_{\scriptscriptstyle i}(X) in Ω1\Omega_{\scriptscriptstyle 1} with μ^i\widehat{\mu}_{\scriptscriptstyle i}, Σ^i\widehat{\Sigma}_{\scriptscriptstyle i}, and π^i(X)\widehat{\pi}_{\scriptscriptstyle i}(X) mentioned in Section 2, and replace the population mean in Ω1\Omega_{\scriptscriptstyle 1} with the sample mean. The resulting Ω^1\widehat{\Omega}_{\scriptscriptstyle 1} is n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent and asymptotically normal. By Luo and Li (2016), this permits using the ladle estimator on Ω^1\widehat{\Omega}_{\scriptscriptstyle 1} to determine the dimension dd of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, given the identity between 𝒮(Ω1){\mathcal{S}}({\Omega_{\scriptscriptstyle 1}}) and 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} discussed above. When the sample size is limited compared with the dimension pp or the number of mixture components qq, we also recommend using the predictor augmentation estimator (PAE; Luo and Li, 2021) to determine dd, which does not require the asymptotic normality of Ω^1\widehat{\Omega}_{\scriptscriptstyle 1}. As discussed above Lemma 1 in Section 2, these estimators permit us to safely assume dd to be known throughout the theoretical study. Given Ω^1\widehat{\Omega}_{\scriptscriptstyle 1}, we use its first dd left singular vectors to span an n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent estimator of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. The proofs of these results are simple, so they are omitted. The details about the implementation of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} is deferred to Appendix B.

When the dimension of XX is large, the estimation bias in the clustering stage, especially in estimating Σi1\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}, will adversely affect SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}. One way to address this issue is to impose additional constraints on the distribution of XX; see Section 4 later. Alternatively, observe that β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X follows a mixture model in d\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle d$}}}. The low dimensionality of πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) makes its estimation more accurate than that of πi(X)\pi_{\scriptscriptstyle i}(X), i.e. regardless of the dimension of XX, given a consistent initial estimator of β0\beta_{\scriptscriptstyle 0}. As πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) can be rewritten as E{πi(X)|β0TX}E\{\pi_{\scriptscriptstyle i}(X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}, the enhanced estimation power can also be understood as benefited from targeting at the conditional mean of πi(X)\pi_{\scriptscriptstyle i}(X), which is intuitively smoother than πi(X)\pi_{\scriptscriptstyle i}(X) itself. In addition, (10) implies that E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) can be estimated by linearly regressing XX on {πi(β0TX),πi(β0TX)β0TX}i{1,,q}\{\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X),\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\}_{\scriptscriptstyle i\in\{1,\ldots,q\}}, which does not involve Σi\Sigma_{\scriptscriptstyle i}’s and thus is robust to the dimension of XX.

Based on these observations, we now turn to the second strategy of estimating 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} mentioned in the beginning of this section, which serves as a refinement of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}. Namely, we use (10) to construct an appropriate estimating equation (5) for 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. The consistency of this approach is endorsed by Ma and Zhu (2012), as E(α(X)|βTX)E(\alpha(X)|\beta^{\mbox{\tiny{\sf T}}}X) in (5) needs to be truly specified only when 𝒮(β){\mathcal{S}}({\beta}) coincides with 𝒮(β0){\mathcal{S}}({\beta_{\scriptscriptstyle 0}}).

Assume that πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) is known a priori, where in practice we can replace 𝒮(β0){\mathcal{S}}({\beta_{\scriptscriptstyle 0}}) with a consistent estimator such as SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}. Let

RX(𝒮(β))i=1qπi(β0TX){PT(Σi,β)(Xμi)+μi}.\displaystyle R_{\scriptscriptstyle X}({\mathcal{S}}({\beta}))\equiv\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta)(X-\mu_{\scriptscriptstyle i})+\mu_{\scriptscriptstyle i}\}. (21)

By (10), RX(𝒮Y|X)R_{\scriptscriptstyle X}({\mathcal{S}}_{\scriptscriptstyle Y|X}) coincides with E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X). We explain later why using β0\beta_{\scriptscriptstyle 0} instead of the argument β\beta in πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X). To measure the relative “importance” of an outcome to the SDR estimation, we introduce

D(β0TX)i=1qj=1q(β0TΣjβ0)1E{β0T(Xμj)YSπi(β0TX)πj(β0TX)}πi(β0TX).\displaystyle D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\equiv\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle j}\beta_{\scriptscriptstyle 0})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E\{\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}(X-\mu_{\scriptscriptstyle j})\vec{Y}_{\scriptscriptstyle S}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\pi_{\scriptscriptstyle j}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X). (22)

This term is best interpretable when β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X follows a mixture normal distribution: by Stein’s Lemma, it can rewritten as a weighted average of the derivative of the regression function E(YS|β0TX)E(\vec{Y}_{\scriptscriptstyle S}|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) with respect to β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, i.e.

D(β0TX)=i=1qE[{E(YS|β0TX)πi(β0TX)}/β0TX]πi(β0TX),\displaystyle D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E[\partial\{E(\vec{Y}_{\scriptscriptstyle S}|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\}/\partial\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X]\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), (23)

which reveals the strength of the point-wise effect of β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X on regressing YS\vec{Y}_{\scriptscriptstyle S}.

We now plug RX(𝒮(β))R_{\scriptscriptstyle X}({\mathcal{S}}({\beta})) and D(β0TX)D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) into (5), with μg(βTX)\mu_{\scriptscriptstyle g}(\beta^{\mbox{\tiny{\sf T}}}X) set at zero as permitted by the double-robust property. To avoid obscures caused by dimensionality (Ma and Zhu, 2013), we rewrite (5) as minimizing the magnitude of its left-hand side over all the dd-dimensional subspaces of p\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p$}}}; that is, we minimize

G1(𝒮(β))E[{XRX(𝒮(β))}vecT{D(β0TX)diag(YS)}]F2,\displaystyle{\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta}))\equiv\left\|E[\{X-R_{\scriptscriptstyle X}({\mathcal{S}}({\beta}))\}{\mathrm{vec}}^{\mbox{\tiny{\sf T}}}\{D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X){\mathrm{diag}}(\vec{Y}_{\scriptscriptstyle S})\}]\right\|_{\scriptscriptstyle F}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}, (24)

where F\|\cdot\|_{\scriptscriptstyle F} denotes the Frobenius norm of a matrix and diag(v){\mathrm{diag}}(v) transforms any vector vv to the diagonal matrix with vv being the diagonal. Same as for πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} in D(β0TX)D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) will be replaced with SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, the explanation for using D(β0TX)D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) instead of D(βTX)D(\beta^{\mbox{\tiny{\sf T}}}X) deferred to later. Clearly, the minimum value of G1(){\mathrm{G}}_{\scriptscriptstyle 1}(\cdot) is zero, and can be reached by 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. Following Ma and Zhu (2012), under fairly general conditions, the intersection of all the minimizers of G1(){\mathrm{G}}_{\scriptscriptstyle 1}(\cdot) also minimizes G1(){\mathrm{G}}_{\scriptscriptstyle 1}(\cdot), which means that it is the unique minimizer of G1(){\mathrm{G}}_{\scriptscriptstyle 1}(\cdot) of the smallest dimension, and it is always a subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} under the mixture component-wise linearity condition (7). We call any method that recovers this space SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, R in the subscript for refined.

Similar to SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, the exhaustiveness of SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} depends on the variability of the association between YY and β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X across different mixture components of β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X. In the special case that the Σi\Sigma_{\scriptscriptstyle i}’s are identical to each other, the following theorem justifies a sufficient condition for the exhaustiveness of SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}.

Theorem 2.

Suppose XX follows a mixture model that has identical Σi\Sigma_{\scriptscriptstyle i}’s and satisfies (7). If [E{i=1qπi(β0TX)πj(β0TX)β0T(Xμi)YS}]j{1,,q}[E\{\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\pi_{\scriptscriptstyle j}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}(X-\mu_{\scriptscriptstyle i})\vec{Y}_{\scriptscriptstyle S}\}]_{\scriptscriptstyle j\in\{1,\ldots,q\}} has rank dd, then SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} exhaustively recovers 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}.

Proof.

For any β\beta such that 𝒮(β){\mathcal{S}}({\beta}) is a proper subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, suppose 𝒮(β){\mathcal{S}}({\beta}) is kk-dimensional, β\beta is of full column rank, β0\beta_{\scriptscriptstyle 0} is a basis of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, and, without loss of generality, (β,γ)=β0(\beta,\gamma)=\beta_{\scriptscriptstyle 0} for some semi-orthogonal γp×(dk)\gamma\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times(d-k)$}}}. We have P(Σi,β)γ=β(βTΣiβ)1(βTΣiγ)P(\Sigma_{\scriptscriptstyle i},\beta)\gamma=\beta(\beta^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(\beta^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\gamma) for i{1,,q}i\in\{1,\ldots,q\}, which, by (21), means

γT{XRX(𝒮(β))}\displaystyle\gamma^{\mbox{\tiny{\sf T}}}\{X-R_{\scriptscriptstyle X}({\mathcal{S}}({\beta}))\} =iqπi(β0TX){γT(γTΣiβ)(βTΣiβ)1βT}(Xμi)\displaystyle=\textstyle\sum_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\{\gamma^{\mbox{\tiny{\sf T}}}-(\gamma^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta)(\beta^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\beta^{\mbox{\tiny{\sf T}}}\}(X-\mu_{\scriptscriptstyle i})
=iqπi(β0TX)(0,Ai)(β0TΣiβ0)1β0T(Xμi)\displaystyle=\textstyle\sum_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)(0,A_{\scriptscriptstyle i})(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta_{\scriptscriptstyle 0})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}(X-\mu_{\scriptscriptstyle i})

where Ai=γTΣiγ(γTΣiβ)(βTΣiβ)(βTΣiγ)A_{\scriptscriptstyle i}=\gamma^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\gamma-(\gamma^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta)(\beta^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta)(\beta^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\gamma). Since Σi\Sigma_{\scriptscriptstyle i} is invariant of ii, as is AiA_{\scriptscriptstyle i}. Denote the former by Σ\Sigma and the latter by AA, and write G1(𝒮(β)){\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta})) as Ψ(𝒮(β))F2\|\Psi({\mathcal{S}}({\beta}))\|_{\scriptscriptstyle F}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}. We then have

γTΨ(𝒮(β))\displaystyle\gamma^{\mbox{\tiny{\sf T}}}\Psi({\mathcal{S}}({\beta})) =E[{i=1qπi(β0TX)(0,A)(β0TΣiβ0)1β0T(Xμi)YS}DT(β0TX)]\displaystyle=E[\{\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)(0,A)(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta_{\scriptscriptstyle 0})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}(X-\mu_{\scriptscriptstyle i})\vec{Y}_{\scriptscriptstyle S}\}D^{\mbox{\tiny{\sf T}}}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)]
=(0,A)j=1qE2[i=1qπi(β0TX)πj(β0TX)(β0TΣiβ0)1β0T(Xμi)YS]\displaystyle=(0,A)\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}[\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)\pi_{\scriptscriptstyle j}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}\Sigma_{\scriptscriptstyle i}\beta_{\scriptscriptstyle 0})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}(X-\mu_{\scriptscriptstyle i})\vec{Y}_{\scriptscriptstyle S}]
(0,A)j=1qMj2.\displaystyle\equiv(0,A)\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}M_{\scriptscriptstyle j}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}.

Because Σ\Sigma is invertible, AA must be nonzero. Hence, if j=1qMj2=(Mj)j{1,,q}2\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}M_{\scriptscriptstyle j}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}=(M_{\scriptscriptstyle j})_{\scriptscriptstyle j\in\{1,\ldots,q\}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}} is invertible, which is the condition adopted in this theorem, then γTΨ(𝒮(β))\gamma^{\mbox{\tiny{\sf T}}}\Psi({\mathcal{S}}({\beta})) is always nonzero, and, as G1(𝒮(β))γγTΨ(𝒮(β))F2{\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta}))\geq\|\gamma\gamma^{\mbox{\tiny{\sf T}}}\Psi({\mathcal{S}}({\beta}))\|_{\scriptscriptstyle F}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}, G1(𝒮(β)){\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta})) must also be nonzero for any β\beta that spans a proper subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. Thus, SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} exhaustively recovers 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. This completes the proof. ∎

The exhaustiveness condition in Theorem 2 essentially requires that, for any direction vv in 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, the association between vTXv^{\mbox{\tiny{\sf T}}}X and YY is asymmetric in at least one mixture component of β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X. Again, this will be likely to hold in a Bayesian sense if the number of mixture components qq is sufficiently large in the mixture model of XX, and it will be trivially true if we let qq diverge with nn, e.g. in the scenario of approximating non-clustered distribution of XX by the mixture model. Same as SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, depending on the complexity of WW, SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} builds a path from SIR that adopts the aggressive parametric assumption on XX to the fully nonparametric SDR methods, i.e. of MAVE-type, that require the weakest assumption on XX. In one extreme that WW is degenerate, D(β0TX)D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) will also be degenerate and RX(𝒮(β))R_{\scriptscriptstyle X}({\mathcal{S}}({\beta})) will reduce to a linear E(X|βTX)E(X|\beta^{\mbox{\tiny{\sf T}}}X), which together imply the identity between SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} and SIR. In the other extreme that WW is identical to β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X, the integrand in (24) would resemble the efficient score for homoscedastic data (Luo, Li and Yin, 2014), had β0\beta_{\scriptscriptstyle 0} in the integrand been replaced with β\beta and E{YS|βTX}E\{\vec{Y}_{\scriptscriptstyle S}|\beta^{\mbox{\tiny{\sf T}}}X\} been consistently estimated instead of being misspecified at zero.

The concern behind the use of β0\beta_{\scriptscriptstyle 0} instead of the argument β\beta in πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) and D(β0TX)D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) is as follows. If we replace β0\beta_{\scriptscriptstyle 0} with the argument β\beta in both πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) in RX(𝒮(β))R_{\scriptscriptstyle X}({\mathcal{S}}({\beta})) and D(β0TX)D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), then the unique minimizer of (24) of smallest dimension, if exists, will still fall in 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. However, the corresponding exhaustiveness will require stronger assumptions. For example, this space will be trivially zero-dimensional in Model (20), whereas SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} still recovers the one-dimensional 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. For this reason, we decide not to do so when constructing G1(){\mathrm{G}}_{\scriptscriptstyle 1}(\cdot).

To implement SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, we assume its exhaustiveness as well as the exhaustiveness of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, and we start with estimating πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) by rewriting this term as E(πi(X)|β0TX)E(\pi_{\scriptscriptstyle i}(X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) and running a kernel regression of π^i(X)\widehat{\pi}_{\scriptscriptstyle i}(X) on β~TX\widetilde{\beta}^{\mbox{\tiny{\sf T}}}X, where 𝒮(β~){\mathcal{S}}({\widetilde{\beta}}) denotes the result of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}. By plugging μ^i\widehat{\mu}_{\scriptscriptstyle i}’s, Σ^i\widehat{\Sigma}_{\scriptscriptstyle i}’s, and the estimates of πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)’s into (24), we have G^1()\widehat{\mathrm{G}}_{\scriptscriptstyle 1}(\cdot) that consistently estimates G1(){\mathrm{G}}_{\scriptscriptstyle 1}(\cdot). SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} is then the unique minimizer of G^1()\widehat{\mathrm{G}}_{\scriptscriptstyle 1}(\cdot), which is n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent and asymptotically normal. Referring to the discussions above, it can outperform SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} in the sample level, which is analogous to the relative effectiveness of the refined MAVE with respect to MAVE (Xia et al., 2002). For continuity of the context, we leave the detailed form of G^1()\widehat{\mathrm{G}}_{\scriptscriptstyle 1}(\cdot) and the algorithm for its minimization to Appendix B.

As mentioned in the Introduction, when XX follows a mixture skew-elliptical distribution, Guan, Xie and Zhu (2017) proposed the generalized Stein’s Lemma based method (StI) to consistently recover 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. Due to the violation of the mixture component-wise linearity condition (7), both SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} are theoretically inconsistent under Guan, Xie and Zhu’s settings. Nonetheless, while StI requires identical Σi\Sigma_{\scriptscriptstyle i}’s up to multiplicative scalars and uses the average of SDR results across different mixture components, we allow Σi\Sigma_{\scriptscriptstyle i}’s to vary freely and we collect the information for SDR from all the mixture components to form the final result. Therefore, our approaches are sometimes exclusively useful and are more likely to recover 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} exhaustively. As seen next, our approaches can also be extended to the high-dimensional settings, as well as to the second-order inverse regression methods.

4 An extension towards the high-dimensional sparse settings

In the literature, SIR has been widely studied under the high-dimensional settings, with the aid of the sparsity assumption on 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} that only a few rows of β0\beta_{\scriptscriptstyle 0} are nonzero. Representative works include Chen, Zou and Cook (2010), Li (2007), Lin, Zhao and Liu (2019), and Tan et al. (2018), etc. A common strategy employed in these works is to reformulate SIR as a least squares method and then incorporate penalty functions. Accordingly, the overall linearity condition (3) is commonly adopted, which, as mentioned in the Introduction, can be restrictive in practice due to the sparsity assumption.

We now modify SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} towards sparsity under the high-dimensional settings. For the applicability of existing clustering analysis, here we follow Cai, Ma and Zhang (2019) to assume that XX has a mixture multivariate normal distribution with the number of mixture components qq known a priori and all Σi\Sigma_{\scriptscriptstyle i}’s identical to some invertible Σ\Sigma, and that

Σ1(μiμj) is sparse for all ij,\displaystyle\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(\mu_{\scriptscriptstyle i}-\mu_{\scriptscriptstyle j})\mbox{ is sparse for all }i\neq j, (25)

under which the consistent estimators μ^i\widehat{\mu}_{\scriptscriptstyle i}’s, Σ^\widehat{\Sigma}, and π^i(X)\widehat{\pi}_{\scriptscriptstyle i}(X)’s have been derived in Cai, Ma and Zhang (2019). Given the mixture model fit, we modify SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} following the spirit of Tan et al. (2018) for its computational efficiency.

Following Tan et al. (2018), we change the parametrization in SDR to Π=β2\Pi=\beta^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}} and further extend the parameter space to the set of pp-dimensional positive semi-definite matrices, denoted by {\mathcal{M}}. Let

ΩE(X|Y)[[E{(Xμi)πi(X)YS}]i{1,,q}]2,\displaystyle{\Omega}_{\scriptscriptstyle E(X|Y)}\equiv\left[[E\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)\vec{Y}_{\scriptscriptstyle S}\}]_{\scriptscriptstyle i\in\{1,\ldots,q\}}\right]^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}, (26)

and Ω^E(X|Y)\widehat{\Omega}_{\scriptscriptstyle E(X|Y)} be its estimator using μ^i\widehat{\mu}_{\scriptscriptstyle i}’s and π^i(X)\widehat{\pi}_{\scriptscriptstyle i}(X)’s. We minimize

tr(Ω^E(X|Y)Π)+ρΠ1 subject to tr(Σ^1/2ΠΣ^1/2)d,Σ^1/2ΠΣ^1/2sp1,\displaystyle-{\mathrm{tr}}(\widehat{\Omega}_{\scriptscriptstyle E(X|Y)}\Pi)+\rho\|\Pi\|_{\scriptscriptstyle 1}\mbox{ subject to }{\mathrm{tr}}(\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}\Pi\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}})\leq d,\|\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}\Pi\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}\|_{\scriptscriptstyle\mathrm{sp}}\leq 1, (27)

over Π\Pi\in{\mathcal{M}}, where ρ\rho is a tuning parameter, tr(){\mathrm{tr}}(\cdot) and sp\|\cdot\|_{\scriptscriptstyle\mathrm{sp}} denote the trace and the spectral norm of a matrix, respectively, and A1=i,j|aij|\|A\|_{\scriptscriptstyle 1}=\textstyle\sum_{\scriptscriptstyle i,j}|a_{\scriptscriptstyle ij}| for any matrix A=(aij)A=(a_{\scriptscriptstyle ij})\in{\mathcal{M}}.

Because (27) is a convex minimization problem, it has the unique minimizer, which we denote by Π^\widehat{\Pi}. By simple algebra, Π^\widehat{\Pi} must be on the boundary of the constraints in (27), and must have the form β^2\widehat{\beta}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}} for some β^p×d\widehat{\beta}\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times d$}}} with Σ^1/2β^\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}\widehat{\beta} being semi-orthogonal. We estimate 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} by 𝒮(β^){\mathcal{S}}({\widehat{\beta}}), and call this estimator the sparse SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} or simply SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} due to the natural bond between ΩE(X|Y){\Omega}_{\scriptscriptstyle E(X|Y)} and Ω1\Omega_{\scriptscriptstyle 1} defined in (19). Its consistency is justified in the following theorem, where we use

δ(β^,β0)P(Ip,β^)P(Ip,β0)F\displaystyle\delta(\widehat{\beta},\beta_{\scriptscriptstyle 0})\equiv\|P(\mathrm{I}_{\scriptscriptstyle p},\widehat{\beta})-P(\mathrm{I}_{\scriptscriptstyle p},\beta_{\scriptscriptstyle 0})\|_{\scriptscriptstyle F} (28)

to measure the distance between 𝒮(β^){\mathcal{S}}({\widehat{\beta}}) and 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}.

Theorem 3.

Let s1=maxi,j{1,,q}Σ1(μiμj)s_{\scriptscriptstyle 1}=\max_{\scriptscriptstyle i,j\in\{1,\ldots,q\}}\|\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(\mu_{\scriptscriptstyle i}-\mu_{\scriptscriptstyle j})\|_{\scriptscriptstyle\infty} under the sparsity assumption (25), let s2s_{\scriptscriptstyle 2} be the number of nonzero rows of β0\beta_{\scriptscriptstyle 0} under the sparse SDR assumption, and let λd\lambda_{\scriptscriptstyle d} be the ddth largest eigenvalue of Σ1/2ΩE(X|Y)Σ1/2\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1/2$}}}{\Omega}_{\scriptscriptstyle E(X|Y)}\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1/2$}}}. Under the regularity conditions (C1)-(C3) (see Appendix A), if d2s1s2logp=o(n)d^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}s_{\scriptscriptstyle 1}s_{\scriptscriptstyle 2}\log p=o(n), s1s22logp=o(n)s_{\scriptscriptstyle 1}s_{\scriptscriptstyle 2}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}\log p=o(n), ds1logp=o(nρλd)ds_{\scriptscriptstyle 1}\log p=o(n\rho\lambda_{\scriptscriptstyle d}), and ρs2=o(λd)\rho s_{\scriptscriptstyle 2}=o(\lambda_{\scriptscriptstyle d}), then we have

D(β^,β0)=OP(ds1logp/(nρλd)+ρs2/λd+d(s1s2logp/n)1/2).\displaystyle D(\widehat{\beta},\beta_{\scriptscriptstyle 0})=O_{\scriptscriptstyle P}(ds_{\scriptscriptstyle 1}\log p/(n\rho\lambda_{\scriptscriptstyle d})+\rho s_{\scriptscriptstyle 2}/\lambda_{\scriptscriptstyle d}+d(s_{\scriptscriptstyle 1}s_{\scriptscriptstyle 2}\log p/n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}). (29)
Proof.

Theorem 3 can be proved by incorporating the theoretical results of Cai, Ma and Zhang (2019) into the proof of Theorem 1 in Tan et al. (2018). Hence, we only provide a sketch of the proof here for the case q=2q=2. Let γ=Σ1(μ1μ2)\gamma=\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(\mu_{\scriptscriptstyle 1}-\mu_{\scriptscriptstyle 2}). Under Condition (C1) in Appendix A of this article and Lemma 3.1, Lemma 3.2, and Theorem 3.1 in Cai, Ma and Zhang (2019), we have

max(|π^1π1|,μ^1μ12,μ^2μ22,γ^γ2)=OP((s1logp/n)1/2).\displaystyle\max(|\widehat{\pi}_{\scriptscriptstyle 1}-\pi_{\scriptscriptstyle 1}|,\|\widehat{\mu}_{\scriptscriptstyle 1}-\mu_{\scriptscriptstyle 1}\|_{\scriptscriptstyle 2},\|\widehat{\mu}_{\scriptscriptstyle 2}-\mu_{\scriptscriptstyle 2}\|_{\scriptscriptstyle 2},\|\widehat{\gamma}-\gamma\|_{\scriptscriptstyle 2})=O_{\scriptscriptstyle P}((s_{\scriptscriptstyle 1}\log p/n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}).

Together with the sub-Gaussian condition of XX implied by its mixture multivariate normal distribution and Conditions (C2)-(C3) in Appendix A of this article, it is easy to modify the proofs of Lemma 1 and Proposition 1 in Tan et al. (2018) to derive

Ω^E(X|Y)ΩE(X|Y)max=OP((s1logp/n)1/2) and Σ^Σmax=OP((s1logp/n)1/2),\displaystyle\|\widehat{\Omega}_{\scriptscriptstyle E(X|Y)}\!-\!{\Omega}_{\scriptscriptstyle E(X|Y)}\|_{\scriptscriptstyle\max}\!=\!O_{\scriptscriptstyle P}((s_{\scriptscriptstyle 1}\!\log p/n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}})\mbox{ and }\|\widehat{\Sigma}\!-\!\Sigma\|_{\scriptscriptstyle\max}\!=\!O_{\scriptscriptstyle P}((s_{\scriptscriptstyle 1}\!\log p/n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}),

where Amax=max{|aij|:i,j{1,,p}}\|A\|_{\scriptscriptstyle\max}=\max\{|a_{\scriptscriptstyle ij}|:i,j\in\{1,\ldots,p\}\} for any matrix A=(aij)p×pA=(a_{\scriptscriptstyle ij})\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times p$}}}. Similarly, in our scenario, Lemma S1 in the supplementary material for Tan et al. (2018) can be modified to

Λ~ΛspC(s1s2logp/n)1/2 and Π~ΠFCd(s1s2logp/n)1/2\displaystyle\|\tilde{\Lambda}-\Lambda\|_{\scriptscriptstyle\mathrm{sp}}\leq C(s_{\scriptscriptstyle 1}s_{\scriptscriptstyle 2}\log p/n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}\mbox{ and }\|\tilde{\Pi}-\Pi\|_{\scriptscriptstyle F}\leq Cd(s_{\scriptscriptstyle 1}s_{\scriptscriptstyle 2}\log p/n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}} (30)

where Λ\Lambda is defined in Equation (4) of Tan et al. (2018), Π=β02\Pi=\beta_{\scriptscriptstyle 0}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}, and the definitions of Λ~\tilde{\Lambda} and Π~\tilde{\Pi} can be found above Lemma S1 in the supplementary material for Tan et al. (2018); Lemma S5 in the supplementary material for Tan et al. (2018) can be modified to that the term {slog(ep)/n}2\{s\log(ep)/n\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}} be changed to {s1s2log(ep)/n}1/2\{s_{\scriptscriptstyle 1}s_{\scriptscriptstyle 2}\log(ep)/n\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}; Lemma S6 in the supplementary material for Tan et al. (2018) can be modified to that the convergence rate in its conclusion is C(s1logp/n)1/2C(s_{\scriptscriptstyle 1}\log p/n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}; the rest of the lemmas, including Lemma S2, Lemma S3, and Lemma S4, will still hold. Consequently, in the proof of Tan et al. (2018)’s Theorem 1 (see their supplementary material), the condition s1s22logp=o(n)s_{\scriptscriptstyle 1}s_{\scriptscriptstyle 2}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}\log p=o(n) is needed to ensure (S25), and the term γ\gamma (which is a different term from γ=Σ1(μ1μ2)\gamma=\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(\mu_{\scriptscriptstyle 1}-\mu_{\scriptscriptstyle 2}) defined above) is of order OP((ds1s2logp/n)1/2)O_{\scriptscriptstyle P}((ds_{\scriptscriptstyle 1}s_{\scriptscriptstyle 2}\log p/n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}). A modification of equation (S29) in their proof then yields

ΔF\displaystyle\|\Delta\|_{\scriptscriptstyle F} =OP(γ/λd+γ2/(ρs2λd)+ρs2/λd)=OP(γ2/(ρs2λd)+ρs2/λd)\displaystyle=O_{\scriptscriptstyle P}(\gamma/\lambda_{\scriptscriptstyle d}+\gamma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}/(\rho s_{\scriptscriptstyle 2}\lambda_{\scriptscriptstyle d})+\rho s_{\scriptscriptstyle 2}/\lambda_{\scriptscriptstyle d})=O_{\scriptscriptstyle P}(\gamma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}/(\rho s_{\scriptscriptstyle 2}\lambda_{\scriptscriptstyle d})+\rho s_{\scriptscriptstyle 2}/\lambda_{\scriptscriptstyle d})
=OP(ds1logp/(nρλd)+ρs2/λd),\displaystyle=O_{\scriptscriptstyle P}(ds_{\scriptscriptstyle 1}\log p/(n\rho\lambda_{\scriptscriptstyle d})+\rho s_{\scriptscriptstyle 2}/\lambda_{\scriptscriptstyle d}), (31)

where the definition of Δ\Delta can be found below Lemma S1 in the supplementary material for Tan et al. (2018). By the subsequent logic flow of their proof, (30) and (4) together imply (29). This completes the proof. ∎

In Theorem 3, we allow divergence of both the number of variables of XX that are uniquely informative to YY and the dimension of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. (29) suggests that the optimal value of ρ\rho is of order (ds1logp)1/2/(s2n)1/2(ds_{\scriptscriptstyle 1}\log p)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}/(s_{\scriptscriptstyle 2}n)^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}. In practice, we follow Tan et al. (2018) to suggest tuning ρ\rho by cross validation, details omitted.

To modify SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} towards sparsity, we follow Theorem 2 to assume the equality of Σi\Sigma_{\scriptscriptstyle i}’s, under which SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} delivers the dd leading eigenvectors of Σ1ΩrSIRΣ1\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}{\Omega}_{\scriptscriptstyle\mathrm{rSIR}}\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}} in the population level with

ΩrSIRE2[{i=1qπi(β0TX)(Xμi)}vecT{D(β0TX)diag(YS)}].\displaystyle{\Omega}_{\scriptscriptstyle\mathrm{rSIR}}\equiv E^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\left[\{\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)(X-\mu_{\scriptscriptstyle i})\}{\mathrm{vec}}^{\mbox{\tiny{\sf T}}}\{D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X){\mathrm{diag}}(\vec{Y}_{\scriptscriptstyle S})\}\right]. (32)

Let Ω^rSIR\widehat{\Omega}_{\scriptscriptstyle\mathrm{rSIR}} be a consistent estimator of ΩrSIR{\Omega}_{\scriptscriptstyle\mathrm{rSIR}} using μ^i\widehat{\mu}_{\scriptscriptstyle i}’s to replace μi\mu_{\scriptscriptstyle i}’s and using SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} to replace 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. Similar to (27), we minimize

tr(Ω^rSIRΠ)+ρΠ1 subject to tr(Σ^1/2ΠΣ^1/2)d,Σ^1/2ΠΣ^1/2sp1,\displaystyle-{\mathrm{tr}}(\widehat{\Omega}_{\scriptscriptstyle\mathrm{rSIR}}\Pi)+\rho\|\Pi\|_{\scriptscriptstyle 1}\mbox{ subject to }{\mathrm{tr}}(\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}\Pi\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}})\leq d,\|\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}\Pi\widehat{\Sigma}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}\|_{\scriptscriptstyle\mathrm{sp}}\leq 1, (33)

over Π\Pi\in{\mathcal{M}}, which again has the unique minimizer of the form β^R2\widehat{\beta}_{\scriptscriptstyle R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}} for some β^Rp×d\widehat{\beta}_{\scriptscriptstyle R}\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times d$}}} that satisfies β^RTΣ^β^R=Id\widehat{\beta}_{\scriptscriptstyle R}^{\mbox{\tiny{\sf T}}}\widehat{\Sigma}\widehat{\beta}_{\scriptscriptstyle R}=\mathrm{I}_{\scriptscriptstyle d}. We call 𝒮(β^R){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle R}}) the sparse SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} or simply SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}}. Following a similar reasoning to Theorem 3, SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} converges to 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} at the same rate as SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}}. Because it does not involve πi(X)\pi_{\scriptscriptstyle i}(X), it can be more robust than SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} against the estimation bias in fitting the mixture model of XX, caused by large pp or the non-normality of the mixture components, etc.

As mentioned above (25), the implementation of both SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} requires qq to be truly specified a priori in the clustering stage. Because there is lack of consistent estimators of qq under the high-dimensional settings in the existing literature, qq can be potentially underestimated in practice, causing inconsistency of SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}}. Nonetheless, as these methods reduce to the sparse SIR proposed in Tan et al. (2018) when qq is forced to be one, which is the worst case scenario for estimating qq, they are still expected to outperform the existing sparse SIR if qq is underestimated to be some q^>1\widehat{q}>1. To determine dd under the high-dimensional settings, which coincides with the rank of both ΩE(X|Y)\Omega_{\scriptscriptstyle E(X|Y)} and ΩrSIR{\Omega}_{\scriptscriptstyle\mathrm{rSIR}}, we recommend applying the aforementioned PAE (Luo and Li, 2021) to either Ω^E(X|Y)\widehat{\Omega}_{\scriptscriptstyle E(X|Y)} or Ω^rSIR\widehat{\Omega}_{\scriptscriptstyle\mathrm{rSIR}}, details omitted.

5 Adjusting SAVE for XX under mixture model

We now parallelize the two proposed strategies above to adjust SAVE for XX under the mixture model (6). The results can be developed similarly for the other second-order inverse regression methods, e.g. pHd and directional regression; see Appendix A for detail.

The first strategy amounts to conducting SAVE within each mixture component of XX and then taking the ensemble of the results. For h{1,,H}h\in\{1,\ldots,H\}, let YS,hY_{\scriptscriptstyle S,h} be the hhth entry of YS\vec{Y}_{\scriptscriptstyle S}, i.e. YS,hY_{\scriptscriptstyle S,h} is one if and only if YSY_{\scriptscriptstyle S} equals hh. For i{1,,q}i\in\{1,\ldots,q\}, we introduce

E{πi(X)YS,h}IpΣi1[E{(Xμi)2πi(X)YS,h}HE2{(Xμi)πi(X)YS,h}],\displaystyle E\{\pi_{\scriptscriptstyle i}(X)Y_{\scriptscriptstyle S,h}\}\mathrm{I}_{\scriptscriptstyle p}-\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\left[E\{(X-\mu_{\scriptscriptstyle i})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\pi_{\scriptscriptstyle i}(X)Y_{\scriptscriptstyle S,h}\}-HE^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)Y_{\scriptscriptstyle S,h}\}\right],

denoted by Ωh,i\Omega_{\scriptscriptstyle h,i}, whose ensemble over hh mimics SAVE for the iith mixture component of XX. We recover 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} by the column space of Ω2(Ωh,i)h{1,,H},i{1,,q}\Omega_{\scriptscriptstyle 2}\equiv(\Omega_{\scriptscriptstyle h,i})_{\scriptscriptstyle h\in\{1,\ldots,H\},i\in\{1,\ldots,q\}}, and call the corresponding SDR method SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}}. When XX has only one mixture component, SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} reduces to SAVE. Its general consistency is justified in the following theorem.

Theorem 4.

Suppose XX follows a mixture model that satisfies (7) and (8). Then Ω2\Omega_{\scriptscriptstyle 2} always spans a subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. In addition, Ω2\Omega_{\scriptscriptstyle 2} spans 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} if, for any v𝒮Y|Xv\in{\mathcal{S}}_{\scriptscriptstyle Y|X}, E{(vTX,vTX2v)πi(X)YS,h}E\{(v^{\mbox{\tiny{\sf T}}}X,v^{\mbox{\tiny{\sf T}}}X^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}v)\pi_{\scriptscriptstyle i}(X)Y_{\scriptscriptstyle S,h}\} varies with hh for at least one of i{1,,q}i\in\{1,\ldots,q\}.

Proof.

For each h{1,,H}h\in\{1,\ldots,H\}, we can rewrite Ωh,i\Omega_{\scriptscriptstyle h,i} as

Ωh,i\displaystyle\Omega_{\scriptscriptstyle h,i} =[E{πi(X)|YS,h=1}/H]Ip\displaystyle=[E\{\pi_{\scriptscriptstyle i}(X)|Y_{\scriptscriptstyle S,h}=1\}/H]\mathrm{I}_{\scriptscriptstyle p}-
Σi1[E{(Xμi)2πi(X)|YS,h=1}E2{(Xμi)πi(X)|YS,h=1}]/H\displaystyle\hskip 11.38092pt\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}[E\{(X-\mu_{\scriptscriptstyle i})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\pi_{\scriptscriptstyle i}(X)|Y_{\scriptscriptstyle S,h}=1\}-E^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\{(X-\mu_{\scriptscriptstyle i})\pi_{\scriptscriptstyle i}(X)|Y_{\scriptscriptstyle S,h}=1\}]/H
=[E{πi(X)|YS,h=1}IpΣi1var{(Xμi)I(W=i)|YS,h=1}]/H\displaystyle=[E\{\pi_{\scriptscriptstyle i}(X)|Y_{\scriptscriptstyle S,h}=1\}\mathrm{I}_{\scriptscriptstyle p}-\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\mathrm{var}\{(X-\mu_{\scriptscriptstyle i})I(W=i)|Y_{\scriptscriptstyle S,h}=1\}]/H
=E{πi(X)IpΣi1vari(X|β0TX)|YS,h=1}/H\displaystyle=E\{\pi_{\scriptscriptstyle i}(X)\mathrm{I}_{\scriptscriptstyle p}-\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\mathrm{var}_{\scriptscriptstyle i}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)|Y_{\scriptscriptstyle S,h}=1\}/H
Σi1var{Ei(X|β0TX)|YS,h=1}/H.\displaystyle\hskip 11.38092pt-\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\mathrm{var}\{E_{\scriptscriptstyle i}(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)|Y_{\scriptscriptstyle S,h}=1\}/H.

Thus, by (9) and (11), we have Ωh,i=β0Ah,i\Omega_{\scriptscriptstyle h,i}=\beta_{\scriptscriptstyle 0}A_{\scriptscriptstyle h,i} for some Ah,id×pA_{\scriptscriptstyle h,i}\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle d\times p$}}}, which implies 𝒮(Ω2)𝒮Y|X{\mathcal{S}}({\Omega_{\scriptscriptstyle 2}})\subseteq{\mathcal{S}}_{\scriptscriptstyle Y|X}. The condition for the coincidence between 𝒮(Ω2){\mathcal{S}}({\Omega_{\scriptscriptstyle 2}}) and 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} can be derived straightforwardly following the proofs of Theorem 3 and 4 in Li and Wang (2007). ∎

By Theorem 4, the exhaustiveness of SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} requires each informative direction of XX to be captured by SAVE in at least one mixture component of XX, so it is quite general (Li and Wang, 2007). The implementation of SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} involves replacing the parameters of the mixture model in Ω2\Omega_{\scriptscriptstyle 2} with the corresponding estimators in Section 2 and replacing the population moments with the sample moments. The resulting Ω^2\widehat{\Omega}_{\scriptscriptstyle 2} is n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent and asymptotically normal, by which the ladle estimator (Luo and Li, 2016) can be applied again to determine dd. The leading dd left singular vectors of Ω^2\widehat{\Omega}_{\scriptscriptstyle 2} then span a n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent estimator of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. These results are straightforward, so we omit the details.

Under framework (5), SAVE can be reformulated by taking

g(Y,βTX)=ϕ(βTX)[E(X2YS|βTX){var(X|βTX)+E2(X|βTX)}YS]\displaystyle g(Y,\beta^{\mbox{\tiny{\sf T}}}X)=\phi(\beta^{\mbox{\tiny{\sf T}}}X)\left[E(X^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\otimes\vec{Y}_{\scriptscriptstyle S}|\beta^{\mbox{\tiny{\sf T}}}X)-\{\mathrm{var}(X|\beta^{\mbox{\tiny{\sf T}}}X)+E^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}(X|\beta^{\mbox{\tiny{\sf T}}}X)\}\otimes\vec{Y}_{\scriptscriptstyle S}\right]

with ϕ(βTX)\phi(\beta^{\mbox{\tiny{\sf T}}}X) set at a constant for simplicity and α(X)=E{g(Y,βTX)|X}\alpha(X)=E\{g(Y,\beta^{\mbox{\tiny{\sf T}}}X)|X\}, and μg(βTX)\mu_{\scriptscriptstyle g}(\beta^{\mbox{\tiny{\sf T}}}X) set at zero because E{g(Y,β0TX)|β0TX}E\{g(Y,\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X)|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X\} is zero, and μα(βTX)\mu_{\scriptscriptstyle\alpha}(\beta^{\mbox{\tiny{\sf T}}}X) also at zero due to the double-robust property. To adjust for the mixture model of XX, we apply the functional forms (10) for E(X|βTX)E(X|\beta^{\mbox{\tiny{\sf T}}}X) and (12) for var(X|βTX)\mathrm{var}(X|\beta^{\mbox{\tiny{\sf T}}}X), and we set ϕ(βTX)\phi(\beta^{\mbox{\tiny{\sf T}}}X) at πi2(βTX)\pi_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\beta^{\mbox{\tiny{\sf T}}}X) for each i{1,,q}i\in\{1,\ldots,q\} and then merge the estimating equations. For simplicity, one may also set ϕ(βTX)\phi(\beta^{\mbox{\tiny{\sf T}}}X) at a constant as in SAVE. The resulting objective function is

G2(𝒮(β))i=1qh=1H\displaystyle{\mathrm{G}}_{\scriptscriptstyle 2}({\mathcal{S}}({\beta}))\equiv\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\textstyle\sum_{\scriptscriptstyle h=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle H$}}} E{πi2(βTX)X2YS,h}E{πi2(βTX)μ2(βTX)YS,h}\displaystyle\left\|E\{\pi_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\beta^{\mbox{\tiny{\sf T}}}X)X^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}Y_{\scriptscriptstyle S,h}\}-E\{\pi_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\beta^{\mbox{\tiny{\sf T}}}X)\mu_{\scriptscriptstyle 2}(\beta^{\mbox{\tiny{\sf T}}}X)Y_{\scriptscriptstyle S,h}\}\right.
j=1qE{πi2(βTX)πj(βTX)YS,h}ΣjQ(Σj,β)F2,\displaystyle\hskip 5.69046pt\left.-\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E\{\pi_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\beta^{\mbox{\tiny{\sf T}}}X)\pi_{\scriptscriptstyle j}(\beta^{\mbox{\tiny{\sf T}}}X)Y_{\scriptscriptstyle S,h}\}\Sigma_{\scriptscriptstyle j}Q(\Sigma_{\scriptscriptstyle j},\beta)\right\|_{\scriptscriptstyle F}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}},

where μ2(βTX)\mu_{\scriptscriptstyle 2}(\beta^{\mbox{\tiny{\sf T}}}X) is defined in Lemma 3. Same as G1(){\mathrm{G}}_{\scriptscriptstyle 1}(\cdot), there also exists the unique minimizer of G2(){\mathrm{G}}_{\scriptscriptstyle 2}(\cdot) with the smallest dimension, which is always a subspace of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} under the mixture component-wise linearity and constant variance conditions (7) and (8). The exhaustiveness of this space roughly follows that of SAVE, although πi(βTX)\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X) is used instead of πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) in G2(){\mathrm{G}}_{\scriptscriptstyle 2}(\cdot).

Let G^2()\widehat{\mathrm{G}}_{\scriptscriptstyle 2}(\cdot) be the estimator of G2(){\mathrm{G}}_{\scriptscriptstyle 2}(\cdot) using model fitting results in Section 2. We call the unique dd-dimensional minimizer of G^2()\widehat{\mathrm{G}}_{\scriptscriptstyle 2}(\cdot) the refined SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} or simply SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}}, whose n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistency can be easily proved. The implementation of SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}} is deferred to Appendix B. Again, the advantage of SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}} over SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} comes from that the mixture components are embedded in the low dimensional βTX\beta^{\mbox{\tiny{\sf T}}}X rather than the ambient XX, much comparable with the advantage of the refined MAVE over MAVE.

6 Simulation studies

We now use simulation models to evaluate the effectiveness of the proposed methods. We will start with generating XX by some simple mixture multivariate normal models, under which we examine the performance of the proposed SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}. To better comprehend the applicability of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, we will then assess their robustness under more complex mixture model of XX and under the violation of the mixture model assumption itself. The adjusted sparse SIR, i.e. SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}}, will be evaluated afterwards under the high-dimensional settings where pnp\geq n, and the adjustments of SAVE will be assessed finally. The section is divided into four subsections in this order. A complementary simulation study is presented in Appendix C, where we evaluate how the estimation error in fitting the mixture model and how a hypothetical misspecification of qq impact the proposed SDR results.

6.1 Adjusted SIR for mixture normal XX

We first evaluate the performance of the proposed SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} in comparison of SIR and the MAVE-type methods, i.e. MAVE, ADR, and eCVE, in the case that XX follows a mixture multivariate normal distribution. The R packages meanMAVE and CVarE are used to implement MAVE and eCVE, respectively, where the tuning parameters are automatically selected. To implement ADR, we follow the suggestion in Wang et al. (2020) to set H=2H=2 and characterize the local neighborhoods of XX by the kk-nearest neighbors with k=4pk=4p. To fit the mixture model in implementing SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, we use the maximal likelihood estimation, with qq determined by BIC, both available from the R package mixtools. For SIR, SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, we set H=5H=5 uniformly unless YY is discrete.

We apply these SDR methods to the following four models, where ε\varepsilon is an independent error with standard normal distribution. In the first three models, XX is a balanced mixture of two multivariate normal distributions, i.e. π1N(μ1,Σ1)+π2N(μ2,Σ2)\pi_{\scriptscriptstyle 1}N(\mu_{\scriptscriptstyle 1},\Sigma_{\scriptscriptstyle 1})+\pi_{\scriptscriptstyle 2}N(\mu_{\scriptscriptstyle 2},\Sigma_{\scriptscriptstyle 2}) with π1=π2=.5\pi_{\scriptscriptstyle 1}=\pi_{\scriptscriptstyle 2}=.5, and we set μ2=μ1\mu_{\scriptscriptstyle 2}=-\mu_{\scriptscriptstyle 1} and Σi=[ri]p\Sigma_{\scriptscriptstyle i}=[r_{\scriptscriptstyle i}]_{\scriptscriptstyle p} for some scalar rir_{\scriptscriptstyle i} for i=1,2i=1,2. Here, [ri]p[r_{\scriptscriptstyle i}]_{\scriptscriptstyle p} denotes the pp-dimensional square matrix whose (i,j)(i,j)th entry is ri|ij|r_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle|i-j|$}}}, and we allow r1r2r_{\scriptscriptstyle 1}\neq r_{\scriptscriptstyle 2}. The case of unbalanced XX is studied in Model 44 and furthermore in Subsection 6.2 later.

Model 1:Y=sgn(X1+1)log(|X22+ε|),\displaystyle\mbox{Model 1:}\quad Y=\mathrm{sgn}(X_{\scriptscriptstyle 1}+1)\,\log(|X_{\scriptscriptstyle 2}-2+\varepsilon|),
μ1=(1.5,1.5,1.5,0,,0)T,(r1,r2)=(.5,.5),\displaystyle\qquad\qquad\quad\mu_{\scriptscriptstyle 1}=(1.5,1.5,1.5,0,\cdots,0)^{\mbox{\tiny{\sf T}}},(r_{\scriptscriptstyle 1},r_{\scriptscriptstyle 2})=(.5,.5),
Model 2:Y=sgn(X1)(5X12+X22)+.5ε,\displaystyle\mbox{Model 2:}\quad Y=\mathrm{sgn}(X_{\scriptscriptstyle 1})\,(5-X_{\scriptscriptstyle 1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}+X_{\scriptscriptstyle 2}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}})+.5\varepsilon,
μ1=(.8,.8,.8,0,,0)T,(r1,r2)=(.5,.5),\displaystyle\qquad\qquad\quad\mu_{\scriptscriptstyle 1}=(.8,.8,.8,0,\ldots,0)^{\mbox{\tiny{\sf T}}},(r_{\scriptscriptstyle 1},r_{\scriptscriptstyle 2})=(.5,-.5),
Model 3:Y=(X1+X2)2+exp(X3+X4)ε,\displaystyle\mbox{Model 3:}\quad Y=(X_{\scriptscriptstyle 1}+X_{\scriptscriptstyle 2})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}+\exp(X_{\scriptscriptstyle 3}+X_{\scriptscriptstyle 4})\varepsilon,
μ1=(2,2,,2)T/p1/2,(r1,r2)=(.5,.5),\displaystyle\qquad\qquad\quad\mu_{\scriptscriptstyle 1}=(2,2,\cdots,2)^{\mbox{\tiny{\sf T}}}/p^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}},(r_{\scriptscriptstyle 1},r_{\scriptscriptstyle 2})=(.5,-.5),
Model 4:YBernoulli[{1+exp(2X1+2X2+4)}1],\displaystyle\mbox{Model 4:}\quad Y\sim\mbox{Bernoulli}[\{1+\exp(2X_{\scriptscriptstyle 1}+2X_{\scriptscriptstyle 2}+4)\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}],
X.25N(μ1,[.3]p)+.5N(0,Ip)+.25N(μ3,[.3]p),\displaystyle\qquad\qquad\quad X\sim.25N(\mu_{\scriptscriptstyle 1},[.3]_{\scriptscriptstyle p})+.5N(0,I_{\scriptscriptstyle p})+.25N(\mu_{\scriptscriptstyle 3},[-.3]_{\scriptscriptstyle p}),
μ1=(2,2,0,,0)T,μ3=(2,2,0,,0)T.\displaystyle\qquad\qquad\quad\mu_{\scriptscriptstyle 1}=(2,2,0,\cdots,0)^{\mbox{\tiny{\sf T}}},\mu_{\scriptscriptstyle 3}=(-2,2,0,\cdots,0)^{\mbox{\tiny{\sf T}}}.

In Models 131-3, YY is continuous and the central subspace is two-dimensional. The monotonicity or asymmetry between β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X and YY varies in these models: while both X1X_{\scriptscriptstyle 1} and X2X_{\scriptscriptstyle 2} have monotone effects on YY in Model 1, the monotonicity only applies to X1X_{\scriptscriptstyle 1} in Model 2, and neither effects are asymmetric in Model 3. Consequently, these models are in favor of SIR from the most to the least, if we temporarily ignore the impact from the distribution of XX. By contrast, the asymmetry between β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X and YY becomes stronger when specified for the individual mixture components in all the three models. The deviation between μi\mu_{\scriptscriptstyle i}’s also varies in these models, representing different degrees of separation between the mixture components of XX. In Model 4, XX consists of three unbalanced mixture components, and YY is binary with one-dimensional central subspace. The monotonicity between β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X and YY in Model 44 suggests that this model is in favor of SIR.

For each model, we set (n,p)(n,p) at (500,10)(500,10), and generate 200200 independent copies. To measure the accuracy of an SDR estimator 𝒮(β~){\mathcal{S}}({\widetilde{\beta}}) for each model, we use the sample mean and sample standard deviation of δ(β~,β0)\delta(\widetilde{\beta},\beta_{\scriptscriptstyle 0}) defined in (28). The performance of the aforementioned SDR methods under this measure are recorded in Table 1. For reference, if β~\widetilde{\beta} is randomly generated from the uniform distribution on the unit hyper-sphere {βp×d:βTβ=Id}\{\beta\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p\times d$}}}:\beta^{\mbox{\tiny{\sf T}}}\beta=\mathrm{I}_{\scriptscriptstyle d}\}, then E{δ(β~,β0)}E\{\delta(\widetilde{\beta},\beta_{\scriptscriptstyle 0})\} is {2d(p1)/p}1/2\{2d(p-1)/p\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}, which is 1.341.34 for d=1d=1 and 1.901.90 for d=2d=2 given p=10p=10.

From Table 1, SIR fails to capture the central subspace in Models 131-3. The MAVE-type methods have an uneven performance across the models. In particular, ADR fails to capture the local patterns of Model 3 where these patterns vary dramatically with the local neighborhoods of XX, MAVE fails to capture the non-continuous effect of X1X_{\scriptscriptstyle 1} in Model 1 and the effect of XX that falls out of the central mean subspace in Model 3, and eCVE shows a similar inconsistency to MAVE in Model 1. These comply with the intrinsic limitations of the corresponding SDR methods. By contrast, both SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} are consistent in all the models, and they mostly outperform SIR and the MAVE-type methods even when the latter are also consistent. Compared with SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} is generally a slight improvement.

Table 1: Performance of SDR methods for mixture normal XX
Model 1 Model 2 Model 3 Model 4
SIR .908(.136) .910(.128) 1.08(.179) .634(.127)
ADR .611(.176) .473(.138) 1.31(.113) .645(.189)
MAVE .745(.229) .223(.056) .895(.278) .326(.089)
eCVE 1.42(.126) .452(.128) .609(.222) 1.29(.226)
SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .590(.151) .436(.094) .553(.111) .374(.105)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .555(.146) .402(.085) .502(.109) .338(.098)
  • In each cell of Columns 2-5, a(b)a(b) is the sample mean (sample standard deviation) of δ(β^,β0)\delta(\widehat{\beta},\beta_{\scriptscriptstyle 0}) for the corresponding SDR method, based on 200200 replications.

6.2 Robustness of adjusted SIR

We now evaluate the performance of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} more carefully when the distribution of XX deviates from the simple balanced mixture multivariate normal distribution. These deviations include the case of unbalanced mixture components, the case of skewed distributions for the individual mixture components, and the case of non-clustered XX that violates the mixture model assumption itself.

To generate an unbalanced mixture model for XX, we set (π1,π2)(\pi_{\scriptscriptstyle 1},\pi_{\scriptscriptstyle 2}) in Models 131-3 in Subsection 6.1 above to each of (.1,.9)(.1,.9) and (.3,.7)(.3,.7), representing severe unbalance and mild unbalance between mixture components, respectively. The performance of the SDR methods are summarized in Table 2, in the same format as Table 1. Because similar phenomena to Table 1 can be observed, Table 2 again suggests the consistency of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} when XX has a mixture multivariate normal distribution. However, when π\pi becomes smaller, the advantage of these methods over the existing SDR methods becomes less substantial, which can be anticipated in theory as a smaller π\pi can be regarded as a weaker clustered pattern of XX. Note that, for each model, the overall strength of SDR pattern differs when we change the distribution of XX. Thus, it is not meaningful to compare the performance of an individual SDR method in an individual model for different values of (π1,π2)(\pi_{\scriptscriptstyle 1},\pi_{\scriptscriptstyle 2}), i.e. across Table 1 and Table 2.

Table 2: Performance of SDR methods for unbalanced XX
π\pi Methods Model 1 Model 2 Model 3
.1.1 SIR .729(.100) .692(.140) .937(.149)
ADR .572(.152) .485(.126) 1.15(.228)
MAVE .395(.109) .211(.054) .868(.283)
eCVE .503(.105) .309(.055) .702(.335)
SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .461(.098) .541(.131) .458(.094)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .449(.106) .531(.132) .436(.100)
.3.3 SIR .713(.117) .954(.255) 1.36(.127)
ADR .586(.142) .451(.109) 1.23(.214)
MAVE .409(.102) .218(.058) .898(.263)
eCVE .841(.117) .436(.045) .594(.210)
SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .503(.099) .528(.133) .481(.101)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .490(.097) .498(.126) .481(.098)
  • The meanings of numbers in each cell follow those in Table 1.

Next, we add skewness to the mixture components of XX. As mentioned in Section 4, this case was studied in Guan, Xie and Zhu (2017), where StI was specifically proposed to recover 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. Because both SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} are theoretically inconsistent in this case due to the violation of the mixture component-wise linearity condition (7), we evaluate their sample-level effectiveness using SIR as a reference and using StI as the benchmark. For simplicity, we use three simulation models, i.e. Models (4.1), (4.3) and (4.5), in Guan, Xie and Zhu (2017):

Model (4.1):Y=β1TX+ε,\displaystyle\mbox{Model (4.1):}\quad Y=\beta_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X+\varepsilon,
Model (4.3):Y=sin(β1TX/3+ε),\displaystyle\mbox{Model (4.3):}\quad Y=\sin(\beta_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X/3+\varepsilon),
Model (4.5):Y=(β1TX+.3ε)/(2+|β2TX4+ε|),\displaystyle\mbox{Model (4.5):}\quad Y=(\beta_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X+.3\varepsilon)\,/\,(2+|\beta_{\scriptscriptstyle 2}^{\mbox{\tiny{\sf T}}}X-4+\varepsilon|),

where β1=(1,1,1,1,0,,0)T\beta_{\scriptscriptstyle 1}=(1,1,1,1,0,\ldots,0)^{\mbox{\tiny{\sf T}}}, β2=(1,1,0,0,1,1,0,0)T\beta_{\scriptscriptstyle 2}=(1,1,0,0,1,1,0\ldots,0)^{\mbox{\tiny{\sf T}}}, and ε\varepsilon is the same random error as in Subsection 6.1 above. For all the three models, XX is generated by .5SNp(μ1,Ip,C1)+.3SNp(μ2,Ip,C2)+.2SNp(μ3,Ip,C3).5SN_{\scriptscriptstyle p}(\mu_{\scriptscriptstyle 1},I_{\scriptscriptstyle p},C_{\scriptscriptstyle 1})+.3SN_{\scriptscriptstyle p}(\mu_{\scriptscriptstyle 2},I_{\scriptscriptstyle p},C_{\scriptscriptstyle 2})+.2SN_{\scriptscriptstyle p}(\mu_{\scriptscriptstyle 3},I_{\scriptscriptstyle p},C_{\scriptscriptstyle 3}), with

μ1=(3,3,3,0,,0)T,μ2=(3,0,3,3,0,,0)T,μ3=(3,0,0,3,3,0,,0)T,\displaystyle\mu_{\scriptscriptstyle 1}=(3,3,3,0,\ldots,0)^{\mbox{\tiny{\sf T}}},\mu_{\scriptscriptstyle 2}=(3,0,3,3,0,\ldots,0)^{\mbox{\tiny{\sf T}}},\mu_{\scriptscriptstyle 3}=(3,0,0,3,3,0,\ldots,0)^{\mbox{\tiny{\sf T}}},
C1=β1,C2=3β1,C3=(1,1,0,0,0,0,1,1,0,0)T\displaystyle C_{\scriptscriptstyle 1}=\beta_{\scriptscriptstyle 1},C_{\scriptscriptstyle 2}=3\beta_{\scriptscriptstyle 1},C_{\scriptscriptstyle 3}=(1,1,0,0,0,0,1,1,0,0)^{\mbox{\tiny{\sf T}}}

in the first two models, and with

μ1=(3,3,3,3,0,,0)T,μ2=(3,0,0,3,3,0,,0)T,μ3=(3,3,0,0,3,3,0,,0)T,\displaystyle\mu_{\scriptscriptstyle 1}=(3,3,3,3,0,\ldots,0)^{\mbox{\tiny{\sf T}}},\mu_{\scriptscriptstyle 2}=(3,0,0,3,3,0,\ldots,0)^{\mbox{\tiny{\sf T}}},\mu_{\scriptscriptstyle 3}=(3,3,0,0,3,3,0,\ldots,0)^{\mbox{\tiny{\sf T}}},
C1=2β1,C2=3β1,C3=(1,1,0,0,0,0,1,1,0,0)T\displaystyle C_{\scriptscriptstyle 1}=2\beta_{\scriptscriptstyle 1},C_{\scriptscriptstyle 2}=3\beta_{\scriptscriptstyle 1},C_{\scriptscriptstyle 3}=(-1,-1,0,0,0,0,-1,-1,0,0)^{\mbox{\tiny{\sf T}}}

in the third model. Here, SN(μ,Σ,C)SN(\mu,\Sigma,C) denotes a multivariate skew normal distribution, with CC being the skewness parameter whose magnitude indicates the severity of skewness; see Guan, Xie and Zhu (2017) for more details. Roughly, the skewness is more severe for the mixture components of XX in Model (4.5) than in the other two models.

Same as Guan, Xie and Zhu (2017), we set (n,p)=(400,10)(n,p)=(400,10). To permit a direct transfer of their simulation results, we also use δ2(β^,β0)\delta^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\widehat{\beta},\beta_{\scriptscriptstyle 0}) instead of δ(β^,β0)\delta(\widehat{\beta},\beta_{\scriptscriptstyle 0}) to measure the estimation error of a SDR method, which is a linear transformation of their working measure. The performance of all the SDR methods under δ2(β^,β0)\delta^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\widehat{\beta},\beta_{\scriptscriptstyle 0}) is summarized in Table 3. From this table, the proposed SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} have a similar performance to StI in comparison of SIR for the first two models, and they are sub-optimal to StI but still substantially ourperform SIR for the third model, where again the skewness is more severe in the mixture components of XX. Overall, these suggest that both SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} are robust against mild skewness in the mixture components of XX if XX follows a mixture model, and SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} is slightly better than SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} in this context.

Table 3: Performance of SDR methods for mixture skew normal XX
Method Model (4.1) Model (4.3) Model (4.5)
SIR .156(.041) .219(.112) 1.20(.072)
SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .026(.012) .050(.046) .816(.201)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .062(.044) .078(.050) .826(.232)
StI .026(.010) .038(.022) .168(.084)
  • In each cell of Columns 2-4, a(b)a(b) is the sample mean (sample standard deviation) of δ2(β^,β0)\delta^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\widehat{\beta},\beta_{\scriptscriptstyle 0}) for the corresponding SDR method, based on 200200 replications.

To evaluate the applicability of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} for more complex data, we next generate XX such that it conveys a unimodal, continuous, but curved pattern. Namely, we first generate (X1,X3,,Xp)(X_{\scriptscriptstyle 1},X_{\scriptscriptstyle 3},\ldots,X_{\scriptscriptstyle p}) under N(0,[.3]p1)N(0,[.3]_{\scriptscriptstyle p-1}), and, given X1X_{\scriptscriptstyle 1}, we generate X2X_{\scriptscriptstyle 2} in each of the following ways:

X2=|X1|+ε1,X2=cos(2X1)+ε1\displaystyle X_{\scriptscriptstyle 2}=|X_{\scriptscriptstyle 1}|+\varepsilon_{\scriptscriptstyle 1},\quad X_{\scriptscriptstyle 2}=\cos(2X_{\scriptscriptstyle 1})+\varepsilon_{\scriptscriptstyle 1} (34)

where ε1\varepsilon_{\scriptscriptstyle 1} is an independent error distributed as N(0,.12)N(0,.1^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}). The sample support of (X1,X2)(X_{\scriptscriptstyle 1},X_{\scriptscriptstyle 2}) conveys a VV shape and a WW shape, respectively. For each case, we generate YY from Model 1, Model 2 and Model 4 in Subsection 6.1 above. The corresponding performance of SDR methods are summarized in Table 4, again measured by δ(β~,β0)\delta(\widetilde{\beta},\beta_{\scriptscriptstyle 0}) in (28). The results for Model 3 are omitted because X1+X2X_{\scriptscriptstyle 1}+X_{\scriptscriptstyle 2} is close to zero with a large probability under (34), which causes an intrinsic difficulty to recover its effect for this model. The similar phenomena in Table 4 to Table 1 suggest the robustness of both SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} against the violation of the mixture model assumption on XX, which means that these methods can have a wider application in practice.

Table 4: Performance of SDR methods for V- and W-shaped XX
Methods Model 1 Model 2 Model 4
V-shaped SIR .278(.059) 1.14(.220) 1.26(.099)
ADR .654(.180) .641(.231) 1.01(.296)
MAVE .087(.019) .064(.013) 1.08(.358)
eCVE 1.41(.007) 1.42(.010) 1.33(.114)
SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .292(.093) .132(.112) .357(.163)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .166(.109) .115(.130) .328(.185)
W-shaped SIR .317(.090) .289(.067) .196(.040)
ADR .353(.096) .310(.067) .442(.110)
MAVE .076(.021) .061(.013) .372(.092)
eCVE 1.36(.243) .328(.085) .648(.395)
SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .242(.102) .236(.154) .502(.178)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .151(.091) .161(.079) .495(.180)
  • The meanings of numbers in each cell follow those in Table 1.

6.3 Adjusted high-dimensional sparse SIR

We now compare the proposed SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} with the sparse SIR proposed in Tan et al. (2018) under the high-dimensional settings, where nn is set at 200200 and pp is set at each of 200200 and 300300. The ratio p/np/n under the latter setting is the same as the simulation studies in Tan et al. (2018). The MAVE-type methods are omitted under these settings due to their theoretical inconsistency caused by localization.

To generate a high-dimensional mixture multivariate normal XX, we set q=2q=2 with π1=π2=.5\pi_{\scriptscriptstyle 1}=\pi_{\scriptscriptstyle 2}=.5, and set both Σ1\Sigma_{\scriptscriptstyle 1} and Σ2\Sigma_{\scriptscriptstyle 2} to be Σ=1.5[.3]p\Sigma=1.5[-.3]_{\scriptscriptstyle p}. To set μ1\mu_{\scriptscriptstyle 1} and μ2\mu_{\scriptscriptstyle 2}, we follow Cai, Ma and Zhang (2019) to introduce a discriminant vector γ=(2u,,2u,0,,0)T\gamma=(2u,\ldots,2u,0,\ldots,0)^{\mbox{\tiny{\sf T}}} with the first ten entries being nonzero, and let μ1=(u,,u)T\mu_{\scriptscriptstyle 1}=(u,\ldots,u)^{\mbox{\tiny{\sf T}}} and μ2=μ1Σγ\mu_{\scriptscriptstyle 2}=\mu_{\scriptscriptstyle 1}-\Sigma\gamma. The magnitude of uu controls the degree of separation between the two mixture components, which we set to be .5.5, .8.8, and 11 sequentially. Note that with u=.5u=.5, the mixture pattern is rather weak along any direction of XX. Given XX, we generate YY from each of

Model A: Y={(X1+X2+X3)/3+1.5}2+.5ε\displaystyle Y=\{(X_{\scriptscriptstyle 1}+X_{\scriptscriptstyle 2}+X_{\scriptscriptstyle 3})/\sqrt{3}+1.5\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}+.5\varepsilon
Model B: Y=exp{(X1+X2+X3)/3}+ε\displaystyle Y=\exp\{(X_{\scriptscriptstyle 1}+X_{\scriptscriptstyle 2}+X_{\scriptscriptstyle 3})/\sqrt{3}\}+\varepsilon

where ε\varepsilon is the same random error as in Subsection 6.1. Both models share the same one-dimensional 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. In Model A, the effect of XX on YY is only moderately asymmetric over all the observations, but it is more monotone if specified for each mixture component of XX. Both effects are monotone in Model B.

Because YY is continuous in both Model A and Model B, we still set the number of slices HH at five in implementing the sparse SIR and the proposed SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}}. Again, as mentioned in Section 4, qq is assumed to be known when fitting the mixture multivariate normal distribution of XX in implementing the proposed methods. The accuracy of these SDR methods in estimating 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} is summarized in Table 5, and their variable selection consistency, as measured by the true positive rate and the false positive rate, is summarized in Table 6, both based on 200200 independent runs.

Table 5: Performance of sparse SDR methods in estimating 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} under HD settings
Model Method p=200,n=200p=200,n=200 p=300,n=200p=300,n=200
u=.5u=.5 u=.8u=.8 u=1u=1 u=.5u=.5 u=.8u=.8 u=1u=1
A SIR .536(.169) .641(.314) 1.16(.039) .712(.085) .805(.069) .992(.051)
SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .342(.098) .431(.087) .487(.068) .411(.197) .686(.115) .709(.102)
SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .236(.083) .401(.095) .420(.078) .407(.122) .569(.091) .592(.103)
B SIR .665(.142) .795(.115) .973(.427) .692(.086) .846(.161) .972(.061)
SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .636(.357) .570(.125) .608(.220) .620(.233) .818(.111) .832(.139)
SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .554(.407) .569(.106) .604(.185) .586(.138) .713(.104) .704(.124)
  • The meanings of numbers in each cell follow those in Table 1.

Table 6: Performance of sparse SDR methods in variable selection under HD settings
Model Method p=200,n=200p=200,n=200 p=300,n=200p=300,n=200
u=.5u=.5 u=.8u=.8 u=1u=1 u=.5u=.5 u=.8u=.8 u=1u=1
A SIR 1.00(.004) 1.00(.017) 1.00(.050) 1.00(.032) 1.00(.023) 1.00(.024)
SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} 1.00(.002) 1.00(.008) 1.00(.012) 1.00(.006) 1.00(.021) 1.00(.023)
SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} 1.00(.001) 1.00(.008) 1.00(.008) 1.00(.004) 1.00(.011) 1.00(.013)
B SIR 1.00(.021) .976(.405) .987(.031) 1.00(.018) 1.00(.039) 1.00(.043)
SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .933(.006) .956(.198) 1.00(.020) 1.00(.011) 1.00(.021) 1.00(.024)
SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .940(.009) .964(.186) 1.00(.021) 1.00(.007) 1.00(.021) 1.00(.021)
  • In each cell of Columns 3-8, a(b)a(b) is the true(false) positive rate of the corresponding

  • sparse SDR method, based on 200200 replications.

From these two tables, both SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} are clearly more effective than the sparse SIR, except for Model B with u=.5u=.5 where they are comparable. In both models, the sparse SIR quickly fails to recover 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} as uu increases (although still consistent in variable selection), whereas SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} are much more robust against this change. The reason for the slightly compromised performance of SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} when uu increases, is that a larger uu makes the second mixture component of β0TX\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X move towards the origin in Model A and leave the origin from the left in Model B, both leading to a vanishing signal of Y|β0TXY|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X in this mixture component. When uu is .5.5, SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} still slightly outperform the sparse SIR, so they are useful even when the distribution of XX is only weakly clustered.

We next change XX to follow a V-shaped distribution, by first generating (X1,X3,,Xp)(X_{\scriptscriptstyle 1},X_{\scriptscriptstyle 3},\ldots,X_{\scriptscriptstyle p}) from N(0,[.3]p1)N(0,[-.3]_{\scriptscriptstyle p-1}) and then generating X2X_{\scriptscriptstyle 2} as in the first case of (34) given X1X_{\scriptscriptstyle 1}. Same as above, we fix q=2q=2 when approximating this distribution by a mixture multivariate normal distribution using Cai, Ma and Zhang’s method, for which a more complex non-clustered distribution such as W-shaped cannot be well approximated and thus is omitted. Based on 200200 independent runs, the performance of the sparse SDR methods for Model A and Model B is recorded in Table 7 and Table 8, in the same format as Table 5 and Table 6, respectively. Compared with Table 5 and Table 6, a less substantial but similar phenomenon can be observed, suggesting the usefulness of both SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} and SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} when XX has a curved and non-clustered distribution.

Table 7: Performance of sparse SDR methods in estimating 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} for HD V-shaped XX
Model Method p=200,n=200p=200,n=200 p=300,n=200p=300,n=200
model A SIR .625(.079) .672(.068)
SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .463(.078) .502(.068)
SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .432(.089) .499(.087)
model B SIR .614(.110) .694(.088)
SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .503(.108) .545(.086)
SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} .484(.106) .510(.102)
  • The meanings of numbers in each cell follow those in Table 1.

Table 8: Performance of sparse SDR methods in variable selection for HD V-shaped XX
Model Method p=200,n=200p=200,n=200 p=300,n=200p=300,n=200
model A SIR 1.00(.001) 1.00(.001)
SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} 1.00(.001) 1.00(.001)
SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} 1.00(.001) 1.00(.001)
model B SIR .987(.001) 1.00(.001)
SIRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} 1.00(.004) 1.00(.004)
SIRRMS\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\mathrm{S}$}}} 1.00(.004) 1.00(.004)
  • The meanings of numbers in each cell follow those in Table 6.

6.4 Adjusted SAVE under various settings

To illustrate the performance of the proposed SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} and SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}} in comparison of SAVE and the MAVE-type methods, we generate Y|XY|X by the following two models where ε\varepsilon is the same random error as in Subsection 6.1.

Model I:Y=sin(X1+X2+2)+.5ε,\displaystyle\mbox{Model \mbox{I}:}\quad Y=\sin(X_{\scriptscriptstyle 1}+X_{\scriptscriptstyle 2}+2)+.5\varepsilon,
Model I​I:Y=sin(X1)+sin(X2)+.5ε.\displaystyle\mbox{Model \mbox{I\!I}:}\quad Y=\sin(X_{\scriptscriptstyle 1})+\sin(X_{\scriptscriptstyle 2})+.5\varepsilon.

Similarly to the studies for the adjusted SIR, we first generate XX under the mixture model .5N(μ1,[.3]p)+.5N(μ1,[r2]p).5N(\mu_{\scriptscriptstyle 1},[.3]_{\scriptscriptstyle p})+.5N(-\mu_{\scriptscriptstyle 1},[r_{\scriptscriptstyle 2}]_{\scriptscriptstyle p}), where μ1=(1.5,1.5,0,,0)T/p\mu_{\scriptscriptstyle 1}=(-1.5,-1.5,0,\cdots,0)^{\mbox{\tiny{\sf T}}}/p and we set r2=.3r_{\scriptscriptstyle 2}=-.3 for Model I and r2=.3r_{\scriptscriptstyle 2}=.3 for Model I​I; we then generate XX under the V-shaped distribution and under the W-shaped distribution in the same way as in Subsection 6.2. Note that both SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} and SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}} are theoretically consistent only in the first case.

Let (n,p)=(500,10)(n,p)=(500,10). For each of SAVE, SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}}, and SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}}, we set H=5H=5. The MAVE-type methods are implemented in the same way as in Subsection 6.16.1. The results are summarized in Table 9. Clearly, both SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} and SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}} are effective even when XX does not convey a clustered pattern, and they outperform both SAVE and the MAVE-type methods in most cases.

Table 9: Performance of SDR methods in comparison of SAVE
Method Model I Model I​I
MMN V-shaped W-shaped MMN V-shaped W-shaped
SAVE 1.40(.015) 1.06(.342) .267(.050) .515(.114) .249(.071) .311(.092)
ADR .312(.071) .398(.072) .275(.079) .620(.207) .916(.314) .403(.096)
MAVE .871(.625) .129(.031) .092(.028) .371(.286) .944(.344) .408(.236)
eCVE .191(.208) .353(.088) .729(.425) 1.14(.414) 1.42(.009) 1.38(.072)
SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} .347(.099) .158(.055) .532(.187) .581(.166) .103(.029) .241(.186)
SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}} .334(.094) .143(.028) .514(.171) .575(.160) .106(.032) .268(.129)
  • “MMN” represents the case that XX is generated under the mixture multivariate normal distribution. The meanings of numbers in each cell follow those in Table 1.

6.5 A real data example

The MASSCOL2 data set, available at MINITAB (in catalogue STUDENT12), was collected in 5656 four-year colleges of Massachusetts in 1995, with the research interest on assessing how the academic strength of the incoming students of each college and the features of the college itself jointly affect the percentage of the freshman class that graduate. After removing missing values, 4444 observations are left in the data. Based on preliminary exploratory data analysis, we choose seven variables as the working predictor, including the percentage of the freshman class that were in the top 25%25\% of their high school graduating class (denoted by Top 25), the median SAT Math score (MSAT), the median SAT Verbal score (VSAT), the percentage of applicants accepted by each college (Accept), the percentage of enrolled students among all who were accepted (Enroll), the student/faculty ratio (SFRatio), and the out-of-state tuition (Tuition). The same predictor was also selected in Li and Dong (2009), who applied SDR to an earlier version of the data set. The response variable, which is the percentage of the freshman class that graduate, is labeled as WhoGrad in the data set. We next analyze this data set using the proposed SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}.

As suggested by BIC, the distribution of XX in this data set can be best approximated by the mixture multivariate normal distribution if we set the number of mixture components qq to be two or three. Here, we use three as a conservative choice. The dimension dd of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} for this data set is determined to be two. To choose HH for SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, we measure the performance of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} by the predictability of the corresponding reduced predictor, for which we calculate the out-of-sample R2R^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}} based on the leave-one-out cross validation. In each training set, the regression function is fitted by kernel mean estimation using the R package npreg, with the bandwidth selected automatically from the R package npregbw. The resulting optimal choice of HH is two. Likewise, the optimal choice of HH for SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} is three. We then apply SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} accordingly.

To illustrate the goodness of the mixture model fit for XX in this data set, we draw the scatter plots of the two-dimensional reduced predictors from SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, in the upper-left and upper-right panels of Figure 3, respectively. The point type for each observation xx is determined by an independent multinomial random variable, with probability (π^1(x),π^2(x),π^3(x))(\widehat{\pi}_{\scriptscriptstyle 1}(x),\widehat{\pi}_{\scriptscriptstyle 2}(x),\widehat{\pi}_{\scriptscriptstyle 3}(x)) derived from the mixture model fit. The nature of clustering in these plots suggests the appropriateness of the mixture model and consequently the effectiveness of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} for this data set. The latter is further supported by the aforementioned out-of-sample R2R^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}} for the reduced predictor from these methods, which is .87.87 for SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and is .84.84 for SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}. For reference, we also apply SIR, ADR, and eCVE to the data, and the out-of-sample R2R^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}} for their reduced predictors are .79.79, .61.61, and .62.62, respectively. Roughly, this means that the variance of the error in the reduced data derived by SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} is 38%38\% less than that by SIR, 67%67\% less than that by ADR, and 66%66\% less than that by eCVE. The out-of-sample R2R^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}} for MAVE is .86.86, so MAVE performs equally well for this data set as the proposed SDR methods.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: In the upper panel are the scatter plots of the reduced predictors, the left for SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and the right for SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}; in the lower panel are the 3D scatter plots of the reduced data with YY in the zz-axis, as well as the fitted surfaces by the kernel mean estimation, again the left for SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and the right for SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}.

To visualize the effect of the reduced predictor from SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} on the response, we draw the 33-dimensional scatter plot of the reduced data in the lower-left panel of Figure 3. Clearly, both components of the reduced predictor uniquely affect the response, and their effect interact with each other: the effect of the first component changes from moderately positive to dramatically positive as the second component grows, and the effect of the second component changes from monotone decreasing to monotone increasing as the first component grows. A similar pattern can be observed from the 33-dimensional scatter plot of the reduced data from SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, as depicted in the lower-right panel of Figure 3. To better interpret these, we report the coefficients of the working reduced predictor for each SDR method in Table 10, with each variable in the original predictor standardized to have the unit variance.

From Table 10, both SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} suggest that the median SAT Verbal score, the percentage of applicants accepted by each college, and the out-of-state tuition are the dominating variables in predicting the percentage of the freshman class that graduate, among which the out-of-state tuition seems to constantly contribute a positive effect. These may help the researchers comprehend the mechanism underlying the data set in the future investigation.

Table 10: Coefficients of the reduced predictor from SDR for MASSCOL2 data set
XX SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}
direction 1 direction 2 direction 1 direction 2
Top 25 .201 -.014 .193 -.186
MSAT -.197 -.041 -.076 .330
VSAT -.125 -.516 -.213 -.726
Accept -.193 -.846 -.189 -.431
Enroll .069 .016 .133 .299
SFRatio .223 -.035 .324 -.027
Tuition .901 -.118 .866 .226

7 Discussion

In the proposed work, we conduct SDR under the mixture model of XX, which relaxes the linearity and constant variance conditions commonly adopted in the inverse regression methods, and meanwhile provides a parametric analogue of the MAVE-type methods such as ADR. Complying with the wide applicability of mixture models in approximating unknown distributions, our numerical studies suggest the effectiveness of the proposed SDR methods for generally distributed predictors, even when the clustered pattern is not obvious.

One issue that is underrepresented in this article is the difficulty of fitting the mixture model. This includes a possibly high computational cost for implementing the maximal likelihood estimator, which is sensitive to the initial value, as well as the potential failure of recovering the mixture components with small proportions when the underlying mixture model is severely unbalanced. As mentioned in Section 4, under the high-dimensional settings, only the mixture multivariate normal distribution has been studied in the existing literature, with the number of mixture components qq assumed known, and with the homogeneity of Σi\Sigma_{\scriptscriptstyle i}’s and the sparsity of Σi1(μiμj)\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(\mu_{\scriptscriptstyle i}-\mu_{\scriptscriptstyle j})’s imposed for estimation consistency. These limitations may restrict the applicability of the proposed SDR methods, especially for large-dimensional data.

Finally, we speculate that the proposed strategy can be applied to adjust other SDR methods besides inverse regression. For example, we can assume a mixture model on X|YX|Y instead of on XX, which will adjust the SDR methods that assume X|YX|Y to have an elliptical contoured distribution or fall in the exponential family (Cook, 2007; Cook and Forzani, 2008). These extensions may further facilitate the applicability of SDR for data sets with large-dimensional and non-elliptically distributed predictor.

{acks}

[Acknowledgments]

Appendix A Regularities and other adjusted SDR methods

In this Appendix, we include the adjustments of pHd and directional regression, as well as the regularity conditions (C1)-(C3) needed for Theorem 3 in the main text. The equations in this Appendix are labeled as (A.1), (A.2), etc.

A.1 Adjusted pHd and directional regression

The candidate matrix for pHd is ΣX1E[(XμX)2{YE(Y)}]ΣX1\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E[(X-\mu_{\scriptscriptstyle X})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\{Y-E(Y)\}]\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}. Thus, the first adjustment of pHd is to recover the central mean subspace 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)} by the column space of

ΩpHd=[Σi1[E{πi(X)Y}ΣiE{πi(X)(Xμi)2Y}]Σi1]i{1,,q}.\displaystyle{\Omega}_{\scriptscriptstyle\mathrm{pHd}}=\left[\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\left[E\{\pi_{\scriptscriptstyle i}(X)Y\}\Sigma_{\scriptscriptstyle i}-E\{\pi_{\scriptscriptstyle i}(X)(X-\mu_{\scriptscriptstyle i})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}Y\}\right]\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\right]_{\scriptscriptstyle i\in\{1,\ldots,q\}}. (A.1)

A n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent estimator Ω^pHd\widehat{\Omega}_{\scriptscriptstyle\mathrm{pHd}} can be constructed by plugging the mixture model fit and using the sample moments, whose leading left singular vectors span a n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent estimator of 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)}, denoted by pHdM{\mathrm{pHd}}_{\scriptscriptstyle\mathrm{M}}. In addition, define

GpHd(𝒮(β))=j=1qE[{πi(βTX)ΣiQ(Σi,β)X2+μ2(βTX)}Yπj(βTX)]F2\displaystyle G_{\scriptscriptstyle\mathrm{pHd}}({\mathcal{S}}({\beta}))=\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\left\|E\left[\{\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X)\Sigma_{\scriptscriptstyle i}Q(\Sigma_{\scriptscriptstyle i},\beta)-X^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}+\mu_{\scriptscriptstyle 2}(\beta^{\mbox{\tiny{\sf T}}}X)\}Y\pi_{\scriptscriptstyle j}(\beta^{\mbox{\tiny{\sf T}}}X)\right]\right\|_{\scriptscriptstyle F}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}

where μ2(βTX)=i=1qπi(βTX){PT(Σi,β)(Xμi)+μi}2\mu_{\scriptscriptstyle 2}(\beta^{\mbox{\tiny{\sf T}}}X)=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta)(X-\mu_{\scriptscriptstyle i})+\mu_{\scriptscriptstyle i}\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}. The second adjustment of pHd is to minimize a sample-level G^pHd()\widehat{G}_{\scriptscriptstyle\mathrm{pHd}}(\cdot), and the resulting minimizer spans a n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent estimator of 𝒮E(Y|X){\mathcal{S}}_{\scriptscriptstyle E(Y|X)}, denoted by pHdRM{\mathrm{pHd}}_{\scriptscriptstyle\mathrm{RM}}.

Let (X~,Y~)(\widetilde{X},\widetilde{Y}) be an independent copy of (X,Y)(X,Y). The candidate matrix for directional regression is ΣX1E{2ΣXvar(XX~|Y,Y~)}2ΣX1\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E\{2\Sigma_{\scriptscriptstyle X}-\mathrm{var}(X-\widetilde{X}|Y,\widetilde{Y})\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}\Sigma_{\scriptscriptstyle X}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}. The first adjustment of directional regression is to recover 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} by the column space of

ΩDR=\displaystyle{\Omega}_{\scriptscriptstyle\mathrm{DR}}= E[Σi1[2E{πi(X)|YS=s}E{πi(X~)|Y~S=t}Σi\displaystyle E\left[\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\left[2E\{\pi_{\scriptscriptstyle i}(X)|Y_{\scriptscriptstyle S}=s\}E\{\pi_{\scriptscriptstyle i}(\widetilde{X})|\widetilde{Y}_{\scriptscriptstyle S}=t\}\Sigma_{\scriptscriptstyle i}\right.\right.
E{πi(X)πi(X~)(XX~)2|YS=s,Y~S=t}]Σi1]s,t{1,,H},i{1,,q}.\displaystyle\hskip 14.22636pt\left.\left.-E\{\pi_{\scriptscriptstyle i}(X)\pi_{\scriptscriptstyle i}(\widetilde{X})(X-\widetilde{X})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}|Y_{\scriptscriptstyle S}=s,\widetilde{Y}_{\scriptscriptstyle S}=t\}\right]\Sigma_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\right]_{\scriptscriptstyle s,t\in\{1,\ldots,H\},i\in\{1,\ldots,q\}}.

A n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent estimator of ΩDR{\Omega}_{\scriptscriptstyle\mathrm{DR}} can be constructed by plugging the mixture model fit and using the sample moments, whose leading left singular vectors span a n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent estimator of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, denoted by DRM{\mathrm{DR}}_{\scriptscriptstyle\mathrm{M}}. In addition, we propose the following objective function

GDR(𝒮(β))=\displaystyle G_{\scriptscriptstyle\mathrm{DR}}({\mathcal{S}}({\beta}))= E[(πi(βTX)πj(βTX~))i,j{1,,q}\displaystyle\left\|E\left[(\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X)\pi_{\scriptscriptstyle j}(\beta^{\mbox{\tiny{\sf T}}}\widetilde{X}))_{\scriptscriptstyle i,j\in\{1,\ldots,q\}}\right.\right.
[i=1qE{πi(βTX)|YS=s}ΣiQ(Σi,β)\displaystyle\hskip 19.91684pt\left[\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E\{\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X)|Y_{\scriptscriptstyle S}=s\}\Sigma_{\scriptscriptstyle i}Q(\Sigma_{\scriptscriptstyle i},\beta)\right.
+j=1qE{πj(βTX~)|Y~S=t}ΣjQ(Σj,β)\displaystyle\hskip 25.6073pt+\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E\{\pi_{\scriptscriptstyle j}(\beta^{\mbox{\tiny{\sf T}}}\widetilde{X})|\widetilde{Y}_{\scriptscriptstyle S}=t\}\Sigma_{\scriptscriptstyle j}Q(\Sigma_{\scriptscriptstyle j},\beta)
E{(XX~)2|YS=s,Y~S=t}\displaystyle\hskip 25.6073pt-E\{(X-\widetilde{X})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}|Y_{\scriptscriptstyle S}=s,\widetilde{Y}_{\scriptscriptstyle S}=t\}
+i=1qE[πi(βTX){PT(Σi,β)(Xμi)+μi}2|YS=s]\displaystyle\hskip 25.6073pt+\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E[\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta)(X-\mu_{\scriptscriptstyle i})+\mu_{\scriptscriptstyle i}\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}|Y_{\scriptscriptstyle S}=s]
+j=1qE[πj(βTX~){PT(Σj,β)(X~μj)+μj}2|Y~S=t]\displaystyle\hskip 25.6073pt+\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E[\pi_{\scriptscriptstyle j}(\beta^{\mbox{\tiny{\sf T}}}\widetilde{X})\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle j},\beta)(\widetilde{X}-\mu_{\scriptscriptstyle j})+\mu_{\scriptscriptstyle j}\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}|\widetilde{Y}_{\scriptscriptstyle S}=t]
[i=1qE[πi(βTX){PT(Σi,β)(Xμi)+μi}|YS=s]\displaystyle\hskip 25.6073pt-[\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E[\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta)(X-\mu_{\scriptscriptstyle i})+\mu_{\scriptscriptstyle i}\}|Y_{\scriptscriptstyle S}=s]
[j=1qE[πj(βTX~){PT(Σj,β)(X~μj)+μj}|Y~S=t]]T\displaystyle\hskip 45.52458pt[\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E[\pi_{\scriptscriptstyle j}(\beta^{\mbox{\tiny{\sf T}}}\widetilde{X})\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle j},\beta)(\widetilde{X}-\mu_{\scriptscriptstyle j})+\mu_{\scriptscriptstyle j}\}|\widetilde{Y}_{\scriptscriptstyle S}=t]]^{\mbox{\tiny{\sf T}}}
[j=1qE[πj(βTX~){PT(Σj,β)(X~μj)+μj}|Y~S=t]]\displaystyle\hskip 25.6073pt-[\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E[\pi_{\scriptscriptstyle j}(\beta^{\mbox{\tiny{\sf T}}}\widetilde{X})\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle j},\beta)(\widetilde{X}-\mu_{\scriptscriptstyle j})+\mu_{\scriptscriptstyle j}\}|\widetilde{Y}_{\scriptscriptstyle S}=t]]
[i=1qE[πi(βTX){PT(Σi,β)(Xμi)+μi}|YS=s]]T]2]F2.\displaystyle\hskip 39.83368pt\left.\left.\left.[\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\!E[\pi_{\scriptscriptstyle i}(\beta^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\Sigma_{\scriptscriptstyle i},\beta)(X\!-\!\mu_{\scriptscriptstyle i})\!+\!\mu_{\scriptscriptstyle i}\}|Y_{\scriptscriptstyle S}\!=\!s]]^{\mbox{\tiny{\sf T}}}\right]^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\right]\right\|_{\scriptscriptstyle F}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}.

The second adjustment of directional regression is to minimize a sample-level G^DR()\widehat{G}_{\scriptscriptstyle\mathrm{DR}}(\cdot), and the resulting minimizer spans a n1/2n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}-consistent estimator of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}, denoted by DRRM{\mathrm{DR}}_{\scriptscriptstyle\mathrm{RM}}. The minimization of both G^pHd()\widehat{G}_{\scriptscriptstyle\mathrm{pHd}}(\cdot) and G^DR()\widehat{G}_{\scriptscriptstyle\mathrm{DR}}(\cdot) requires iterative algorithms, which resemble those for implementing SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} and SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}} discussed later in Appendix B and thus are omitted.

A.2 Regularity conditions for Theorem 3

(C1). The signal-to-noise ratio Δ\Delta of the mixture multivariate normal distribution, defined as {(μ1μ2)TΣ1(μ1μ2)}1/2\{(\mu_{\scriptscriptstyle 1}-\mu_{\scriptscriptstyle 2})^{\mbox{\tiny{\sf T}}}\Sigma^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(\mu_{\scriptscriptstyle 1}-\mu_{\scriptscriptstyle 2})\}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 1/2$}}}, satisfies Δ>C(c0,c1,M,Cb)\Delta>C(c_{\scriptscriptstyle 0},c_{\scriptscriptstyle 1},M,C_{\scriptscriptstyle b}), where C(c0,c1,M,Cb)C(c_{\scriptscriptstyle 0},c_{\scriptscriptstyle 1},M,C_{\scriptscriptstyle b}) is given in (C.24) in supplementary material for Cai, Ma and Zhang (2019).

(C2). YY has a bounded support. E(X|Y)E(X|Y) has a total variation of order 0.250.25, i.e.

supB>0limnn1/4supΞ(B)i=2nE(X|Y=ai)E(X|Y=ai1)max=0,\displaystyle\sup_{\scriptscriptstyle B>0}\lim_{\scriptscriptstyle n\rightarrow\infty}n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1/4$}}}\sup_{\scriptscriptstyle\Xi(B)}\textstyle\sum_{\scriptscriptstyle i=2}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle n$}}}\|E(X|Y=a_{\scriptscriptstyle i})-E(X|Y=a_{\scriptscriptstyle i-1})\|_{\scriptscriptstyle\max}=0,

where vmax=max{vi:i{1,,p}}\|v\|_{\scriptscriptstyle\max}=\max\{v_{\scriptscriptstyle i}:i\in\{1,\ldots,p\}\} for any v=(v1,,vp)Tpv=(v_{\scriptscriptstyle 1},\ldots,v_{\scriptscriptstyle p})^{\mbox{\tiny{\sf T}}}\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p$}}}, and, for any B>0B>0, Ξ(B)\Xi(B) is the collection of all the nn-point partitions Ba1anB-B\leq a_{\scriptscriptstyle 1}\leq\ldots\leq a_{\scriptscriptstyle n}\leq B.

(C3). Define the s2s_{\scriptscriptstyle 2}-sparse minimal and maximal eigenvalues of Σ\Sigma as

λmin(Σ,s2)=minv0s2(vTΣv/vTv),λmax(Σ,s2)=maxv0s2(vTΣv/vTv)\displaystyle\lambda_{\scriptscriptstyle\min}(\Sigma,s_{\scriptscriptstyle 2})=\min_{\scriptscriptstyle\|v\|_{0}\leq s_{2}}(v^{\mbox{\tiny{\sf T}}}\Sigma v/v^{\mbox{\tiny{\sf T}}}v),\quad\lambda_{\scriptscriptstyle\max}(\Sigma,s_{\scriptscriptstyle 2})=\max_{\scriptscriptstyle\|v\|_{0}\leq s_{2}}(v^{\mbox{\tiny{\sf T}}}\Sigma v/v^{\mbox{\tiny{\sf T}}}v)

where v0\|v\|_{\scriptscriptstyle 0} is the number of nonzero elements in vv for any vpv\in\mathbb{R}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle p$}}}. Assume that there exist constants c1,c2>0c_{\scriptscriptstyle 1},c_{\scriptscriptstyle 2}>0 such that c1λmin(Σ,s2)λmax(Σ,s2)c2c_{\scriptscriptstyle 1}\leq\lambda_{\scriptscriptstyle\min}(\Sigma,s_{\scriptscriptstyle 2})\leq\lambda_{\scriptscriptstyle\max}(\Sigma,s_{\scriptscriptstyle 2})\leq c_{\scriptscriptstyle 2}.

Appendix B Algorithms for the proposed methods

In this Appendix, we elaborate the details in implementing SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}, SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}}, and SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}}.

B.1 Algorithms for SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}

Given Ω^1\widehat{\Omega}_{\scriptscriptstyle 1}, SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} is readily derived by singular value decomposition. We list the detailed steps in Algorithm 1.

Algorithm 1 SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}
1:  Use the EM algorithm to calculate the maximal likelihood estimators μ^i\widehat{\mu}_{\scriptscriptstyle i}, Σ^i\widehat{\Sigma}_{\scriptscriptstyle i}, and π^i\widehat{\pi}_{\scriptscriptstyle i} for i{1,,q}i\in\{1,\dots,q\}, and derive the posterior probability π^i(X)\widehat{\pi}_{\scriptscriptstyle i}(X) accordingly;
2:  Plug the estimators above into Ω1\Omega_{\scriptscriptstyle 1} to derive
Ω^1=En[{Σ^i1(Xμ^i)π^i(X)YS}i{1,,q}]\displaystyle\widehat{\Omega}_{\scriptscriptstyle 1}=E_{\scriptscriptstyle n}[\{\widehat{\Sigma}_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}(X-\widehat{\mu}_{\scriptscriptstyle i})\widehat{\pi}_{\scriptscriptstyle i}(X)\vec{Y}_{\scriptscriptstyle S}\}_{\scriptscriptstyle i\in\{1,\ldots,q\}}]
and calculate the linear span of the leading dd left singular vectors of Ω^1\widehat{\Omega}_{\scriptscriptstyle 1} to estimate 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}.

To implement SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}, we fix β0\beta_{\scriptscriptstyle 0} at β^1\widehat{\beta}_{\scriptscriptstyle 1} in D(β0TX)D(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) and each πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) involved in G1(){\mathrm{G}}_{\scriptscriptstyle 1}(\cdot), where β^1\widehat{\beta}_{\scriptscriptstyle 1} is an arbitrary orthonormal basis matrix of 𝒮(Ω^1){\mathcal{S}}({\widehat{\Omega}_{\scriptscriptstyle 1}}) derived from SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}. The estimators π^i(β^1TX)\widehat{\pi}_{\scriptscriptstyle i}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X) are then derived from the mixture model fit mentioned in Section 2, which deliver

D^(β^1TX)=i=1qj=1q(β^1TΣ^jβ^1)1En{β^1T(Xμ^j)YSπ^i(β^1TX)π^j(β^1TX)}π^i(β^1TX).\displaystyle\widehat{D}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}\widehat{\Sigma}_{\scriptscriptstyle j}\widehat{\beta}_{\scriptscriptstyle 1})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}E_{\scriptscriptstyle n}\{\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}(X-\widehat{\mu}_{\scriptscriptstyle j})\vec{Y}_{\scriptscriptstyle S}\widehat{\pi}_{\scriptscriptstyle i}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)\widehat{\pi}_{\scriptscriptstyle j}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)\}\widehat{\pi}_{\scriptscriptstyle i}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X).

Together with R^X(𝒮(β))=i=1qπ^i(β^1TX){PT(Σ^i,β)(Xμ^i)+μ^i}\widehat{R}_{\scriptscriptstyle X}({\mathcal{S}}({\beta}))=\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\widehat{\pi}_{\scriptscriptstyle i}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)\{P^{\mbox{\tiny{\sf T}}}(\widehat{\Sigma}_{\scriptscriptstyle i},\beta)(X-\widehat{\mu}_{\scriptscriptstyle i})+\widehat{\mu}_{\scriptscriptstyle i}\}, we have

G^1(𝒮(β))En[{XR^X(𝒮(β))}Ds]F2.\displaystyle\widehat{\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta}))\equiv\left\|E_{\scriptscriptstyle n}[\{X-\widehat{R}_{\scriptscriptstyle X}({\mathcal{S}}({\beta}))\}D_{\scriptscriptstyle s}]\right\|_{\scriptscriptstyle F}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}. (B.1)

where DsD_{\scriptscriptstyle s} denotes vecT{D^(β^1TX)diag(YS)}{\mathrm{vec}}^{\mbox{\tiny{\sf T}}}\{\widehat{D}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X){\mathrm{diag}}(\vec{Y}_{\scriptscriptstyle S})\}. To minimize G^1()\widehat{\mathrm{G}}_{\scriptscriptstyle 1}(\cdot), we propose an iterative algorithm as follows. Start with SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} as the initial estimator of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. In the kkth iteration, let β^k\widehat{\beta}_{\scriptscriptstyle k} be an orthonormal basis matrix of the most updated estimate of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. We replace each PT(Σ^i,β)P^{\mbox{\tiny{\sf T}}}(\widehat{\Sigma}_{\scriptscriptstyle i},\beta) in G^1(𝒮(β))\widehat{\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta})) with Σ^iβ(β^kTΣ^iβ^k)1β^kT\widehat{\Sigma}_{\scriptscriptstyle i}\beta(\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}\widehat{\Sigma}_{\scriptscriptstyle i}\widehat{\beta}_{\scriptscriptstyle k})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}, which changes G^1(𝒮(β))\widehat{\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta})) into a quadratic function of β\beta that can be readily minimized, with the minimizer being the update 𝒮(β^k+1){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle k+1}}). We repeat the iterations until a prefixed convergence threshold for δ(β^k,β^k+1)\delta(\widehat{\beta}_{\scriptscriptstyle k},\widehat{\beta}_{\scriptscriptstyle k+1}) defined in (28) of the main text is met. These are summarized as Algorithm 2.

Algorithm 2 SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}
1:  Use SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} to estimate 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} in each πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), and use the mixture model fit to construct G^1(𝒮(β))\widehat{\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta})) as in (B.1); let 𝒮(β^1){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle 1}}) derived from SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} be the initial value of 𝒮(β){\mathcal{S}}({\beta}).
2:  In the kkth iteration, replace each PT(Σ^i,β)P^{\mbox{\tiny{\sf T}}}(\widehat{\Sigma}_{\scriptscriptstyle i},\beta) in G^1(𝒮(β))\widehat{\mathrm{G}}_{\scriptscriptstyle 1}({\mathcal{S}}({\beta})) with Σ^iβ(β^kTΣ^iβ^k)1β^kT\widehat{\Sigma}_{\scriptscriptstyle i}\beta(\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}\widehat{\Sigma}_{\scriptscriptstyle i}\widehat{\beta}_{\scriptscriptstyle k})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}, and solve the corresponding quadratic minimization problem to derive 𝒮(β^k+1){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle k+1}}).
3:  Repeat the iterations until δ(β^k,β^k+1)<c0\delta(\widehat{\beta}_{\scriptscriptstyle k},\widehat{\beta}_{\scriptscriptstyle k+1})<c_{\scriptscriptstyle 0} for some prefixed small scalar c0c_{\scriptscriptstyle 0}, e.g. n1n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}. The final estimator of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} is the most updated 𝒮(β^k+1){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle k+1}}).

B.2 Algorithms for SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} and SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}}

Given Ω2\Omega_{\scriptscriptstyle 2}, SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} is readily derived by singular value decomposition. We list the detailed steps in Algorithm 3.

Algorithm 3 SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}}
1:  Apply EM algorithm to estimate μ^i\widehat{\mu}_{\scriptscriptstyle i}, Σ^i\widehat{\Sigma}_{\scriptscriptstyle i}, and π^i(X)\widehat{\pi}_{\scriptscriptstyle i}(X), i{1,,q}i\in\{1,\ldots,q\};
2:  Calculate the span of leading dd left singular vectors of Ω^2\widehat{\Omega}_{\scriptscriptstyle 2} as SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}}, denoted by 𝒮(β^mSAVE){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle\mathrm{mSAVE}}}).

To implement SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}}, we fix β0\beta_{\scriptscriptstyle 0} at β^1\widehat{\beta}_{\scriptscriptstyle 1} in each πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) involved in G2(){\mathrm{G}}_{\scriptscriptstyle 2}(\cdot), where β^1\widehat{\beta}_{\scriptscriptstyle 1} is an arbitrary orthonormal basis matrix of 𝒮(Ω^2){\mathcal{S}}({\widehat{\Omega}_{\scriptscriptstyle 2}}) derived from SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}}. The estimators π^i(β^1TX)\widehat{\pi}_{\scriptscriptstyle i}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X) are then derived from the mixture model fit mentioned in Section 2. We have,

G^2(𝒮(β))i=1qh=1H(Ai,h+Bi,h)2F2.\displaystyle\widehat{\mathrm{G}}_{\scriptscriptstyle 2}({\mathcal{S}}({\beta}))\equiv\left\|\textstyle\sum_{\scriptscriptstyle i=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}\textstyle\sum_{\scriptscriptstyle h=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle H$}}}(A_{\scriptscriptstyle i,h}+B_{\scriptscriptstyle i,h})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\right\|_{\scriptscriptstyle F}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}. (B.2)

where

Ai,h\displaystyle A_{\scriptscriptstyle i,h} =En[X2π^i2(β^1TX)YS,h]j=1qEn[π^i2(β^1TX)π^j(β^1TX)YS,h]Σ^j\displaystyle=\mathrm{E}_{\scriptscriptstyle n}[X^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle\otimes 2$}}}\widehat{\pi}_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)Y_{\scriptscriptstyle S,h}]-\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E_{\scriptscriptstyle n}[\widehat{\pi}_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)\widehat{\pi}_{\scriptscriptstyle j}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)Y_{\scriptscriptstyle S,h}]\widehat{\Sigma}_{\scriptscriptstyle j}
En[π^i2(β^1TX)μ2(βTX)YS,h]\displaystyle\quad-E_{\scriptscriptstyle n}[\widehat{\pi}_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)\mu_{\scriptscriptstyle 2}({\beta^{\mbox{\tiny{\sf T}}}}X)Y_{\scriptscriptstyle S,h}]
Bi,h\displaystyle B_{\scriptscriptstyle i,h} =j=1qEn[π^i2(β^1TX)π^j(β^1TX)YS,h]Σ^jP(Σ^j,β)\displaystyle=\textstyle\sum_{\scriptscriptstyle j=1}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle q$}}}E_{\scriptscriptstyle n}[\widehat{\pi}_{\scriptscriptstyle i}^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle 2$}}}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)\widehat{\pi}_{\scriptscriptstyle j}(\widehat{\beta}_{\scriptscriptstyle 1}^{\mbox{\tiny{\sf T}}}X)Y_{\scriptscriptstyle S,h}]\widehat{\Sigma}_{\scriptscriptstyle j}P(\widehat{\Sigma}_{\scriptscriptstyle j},{\beta})

To minimize G^2()\widehat{\mathrm{G}}_{\scriptscriptstyle 2}(\cdot), we propose an iterative algorithm as follows. Start with SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} as the initial estimator of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. In the kkth iteration, let β^k\widehat{\beta}_{\scriptscriptstyle k} be an orthonormal basis matrix of the most updated estimate of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X}. We replace each μ2(βTX)\mu_{\scriptscriptstyle 2}(\beta^{\mbox{\tiny{\sf T}}}X) in Ai,hA_{\scriptscriptstyle i,h} with μ2(β^kTX)\mu_{\scriptscriptstyle 2}(\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}X), and replace each P(Σ^i,β)P(\widehat{\Sigma}_{\scriptscriptstyle i},\beta) in Bi,hB_{\scriptscriptstyle i,h} with β(β^kTΣ^iβ^k)1β^kTΣ^i\beta(\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}\widehat{\Sigma}_{\scriptscriptstyle i}\widehat{\beta}_{\scriptscriptstyle k})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}\widehat{\Sigma}_{\scriptscriptstyle i}, which changes G^2(𝒮(β))\widehat{\mathrm{G}}_{\scriptscriptstyle 2}({\mathcal{S}}({\beta})) into a quadratic function of β\beta that can be readily minimized, with the minimizer being the update 𝒮(β^k+1){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle k+1}}). We repeat the iterations until a prefixed convergence threshold for δ(β^k,β^k+1)\delta(\widehat{\beta}_{\scriptscriptstyle k},\widehat{\beta}_{\scriptscriptstyle k+1}) defined in (28) of the main text is met. These are summarized as Algorithm 4.

Algorithm 4 SAVERM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{RM}}
1:  Use SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} to estimate 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} in each πi(β0TX)\pi_{\scriptscriptstyle i}(\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X), and use the mixture model fit to construct G^2(𝒮(β))\widehat{\mathrm{G}}_{\scriptscriptstyle 2}({\mathcal{S}}({\beta})) as in (B.2); let 𝒮(β^1){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle 1}}) derived from SAVEM\mathrm{SAVE}_{\scriptscriptstyle\mathrm{M}} be the initial value of 𝒮(β){\mathcal{S}}({\beta}).
2:  In the kkth iteration, replace each μ2(βTX)\mu_{\scriptscriptstyle 2}(\beta^{\mbox{\tiny{\sf T}}}X) in Ai,hA_{\scriptscriptstyle i,h} with μ2(β^kTX)\mu_{\scriptscriptstyle 2}(\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}X), and replace each P(Σ^i,β)P(\widehat{\Sigma}_{\scriptscriptstyle i},\beta) in Bi,hB_{\scriptscriptstyle i,h} with β(β^kTΣ^iβ^k)1β^kTΣ^i\beta(\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}\widehat{\Sigma}_{\scriptscriptstyle i}\widehat{\beta}_{\scriptscriptstyle k})^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}\widehat{\beta}_{\scriptscriptstyle k}^{\mbox{\tiny{\sf T}}}\widehat{\Sigma}_{\scriptscriptstyle i}, and solve the corresponding quadratic minimization problem to derive 𝒮(β^k+1){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle k+1}}).
3:  Repeat the iterations until δ(β^k,β^k+1)<c0\delta(\widehat{\beta}_{\scriptscriptstyle k},\widehat{\beta}_{\scriptscriptstyle k+1})<c_{\scriptscriptstyle 0} for some prefixed small scalar c0c_{\scriptscriptstyle 0}, e.g. n1n^{\raisebox{1.2pt}{\mbox{$\scriptscriptstyle-1$}}}. The final estimator of 𝒮Y|X{\mathcal{S}}_{\scriptscriptstyle Y|X} is the most updated 𝒮(β^k+1){\mathcal{S}}({\widehat{\beta}_{\scriptscriptstyle k+1}}).

Appendix C Complementary simulation studies

In this Appendix, we present the complementary simulation studies to evaluate how the proposed SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} are impacted by the estimation error in fitting the mixture model of XX, as well as by hypothetical misspecification of the number of mixture components qq, under the conventional low-dimensional settings and assuming that the mixture model of XX holds. The misspecification of qq does not occur in the simulation studies of the main text due to the effectiveness of BIC, but it is possible in real data analyses and thus worth the investigation.

Generally, for an arbitrarily fixed working number of mixture components kk, the corresponding mixture model misspecifies the distribution of XX if k<qk<q, and it still correctly specifies the distribution if k>qk>q. Thus, the proposed SDR methods theoretically lose consistency if k<qk<q and are still consistent if k>qk>q. As an illustration, we set kk to be each of 11, 22, 33, and 44 sequentially for all the models in Subsection 6.1 of the main text, and apply SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} in each case. Note that this procedure differs from the implementation of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} in Subsection 6.1, as qq is forced to be a fixed value rather than being estimated by BIC. After repeating the procedure for 200200 times, the results are summarized in Table 11. When k=1k=1, both SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} reduce to SIR, so the result of SIR in Table 1 is used for this case.

Table 11: Performance of SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} under misspecification of qq
q^\widehat{q} Methods Model 1 Model 2 Model 3 Model 4
11 SIR .908(.136) .910(.128) 1.08(.179) .634(.127)
22 SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .590(.151) .436(.094) .553(.111) .375(.058)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .555(.146) .402(.088) .502(.109) .320(.076)
33 SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .698(.191) .506(.154) .608(.192) .374(.105)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .642(.156) .416(.128) .545(.174) .338(.098)
44 SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .801(.233) .549(.188) .636(.217) .388(.104)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .689(.194) .435(.163) .590(.189) .350(.095)
ADR .611(.176) .473(.138) 1.31(.113) .645(.189)
MAVE .745(.229) .223(.056) .895(.278) .326(.089)
eCVE 1.42(.126) .452(.128) .609(.222) 1.29(.226)
  • The meanings of numbers in each cell follow those in Table 1.

Recall that qq is 22 for Models 131-3 and is 33 for Model 4. Thus, Table 11 illustrates both the impact of overestimation of qq and the impact of underestimation of qq on the proposed methods for each model. Compared with both the benchmark and the worst case scenario, i.e. the proposed methods with truly specified qq and SIR, respectively, an overestimation of qq brings negligible additional bias to the proposed methods and makes them still comparable and sometimes outperform the MAVE-type methods. By contrast, an underestimate q^=1\widehat{q}=1 brings significant bias to the proposed methods (i.e. SIR) for all the four models. These comply with the theoretical anticipation above.

An interesting exception in Table 11 is the case of q^\widehat{q} forced to be 22 for Model 4, which suggests the effectiveness of the proposed methods even though qq is slightly underestimated. The reason is that the mixture of three multivariate normal distributions in this model can still be approximated by a mixture of two multivariate normal distributions to a good extent. To see this, we draw the scatter plot of (X1,X2)(X_{\scriptscriptstyle 1},X_{\scriptscriptstyle 2}), the sub-vector of XX that has the clustered pattern, in Figure 4. The point type of each observation xx is determined by an independent Bernoulli random variable, whose probability of success is π^1(x)\widehat{\pi}_{\scriptscriptstyle 1}(x) derived from the mixture model fit with q^=2\widehat{q}=2. The two fitted clusters are then formed by the observations that have the same point types. From Figure 4, each cluster conveys an approximate elliptical distribution, and, in particular, their loess fits of E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) are approximately linear.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: In the upper panels are the scatter plots of the fitted clusters of Model 4 with q^\widehat{q} forced at two, the left two plots specified for each cluster; in the lower panels are the scatter plots that illustrate the loess fit of E(X|β0TX)E(X|\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X) within each cluster, where the x-axis is changed to β0TX=X1+X2\beta_{\scriptscriptstyle 0}^{\mbox{\tiny{\sf T}}}X=X_{\scriptscriptstyle 1}+X_{\scriptscriptstyle 2}, and the y-axis is changed to X1X_{\scriptscriptstyle 1}.

When qq is estimated by a consistent order-determination method, it may still be misspecified in practice. However, unlike the case of arbitrarily fixed q^\widehat{q} above, an underestimation of qq in this case is plausibly a sign for severely overlapping mixture components in the data set; that is, two or more mixture components have similar distributions and thus can be regarded as one mixture component that approximately falls in the working parametric family {\mathcal{F}} (e.g. multivariate normal). Thus, the distribution of XX can still be approximated by the working mixture model, indicating the effectiveness of the proposed SDR methods. Together with the robustness of the proposed methods to the choice of qq discussed above, a misspecification of qq should not be worrisome in practice.

Next, we assess the impact of the estimation error of π^\widehat{\pi}’s, μ^i\widehat{\mu}_{\scriptscriptstyle i}’s, and Σ^i\widehat{\Sigma}_{\scriptscriptstyle i}’s on the proposed SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}. To this end, we implement both SDR methods in the oracle scenario that the mixture model of XX is completely known for each model in Subsection 6.1 of the main text, and compare their performance (based on 200200 independent runs) with that in Table 1. The results are summarized in Table 12. Generally, the estimation error in fitting the mixture model of XX has a negligible impact on both of the proposed methods. A relative asymptotic study is deferred to future.

Table 12: Comparison of the proposed methods with their oracle counterparts
Model 1 Model 2 Model 3 Model 4
SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .590(.151) .436(.094) .553(.111) .374(.105)
SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .555(.146) .402(.085) .502(.109) .338(.098)
oracle-SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} .616(.158) .469(.113) .544(.128) .401(.105)
oracle-SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} .538(.138) .314(.064) .476(.116) .326(.078)
  • The methods “oracle-SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}}” and “oracle-SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}}” refer to SIRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{M}} and SIRRM\mathrm{SIR}_{\scriptscriptstyle\mathrm{RM}} with the mixture model of XX completely known, respectively. The meanings of numbers in each cell of Columns 252-5 follow those in Table 1.

References

  • Biernacki, Celeux and Govaert (2000) {barticle}[author] \bauthor\bsnmBiernacki, \bfnmChristophe\binitsC., \bauthor\bsnmCeleux, \bfnmGilles\binitsG. and \bauthor\bsnmGovaert, \bfnmGérard\binitsG. (\byear2000). \btitleAssessing a mixture model for clustering with the integrated completed likelihood. \bjournalIEEE transactions on pattern analysis and machine intelligence \bvolume22 \bpages719–725. \endbibitem
  • Cai, Ma and Zhang (2019) {barticle}[author] \bauthor\bsnmCai, \bfnmT Tony\binitsT. T., \bauthor\bsnmMa, \bfnmJing\binitsJ. and \bauthor\bsnmZhang, \bfnmLinjun\binitsL. (\byear2019). \btitleCHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. \bjournalThe Annals of Statistics \bvolume47 \bpages1234–1267. \endbibitem
  • Chen, Zou and Cook (2010) {barticle}[author] \bauthor\bsnmChen, \bfnmX.\binitsX., \bauthor\bsnmZou, \bfnmC.\binitsC. and \bauthor\bsnmCook, \bfnmR. D.\binitsR. D. (\byear2010). \btitleCoordinate-independent sparse suffcient dimension reduction and variable selection. \bjournalThe Annals of Statistics \bvolume6 \bpages3696–3723. \endbibitem
  • Chiaromonte, Cook and Li (2002) {barticle}[author] \bauthor\bsnmChiaromonte, \bfnmFrancesca\binitsF., \bauthor\bsnmCook, \bfnmR Dennis\binitsR. D. and \bauthor\bsnmLi, \bfnmBing\binitsB. (\byear2002). \btitleSufficient dimension reduction in regressions with categorical predictors. \bjournalAnnals of Statistics \bpages475–497. \endbibitem
  • Cook (1998) {bbook}[author] \bauthor\bsnmCook, \bfnmR. D.\binitsR. D. (\byear1998). \btitleRegression Graphics. \bpublisherWiley, New York. \endbibitem
  • Cook (2007) {barticle}[author] \bauthor\bsnmCook, \bfnmR Dennis\binitsR. D. (\byear2007). \btitleFisher lecture: Dimension reduction in regression. \endbibitem
  • Cook and Forzani (2008) {barticle}[author] \bauthor\bsnmCook, \bfnmR. Dennis\binitsR. D. and \bauthor\bsnmForzani, \bfnmLiliana\binitsL. (\byear2008). \btitlePrincipal Fitted Components for Dimension Reduction in Regression. \bjournalStatistical Science \bvolume23 \bpages485 – 501. \endbibitem
  • Cook and Li (2002) {barticle}[author] \bauthor\bsnmCook, \bfnmR. D.\binitsR. D. and \bauthor\bsnmLi, \bfnmB.\binitsB. (\byear2002). \btitleDimension reduction for conditional mean in regression. \bjournalThe Annals of Statistics \bvolume30 \bpages455–474. \endbibitem
  • Cook and Weisberg (1991) {barticle}[author] \bauthor\bsnmCook, \bfnmR. D.\binitsR. D. and \bauthor\bsnmWeisberg, \bfnmS.\binitsS. (\byear1991). \btitleDiscussion of “Sliced inverse regression for dimension reduction”. \bjournalJournal of the American Statistical Association \bvolume86 \bpages316–342. \endbibitem
  • Dong and Li (2010) {barticle}[author] \bauthor\bsnmDong, \bfnmY.\binitsY. and \bauthor\bsnmLi, \bfnmB.\binitsB. (\byear2010). \btitleDimension reduction for non-elliptically distributed predictors: second-order methods. \bjournalBiometrika \bvolume97 \bpages279–294. \endbibitem
  • Everitt (2013) {bbook}[author] \bauthor\bsnmEveritt, \bfnmBrian\binitsB. (\byear2013). \btitleFinite mixture distributions. \bpublisherSpringer Science & Business Media. \endbibitem
  • Fertl and Bura (2022) {barticle}[author] \bauthor\bsnmFertl, \bfnmLukas\binitsL. and \bauthor\bsnmBura, \bfnmEfstathia\binitsE. (\byear2022). \btitleThe ensemble conditional variance estimator for sufficient dimension reduction. \bjournalElectronic Journal of Statistics \bvolume16 \bpages1595–1634. \endbibitem
  • Guan, Xie and Zhu (2017) {barticle}[author] \bauthor\bsnmGuan, \bfnmYu\binitsY., \bauthor\bsnmXie, \bfnmChuanlong\binitsC. and \bauthor\bsnmZhu, \bfnmLixing\binitsL. (\byear2017). \btitleSufficient dimension reduction with mixture multivariate skew-elliptical distributions. \bjournalStatistica Sinica \bpages335–355. \endbibitem
  • Hall and Li (1993) {barticle}[author] \bauthor\bsnmHall, \bfnmP.\binitsP. and \bauthor\bsnmLi, \bfnmK. C.\binitsK. C. (\byear1993). \btitleOn almost linearity of low dimensional projections from high dimensional data. \bjournalThe Annals of Statistics \bvolume47 \bpages867–889. \endbibitem
  • Li (1991) {barticle}[author] \bauthor\bsnmLi, \bfnmK. C.\binitsK. C. (\byear1991). \btitleSliced inverse regression for dimension reduction (with discussion). \bjournalJournal of the American Statistical Association \bvolume86 \bpages316–342. \endbibitem
  • Li (1992) {barticle}[author] \bauthor\bsnmLi, \bfnmK.\binitsK. (\byear1992). \btitleOn principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. \bjournalJournal of the American Statistical Association \bvolume87 \bpages1025–1039. \endbibitem
  • Li (2007) {barticle}[author] \bauthor\bsnmLi, \bfnmLexin\binitsL. (\byear2007). \btitleSparse sufficient dimension reduction. \bjournalBiometrika \bvolume94 \bpages603–613. \endbibitem
  • Li and Dong (2009) {barticle}[author] \bauthor\bsnmLi, \bfnmB.\binitsB. and \bauthor\bsnmDong, \bfnmY.\binitsY. (\byear2009). \btitleDimension reduction for nonelliptically distributed predictors. \bjournalThe Annals of Statistics \bpages1272–1298. \endbibitem
  • Li and Duan (1989) {barticle}[author] \bauthor\bsnmLi, \bfnmK. C.\binitsK. C. and \bauthor\bsnmDuan, \bfnmN.\binitsN. (\byear1989). \btitleRegression analysis under link violation. \bjournalThe Annals of Statistics \bpages1009–1052. \endbibitem
  • Li and Wang (2007) {barticle}[author] \bauthor\bsnmLi, \bfnmB.\binitsB. and \bauthor\bsnmWang, \bfnmS.\binitsS. (\byear2007). \btitleOn directional regression for dimension reduction. \bjournalJournal of the American Statistical Association \bvolume35 \bpages2143–2172. \endbibitem
  • Lin, Zhao and Liu (2016) {barticle}[author] \bauthor\bsnmLin, \bfnmQian\binitsQ., \bauthor\bsnmZhao, \bfnmZhigen\binitsZ. and \bauthor\bsnmLiu, \bfnmJun S\binitsJ. S. (\byear2016). \btitleSparse sliced inverse regression for high dimensional data. \bjournalarXiv preprint arXiv:1611.06655. \endbibitem
  • Lin, Zhao and Liu (2019) {barticle}[author] \bauthor\bsnmLin, \bfnmQian\binitsQ., \bauthor\bsnmZhao, \bfnmZhigen\binitsZ. and \bauthor\bsnmLiu, \bfnmJun S\binitsJ. S. (\byear2019). \btitleSparse sliced inverse regression via lasso. \bjournalJournal of the American Statistical Association \bvolume114 \bpages1726–1739. \endbibitem
  • Lindsay (1995) {binproceedings}[author] \bauthor\bsnmLindsay, \bfnmBruce G\binitsB. G. (\byear1995). \btitleMixture models: theory, geometry, and applications. \bpublisherIms. \endbibitem
  • Luo (2018) {barticle}[author] \bauthor\bsnmLuo, \bfnmWei\binitsW. (\byear2018). \btitleOn the second-order inverse regression methods for a general type of elliptical predictors. \bjournalSTATISTICA SINICA \bvolume28 \bpages1415–1436. \endbibitem
  • Luo (2022) {barticle}[author] \bauthor\bsnmLuo, \bfnmWei\binitsW. (\byear2022). \btitleDetermine the number of clusters by data augmentation. \bjournalElectronic Journal of Statistics \bvolume16 \bpages3910–3936. \endbibitem
  • Luo, Li and Yin (2014) {barticle}[author] \bauthor\bsnmLuo, \bfnmW.\binitsW., \bauthor\bsnmLi, \bfnmB.\binitsB. and \bauthor\bsnmYin, \bfnmX.\binitsX. (\byear2014). \btitleOn efficient dimension reduction with respect to a statistical functional of interest. \bjournalThe Annals of Statistics \bvolume42 \bpages382–412. \endbibitem
  • Luo and Li (2016) {barticle}[author] \bauthor\bsnmLuo, \bfnmW.\binitsW. and \bauthor\bsnmLi, \bfnmB.\binitsB. (\byear2016). \btitleCombining eigenvalues and variation of eigenvectors for order determination. \bjournalBiometrika \bvolume103 \bpages875–887. \endbibitem
  • Luo and Li (2021) {barticle}[author] \bauthor\bsnmLuo, \bfnmWei\binitsW. and \bauthor\bsnmLi, \bfnmBing\binitsB. (\byear2021). \btitleOn order determination by predictor augmentation. \bjournalBiometrika \bvolume108 \bpages557-574. \endbibitem
  • Ma and Zhu (2012) {barticle}[author] \bauthor\bsnmMa, \bfnmYanyuan\binitsY. and \bauthor\bsnmZhu, \bfnmLiping\binitsL. (\byear2012). \btitleA Semiparametric Approach to Dimension Reduction. \bjournalJournal of the American Statistical Association \bvolume107 \bpages168-179. \endbibitem
  • Ma and Zhu (2013) {barticle}[author] \bauthor\bsnmMa, \bfnmY.\binitsY. and \bauthor\bsnmZhu, \bfnmL.\binitsL. (\byear2013). \btitleEfficient estimation in sufficient dimension reduction. \bjournalAnnals of statistics \bvolume41 \bpages250. \endbibitem
  • Marin, Mengersen and Robert (2005) {barticle}[author] \bauthor\bsnmMarin, \bfnmJean-Michel\binitsJ.-M., \bauthor\bsnmMengersen, \bfnmKerrie\binitsK. and \bauthor\bsnmRobert, \bfnmChristian P\binitsC. P. (\byear2005). \btitleBayesian modelling and inference on mixtures of distributions. \bjournalHandbook of statistics \bvolume25 \bpages459–507. \endbibitem
  • McLachlan, Lee and Rathnayake (2019a) {barticle}[author] \bauthor\bsnmMcLachlan, \bfnmGeoffrey J\binitsG. J., \bauthor\bsnmLee, \bfnmSharon X\binitsS. X. and \bauthor\bsnmRathnayake, \bfnmSuren I\binitsS. I. (\byear2019a). \btitleFinite mixture models. \bjournalAnnual review of statistics and its application \bvolume6 \bpages355–378. \endbibitem
  • McLachlan, Lee and Rathnayake (2019b) {barticle}[author] \bauthor\bsnmMcLachlan, \bfnmGeoffrey J.\binitsG. J., \bauthor\bsnmLee, \bfnmSharon X.\binitsS. X. and \bauthor\bsnmRathnayake, \bfnmSuren I.\binitsS. I. (\byear2019b). \btitleFinite Mixture Models. \bjournalAnnual Review of Statistics and Its Application \bvolume6 \bpages355-378. \endbibitem
  • Tan et al. (2018) {barticle}[author] \bauthor\bsnmTan, \bfnmKean Ming\binitsK. M., \bauthor\bsnmWang, \bfnmZhaoran\binitsZ., \bauthor\bsnmZhang, \bfnmTong\binitsT., \bauthor\bsnmLiu, \bfnmHan\binitsH. and \bauthor\bsnmCook, \bfnmR Dennis\binitsR. D. (\byear2018). \btitleA convex formulation for high-dimensional sparse sliced inverse regression. \bjournalBiometrika \bvolume105 \bpages769–782. \endbibitem
  • Titterington, Smith and Makov (1985) {binproceedings}[author] \bauthor\bsnmTitterington, \bfnmD. Michael\binitsD. M., \bauthor\bsnmSmith, \bfnmAdrian F. M.\binitsA. F. M. and \bauthor\bsnmMakov, \bfnmUdi E.\binitsU. E. (\byear1985). \btitleStatistical analysis of finite mixture distributions. \bpublisherJohn Wiley & Sons. \endbibitem
  • Wang et al. (2020) {barticle}[author] \bauthor\bsnmWang, \bfnmQin\binitsQ., \bauthor\bsnmYin, \bfnmXiangrong\binitsX., \bauthor\bsnmLi, \bfnmBing\binitsB. and \bauthor\bsnmTang, \bfnmZhihui\binitsZ. (\byear2020). \btitleOn aggregate dimension reduction. \bjournalStatistica Sinica \bvolume30 \bpages1027–1048. \endbibitem
  • Xia et al. (2002) {barticle}[author] \bauthor\bsnmXia, \bfnmYingcun\binitsY., \bauthor\bsnmTong, \bfnmHowell\binitsH., \bauthor\bsnmLi, \bfnmWK\binitsW. and \bauthor\bsnmZhu, \bfnmLi-Xing\binitsL.-X. (\byear2002). \btitleAn adaptive estimation of dimension reduction space. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume64 \bpages363–410. \endbibitem
  • Yin, Li and Cook (2008) {barticle}[author] \bauthor\bsnmYin, \bfnmX.\binitsX., \bauthor\bsnmLi, \bfnmB.\binitsB. and \bauthor\bsnmCook, \bfnmR. D.\binitsR. D. (\byear2008). \btitleSuccessive direction extraction for estimating the central subspace in a multiple-index regression. \bjournalJournal of Multivariate Analysis \bvolume99 \bpages1733–1757. \endbibitem
  • Zeng, Mai and Zhang (2022) {barticle}[author] \bauthor\bsnmZeng, \bfnmJing\binitsJ., \bauthor\bsnmMai, \bfnmQing\binitsQ. and \bauthor\bsnmZhang, \bfnmXin\binitsX. (\byear2022). \btitleSubspace Estimation with Automatic Dimension and Variable Selection in Sufficient Dimension Reduction. \bjournalJournal of the American Statistical Association \bvolume0 \bpages1-13. \endbibitem
  • Zhao, Hautamaki and Fränti (2008) {binproceedings}[author] \bauthor\bsnmZhao, \bfnmQinpei\binitsQ., \bauthor\bsnmHautamaki, \bfnmVille\binitsV. and \bauthor\bsnmFränti, \bfnmPasi\binitsP. (\byear2008). \btitleKnee point detection in BIC for detecting the number of clusters. In \bbooktitleAdvanced Concepts for Intelligent Vision Systems: 10th International Conference, ACIVS 2008, Juan-les-Pins, France, October 20-24, 2008. Proceedings 10 \bpages664–673. \bpublisherSpringer. \endbibitem