This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Indirect multivariate response linear regression

Aaron J. Molstad
School of Statistics
University of Minnesota
molst029@umn.edu
   Adam J. Rothman
School of Statistics
University of Minnesota
arothman@umn.edu
(July 16, 2015)
Abstract

We propose a new class of estimators of the multivariate response linear regression coefficient matrix that exploits the assumption that the response and predictors have a joint multivariate Normal distribution. This allows us to indirectly estimate the regression coefficient matrix through shrinkage estimation of the parameters of the inverse regression, or the conditional distribution of the predictors given the responses. We establish a convergence rate bound for estimators in our class and we study two examples. The first example estimator exploits an assumption that the inverse regression’s coefficient matrix is sparse. The second example estimator exploits an assumption that the inverse regression’s coefficient matrix is rank deficient. These estimators do not require the popular assumption that the forward regression coefficient matrix is sparse or has small Frobenius norm. Using simulation studies, we show that our example estimators outperform relevant competitors for some data generating models.

1 Introduction

Some statistical applications require the modeling of a multivariate response. Let yiqy_{i}\in\mathbb{R}^{q} be the measurement of the qq-variate response for the iith subject and let xipx_{i}\in\mathbb{R}^{p} be the nonrandom values of the pp predictors for the iith subject (i=1,,n)i=1,\dots,n). The multivariate response linear regression model assumes that yiy_{i} is a realization of the random vector

Yi=μ+βxi+εi,i=1,,n,Y_{i}=\mu_{*}+\beta_{*}^{\prime}x_{i}+\varepsilon_{i},\quad i=1,\ldots,n, (1)

where μq\mu_{*}\in\mathbb{R}^{q} is the unknown intercept, β\beta_{*} is the unknown pp by qq regression coefficient matrix, and ε1,,εn\varepsilon_{1},\ldots,\varepsilon_{n} are independent copies of a mean zero random vector with covariance matrix ΣE\Sigma_{*E}.

The ordinary least squares estimator of β\beta_{*} is

β^(OLS)=argminβp×q𝕐𝕏βF2,\hat{\beta}^{({\rm OLS})}=\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}, (2)

where F\|\cdot\|_{F} is the Frobenius norm, p×q\mathbb{R}^{p\times q} is the set of real valued pp by qq matrices, 𝕐\mathbb{Y} is the nn by qq matrix with iith row (Yin1i=1nYi)(Y_{i}-n^{-1}\sum_{i=1}^{n}Y_{i})^{\prime}, and 𝕏\mathbb{X} is the nn by pp matrix with iith row (xin1i=1nxi)(x_{i}-n^{-1}\sum_{i=1}^{n}x_{i})^{\prime} (i=1,,ni=1,...,n). It is well known that β^(OLS)\hat{\beta}^{({\rm OLS})} is the maximum likelihood estimator of β\beta_{*} when ε1,,εn\varepsilon_{1},\ldots,\varepsilon_{n} are independent and identically distributed Nq(0,ΣE)N_{q}(0,\Sigma_{*E}) and the corresponding maximum likelihood estimator of ΣE1\Sigma_{*E}^{-1} exists.

Many shrinkage estimators of β\beta_{*} have been proposed by penalizing the optimization in (2). Some of these estimators simultaneously estimate β\beta_{*} and remove irrelevant predictors (Turlach et al.,, 2005; Obozinski et al.,, 2010; Peng et al.,, 2010). Others encourage an estimator of reduced rank (Yuan et al.,, 2007; Chen and Huang,, 2012).

Under the restriction that ε1,,εn\varepsilon_{1},\ldots,\varepsilon_{n} are independent and identically distributed Nq(0,ΣE)N_{q}(0,\Sigma_{*E}), shrinkage estimators of β\beta_{*} that penalize or constrain the minimization of the negative loglikelihood have been proposed. These methods simultaneously estimate β\beta_{*} and ΣE1\Sigma_{*E}^{-1}. Examples include maximum likelihood reduced rank regression (Izenman,, 1975; Reinsel and Velu,, 1998), envelope models (Cook et al.,, 2010; Su and Cook,, 2011, 2012, 2013), and multivariate regression with covariance estimation (Rothman et al.,, 2010; Lee and Liu,, 2012; Bhadra and Mallick,, 2013).

To fit (1) with these shrinkage estimators, one exploits explicit assumptions about β\beta_{*}, but these may be unreasonable in some applications. As an alternative, we propose an indirect method to fit (1) without making explicit assumptions about β\beta_{*}. We exploit the assumption that response and predictors have a joint multivariate Normal distribution and we employ shrinkage estimators of the parameters of the conditional distribution of the predictors given the response. Our method provides an alternative indirect estimator of β\beta_{*}, which may be suitable when the existing shrinkage estimators are inadequate.

2 A new class of indirect estimators of β\beta_{*}

2.1 Class definition

We assume that the measured predictor and response pairs (x1,y1),,(xn,yn)(x_{1},y_{1}),\dots,(x_{n},y_{n}) are a realization of nn independent copies of (X,Y)(X,Y), where (X,Y)Np+q(μ,Σ)(X^{\prime},Y^{\prime})^{\prime}\sim N_{p+q}(\mu_{*},\Sigma_{*}). We also assume that Σ\Sigma_{*} positive definite. Define the marginal parameters through the following partitions:

μ=(μXμY),Σ=(ΣXXΣXYΣXYΣYY).\mu_{*}=\left(\begin{array}[]{c}\mu_{*X}\\ \mu_{*Y}\end{array}\right),\quad\Sigma_{*}=\left(\begin{array}[]{cc}\Sigma_{*XX}&\Sigma_{*XY}\\ \Sigma_{*XY}^{\prime}&\Sigma_{*YY}\end{array}\right).

Our goal is to estimate the multivariate regression coefficient matrix β=ΣXX1ΣXY\beta_{*}=\Sigma_{*XX}^{-1}\Sigma_{*XY} in the forward regression model

(Y|X=x)Nq(μY+β(xμX),ΣE),(Y|X=x)\sim N_{q}(\mu_{*Y}+\beta_{*}^{\prime}(x-\mu_{*X}),\Sigma_{*E}),

without assuming that β\beta_{*} is sparse or that βF2\|\beta_{*}\|_{F}^{2} is small. To do this we will estimate the inverse regression’s coefficient matrix η=ΣYY1ΣXY\eta_{*}=\Sigma_{YY}^{-1}\Sigma_{XY}^{\prime} and the inverse regression’s error precision matrix Δ1\Delta_{*}^{-1} in the inverse regression model

(X|Y=y)Np(μX+η(yμY),Δ).(X|Y=y)\sim N_{p}(\mu_{*X}+\eta_{*}^{\prime}(y-\mu_{*Y}),\Delta_{*}).

We connect the parameters of the inverse regression model to β\beta_{*} with the following proposition.

Proposition 1.

If Σ\Sigma_{*} is positive definite, then

β=Δ1η(ΣYY1+ηΔ1η)1.\beta_{*}=\Delta_{*}^{-1}\eta_{*}^{\prime}\left(\Sigma_{*YY}^{-1}+\eta_{*}\Delta_{*}^{-1}\eta_{*}^{\prime}\right)^{-1}. (3)

We prove Proposition 1 in Appendix A.1. This result leads us to propose a class of estimators of β\beta_{*} defined by

β^=Δ^1η^(Σ^YY1+η^Δ^1η^)1,\hat{\beta}=\hat{\Delta}^{-1}\hat{\eta}^{\prime}(\hat{\Sigma}_{YY}^{-1}+\hat{\eta}\hat{\Delta}^{-1}\hat{\eta}^{\prime})^{-1}, (4)

where η^\hat{\eta}, Δ^\hat{\Delta}, and Σ^YY\hat{\Sigma}_{YY} are user-selected estimators of η\eta_{*}, Δ\Delta_{*}, and ΣYY\Sigma_{*YY}. If n>max(p,q)n>\max(p,q) and the ordinary sample estimators are used for η^\hat{\eta}, Δ^\hat{\Delta} and Σ^YY\hat{\Sigma}_{YY}, then β^\hat{\beta} is equivalent to β^(OLS)\hat{\beta}^{({\rm OLS})}.

We propose to use shrinkage estimators of η\eta_{*}, Δ1\Delta_{*}^{-1}, and ΣYY1\Sigma_{*YY}^{-1} in (4). This gives us the potential to indirectly fit an unparsimonious forward regression model by fitting a parsimonious inverse regression model. For example, suppose that η\eta_{*} and Δ1\Delta_{*}^{-1} are sparse, but β\beta_{*} is dense. To fit the inverse regression model, we could use any of the forward regression shrinkage estimators discussed in Section 1.

2.2 Related work

Lee and Liu, (2012) proposed an estimator of β\beta_{*} that also exploits the assumption that (X,Y)(X^{\prime},Y^{\prime})^{\prime} is multivariate Normal; however, unlike our approach that makes no explicit assumptions about β\beta_{*}, their approach assumes that both Σ1\Sigma_{*}^{-1} and β\beta_{*} are sparse.

Modeling the inverse regression is a well-known idea in multivariate analysis. For example, when YY is categorical, quadratic discriminant analysis models (X|Y=y)(X|Y=y) as pp-variate Normal. There are also many examples of modeling the inverse regression in the sufficient dimension reduction literature (Adragni and Cook,, 2009).

The most closely related work to ours is that by Cook et al., (2013). They proposed indirect estimators of β\beta_{*} based on modeling the inverse regression in the special case when the response is univariate, i.e. q=1q=1. Under the same multivariate Normal assumption on (X,Y)(X^{\prime},Y^{\prime})^{\prime} that we make, Cook et al., (2013) showed that

β=11+ΣXYΔ1ΣXY/ΣYYΔ1ΣXY.\beta_{*}=\frac{1}{1+\Sigma_{*XY}^{\prime}\Delta_{*}^{-1}\Sigma_{*XY}/\Sigma_{*YY}}\Delta_{*}^{-1}\Sigma_{*XY}. (5)

They proposed estimators of β\beta_{*} by replacing ΣXY\Sigma_{*XY} and ΣYY\Sigma_{*YY} in the right hand side of (5) with their usual sample estimators, and by replacing Δ1\Delta_{*}^{-1} with a shrinkage estimator. This class of estimators was designed to exploit an abundant signal rate in the forward univariate response regression when p>np>n.

3 Asymptotic Analysis

We present a convergence rate bound for the indirect estimator of β\beta_{*} defined by (4). Our bound allows pp and qq to grow with the sample size nn. In the following proposition, \|\cdot\| is the spectral norm and φmin()\varphi_{\min}(\cdot) is the minimum eigenvalue.

Proposition 2.

Suppose that following conditions are true: (i) Σ\Sigma_{*} is positive definite for all p+qp+q; (ii) the estimator Σ^YY1\hat{\Sigma}_{YY}^{-1} is positive definite for all qq; (iii) the estimator Δ^1\hat{\Delta}^{-1} is positive definite for all pp; (iv) there exists a positive constant KK such that φmin(ΣYY1)K\varphi_{\min}(\Sigma_{*YY}^{-1})\geq K for all qq; and (v) there exist sequences {an},{bn}\{a_{n}\},\{b_{n}\} and {cn}\{c_{n}\} such that η^η=OP(an)\|\hat{\eta}-\eta_{*}\|=O_{P}(a_{n}), Δ^1Δ1=OP(bn)\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|=O_{P}(b_{n}), Σ^YY1ΣYY1=OP(cn)\|\hat{\Sigma}_{YY}^{-1}-\Sigma_{*YY}^{-1}\|=O_{P}(c_{n}), and anηΔ1+bnη2+cn0a_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|+b_{n}\|\eta_{*}\|^{2}+c_{n}\rightarrow 0 as nn\rightarrow\infty. Then

β^β=OP(anη2Δ12+bnη3Δ1+cnηΔ1).\displaystyle\|\hat{\beta}-\beta_{*}\|=O_{P}\left(a_{n}\|\eta_{*}\|^{2}\|\Delta_{*}^{-1}\|^{2}+b_{n}\|\eta_{*}\|^{3}\|\Delta_{*}^{-1}\|+c_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|\right).

We prove Proposition 2 in Appendix A.1. We used the spectral norm because it is compatible with the convergence rate bounds established for sparse inverse covariance estimators (Rothman et al.,, 2008; Lam and Fan,, 2009; Ravikumar et al.,, 2011).

If the inverse regression is parsimonious in the sense that η\|\eta_{*}\| and Δ1\|\Delta_{*}^{-1}\| are bounded, then the bound in Proposition 2 simplifies to β^β=OP(an+bn+cn)\|\hat{\beta}-\beta_{*}\|=O_{P}(a_{n}+b_{n}+c_{n}). From an asymptotic perspective, it is not surprising that the indirect estimator of β\beta_{*} is only as good as its worst plug-in estimator. We explore finite sample performance in Section 5.

4 Example estimators in our class

4.1 Sparse inverse regression

We now describe an estimator of the forward regression coefficient matrix β\beta_{*} defined by (4) that exploits zeros in the inverse regression’s coefficient matrix η\eta_{*}, zeros in the inverse regression’s error precision matrix Δ1\Delta_{*}^{-1}, and zeros in the precision matrix of the responses ΣYY1\Sigma_{*YY}^{-1}. We estimate η\eta_{*} with

η^L1=argminηq×p{𝕏𝕐ηF2+j=1pλjm=1q|ηmj|},\hat{\eta}^{{\rm L1}}=\operatorname*{arg\ min}_{\eta\in\mathbb{R}^{q\times p}}\left\{\|\mathbb{X}-\mathbb{Y}\eta\|_{F}^{2}+\sum_{j=1}^{p}\lambda_{j}\sum_{m=1}^{q}|\eta_{mj}|\right\}, (6)

which separates into pp L1L_{1}-penalized least-squares regressions (Tibshirani,, 1996): the first predictor regressed on the response through the ppth predictor regressed on the response. We select λj\lambda_{j} with 5-fold cross-validation, minimizing squared prediction error totaled over the folds, in the regression of the jjth predictor on the response (j=1,,p)(j=1,\ldots,p). This allows us to estimate the columns of η\eta_{*} in parallel.

We estimate Δ1\Delta_{*}^{-1} and ΣYY1\Sigma_{*YY}^{-1} with L1L_{1}-penalized Normal likelihood precision matrix estimation (Yuan and Lin,, 2007; Banerjee et al.,, 2008). Let Σ^γ,S1\hat{\Sigma}_{\gamma,S}^{-1} be a generic version of this estimator with tuning parameter γ\gamma and input pp by pp sample covariance matrix SS:

Σ^γ,S1=argminΩ𝕊+p{tr(ΩS)log|Ω|+γjk|ωjk|},\hat{\Sigma}_{\gamma,S}^{-1}=\operatorname*{arg\ min}_{\Omega\in\mathbb{S}_{+}^{p}}\left\{{\rm tr}(\Omega S)-\log|\Omega|+\gamma\sum_{j\neq k}|\omega_{jk}|\right\}, (7)

where 𝕊+p\mathbb{S}_{+}^{p} is the set of symmetric and positive definite pp by pp matrices. There are many algorithms that solve (7). Two good choices are the graphical lasso algorithm (Yuan,, 2008; Friedman et al.,, 2008) and the QUIC algorithm (Hsieh et al.,, 2011). We select γ\gamma with 5-fold cross-validation maximizing a validation likelihood criterion (Huang et al.,, 2006):

γ^=argminγ𝒢k=15{tr(Σ^γ,S(k)1S(k))log|Σ^γ,S(k)1|},\hat{\gamma}=\operatorname*{arg\ min}_{\gamma\in\mathcal{G}}\sum_{k=1}^{5}\left\{{\rm tr}\left(\hat{\Sigma}_{\gamma,S_{(-k)}}^{-1}S_{(k)}\right)-\log\left|\hat{\Sigma}_{\gamma,S_{(-k)}}^{-1}\right|\right\}, (8)

where 𝒢\mathcal{G} is a user-selected finite subset of the non-negative real line, S(k)S_{(-k)} is the sample covariance matrix from the observations outside the kkth fold, and S(k)S_{(k)} is the sample covariance matrix from the observations in the kkth fold centered by the sample mean of the observations outside the kkth fold. We estimate Δ1\Delta_{*}^{-1} using (7) with its tuning parameter selected by (8) and S=(𝕏𝕐η^L1)(𝕏𝕐η^L1)/nS=(\mathbb{X}-\mathbb{Y}\hat{\eta}^{{\rm L1}})^{\prime}(\mathbb{X}-\mathbb{Y}\hat{\eta}^{{\rm L1}})/n. Similarly, we estimate ΣYY1\Sigma_{*YY}^{-1} using (7) with its tuning parameter selected by (8) and S=𝕐𝕐/nS=\mathbb{Y}^{\prime}\mathbb{Y}/n.

4.2 Reduced rank inverse regression

We propose indirect estimators of β\beta_{*} that exploit the assumption that the inverse regression’s coefficient matrix η\eta_{*} is rank deficient. We have the following simple proposition that links rank deficiency in η\eta_{*} and its estimator to β\beta_{*} and its indirect estimator.

Proposition 3.

If Σ\Sigma_{*} is positive definite, then rank(β)=rank(η){\rm rank}(\beta_{*})={\rm rank}(\eta_{*}). In addition, if Σ^YY1\hat{\Sigma}_{YY}^{-1} and Δ^1\hat{\Delta}^{-1} are positive definite in the indirect estimator β^\hat{\beta} defined by (4), then rank(β^)=rank(η^){\rm rank}(\hat{\beta})={\rm rank}(\hat{\eta}).

The proof of this proposition is simple so we excluded it to save space.

We propose the following two example reduced rank indirect estimators of β\beta_{*}:

  1. 1.

    Estimate ΣYY\Sigma_{*YY} with 𝕐𝕐/n\mathbb{Y}^{\prime}\mathbb{Y}/n and estimate (η,Δ1)(\eta_{*},\Delta_{*}^{-1}) with Normal likelihood reduced rank inverse regression:

    (η^(r),Δ^1(r))=argmin(η,Ω)q×p×𝕊+p\displaystyle(\hat{\eta}^{(r)},\hat{\Delta}^{-1(r)})=\operatorname*{arg\ min}_{(\eta,\Omega)\in\mathbb{R}^{q\times p}\times\mathbb{S}_{+}^{p}} [n1tr{(𝕏𝕐η)(𝕏𝕐η)Ω}logdet(Ω)]\displaystyle\left[n^{-1}{\rm tr}\left\{(\mathbb{X}-\mathbb{Y}\eta)^{\prime}(\mathbb{X}-\mathbb{Y}\eta)\Omega\right\}-\log{\rm det}(\Omega)\right] (9)
    subjecttorank(η)=r,\displaystyle{\rm subject}\ {\rm to}\ {\rm rank}(\eta)=r,

    where rr is selected from {0,,min(p,q)}\{0,\ldots,\min(p,q)\}. The solution to the optimization in (9) is available in closed form (Reinsel and Velu,, 1998).

  2. 2.

    Estimate η\eta_{*} with η^(r)\hat{\eta}^{(r)} defined in (9), estimate ΣYY1\Sigma_{*YY}^{-1} with (7) using S=𝕐𝕐/nS=\mathbb{Y}^{\prime}\mathbb{Y}/n, and estimate Δ1\Delta_{*}^{-1} with (7) using S=(𝕏𝕐η^(r))(𝕏𝕐η^(r))/nS=(\mathbb{X}-\mathbb{Y}\hat{\eta}^{(r)})^{\prime}(\mathbb{X}-\mathbb{Y}\hat{\eta}^{(r)})/n.

Both example indirect reduced rank estimators of β\beta_{*} are formed by plugging in the estimators of η,Δ1\eta_{*},\Delta_{*}^{-1}, and ΣYY\Sigma_{*YY} to (4). The first estimator is likelihood-based and the second estimator exploits sparsity in ΣYY1\Sigma_{*YY}^{-1} and Δ1\Delta_{*}^{-1}. Neither estimator is defined when min(p,q)>n\min(p,q)>n. In this case, which we do not address, a regularized reduced rank estimator of η\eta_{*} could be used instead of the estimator defined in (9), e.g. the factor estimation and selection estimator (Yuan et al.,, 2007) or the reduced rank ridge regression estimator (Mukherjee and Zhu,, 2011).

5 Simulations

5.1 Sparse inverse regression simulation

We compared the following indirect estimators of β\beta_{*} when the inverse regression’s coefficient matrix η\eta_{*} is sparse:

  • IL1I_{L1}.

    This is the indirect estimator proposed in Section 4.1.

  • ISI_{S}.

    This is an indirect estimator defined by (4) with η^\hat{\eta} defined by (6), Σ^YY=𝕐𝕐/n\hat{\Sigma}_{YY}=\mathbb{Y}^{\prime}\mathbb{Y}/n, and Δ^=(𝕏𝕐η^L1)(𝕏𝕐η^L1)/n\hat{\Delta}=(\mathbb{X}-\mathbb{Y}\hat{\eta}^{L1})^{\prime}(\mathbb{X}-\mathbb{Y}\hat{\eta}^{L1})/n.

  • OΔO_{\Delta}.

    This is a part oracle indirect estimator defined by (4) with η^\hat{\eta} defined by (6), Σ^YY1\hat{\Sigma}_{YY}^{-1} defined by (7), and Δ^1=Δ1\hat{\Delta}^{-1}=\Delta^{-1}_{*}.

  • OO.

    This is a part oracle indirect estimator defined by (4) with η^\hat{\eta} defined by (6), Σ^YY1=ΣYY1\hat{\Sigma}_{YY}^{-1}=\Sigma_{*YY}^{-1}, and Δ^1=Δ1\hat{\Delta}^{-1}=\Delta^{-1}_{*}.

  • OYO_{Y}.

    This is a part oracle indirect estimator defined by (4) with η^\hat{\eta} defined by (6), Σ^YY1=ΣYY1\hat{\Sigma}_{YY}^{-1}=\Sigma_{*YY}^{-1}, and Δ^1\hat{\Delta}^{-1} defined by (7).

We also included the following forward regression estimators of β\beta_{*}:

  • OLS/MP.

    This is the ordinary least squares estimator defined by argminβp×q𝕐𝕏βF2\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}. When npn\leq p, we use the solution 𝕏𝕐\mathbb{X}^{-}\mathbb{Y}, where 𝕏\mathbb{X}^{-} is the Moore-Penrose generalized inverse of 𝕏\mathbb{X}.

  • R.

    This is the ridge penalized least squares estimator defined by

    argminβp×q(𝕐𝕏βF2+λβF2).\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\left(\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}+\lambda\|\beta\|_{F}^{2}\right).
  • 2\ell_{2}.

    This is an alternative ridge penalized least squares estimator defined by

    argminβp×q(𝕐𝕏βF2+m=1qλmj=1pβjm2),\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\left(\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}+\sum_{m=1}^{q}\lambda_{m}\sum_{j=1}^{p}\beta_{jm}^{2}\right),

    where a separate tuning parameter is used for each response.

We selected the tuning parameters for uses of (6) with 5-fold cross-validation, minimizing validation prediction error on the inverse regression. Tuning parameters for 2\ell_{2} and R were selected with 5-fold cross-validation, minimizing validation prediction error on the forward regression. We selected tuning parameters for uses of (7) with (8). The candidate set of tuning parameters was {108,107.5,,107.5,108}\left\{10^{-8},10^{-7.5},\dots,10^{7.5},10^{8}\right\}.

For 50 independent replications, we generated a realization of nn independent copies of (X,Y)(X^{\prime},Y^{\prime})^{\prime}, where YNq(0,ΣYY)Y\sim N_{q}(0,\Sigma_{*YY}) and (X|Y=y)Np(ηy,Δ)(X|Y=y)\sim N_{p}(\eta_{*}^{\prime}y,\Delta_{*}). The (i,j)(i,j)th entry of ΣYY\Sigma_{*YY} was set to ρY|ij|\rho_{Y}^{|i-j|} and the (i,j)(i,j)th entry of Δ\Delta_{*} was set to ρΔ|ij|\rho_{\Delta}^{|i-j|}. We set η=ZA\eta_{*}=Z\circ A, where \circ denotes the element-wise product: ZZ had entries independently drawn from N(0,1)N(0,1) and AA had entries independently drawn from the Bernoulli distribution with nonzero probability ss_{*}. This model is ideal for IL1I_{L1} because Δ1\Delta_{*}^{-1} and ΣYY1\Sigma_{*YY}^{-1} are both sparse. Every entry in the corresponding randomly generated β\beta_{*} is nonzero with high probability, but the magnitudes of these entries are small. This motivated us to compare our indirect estimators of β\beta_{*} to the ridge-penalized least squares forward regression estimators R and 2\ell_{2}.

We evaluated performance with model error (Breiman and Friedman,, 1997; Yuan et al.,, 2007), which is defined by ΣXX1/2(β^β)F2\|\Sigma_{*XX}^{1/2}(\hat{\beta}-\beta_{*})\|_{F}^{2}.

Table 1: Averages of model error from 50 replications when n=100,n=100, p=20p=20, and q=20q=20. All standard errors were less than or equal to 0.05.
ρY\rho_{Y} ρΔ\rho_{\Delta} ss_{*} IL1I_{L1} OO OΔO_{\Delta} OYO_{Y} ISI_{S} OLS 2\ell_{2} R
0.7 0.0 0.1 0.61 0.32 0.53 0.40 1.35 2.10 1.23 1.22
0.7 0.5 0.1 0.72 0.39 0.59 0.51 1.30 1.91 1.29 1.30
0.7 0.7 0.1 0.76 0.45 0.65 0.56 1.27 1.73 1.27 1.29
0.7 0.9 0.1 0.83 0.66 0.85 0.64 1.26 1.35 1.05 1.09
0.0 0.9 0.1 0.81 0.87 0.87 0.79 2.04 2.34 1.26 1.87
0.5 0.9 0.1 0.96 0.76 0.99 0.74 1.63 1.84 1.36 1.49
0.9 0.9 0.1 0.46 0.39 0.47 0.36 0.63 0.62 0.48 0.48
0.7 0.9 0.3 0.60 0.53 0.65 0.46 0.83 0.67 0.64 0.63
0.7 0.9 0.5 0.48 0.37 0.48 0.37 0.65 0.53 0.52 0.51
0.7 0.9 0.7 0.42 0.29 0.39 0.31 0.55 0.46 0.45 0.44

We report the average model errors, based on these 50 replications, in Table 1. When s=0.1s_{*}=0.1, the indirect estimators defined by (4) performed well for all choices of ρY\rho_{Y} and ρΔ\rho_{\Delta}. Our proposed estimator IL1I_{L1} was competitive with other indirect estimators also defined by (4), even those that used some oracle information. As ss_{*} increased with ρY=0.7\rho_{Y}=0.7 and ρΔ=0.9\rho_{\Delta}=0.9 fixed, the forward regression estimators performed nearly as well as IL1I_{L1}.

Table 2: Averages of model error from 50 replications when n=50,n=50, p=60p=60, and q=60q=60. All standard errors were 0.69 or less, except for MP, which had standard errors between 0.77 and 3.16.
ρY\rho_{Y} ρΔ\rho_{\Delta} ss_{*} IL1I_{L1} OO OΔO_{\Delta} OYO_{Y} MP 2\ell_{2} R
0.7 0.0 0.1 8.59 4.28 5.70 7.40 78.33 13.85 12.44
0.7 0.5 0.1 9.67 5.09 6.37 8.49 73.82 14.79 13.34
0.7 0.7 0.1 10.01 6.37 7.44 8.75 70.30 15.56 14.40
0.7 0.9 0.1 9.92 10.07 11.44 8.88 61.83 16.43 15.94
0.0 0.9 0.1 15.17 17.09 16.93 15.23 119.60 28.63 29.41
0.5 0.9 0.1 14.88 13.59 16.91 12.01 86.88 23.62 22.69
0.9 0.9 0.1 4.71 4.78 5.94 3.99 25.37 6.36 5.91
0.7 0.9 0.3 16.86 17.43 19.66 15.44 43.88 15.30 14.14
0.7 0.9 0.5 26.89 26.81 29.93 24.95 36.87 14.79 13.62
0.7 0.9 0.7 31.86 35.98 38.64 30.36 33.58 14.35 13.65

Similarly, Table 2 shows that when s=0.1s_{*}=0.1, IL1I_{L1} outperforms all three forward regression estimators. However, unlike in the lower dimensional setting illustrated in table 1, when η\eta_{*} is not sparse, i.e. s.3s_{*}\geq.3, IL1I_{L1} is outperformed by forward regression approaches. The part oracle method OYO_{Y} that used the knowledge of ΣYY1\Sigma_{*YY}^{-1} outperformed the other two part oracle indirect estimators OO and OΔO_{\Delta} when ρΔ=.9\rho_{\Delta}=.9. Also, when ρΔ=.9\rho_{\Delta}=.9, IL1I_{L1} was competitive with the part oracle estimators. Taken together, the results in Tables 1 and 2 suggest that when η\eta_{*} is very sparse, our proposed indirect estimator IL1I_{L1} may perform nearly as well as the part oracle indirect estimators and the forward regression estimators.

5.2 Reduced rank inverse regression simulation

We compared the performance of the following indirect reduced rank estimators of β\beta_{*}:

  • IML(r)I_{ML}^{(r)}.

    This is the likelihood-based indirect example estimator 1 proposed in Section 4.2.

  • I(r)I^{(r)}.

    This is the indirect example estimator 2 proposed in Section 4.2, which uses sparse estimators of ΣYY1\Sigma_{*YY}^{-1} and Δ1\Delta_{*}^{-1} in (4).

  • O(r)O^{(r)}.

    This is a part oracle indirect estimator defined by (4) with η^\hat{\eta} defined by (9), Δ^1=Δ1\hat{\Delta}^{-1}=\Delta_{*}^{-1}, and Σ^YY1=ΣYY1\hat{\Sigma}_{YY}^{-1}=\Sigma_{*YY}^{-1}.

  • OΔ(r)O^{(r)}_{\Delta}.

    This is a part oracle indirect estimator defined by (4) with η^\hat{\eta} defined by (9), Δ^1\hat{\Delta}^{-1} defined by (7), and Σ^YY1=ΣYY1\hat{\Sigma}_{YY}^{-1}=\Sigma_{*YY}^{-1}.

  • OY(r)O^{(r)}_{Y}.

    This is a part oracle indirect estimator defined by (4) with η^\hat{\eta} defined by (9), Δ^1=Δ1\hat{\Delta}^{-1}=\Delta_{*}^{-1}, Δ^1\hat{\Delta}^{-1} defined by (7), and Σ^YY1\hat{\Sigma}_{YY}^{-1} defined by (7).

We compared these indirect estimators to the following forward reduced rank regression estimator:

  • RR.

    This is the likelihood based reduced rank regression (Izenman,, 1975; Reinsel and Velu,, 1998). The estimator of β\beta_{*} and the estimator of the forward regression’s error precision matrix ΣE1\Sigma_{*E}^{-1} are defined by

    (β^(r),Σ^E1(r))=argmin(β,Ω)p×q×𝕊+q\displaystyle(\hat{\beta}^{(r)},\hat{\Sigma}_{E}^{-1(r)})=\operatorname*{arg\ min}_{(\beta,\Omega)\in\mathbb{R}^{p\times q}\times\mathbb{S}_{+}^{q}} [n1tr{(𝕐𝕏β)(𝕐𝕏β)Ω}logdet(Ω)]\displaystyle\left[n^{-1}{\rm tr}\left\{(\mathbb{Y}-\mathbb{X}\beta)^{\prime}(\mathbb{Y}-\mathbb{X}\beta)\Omega\right\}-\log{\rm det}(\Omega)\right]
    subjecttorank(β)=r.\displaystyle{\rm subject}\ {\rm to}\ {\rm rank}(\beta)=r.

We selected the rank parameter rr for uses of (9) with 5-fold cross-validation, minimizing validation prediction error on the inverse regression. The rank parameter for RR was selected with 5-fold cross-validation, minimizing validation prediction error on the forward regression. We selected tuning parameters for uses of (7) with (8). The candidate set of tuning parameters was {108,107.5,,107.5,108}\left\{10^{-8},10^{-7.5},\dots,10^{7.5},10^{8}\right\}.

For 50 independent replications, we generated a realization of nn independent copies of (X,Y)(X^{\prime},Y^{\prime})^{\prime} where YNq(0,ΣYY)Y\sim N_{q}(0,\Sigma_{*YY}) and (X|Y=y)Np(ηy,Δ)(X|Y=y)\sim N_{p}(\eta_{*}^{\prime}y,\Delta_{*}). The (i,j)(i,j)th entry of ΣYY\Sigma_{*YY} was set to ρY|ij|\rho_{Y}^{|i-j|} and the (i,j)(i,j)th entry of Δ\Delta_{*} was set to ρΔ|ij|\rho_{\Delta}^{|i-j|}. After specifying rmin(p,q)r_{*}\leq\min(p,q), we set η=PQ\eta_{*}=PQ, where Pq×rP\in\mathbb{R}^{q\times r_{*}} and Qr×pQ\in\mathbb{R}^{r_{*}\times p} had entries independently drawn from N(0,1)N(0,1) so that r=rank(η)=rank(β)r_{*}=\text{rank}(\eta_{*})=\text{rank}(\beta_{*}). As we did in the simulation in Section 5.1, we measured performance with model error.

Table 3: Averages of model error from 50 replications when n=100,n=100, p=20p=20, and q=20q=20. All standard errors were less than or equal to 0.05.
ρY\rho_{Y} ρΔ\rho_{\Delta} rr_{*} I(r)I^{(r)} O(r)O^{(r)} OΔ(r)O^{(r)}_{\Delta} OY(r)O^{(r)}_{Y} IML(r)I_{ML}^{(r)} OLS RR
0.7 0.0 10 0.33 0.04 0.86 0.75 0.64 1.38 0.64
0.7 0.5 10 0.34 0.04 0.86 0.74 0.60 1.31 0.60
0.7 0.7 10 0.31 0.03 0.86 0.80 0.62 1.32 0.61
0.7 0.9 10 0.31 0.02 0.85 0.88 0.60 1.30 0.61
0.0 0.9 10 0.15 0.03 1.00 1.77 1.22 2.61 1.21
0.5 0.9 10 0.42 0.01 1.11 1.36 0.90 1.97 0.89
0.9 0.9 10 0.12 0.01 0.32 0.30 0.22 0.46 0.22
0.7 0.9 4 0.35 0.02 1.73 2.61 0.49 3.12 0.49
0.7 0.9 8 0.35 0.01 1.15 1.33 0.68 1.73 0.65
0.7 0.9 12 0.31 0.04 0.64 0.59 0.55 0.96 0.53
0.7 0.9 16 0.25 0.08 0.30 0.20 0.44 0.50 0.42

We report the model errors, averaged over the 50 independent replications, in Table 3. Under every setting, I(r)I^{(r)} outperformed all non-oracle competitors. When r12r_{*}\leq 12, I(r)I^{(r)} outperformed both OΔ(r)O_{\Delta}^{(r)} and OY(r)O_{Y}^{(r)}, which suggests that shrinkage estimation of Δ1\Delta_{*}^{-1} and ΣYY1\Sigma_{*YY}^{-1} was helpful. In each setting, IML(r)I_{ML}^{(r)} performed similarly to RRRR even though they are estimating parameters of different condition distributions.

5.3 Reduced rank forward regression simulation

Our simulation studies in the previous sections used inverse regression data generating models. In this section, we compare the estimators from Section 5.2 using a forward regression data generating model.

For 50 independent replications, we generated a realization of nn independent copies of (X,Y)(X^{\prime},Y^{\prime})^{\prime} where XNp(0,ΣXX)X\sim N_{p}(0,\Sigma_{*XX}) and (Y|X=x)Nq(βx,ΣE)(Y|X=x)\sim N_{q}(\beta_{*}^{\prime}x,\Sigma_{*E}). The (i,j)(i,j)th entry of ΣXX\Sigma_{*XX} was set to ρX|ij|\rho_{X}^{|i-j|} and the (i,j)(i,j)th entry of ΣE\Sigma_{*E} was set to ρE|ij|\rho_{E}^{|i-j|}. After specifying rmin(p,q)r_{*}\leq\min(p,q), we set β=ZQ\beta_{*}=ZQ where Zp×rZ\in\mathbb{R}^{p\times r_{*}} had entries independently drawn from N(0,1)N(0,1) and Qr×qQ\in\mathbb{R}^{r_{*}\times q} had entries independently drawn from Uniform(1/4,1/4)\text{Uniform}(-1/4,1/4). In this data generating model, neither Δ1\Delta_{*}^{-1} nor ΣYY1\Sigma_{*YY}^{-1} had entries equal to zero.

Table 4: Averages of model error from 50 replications when n=100,n=100, p=20p=20, and q=20q=20. All standard errors were less than or equal to 0.21.
ρX\rho_{X} ρE\rho_{E} rr_{*} I(r)I^{(r)} O(r)O^{(r)} OΔ(r)O^{(r)}_{\Delta} OY(r)O^{(r)}_{Y} IML(r)I_{ML}^{(r)} OLS RR
0.0 0.9 10 2.79 0.54 4.27 5.05 2.48 4.99 2.82
0.5 0.9 10 2.90 0.47 5.36 5.94 2.73 5.00 2.89
0.7 0.9 10 2.97 0.51 4.64 5.03 2.71 4.93 2.76
0.9 0.9 10 2.84 0.73 3.78 4.16 2.67 5.19 2.73
0.7 0.0 10 4.66 1.92 3.59 5.88 4.53 5.11 4.34
0.7 0.5 10 4.27 1.65 3.88 5.51 3.99 5.06 3.97
0.7 0.7 10 3.55 1.26 3.99 5.29 3.43 5.00 3.44
0.7 0.9 4 1.27 0.08 3.84 4.71 0.95 5.00 1.11
0.7 0.9 8 2.39 0.36 4.15 5.15 2.05 4.81 2.22
0.7 0.9 12 3.58 0.79 4.44 5.21 3.20 5.15 3.27
0.7 0.9 16 4.53 1.29 4.62 4.42 4.33 5.11 4.38

The model errors, averaged over the 50 replications, are reported in Table 4. Both I(r)I^{(r)} and IML(r)I_{ML}^{(r)} were competitive with RR in most settings. Although neither Δ1\Delta_{*}^{-1} nor ΣYY1\Sigma_{*YY}^{-1} were sparse, we again see that I(r)I^{(r)} generally outperforms OY(r)O_{Y}^{(r)} and OΔ(r)O_{\Delta}^{(r)}, both of which use some oracle information. These results indicate that shrinkage estimators of Δ1\Delta_{*}^{-1} and ΣYY1\Sigma_{*YY}^{-1} in (4) are helpful when neither is sparse.

6 Tobacco chemical composition data example

As an example application, we use the chemical composition of tobacco leaves data from Anderson and Bancroft, (1952) and Izenman, (2009). These data have n=25n=25 cases, p=6p=6 predictors, and q=3q=3 responses. The names of the predictors, taken from page 183 of Izenman, (2009), are percent nitrogen, percent chlorine, percent potassium, percent phosphorus, percent calcium, and percent magnesium. The names of the response variables, also taken from page 183 of Izenman, (2009), are rate of cigarette burn in inches per 1,000 seconds, percent sugar in the leaf, and percent nicotine in the leaf. In these data, it may inappropriate to assume that Δ1\Delta_{*}^{-1} is sparse. For this reason, we consider another example indirect estimator of β\beta_{*} called IL2I_{L2} that estimates η\eta_{*} with (6), estimates ΣYY1\Sigma_{*YY}^{-1} with (7) using S=𝕐𝕐/nS=\mathbb{Y}^{\prime}\mathbb{Y}/n, and estimates Δ1\Delta_{*}^{-1} with

argminΩ𝕊+P{tr(ΩS)log det(Ω)+γj,k|ωjk|2},\operatorname*{arg\ min}_{\Omega\in\mathbb{S}^{P}_{+}}\left\{\text{tr}(\Omega S)-\text{log }\text{det}\left(\Omega\right)+\gamma\sum_{j,k}|\omega_{jk}|^{2}\right\}, (10)

where S=(𝕐𝕏η^L1)(𝕐𝕏η^L1)/nS=(\mathbb{Y}-\mathbb{X}\hat{\eta}^{L1})^{\prime}(\mathbb{Y}-\mathbb{X}\hat{\eta}^{L1})/n. We compute (10) with the closed form solution derived by Witten and Tibshirani, (2009). As before, we select γ\gamma from {108,107.5,,107.5,108}\{10^{8},10^{-7.5},\ldots,10^{7.5},10^{8}\} using (8). We also consider the forward regression estimators RR, 2\ell_{2}, and OLS defined in Section 5.1 and Section 5.2. We introduce another competitor 1\ell_{1}, defined as

argminβp×q{𝕐𝕏βF2+j=1qλjl=1p|βjl|},\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\left\{\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}+\sum_{j=1}^{q}\lambda_{j}\sum_{l=1}^{p}|\beta_{jl}|\right\},

which is equivalent to performing qq separate lasso regressions (Tibshirani,, 1996). We randomly split the data into a 40% test set and 60% training set in each of 500 replications and we measured the squared prediction error on the test set. All tuning parameters were chosen from {108,107.5,,107.5,108}\{10^{8},10^{-7.5},\ldots,10^{7.5},10^{8}\} by 5-fold cross validation.

Table 5: Averages of squared prediction error, with standard errors in parenthesis, for each response variable from 500 replications.
I(r)I^{(r)} IL1I_{L1} IL2I_{L2} OLS RR 2\ell_{2} 1\ell_{1}
Rate of burn 1.19 1.33 0.45 2.96 2.17 0.57 1.55
(0.08) (0.10) (0.03) (0.15) (0.15) (0.07) (0.13)
Percent sugar 442.38 347.76 235.55 799.03 605.30 365.13 583.98
(17.97) (21.31) (6.31) (29.45) (25.52) (20.68) (24.36)
Percent nicotene 2.55 2.54 0.79 5.65 4.59 0.81 2.82
(0.29) (0.30) (0.05) (0.41) (0.31) (0.21) (0.29)

Table 5 shows squared prediction errors, averaged over the 10 predictions and the 500 replications. These results indicate that IL2I_{L2} outperforms all the competitors we considered. Also, IL1I_{L1} was outperformed by 2\ell_{2}, but was competitive with separate lasso regressions. Reduced rank regression was not competitive with the proposed indirect estimators.

Acknowledgment

We thank Liliana Forzani for an important discussion. This research is supported in part by a grant from the U.S. National Science Foundation.

Appendix A Appendix

A.1 Proofs

Proof of Proposition 1.

Since Σ\Sigma_{*} is positive definite, we apply the partitioned inverse formula to obtain that

Σ1\displaystyle\Sigma_{*}^{-1} =(ΣXXΣXYΣXYΣYY)1=(Δ1βΣE1ηΔ1ΣE1),\displaystyle=\left(\begin{array}[]{c c}\Sigma_{*XX}&\Sigma_{*XY}\\ \Sigma_{*XY}^{\prime}&\Sigma_{*YY}\\ \end{array}\right)^{-1}=\left(\begin{array}[]{c c}\Delta^{-1}_{*}&-\beta_{*}\Sigma_{*E}^{-1}\\ -\eta_{*}\Delta^{-1}_{*}&\Sigma_{*E}^{-1}\\ \end{array}\right),

where Δ=ΣXXΣXYΣYY1ΣXY\Delta_{*}=\Sigma_{*XX}-\Sigma_{*XY}\Sigma_{*YY}^{-1}\Sigma_{*XY}^{\prime} and ΣE=ΣYYΣXYΣXX1ΣXY\Sigma_{*E}=\Sigma_{*YY}-\Sigma_{*XY}^{\prime}\Sigma_{*XX}^{-1}\Sigma_{*XY}. The symmetry of Σ1\Sigma_{*}^{-1} implies that βΣE1=(ηΔ1)\beta_{*}\Sigma_{*E}^{-1}=(\eta_{*}\Delta^{-1}_{*})^{\prime} so

β=Δ1ηΣE.\beta_{*}=\Delta^{-1}_{*}\eta_{*}^{\prime}\Sigma_{*E}. (11)

Using the Woodbury identity,

ΣE1\displaystyle\Sigma_{*E}^{-1} =(ΣYYΣXYΣXX1ΣXY)1\displaystyle=(\Sigma_{*YY}-\Sigma^{\prime}_{*XY}\Sigma_{*XX}^{-1}\Sigma_{*XY})^{-1}
=ΣYY1+ΣYY1ΣXY(ΣXX1ΣXYΣYY1ΣXY)1ΣXYΣYY1\displaystyle=\Sigma_{*YY}^{-1}+\Sigma_{*YY}^{-1}\Sigma^{\prime}_{*XY}\left(\Sigma_{*XX}^{-1}-\Sigma_{*XY}\Sigma_{*YY}^{-1}\Sigma^{\prime}_{*XY}\right)^{-1}\Sigma_{XY}\Sigma_{*YY}^{-1}
=ΣYY1+ηΔ1η.\displaystyle=\Sigma_{*YY}^{-1}+\eta_{*}\Delta_{*}^{-1}\eta_{*}^{\prime}. (12)

Using the inverse of the expression above in (11) establishes the result. ∎

In our proof of Proposition 2, we use the matrix inequality

A(1)A(2)A(3)B(1)B(2)B(3)\displaystyle\|A^{(1)}A^{(2)}A^{(3)}-B^{(1)}B^{(2)}B^{(3)}\| j=13A(j)B(j)kjB(k)\displaystyle\leq\sum_{j=1}^{3}\|A^{(j)}-B^{(j)}\|\prod_{k\neq j}\|B^{(k)}\|
+j=13B(j)kjA(k)B(k)+j=13A(j)B(j).\displaystyle+\sum_{j=1}^{3}\|B^{(j)}\|\prod_{k\neq j}\|A^{(k)}-B^{(k)}\|+\prod_{j=1}^{3}\|A^{(j)}-B^{(j)}\|. (13)

Bickel and Levina, (2008) used (A.1) to prove their Theorem 3.

Proof of Proposition 2.

From (12) in the proof of Proposition 1, ΣE1=ΣYY1+ηΔ1η\Sigma_{*E}^{-1}=\Sigma_{*YY}^{-1}+\eta_{*}\Delta_{*}^{-1}\eta_{*}^{\prime}. Define Σ^E1=Σ^YY1+η^Δ^1η^\hat{\Sigma}_{E}^{-1}=\hat{\Sigma}_{YY}^{-1}+\hat{\eta}\hat{\Delta}^{-1}\hat{\eta}^{\prime}. Applying (A.1),

β^β=\displaystyle\|\hat{\beta}-\beta_{*}\|= Δ^1η^Σ^EΔ1ηΣE\displaystyle\|\hat{\Delta}^{-1}\hat{\eta}^{\prime}\hat{\Sigma}_{E}-\Delta_{*}^{-1}\eta_{*}^{\prime}\Sigma_{*E}\|
\displaystyle\leq Δ^1Δ1ηΣE+η^ηΔ1ΣE+Σ^EΣEΔ1η\displaystyle\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|\cdot\|\eta_{*}\|\cdot\|\Sigma_{*E}\|+\|\hat{\eta}-\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|\cdot\|\Sigma_{*E}\|+\|\hat{\Sigma}_{E}-\Sigma_{*E}\|\cdot\|\Delta_{*}^{-1}\|\cdot\|\eta_{*}\|
+Δ1η^ηΣ^EΣE+ηΔ^1Δ1Σ^EΣE\displaystyle+\|\Delta_{*}^{-1}\|\cdot\|\hat{\eta}-\eta_{*}\|\cdot\|\hat{\Sigma}_{E}-\Sigma_{*E}\|+\|\eta_{*}\|\cdot\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|\cdot\|\hat{\Sigma}_{E}-\Sigma_{*E}\|
+ΣEΔ^1Δ1η^η+η^ηΔ^1Δ1Σ^EΣE.\displaystyle+\|\Sigma_{*E}\|\cdot\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|\cdot\|\hat{\eta}-\eta_{*}\|+\|\hat{\eta}-\eta_{*}\|\cdot\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|\cdot\|\hat{\Sigma}_{E}-\Sigma_{*E}\|. (14)

We will show that the third term in (14) dominates the others. We continue by deriving its bound. Employing a matrix identity used by Cai et al., (2010), we write Σ^EΣE=ΣE(ΣE1Σ^E1)Σ^E\hat{\Sigma}_{E}-\Sigma_{*E}=\Sigma_{*E}(\Sigma_{*E}^{-1}-\hat{\Sigma}_{E}^{-1})\hat{\Sigma}_{E}, so

Σ^EΣEΣ^EΣEΣ^E1ΣE1.\|\hat{\Sigma}_{E}-\Sigma_{*E}\|\leq\|\hat{\Sigma}_{E}\|\cdot\|\Sigma_{*E}\|\cdot\|\hat{\Sigma}_{E}^{-1}-\Sigma_{*E}^{-1}\|. (15)

Using the triangle inequality and (A.1),

Σ^E1ΣE1\displaystyle\|\hat{\Sigma}_{E}^{-1}-\Sigma_{*E}^{-1}\| Σ^YY1ΣYY1+η^Δ^1η^ηΔ1η\displaystyle\leq\|\hat{\Sigma}_{YY}^{-1}-\Sigma_{*YY}^{-1}\|+\|\hat{\eta}\hat{\Delta}^{-1}\hat{\eta}^{\prime}-\eta_{*}\Delta_{*}^{-1}\eta_{*}^{\prime}\|
Σ^YY1ΣYY1+2η^ηΔ1η+Δ^1Δ1η2\displaystyle\leq\|\hat{\Sigma}_{YY}^{-1}-\Sigma_{*YY}^{-1}\|+2\|\hat{\eta}-\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|\cdot\|\eta_{*}\|+\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|\cdot\|\eta_{*}\|^{2}
+2ηΔ^1Δ1η^η+Δ1η^η2+η^η2Δ^1Δ1\displaystyle+2\|\eta_{*}\|\cdot\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|\cdot\|\hat{\eta}-\eta_{*}\|+\|\Delta_{*}^{-1}\|\cdot\|\hat{\eta}-\eta_{*}\|^{2}+\|\hat{\eta}-\eta_{*}\|^{2}\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|
=OP(cn+anηΔ1+bnη2).\displaystyle=O_{P}\left(c_{n}+a_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|+b_{n}\|\eta_{*}\|^{2}\right). (16)

Since φmin(ΣYY1)K\varphi_{\min}(\Sigma_{*YY}^{-1})\geq K and Δ1\Delta_{*}^{-1} is positive definite, Weyl’s eigenvalue inequality implies that φmin(ΣE1)K\varphi_{\min}(\Sigma_{*E}^{-1})\geq K so

ΣE=φmin1(ΣE1)1/K.\|\Sigma_{*E}\|=\varphi_{\min}^{-1}(\Sigma_{*E}^{-1})\leq 1/K. (17)

Also,

Σ^E=φmin1(Σ^E1)=OP(1).\|\hat{\Sigma}_{E}\|=\varphi_{\min}^{-1}(\hat{\Sigma}_{E}^{-1})=O_{P}(1). (18)

because φmin(ΣE1)K\varphi_{\min}(\Sigma_{*E}^{-1})\geq K, Σ^E\hat{\Sigma}_{E} is positive definite, and anηΔ1+bnη2+cn=o(1)a_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|+b_{n}\|\eta_{*}\|^{2}+c_{n}=o(1) in (16). Using (16), (17), and (18), in (15),

Σ^EΣE=OP(anηΔ1+bnη2+cn).\|\hat{\Sigma}_{E}-\Sigma_{*E}\|=O_{P}\left(a_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|+b_{n}\|\eta_{*}\|^{2}+c_{n}\right).

We then see that the third term in (14) dominates and

β^β\displaystyle\|\hat{\beta}-\beta_{*}\| =OP{(anηΔ1+bnη2+cn)ηΔ1}\displaystyle=O_{P}\left\{\left(a_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|+b_{n}\|\eta_{*}\|^{2}+c_{n}\right)\|\eta_{*}\|\|\Delta_{*}^{-1}\|\right\}
=OP(anη2Δ12+bnη3Δ1+cnηΔ1).\displaystyle=O_{P}\left(a_{n}\|\eta_{*}\|^{2}\|\Delta_{*}^{-1}\|^{2}+b_{n}\|\eta_{*}\|^{3}\|\Delta_{*}^{-1}\|+c_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|\right).

References

  • Adragni and Cook, (2009) Adragni, K. P. and Cook, R. D. (2009). Sufficient dimension reduction and prediction in regression. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 367(1906):4385–4405.
  • Anderson and Bancroft, (1952) Anderson, R. L. and Bancroft, T. A. (1952). Statistical theory in research.
  • Banerjee et al., (2008) Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation. Journal of Machine Learning Research, 9:485–516.
  • Bhadra and Mallick, (2013) Bhadra, A. and Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis. Biometrics, 69(2):447–457.
  • Bickel and Levina, (2008) Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Annals of Statistics, 36(1):199–227.
  • Breiman and Friedman, (1997) Breiman, L. and Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3–54.
  • Cai et al., (2010) Cai, T. T., Zhang, C.-H., and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. Annals of Statistics, 38:2118–2144.
  • Chen and Huang, (2012) Chen, L. and Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association, 107(500):1533–1545.
  • Cook et al., (2013) Cook, R. D., Forzani, L., and Rothman, A. J. (2013). Prediction in abundant high-dimensional linear regression. Electronic Journal of Statistics, 7:3059–3088.
  • Cook et al., (2010) Cook, R. D., Li, B., and Chiaromonte, F. (2010). Envelope models for parsimonious and efficient multivariate linear regression (with discussion). Statistica Sinica, 20:927–1010.
  • Friedman et al., (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
  • Hsieh et al., (2011) Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. K. (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems, volume 24, pages 2330–2338. MIT Press, Cambridge, MA.
  • Huang et al., (2006) Huang, J., Liu, N., Pourahmadi, M., and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika, 93(1):85–98.
  • Izenman, (1975) Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5(2):248–264.
  • Izenman, (2009) Izenman, A. J. (2009). Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer.
  • Lam and Fan, (2009) Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrices estimation. Annals of Statistics, 37:4254–4278.
  • Lee and Liu, (2012) Lee, W. and Liu, Y. (2012). Simultaneous multiple response regression and inverse covariance matrix estimation via penalized gaussian maximum likelihood. Journal of Multivariate Analysis, 111:241–255.
  • Mukherjee and Zhu, (2011) Mukherjee, A. and Zhu, J. (2011). Reduced rank ridge regression and its kernel extensions. Statistical Analysis and Data Mining, 4(6):612–622.
  • Obozinski et al., (2010) Obozinski, G., Taskar, B., and Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252.
  • Peng et al., (2010) Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R., and Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. The Annals of Applied Statistics, 4(1):53.
  • Ravikumar et al., (2011) Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980.
  • Reinsel and Velu, (1998) Reinsel, G. C. and Velu, R. P. (1998). Multivariate Reduced-rank Regression. Springer.
  • Rothman et al., (2008) Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.
  • Rothman et al., (2010) Rothman, A. J., Levina, E., and Zhu, J. (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4):947–962.
  • Su and Cook, (2011) Su, Z. and Cook, R. D. (2011). Partial envelopes for efficient estimation in multivariate linear regression. Biometrika, 98:133–146.
  • Su and Cook, (2012) Su, Z. and Cook, R. D. (2012). Inner envelopes: Efficient estimation in multivariate linear regression. Biometrika, 99:687–702.
  • Su and Cook, (2013) Su, Z. and Cook, R. D. (2013). Scaled envelopes: Scale invariant and efficient estimation in multivariate linear regression. Biometrika, 100:921–938.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc., Ser. B, 58:267–288.
  • Turlach et al., (2005) Turlach, B. A., Venables, W. N., and Wright, S. J. (2005). Simultaneous variable selection. Technometrics, 47(3):349–363.
  • Witten and Tibshirani, (2009) Witten, D. M. and Tibshirani, R. (2009). Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):615–636.
  • Yuan, (2008) Yuan, M. (2008). Efficient computation of l1 regularized estimates in gaussian graphical models. Journal of Computational and Graphical Statistics, 17(4):809–826.
  • Yuan et al., (2007) Yuan, M., Ekici, A., Lu, Z., and Monteiro, R. (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3):329–346.
  • Yuan and Lin, (2007) Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35.