Indirect multivariate response linear regression

Aaron J. Molstad
School of Statistics
University of Minnesota
molst029@umn.edu Adam J. Rothman
School of Statistics
University of Minnesota
arothman@umn.edu

(July 16, 2015)

Abstract

We propose a new class of estimators of the multivariate response linear regression coefficient matrix that exploits the assumption that the response and predictors have a joint multivariate Normal distribution. This allows us to indirectly estimate the regression coefficient matrix through shrinkage estimation of the parameters of the inverse regression, or the conditional distribution of the predictors given the responses. We establish a convergence rate bound for estimators in our class and we study two examples. The first example estimator exploits an assumption that the inverse regression’s coefficient matrix is sparse. The second example estimator exploits an assumption that the inverse regression’s coefficient matrix is rank deficient. These estimators do not require the popular assumption that the forward regression coefficient matrix is sparse or has small Frobenius norm. Using simulation studies, we show that our example estimators outperform relevant competitors for some data generating models.

1 Introduction

Some statistical applications require the modeling of a multivariate response. Let $y_{i}\in\mathbb{R}^{q}$ be the measurement of the $q$ -variate response for the $i$ th subject and let $x_{i}\in\mathbb{R}^{p}$ be the nonrandom values of the $p$ predictors for the $i$ th subject ( $i=1,\dots,n)$ . The multivariate response linear regression model assumes that $y_{i}$ is a realization of the random vector

Y_{i}=\mu_{*}+\beta_{*}^{\prime}x_{i}+\varepsilon_{i},\quad i=1,\ldots,n,

(1)

where $\mu_{*}\in\mathbb{R}^{q}$ is the unknown intercept, $\beta_{*}$ is the unknown $p$ by $q$ regression coefficient matrix, and $\varepsilon_{1},\ldots,\varepsilon_{n}$ are independent copies of a mean zero random vector with covariance matrix $\Sigma_{*E}$ .

The ordinary least squares estimator of $\beta_{*}$ is

\hat{\beta}^{({\rm OLS})}=\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2},

(2)

where $\|\cdot\|_{F}$ is the Frobenius norm, $\mathbb{R}^{p\times q}$ is the set of real valued $p$ by $q$ matrices, $\mathbb{Y}$ is the $n$ by $q$ matrix with $i$ th row $(Y_{i}-n^{-1}\sum_{i=1}^{n}Y_{i})^{\prime}$ , and $\mathbb{X}$ is the $n$ by $p$ matrix with $i$ th row $(x_{i}-n^{-1}\sum_{i=1}^{n}x_{i})^{\prime}$ ( $i=1,...,n$ ). It is well known that $\hat{\beta}^{({\rm OLS})}$ is the maximum likelihood estimator of $\beta_{*}$ when $\varepsilon_{1},\ldots,\varepsilon_{n}$ are independent and identically distributed $N_{q}(0,\Sigma_{*E})$ and the corresponding maximum likelihood estimator of $\Sigma_{*E}^{-1}$ exists.

Many shrinkage estimators of $\beta_{*}$ have been proposed by penalizing the optimization in (2). Some of these estimators simultaneously estimate $\beta_{*}$ and remove irrelevant predictors (Turlach et al.,, 2005; Obozinski et al.,, 2010; Peng et al.,, 2010). Others encourage an estimator of reduced rank (Yuan et al.,, 2007; Chen and Huang,, 2012).

Under the restriction that $\varepsilon_{1},\ldots,\varepsilon_{n}$ are independent and identically distributed $N_{q}(0,\Sigma_{*E})$ , shrinkage estimators of $\beta_{*}$ that penalize or constrain the minimization of the negative loglikelihood have been proposed. These methods simultaneously estimate $\beta_{*}$ and $\Sigma_{*E}^{-1}$ . Examples include maximum likelihood reduced rank regression (Izenman,, 1975; Reinsel and Velu,, 1998), envelope models (Cook et al.,, 2010; Su and Cook,, 2011, 2012, 2013), and multivariate regression with covariance estimation (Rothman et al.,, 2010; Lee and Liu,, 2012; Bhadra and Mallick,, 2013).

To fit (1) with these shrinkage estimators, one exploits explicit assumptions about $\beta_{*}$ , but these may be unreasonable in some applications. As an alternative, we propose an indirect method to fit (1) without making explicit assumptions about $\beta_{*}$ . We exploit the assumption that response and predictors have a joint multivariate Normal distribution and we employ shrinkage estimators of the parameters of the conditional distribution of the predictors given the response. Our method provides an alternative indirect estimator of $\beta_{*}$ , which may be suitable when the existing shrinkage estimators are inadequate.

2 A new class of indirect estimators of $\beta_{*}$

2.1 Class definition

We assume that the measured predictor and response pairs $(x_{1},y_{1}),\dots,(x_{n},y_{n})$ are a realization of $n$ independent copies of $(X,Y)$ , where $(X^{\prime},Y^{\prime})^{\prime}\sim N_{p+q}(\mu_{*},\Sigma_{*})$ . We also assume that $\Sigma_{*}$ positive definite. Define the marginal parameters through the following partitions:

\mu_{*}=\left(\begin{array}[]{c}\mu_{*X}\\ \mu_{*Y}\end{array}\right),\quad\Sigma_{*}=\left(\begin{array}[]{cc}\Sigma_{*XX}&\Sigma_{*XY}\\ \Sigma_{*XY}^{\prime}&\Sigma_{*YY}\end{array}\right).

Our goal is to estimate the multivariate regression coefficient matrix $\beta_{*}=\Sigma_{*XX}^{-1}\Sigma_{*XY}$ in the forward regression model

(Y|X=x)\sim N_{q}(\mu_{*Y}+\beta_{*}^{\prime}(x-\mu_{*X}),\Sigma_{*E}),

without assuming that $\beta_{*}$ is sparse or that $\|\beta_{*}\|_{F}^{2}$ is small. To do this we will estimate the inverse regression’s coefficient matrix $\eta_{*}=\Sigma_{YY}^{-1}\Sigma_{XY}^{\prime}$ and the inverse regression’s error precision matrix $\Delta_{*}^{-1}$ in the inverse regression model

(X|Y=y)\sim N_{p}(\mu_{*X}+\eta_{*}^{\prime}(y-\mu_{*Y}),\Delta_{*}).

We connect the parameters of the inverse regression model to $\beta_{*}$ with the following proposition.

Proposition 1.

If $\Sigma_{*}$ is positive definite, then

\beta_{*}=\Delta_{*}^{-1}\eta_{*}^{\prime}\left(\Sigma_{*YY}^{-1}+\eta_{*}\Delta_{*}^{-1}\eta_{*}^{\prime}\right)^{-1}.

(3)

We prove Proposition 1 in Appendix A.1. This result leads us to propose a class of estimators of $\beta_{*}$ defined by

\hat{\beta}=\hat{\Delta}^{-1}\hat{\eta}^{\prime}(\hat{\Sigma}_{YY}^{-1}+\hat{\eta}\hat{\Delta}^{-1}\hat{\eta}^{\prime})^{-1},

(4)

where $\hat{\eta}$ , $\hat{\Delta}$ , and $\hat{\Sigma}_{YY}$ are user-selected estimators of $\eta_{*}$ , $\Delta_{*}$ , and $\Sigma_{*YY}$ . If $n>\max(p,q)$ and the ordinary sample estimators are used for $\hat{\eta}$ , $\hat{\Delta}$ and $\hat{\Sigma}_{YY}$ , then $\hat{\beta}$ is equivalent to $\hat{\beta}^{({\rm OLS})}$ .

We propose to use shrinkage estimators of $\eta_{*}$ , $\Delta_{*}^{-1}$ , and $\Sigma_{*YY}^{-1}$ in (4). This gives us the potential to indirectly fit an unparsimonious forward regression model by fitting a parsimonious inverse regression model. For example, suppose that $\eta_{*}$ and $\Delta_{*}^{-1}$ are sparse, but $\beta_{*}$ is dense. To fit the inverse regression model, we could use any of the forward regression shrinkage estimators discussed in Section 1.

2.2 Related work

Lee and Liu, (2012) proposed an estimator of $\beta_{*}$ that also exploits the assumption that $(X^{\prime},Y^{\prime})^{\prime}$ is multivariate Normal; however, unlike our approach that makes no explicit assumptions about $\beta_{*}$ , their approach assumes that both $\Sigma_{*}^{-1}$ and $\beta_{*}$ are sparse.

Modeling the inverse regression is a well-known idea in multivariate analysis. For example, when $Y$ is categorical, quadratic discriminant analysis models $(X|Y=y)$ as $p$ -variate Normal. There are also many examples of modeling the inverse regression in the sufficient dimension reduction literature (Adragni and Cook,, 2009).

The most closely related work to ours is that by Cook et al., (2013). They proposed indirect estimators of $\beta_{*}$ based on modeling the inverse regression in the special case when the response is univariate, i.e. $q=1$ . Under the same multivariate Normal assumption on $(X^{\prime},Y^{\prime})^{\prime}$ that we make, Cook et al., (2013) showed that

\beta_{*}=\frac{1}{1+\Sigma_{*XY}^{\prime}\Delta_{*}^{-1}\Sigma_{*XY}/\Sigma_{*YY}}\Delta_{*}^{-1}\Sigma_{*XY}.

(5)

They proposed estimators of $\beta_{*}$ by replacing $\Sigma_{*XY}$ and $\Sigma_{*YY}$ in the right hand side of (5) with their usual sample estimators, and by replacing $\Delta_{*}^{-1}$ with a shrinkage estimator. This class of estimators was designed to exploit an abundant signal rate in the forward univariate response regression when $p>n$ .

3 Asymptotic Analysis

We present a convergence rate bound for the indirect estimator of $\beta_{*}$ defined by (4). Our bound allows $p$ and $q$ to grow with the sample size $n$ . In the following proposition, $\|\cdot\|$ is the spectral norm and $\varphi_{\min}(\cdot)$ is the minimum eigenvalue.

Proposition 2.

Suppose that following conditions are true: (i) $\Sigma_{*}$ is positive definite for all $p+q$ ; (ii) the estimator $\hat{\Sigma}_{YY}^{-1}$ is positive definite for all $q$ ; (iii) the estimator $\hat{\Delta}^{-1}$ is positive definite for all $p$ ; (iv) there exists a positive constant $K$ such that $\varphi_{\min}(\Sigma_{*YY}^{-1})\geq K$ for all $q$ ; and (v) there exist sequences $\{a_{n}\},\{b_{n}\}$ and $\{c_{n}\}$ such that $\|\hat{\eta}-\eta_{*}\|=O_{P}(a_{n})$ , $\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\|=O_{P}(b_{n})$ , $\|\hat{\Sigma}_{YY}^{-1}-\Sigma_{*YY}^{-1}\|=O_{P}(c_{n})$ , and $a_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|+b_{n}\|\eta_{*}\|^{2}+c_{n}\rightarrow 0$ as $n\rightarrow\infty$ . Then

\displaystyle\|\hat{\beta}-\beta_{*}\|=O_{P}\left(a_{n}\|\eta_{*}\|^{2}\|\Delta_{*}^{-1}\|^{2}+b_{n}\|\eta_{*}\|^{3}\|\Delta_{*}^{-1}\|+c_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|\right).

We prove Proposition 2 in Appendix A.1. We used the spectral norm because it is compatible with the convergence rate bounds established for sparse inverse covariance estimators (Rothman et al.,, 2008; Lam and Fan,, 2009; Ravikumar et al.,, 2011).

If the inverse regression is parsimonious in the sense that $\|\eta_{*}\|$ and $\|\Delta_{*}^{-1}\|$ are bounded, then the bound in Proposition 2 simplifies to $\|\hat{\beta}-\beta_{*}\|=O_{P}(a_{n}+b_{n}+c_{n})$ . From an asymptotic perspective, it is not surprising that the indirect estimator of $\beta_{*}$ is only as good as its worst plug-in estimator. We explore finite sample performance in Section 5.

4 Example estimators in our class

4.1 Sparse inverse regression

We now describe an estimator of the forward regression coefficient matrix $\beta_{*}$ defined by (4) that exploits zeros in the inverse regression’s coefficient matrix $\eta_{*}$ , zeros in the inverse regression’s error precision matrix $\Delta_{*}^{-1}$ , and zeros in the precision matrix of the responses $\Sigma_{*YY}^{-1}$ . We estimate $\eta_{*}$ with

\hat{\eta}^{{\rm L1}}=\operatorname*{arg\ min}_{\eta\in\mathbb{R}^{q\times p}}\left\{\|\mathbb{X}-\mathbb{Y}\eta\|_{F}^{2}+\sum_{j=1}^{p}\lambda_{j}\sum_{m=1}^{q}|\eta_{mj}|\right\},

(6)

which separates into $p$ $L_{1}$ -penalized least-squares regressions (Tibshirani,, 1996): the first predictor regressed on the response through the $p$ th predictor regressed on the response. We select $\lambda_{j}$ with 5-fold cross-validation, minimizing squared prediction error totaled over the folds, in the regression of the $j$ th predictor on the response $(j=1,\ldots,p)$ . This allows us to estimate the columns of $\eta_{*}$ in parallel.

We estimate $\Delta_{*}^{-1}$ and $\Sigma_{*YY}^{-1}$ with $L_{1}$ -penalized Normal likelihood precision matrix estimation (Yuan and Lin,, 2007; Banerjee et al.,, 2008). Let $\hat{\Sigma}_{\gamma,S}^{-1}$ be a generic version of this estimator with tuning parameter $\gamma$ and input $p$ by $p$ sample covariance matrix $S$ :

\hat{\Sigma}_{\gamma,S}^{-1}=\operatorname*{arg\ min}_{\Omega\in\mathbb{S}_{+}^{p}}\left\{{\rm tr}(\Omega S)-\log|\Omega|+\gamma\sum_{j\neq k}|\omega_{jk}|\right\},

(7)

where $\mathbb{S}_{+}^{p}$ is the set of symmetric and positive definite $p$ by $p$ matrices. There are many algorithms that solve (7). Two good choices are the graphical lasso algorithm (Yuan,, 2008; Friedman et al.,, 2008) and the QUIC algorithm (Hsieh et al.,, 2011). We select $\gamma$ with 5-fold cross-validation maximizing a validation likelihood criterion (Huang et al.,, 2006):

\hat{\gamma}=\operatorname*{arg\ min}_{\gamma\in\mathcal{G}}\sum_{k=1}^{5}\left\{{\rm tr}\left(\hat{\Sigma}_{\gamma,S_{(-k)}}^{-1}S_{(k)}\right)-\log\left|\hat{\Sigma}_{\gamma,S_{(-k)}}^{-1}\right|\right\},

(8)

where $\mathcal{G}$ is a user-selected finite subset of the non-negative real line, $S_{(-k)}$ is the sample covariance matrix from the observations outside the $k$ th fold, and $S_{(k)}$ is the sample covariance matrix from the observations in the $k$ th fold centered by the sample mean of the observations outside the $k$ th fold. We estimate $\Delta_{*}^{-1}$ using (7) with its tuning parameter selected by (8) and $S=(\mathbb{X}-\mathbb{Y}\hat{\eta}^{{\rm L1}})^{\prime}(\mathbb{X}-\mathbb{Y}\hat{\eta}^{{\rm L1}})/n$ . Similarly, we estimate $\Sigma_{*YY}^{-1}$ using (7) with its tuning parameter selected by (8) and $S=\mathbb{Y}^{\prime}\mathbb{Y}/n$ .

4.2 Reduced rank inverse regression

We propose indirect estimators of $\beta_{*}$ that exploit the assumption that the inverse regression’s coefficient matrix $\eta_{*}$ is rank deficient. We have the following simple proposition that links rank deficiency in $\eta_{*}$ and its estimator to $\beta_{*}$ and its indirect estimator.

Proposition 3.

If $\Sigma_{*}$ is positive definite, then ${\rm rank}(\beta_{*})={\rm rank}(\eta_{*})$ . In addition, if $\hat{\Sigma}_{YY}^{-1}$ and $\hat{\Delta}^{-1}$ are positive definite in the indirect estimator $\hat{\beta}$ defined by (4), then ${\rm rank}(\hat{\beta})={\rm rank}(\hat{\eta})$ .

The proof of this proposition is simple so we excluded it to save space.

We propose the following two example reduced rank indirect estimators of $\beta_{*}$ :

Estimate $\Sigma_{*YY}$ with $\mathbb{Y}^{\prime}\mathbb{Y}/n$ and estimate $(\eta_{*},\Delta_{*}^{-1})$ with Normal likelihood reduced rank inverse regression:

	$\displaystyle(\hat{\eta}^{(r)},\hat{\Delta}^{-1(r)})=\operatorname*{arg\ min}_{(\eta,\Omega)\in\mathbb{R}^{q\times p}\times\mathbb{S}_{+}^{p}}$	$\displaystyle\left[n^{-1}{\rm tr}\left\{(\mathbb{X}-\mathbb{Y}\eta)^{\prime}(\mathbb{X}-\mathbb{Y}\eta)\Omega\right\}-\log{\rm det}(\Omega)\right]$		(9)
		$\displaystyle{\rm subject}\ {\rm to}\ {\rm rank}(\eta)=r,$

where $r$ is selected from $\{0,\ldots,\min(p,q)\}$ . The solution to the optimization in (9) is available in closed form (Reinsel and Velu,, 1998).

2.

Estimate $\eta_{*}$ with $\hat{\eta}^{(r)}$ defined in (9), estimate $\Sigma_{*YY}^{-1}$ with (7) using $S=\mathbb{Y}^{\prime}\mathbb{Y}/n$ , and estimate $\Delta_{*}^{-1}$ with (7) using $S=(\mathbb{X}-\mathbb{Y}\hat{\eta}^{(r)})^{\prime}(\mathbb{X}-\mathbb{Y}\hat{\eta}^{(r)})/n$ .

Both example indirect reduced rank estimators of $\beta_{*}$ are formed by plugging in the estimators of $\eta_{*},\Delta_{*}^{-1}$ , and $\Sigma_{*YY}$ to (4). The first estimator is likelihood-based and the second estimator exploits sparsity in $\Sigma_{*YY}^{-1}$ and $\Delta_{*}^{-1}$ . Neither estimator is defined when $\min(p,q)>n$ . In this case, which we do not address, a regularized reduced rank estimator of $\eta_{*}$ could be used instead of the estimator defined in (9), e.g. the factor estimation and selection estimator (Yuan et al.,, 2007) or the reduced rank ridge regression estimator (Mukherjee and Zhu,, 2011).

5 Simulations

5.1 Sparse inverse regression simulation

We compared the following indirect estimators of $\beta_{*}$ when the inverse regression’s coefficient matrix $\eta_{*}$ is sparse:

$I_{L1}$ .

This is the indirect estimator proposed in Section 4.1.
$I_{S}$ .

This is an indirect estimator defined by (4) with $\hat{\eta}$ defined by (6), $\hat{\Sigma}_{YY}=\mathbb{Y}^{\prime}\mathbb{Y}/n$ , and $\hat{\Delta}=(\mathbb{X}-\mathbb{Y}\hat{\eta}^{L1})^{\prime}(\mathbb{X}-\mathbb{Y}\hat{\eta}^{L1})/n$ .
$O_{\Delta}$ .

This is a part oracle indirect estimator defined by (4) with $\hat{\eta}$ defined by (6), $\hat{\Sigma}_{YY}^{-1}$ defined by (7), and $\hat{\Delta}^{-1}=\Delta^{-1}_{*}$ .
$O$ .

This is a part oracle indirect estimator defined by (4) with $\hat{\eta}$ defined by (6), $\hat{\Sigma}_{YY}^{-1}=\Sigma_{*YY}^{-1}$ , and $\hat{\Delta}^{-1}=\Delta^{-1}_{*}$ .
$O_{Y}$ .

This is a part oracle indirect estimator defined by (4) with $\hat{\eta}$ defined by (6), $\hat{\Sigma}_{YY}^{-1}=\Sigma_{*YY}^{-1}$ , and $\hat{\Delta}^{-1}$ defined by (7).

We also included the following forward regression estimators of $\beta_{*}$ :

OLS/MP.

This is the ordinary least squares estimator defined by $\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}$ . When $n\leq p$ , we use the solution $\mathbb{X}^{-}\mathbb{Y}$ , where $\mathbb{X}^{-}$ is the Moore-Penrose generalized inverse of $\mathbb{X}$ .

This is the ridge penalized least squares estimator defined by

\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\left(\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}+\lambda\|\beta\|_{F}^{2}\right).

\ell_{2}

This is an alternative ridge penalized least squares estimator defined by

\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\left(\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}+\sum_{m=1}^{q}\lambda_{m}\sum_{j=1}^{p}\beta_{jm}^{2}\right),

where a separate tuning parameter is used for each response.

We selected the tuning parameters for uses of (6) with 5-fold cross-validation, minimizing validation prediction error on the inverse regression. Tuning parameters for $\ell_{2}$ and R were selected with 5-fold cross-validation, minimizing validation prediction error on the forward regression. We selected tuning parameters for uses of (7) with (8). The candidate set of tuning parameters was $\left\{10^{-8},10^{-7.5},\dots,10^{7.5},10^{8}\right\}$ .

For 50 independent replications, we generated a realization of $n$ independent copies of $(X^{\prime},Y^{\prime})^{\prime}$ , where $Y\sim N_{q}(0,\Sigma_{*YY})$ and $(X|Y=y)\sim N_{p}(\eta_{*}^{\prime}y,\Delta_{*})$ . The $(i,j)$ th entry of $\Sigma_{*YY}$ was set to $\rho_{Y}^{|i-j|}$ and the $(i,j)$ th entry of $\Delta_{*}$ was set to $\rho_{\Delta}^{|i-j|}$ . We set $\eta_{*}=Z\circ A$ , where $\circ$ denotes the element-wise product: $Z$ had entries independently drawn from $N(0,1)$ and $A$ had entries independently drawn from the Bernoulli distribution with nonzero probability $s_{*}$ . This model is ideal for $I_{L1}$ because $\Delta_{*}^{-1}$ and $\Sigma_{*YY}^{-1}$ are both sparse. Every entry in the corresponding randomly generated $\beta_{*}$ is nonzero with high probability, but the magnitudes of these entries are small. This motivated us to compare our indirect estimators of $\beta_{*}$ to the ridge-penalized least squares forward regression estimators R and $\ell_{2}$ .

We evaluated performance with model error (Breiman and Friedman,, 1997; Yuan et al.,, 2007), which is defined by $\|\Sigma_{*XX}^{1/2}(\hat{\beta}-\beta_{*})\|_{F}^{2}$ .

Table 1: Averages of model error from 50 replications when

n=100,

p=20

, and

q=20

. All standard errors were less than or equal to 0.05.

$\rho_{Y}$	$\rho_{\Delta}$	$s_{*}$	$I_{L1}$	$O$	$O_{\Delta}$	$O_{Y}$	$I_{S}$	OLS	$\ell_{2}$	R
0.7	0.0	0.1	0.61	0.32	0.53	0.40	1.35	2.10	1.23	1.22
0.7	0.5	0.1	0.72	0.39	0.59	0.51	1.30	1.91	1.29	1.30
0.7	0.7	0.1	0.76	0.45	0.65	0.56	1.27	1.73	1.27	1.29
0.7	0.9	0.1	0.83	0.66	0.85	0.64	1.26	1.35	1.05	1.09
0.0	0.9	0.1	0.81	0.87	0.87	0.79	2.04	2.34	1.26	1.87
0.5	0.9	0.1	0.96	0.76	0.99	0.74	1.63	1.84	1.36	1.49
0.9	0.9	0.1	0.46	0.39	0.47	0.36	0.63	0.62	0.48	0.48
0.7	0.9	0.3	0.60	0.53	0.65	0.46	0.83	0.67	0.64	0.63
0.7	0.9	0.5	0.48	0.37	0.48	0.37	0.65	0.53	0.52	0.51
0.7	0.9	0.7	0.42	0.29	0.39	0.31	0.55	0.46	0.45	0.44

We report the average model errors, based on these 50 replications, in Table 1. When $s_{*}=0.1$ , the indirect estimators defined by (4) performed well for all choices of $\rho_{Y}$ and $\rho_{\Delta}$ . Our proposed estimator $I_{L1}$ was competitive with other indirect estimators also defined by (4), even those that used some oracle information. As $s_{*}$ increased with $\rho_{Y}=0.7$ and $\rho_{\Delta}=0.9$ fixed, the forward regression estimators performed nearly as well as $I_{L1}$ .

Table 2: Averages of model error from 50 replications when

n=50,

p=60

, and

q=60

. All standard errors were 0.69 or less, except for MP, which had standard errors between 0.77 and 3.16.

$\rho_{Y}$	$\rho_{\Delta}$	$s_{*}$	$I_{L1}$	$O$	$O_{\Delta}$	$O_{Y}$	MP	$\ell_{2}$	R
0.7	0.0	0.1	8.59	4.28	5.70	7.40	78.33	13.85	12.44
0.7	0.5	0.1	9.67	5.09	6.37	8.49	73.82	14.79	13.34
0.7	0.7	0.1	10.01	6.37	7.44	8.75	70.30	15.56	14.40
0.7	0.9	0.1	9.92	10.07	11.44	8.88	61.83	16.43	15.94
0.0	0.9	0.1	15.17	17.09	16.93	15.23	119.60	28.63	29.41
0.5	0.9	0.1	14.88	13.59	16.91	12.01	86.88	23.62	22.69
0.9	0.9	0.1	4.71	4.78	5.94	3.99	25.37	6.36	5.91
0.7	0.9	0.3	16.86	17.43	19.66	15.44	43.88	15.30	14.14
0.7	0.9	0.5	26.89	26.81	29.93	24.95	36.87	14.79	13.62
0.7	0.9	0.7	31.86	35.98	38.64	30.36	33.58	14.35	13.65

Similarly, Table 2 shows that when $s_{*}=0.1$ , $I_{L1}$ outperforms all three forward regression estimators. However, unlike in the lower dimensional setting illustrated in table 1, when $\eta_{*}$ is not sparse, i.e. $s_{*}\geq.3$ , $I_{L1}$ is outperformed by forward regression approaches. The part oracle method $O_{Y}$ that used the knowledge of $\Sigma_{*YY}^{-1}$ outperformed the other two part oracle indirect estimators $O$ and $O_{\Delta}$ when $\rho_{\Delta}=.9$ . Also, when $\rho_{\Delta}=.9$ , $I_{L1}$ was competitive with the part oracle estimators. Taken together, the results in Tables 1 and 2 suggest that when $\eta_{*}$ is very sparse, our proposed indirect estimator $I_{L1}$ may perform nearly as well as the part oracle indirect estimators and the forward regression estimators.

5.2 Reduced rank inverse regression simulation

We compared the performance of the following indirect reduced rank estimators of $\beta_{*}$ :

$I_{ML}^{(r)}$ .

This is the likelihood-based indirect example estimator 1 proposed in Section 4.2.
$I^{(r)}$ .

This is the indirect example estimator 2 proposed in Section 4.2, which uses sparse estimators of $\Sigma_{*YY}^{-1}$ and $\Delta_{*}^{-1}$ in (4).
$O^{(r)}$ .

This is a part oracle indirect estimator defined by (4) with $\hat{\eta}$ defined by (9), $\hat{\Delta}^{-1}=\Delta_{*}^{-1}$ , and $\hat{\Sigma}_{YY}^{-1}=\Sigma_{*YY}^{-1}$ .
$O^{(r)}_{\Delta}$ .

This is a part oracle indirect estimator defined by (4) with $\hat{\eta}$ defined by (9), $\hat{\Delta}^{-1}$ defined by (7), and $\hat{\Sigma}_{YY}^{-1}=\Sigma_{*YY}^{-1}$ .
$O^{(r)}_{Y}$ .

This is a part oracle indirect estimator defined by (4) with $\hat{\eta}$ defined by (9), $\hat{\Delta}^{-1}=\Delta_{*}^{-1}$ , $\hat{\Delta}^{-1}$ defined by (7), and $\hat{\Sigma}_{YY}^{-1}$ defined by (7).

We compared these indirect estimators to the following forward reduced rank regression estimator:

RR.

This is the likelihood based reduced rank regression (Izenman,, 1975; Reinsel and Velu,, 1998). The estimator of $\beta_{*}$ and the estimator of the forward regression’s error precision matrix $\Sigma_{*E}^{-1}$ are defined by

	$\displaystyle(\hat{\beta}^{(r)},\hat{\Sigma}_{E}^{-1(r)})=\operatorname*{arg\ min}_{(\beta,\Omega)\in\mathbb{R}^{p\times q}\times\mathbb{S}_{+}^{q}}$	$\displaystyle\left[n^{-1}{\rm tr}\left\{(\mathbb{Y}-\mathbb{X}\beta)^{\prime}(\mathbb{Y}-\mathbb{X}\beta)\Omega\right\}-\log{\rm det}(\Omega)\right]$
		$\displaystyle{\rm subject}\ {\rm to}\ {\rm rank}(\beta)=r.$

We selected the rank parameter $r$ for uses of (9) with 5-fold cross-validation, minimizing validation prediction error on the inverse regression. The rank parameter for RR was selected with 5-fold cross-validation, minimizing validation prediction error on the forward regression. We selected tuning parameters for uses of (7) with (8). The candidate set of tuning parameters was $\left\{10^{-8},10^{-7.5},\dots,10^{7.5},10^{8}\right\}$ .

For 50 independent replications, we generated a realization of $n$ independent copies of $(X^{\prime},Y^{\prime})^{\prime}$ where $Y\sim N_{q}(0,\Sigma_{*YY})$ and $(X|Y=y)\sim N_{p}(\eta_{*}^{\prime}y,\Delta_{*})$ . The $(i,j)$ th entry of $\Sigma_{*YY}$ was set to $\rho_{Y}^{|i-j|}$ and the $(i,j)$ th entry of $\Delta_{*}$ was set to $\rho_{\Delta}^{|i-j|}$ . After specifying $r_{*}\leq\min(p,q)$ , we set $\eta_{*}=PQ$ , where $P\in\mathbb{R}^{q\times r_{*}}$ and $Q\in\mathbb{R}^{r_{*}\times p}$ had entries independently drawn from $N(0,1)$ so that $r_{*}=\text{rank}(\eta_{*})=\text{rank}(\beta_{*})$ . As we did in the simulation in Section 5.1, we measured performance with model error.

Table 3: Averages of model error from 50 replications when

n=100,

p=20

, and

q=20

. All standard errors were less than or equal to 0.05.

$\rho_{Y}$	$\rho_{\Delta}$	$r_{*}$	$I^{(r)}$	$O^{(r)}$	$O^{(r)}_{\Delta}$	$O^{(r)}_{Y}$	$I_{ML}^{(r)}$	OLS	RR
0.7	0.0	10	0.33	0.04	0.86	0.75	0.64	1.38	0.64
0.7	0.5	10	0.34	0.04	0.86	0.74	0.60	1.31	0.60
0.7	0.7	10	0.31	0.03	0.86	0.80	0.62	1.32	0.61
0.7	0.9	10	0.31	0.02	0.85	0.88	0.60	1.30	0.61
0.0	0.9	10	0.15	0.03	1.00	1.77	1.22	2.61	1.21
0.5	0.9	10	0.42	0.01	1.11	1.36	0.90	1.97	0.89
0.9	0.9	10	0.12	0.01	0.32	0.30	0.22	0.46	0.22
0.7	0.9	4	0.35	0.02	1.73	2.61	0.49	3.12	0.49
0.7	0.9	8	0.35	0.01	1.15	1.33	0.68	1.73	0.65
0.7	0.9	12	0.31	0.04	0.64	0.59	0.55	0.96	0.53
0.7	0.9	16	0.25	0.08	0.30	0.20	0.44	0.50	0.42

We report the model errors, averaged over the 50 independent replications, in Table 3. Under every setting, $I^{(r)}$ outperformed all non-oracle competitors. When $r_{*}\leq 12$ , $I^{(r)}$ outperformed both $O_{\Delta}^{(r)}$ and $O_{Y}^{(r)}$ , which suggests that shrinkage estimation of $\Delta_{*}^{-1}$ and $\Sigma_{*YY}^{-1}$ was helpful. In each setting, $I_{ML}^{(r)}$ performed similarly to $RR$ even though they are estimating parameters of different condition distributions.

5.3 Reduced rank forward regression simulation

Our simulation studies in the previous sections used inverse regression data generating models. In this section, we compare the estimators from Section 5.2 using a forward regression data generating model.

For 50 independent replications, we generated a realization of $n$ independent copies of $(X^{\prime},Y^{\prime})^{\prime}$ where $X\sim N_{p}(0,\Sigma_{*XX})$ and $(Y|X=x)\sim N_{q}(\beta_{*}^{\prime}x,\Sigma_{*E})$ . The $(i,j)$ th entry of $\Sigma_{*XX}$ was set to $\rho_{X}^{|i-j|}$ and the $(i,j)$ th entry of $\Sigma_{*E}$ was set to $\rho_{E}^{|i-j|}$ . After specifying $r_{*}\leq\min(p,q)$ , we set $\beta_{*}=ZQ$ where $Z\in\mathbb{R}^{p\times r_{*}}$ had entries independently drawn from $N(0,1)$ and $Q\in\mathbb{R}^{r_{*}\times q}$ had entries independently drawn from $\text{Uniform}(-1/4,1/4)$ . In this data generating model, neither $\Delta_{*}^{-1}$ nor $\Sigma_{*YY}^{-1}$ had entries equal to zero.

Table 4: Averages of model error from 50 replications when

n=100,

p=20

, and

q=20

. All standard errors were less than or equal to 0.21.

$\rho_{X}$	$\rho_{E}$	$r_{*}$	$I^{(r)}$	$O^{(r)}$	$O^{(r)}_{\Delta}$	$O^{(r)}_{Y}$	$I_{ML}^{(r)}$	OLS	RR
0.0	0.9	10	2.79	0.54	4.27	5.05	2.48	4.99	2.82
0.5	0.9	10	2.90	0.47	5.36	5.94	2.73	5.00	2.89
0.7	0.9	10	2.97	0.51	4.64	5.03	2.71	4.93	2.76
0.9	0.9	10	2.84	0.73	3.78	4.16	2.67	5.19	2.73
0.7	0.0	10	4.66	1.92	3.59	5.88	4.53	5.11	4.34
0.7	0.5	10	4.27	1.65	3.88	5.51	3.99	5.06	3.97
0.7	0.7	10	3.55	1.26	3.99	5.29	3.43	5.00	3.44
0.7	0.9	4	1.27	0.08	3.84	4.71	0.95	5.00	1.11
0.7	0.9	8	2.39	0.36	4.15	5.15	2.05	4.81	2.22
0.7	0.9	12	3.58	0.79	4.44	5.21	3.20	5.15	3.27
0.7	0.9	16	4.53	1.29	4.62	4.42	4.33	5.11	4.38

The model errors, averaged over the 50 replications, are reported in Table 4. Both $I^{(r)}$ and $I_{ML}^{(r)}$ were competitive with RR in most settings. Although neither $\Delta_{*}^{-1}$ nor $\Sigma_{*YY}^{-1}$ were sparse, we again see that $I^{(r)}$ generally outperforms $O_{Y}^{(r)}$ and $O_{\Delta}^{(r)}$ , both of which use some oracle information. These results indicate that shrinkage estimators of $\Delta_{*}^{-1}$ and $\Sigma_{*YY}^{-1}$ in (4) are helpful when neither is sparse.

6 Tobacco chemical composition data example

As an example application, we use the chemical composition of tobacco leaves data from Anderson and Bancroft, (1952) and Izenman, (2009). These data have $n=25$ cases, $p=6$ predictors, and $q=3$ responses. The names of the predictors, taken from page 183 of Izenman, (2009), are percent nitrogen, percent chlorine, percent potassium, percent phosphorus, percent calcium, and percent magnesium. The names of the response variables, also taken from page 183 of Izenman, (2009), are rate of cigarette burn in inches per 1,000 seconds, percent sugar in the leaf, and percent nicotine in the leaf. In these data, it may inappropriate to assume that $\Delta_{*}^{-1}$ is sparse. For this reason, we consider another example indirect estimator of $\beta_{*}$ called $I_{L2}$ that estimates $\eta_{*}$ with (6), estimates $\Sigma_{*YY}^{-1}$ with (7) using $S=\mathbb{Y}^{\prime}\mathbb{Y}/n$ , and estimates $\Delta_{*}^{-1}$ with

\operatorname*{arg\ min}_{\Omega\in\mathbb{S}^{P}_{+}}\left\{\text{tr}(\Omega S)-\text{log }\text{det}\left(\Omega\right)+\gamma\sum_{j,k}|\omega_{jk}|^{2}\right\},

(10)

where $S=(\mathbb{Y}-\mathbb{X}\hat{\eta}^{L1})^{\prime}(\mathbb{Y}-\mathbb{X}\hat{\eta}^{L1})/n$ . We compute (10) with the closed form solution derived by Witten and Tibshirani, (2009). As before, we select $\gamma$ from $\{10^{8},10^{-7.5},\ldots,10^{7.5},10^{8}\}$ using (8). We also consider the forward regression estimators RR, $\ell_{2}$ , and OLS defined in Section 5.1 and Section 5.2. We introduce another competitor $\ell_{1}$ , defined as

\operatorname*{arg\ min}_{\beta\in\mathbb{R}^{p\times q}}\left\{\|\mathbb{Y}-\mathbb{X}\beta\|_{F}^{2}+\sum_{j=1}^{q}\lambda_{j}\sum_{l=1}^{p}|\beta_{jl}|\right\},

which is equivalent to performing $q$ separate lasso regressions (Tibshirani,, 1996). We randomly split the data into a 40% test set and 60% training set in each of 500 replications and we measured the squared prediction error on the test set. All tuning parameters were chosen from $\{10^{8},10^{-7.5},\ldots,10^{7.5},10^{8}\}$ by 5-fold cross validation.

Table 5: Averages of squared prediction error, with standard errors in parenthesis, for each response variable from 500 replications.

	$I^{(r)}$	$I_{L1}$	$I_{L2}$	OLS	RR	$\ell_{2}$	$\ell_{1}$
Rate of burn	1.19	1.33	0.45	2.96	2.17	0.57	1.55
	(0.08)	(0.10)	(0.03)	(0.15)	(0.15)	(0.07)	(0.13)
Percent sugar	442.38	347.76	235.55	799.03	605.30	365.13	583.98
	(17.97)	(21.31)	(6.31)	(29.45)	(25.52)	(20.68)	(24.36)
Percent nicotene	2.55	2.54	0.79	5.65	4.59	0.81	2.82
	(0.29)	(0.30)	(0.05)	(0.41)	(0.31)	(0.21)	(0.29)

Table 5 shows squared prediction errors, averaged over the 10 predictions and the 500 replications. These results indicate that $I_{L2}$ outperforms all the competitors we considered. Also, $I_{L1}$ was outperformed by $\ell_{2}$ , but was competitive with separate lasso regressions. Reduced rank regression was not competitive with the proposed indirect estimators.

Acknowledgment

We thank Liliana Forzani for an important discussion. This research is supported in part by a grant from the U.S. National Science Foundation.

Appendix A Appendix

A.1 Proofs

Proof of Proposition 1.

Since $\Sigma_{*}$ is positive definite, we apply the partitioned inverse formula to obtain that

\displaystyle\Sigma_{*}^{-1}

\displaystyle=\left(\begin{array}[]{c c}\Sigma_{*XX}&\Sigma_{*XY}\\ \Sigma_{*XY}^{\prime}&\Sigma_{*YY}\\ \end{array}\right)^{-1}=\left(\begin{array}[]{c c}\Delta^{-1}_{*}&-\beta_{*}\Sigma_{*E}^{-1}\\ -\eta_{*}\Delta^{-1}_{*}&\Sigma_{*E}^{-1}\\ \end{array}\right),

where $\Delta_{*}=\Sigma_{*XX}-\Sigma_{*XY}\Sigma_{*YY}^{-1}\Sigma_{*XY}^{\prime}$ and $\Sigma_{*E}=\Sigma_{*YY}-\Sigma_{*XY}^{\prime}\Sigma_{*XX}^{-1}\Sigma_{*XY}$ . The symmetry of $\Sigma_{*}^{-1}$ implies that $\beta_{*}\Sigma_{*E}^{-1}=(\eta_{*}\Delta^{-1}_{*})^{\prime}$ so

\beta_{*}=\Delta^{-1}_{*}\eta_{*}^{\prime}\Sigma_{*E}.

(11)

Using the Woodbury identity,

$\displaystyle\Sigma_{*E}^{-1}$	$\displaystyle=(\Sigma_{YY}-\Sigma^{\prime}_{XY}\Sigma_{XX}^{-1}\Sigma_{XY})^{-1}$
	$\displaystyle=\Sigma_{YY}^{-1}+\Sigma_{YY}^{-1}\Sigma^{\prime}_{XY}\left(\Sigma_{XX}^{-1}-\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma^{\prime}_{XY}\right)^{-1}\Sigma_{XY}\Sigma_{YY}^{-1}$
	$\displaystyle=\Sigma_{YY}^{-1}+\eta_{}\Delta_{}^{-1}\eta_{}^{\prime}.$	(12)

Using the inverse of the expression above in (11) establishes the result. ∎

In our proof of Proposition 2, we use the matrix inequality

	$\displaystyle\\|A^{(1)}A^{(2)}A^{(3)}-B^{(1)}B^{(2)}B^{(3)}\\|$	$\displaystyle\leq\sum_{j=1}^{3}\\|A^{(j)}-B^{(j)}\\|\prod_{k\neq j}\\|B^{(k)}\\|$
		$\displaystyle+\sum_{j=1}^{3}\\|B^{(j)}\\|\prod_{k\neq j}\\|A^{(k)}-B^{(k)}\\|+\prod_{j=1}^{3}\\|A^{(j)}-B^{(j)}\\|.$		(13)

Bickel and Levina, (2008) used (A.1) to prove their Theorem 3.

Proof of Proposition 2.

From (12) in the proof of Proposition 1, $\Sigma_{*E}^{-1}=\Sigma_{*YY}^{-1}+\eta_{*}\Delta_{*}^{-1}\eta_{*}^{\prime}$ . Define $\hat{\Sigma}_{E}^{-1}=\hat{\Sigma}_{YY}^{-1}+\hat{\eta}\hat{\Delta}^{-1}\hat{\eta}^{\prime}$ . Applying (A.1),

$\displaystyle\\|\hat{\beta}-\beta_{*}\\|=$	$\displaystyle\\|\hat{\Delta}^{-1}\hat{\eta}^{\prime}\hat{\Sigma}_{E}-\Delta_{}^{-1}\eta_{}^{\prime}\Sigma_{*E}\\|$
$\displaystyle\leq$	$\displaystyle\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\eta_{}\\|\cdot\\|\Sigma_{E}\\|+\\|\hat{\eta}-\eta_{}\\|\cdot\\|\Delta_{}^{-1}\\|\cdot\\|\Sigma_{E}\\|+\\|\hat{\Sigma}_{E}-\Sigma_{E}\\|\cdot\\|\Delta_{}^{-1}\\|\cdot\\|\eta_{*}\\|$
	$\displaystyle+\\|\Delta_{}^{-1}\\|\cdot\\|\hat{\eta}-\eta_{}\\|\cdot\\|\hat{\Sigma}_{E}-\Sigma_{E}\\|+\\|\eta_{}\\|\cdot\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\hat{\Sigma}_{E}-\Sigma_{E}\\|$
	$\displaystyle+\\|\Sigma_{E}\\|\cdot\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\hat{\eta}-\eta_{}\\|+\\|\hat{\eta}-\eta_{}\\|\cdot\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\hat{\Sigma}_{E}-\Sigma_{E}\\|.$	(14)

We will show that the third term in (14) dominates the others. We continue by deriving its bound. Employing a matrix identity used by Cai et al., (2010), we write $\hat{\Sigma}_{E}-\Sigma_{*E}=\Sigma_{*E}(\Sigma_{*E}^{-1}-\hat{\Sigma}_{E}^{-1})\hat{\Sigma}_{E}$ , so

\|\hat{\Sigma}_{E}-\Sigma_{*E}\|\leq\|\hat{\Sigma}_{E}\|\cdot\|\Sigma_{*E}\|\cdot\|\hat{\Sigma}_{E}^{-1}-\Sigma_{*E}^{-1}\|.

(15)

Using the triangle inequality and (A.1),

$\displaystyle\\|\hat{\Sigma}_{E}^{-1}-\Sigma_{*E}^{-1}\\|$	$\displaystyle\leq\\|\hat{\Sigma}_{YY}^{-1}-\Sigma_{YY}^{-1}\\|+\\|\hat{\eta}\hat{\Delta}^{-1}\hat{\eta}^{\prime}-\eta_{}\Delta_{}^{-1}\eta_{}^{\prime}\\|$
	$\displaystyle\leq\\|\hat{\Sigma}_{YY}^{-1}-\Sigma_{YY}^{-1}\\|+2\\|\hat{\eta}-\eta_{}\\|\cdot\\|\Delta_{}^{-1}\\|\cdot\\|\eta_{}\\|+\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\eta_{}\\|^{2}$
	$\displaystyle+2\\|\eta_{}\\|\cdot\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\hat{\eta}-\eta_{}\\|+\\|\Delta_{}^{-1}\\|\cdot\\|\hat{\eta}-\eta_{}\\|^{2}+\\|\hat{\eta}-\eta_{}\\|^{2}\\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\\|$
	$\displaystyle=O_{P}\left(c_{n}+a_{n}\\|\eta_{}\\|\cdot\\|\Delta_{}^{-1}\\|+b_{n}\\|\eta_{*}\\|^{2}\right).$	(16)

Since $\varphi_{\min}(\Sigma_{*YY}^{-1})\geq K$ and $\Delta_{*}^{-1}$ is positive definite, Weyl’s eigenvalue inequality implies that $\varphi_{\min}(\Sigma_{*E}^{-1})\geq K$ so

\|\Sigma_{*E}\|=\varphi_{\min}^{-1}(\Sigma_{*E}^{-1})\leq 1/K.

(17)

Also,

\|\hat{\Sigma}_{E}\|=\varphi_{\min}^{-1}(\hat{\Sigma}_{E}^{-1})=O_{P}(1).

(18)

because $\varphi_{\min}(\Sigma_{*E}^{-1})\geq K$ , $\hat{\Sigma}_{E}$ is positive definite, and $a_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|+b_{n}\|\eta_{*}\|^{2}+c_{n}=o(1)$ in (16). Using (16), (17), and (18), in (15),

\|\hat{\Sigma}_{E}-\Sigma_{*E}\|=O_{P}\left(a_{n}\|\eta_{*}\|\cdot\|\Delta_{*}^{-1}\|+b_{n}\|\eta_{*}\|^{2}+c_{n}\right).

We then see that the third term in (14) dominates and

	$\displaystyle\\|\hat{\beta}-\beta_{*}\\|$	$\displaystyle=O_{P}\left\{\left(a_{n}\\|\eta_{}\\|\cdot\\|\Delta_{}^{-1}\\|+b_{n}\\|\eta_{}\\|^{2}+c_{n}\right)\\|\eta_{}\\|\\|\Delta_{*}^{-1}\\|\right\}$
		$\displaystyle=O_{P}\left(a_{n}\\|\eta_{}\\|^{2}\\|\Delta_{}^{-1}\\|^{2}+b_{n}\\|\eta_{}\\|^{3}\\|\Delta_{}^{-1}\\|+c_{n}\\|\eta_{}\\|\cdot\\|\Delta_{}^{-1}\\|\right).$

∎

References

Adragni and Cook, (2009) Adragni, K. P. and Cook, R. D. (2009). Sufficient dimension reduction and prediction in regression. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 367(1906):4385–4405.
Anderson and Bancroft, (1952) Anderson, R. L. and Bancroft, T. A. (1952). Statistical theory in research.
Banerjee et al., (2008) Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation. Journal of Machine Learning Research, 9:485–516.
Bhadra and Mallick, (2013) Bhadra, A. and Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis. Biometrics, 69(2):447–457.
Bickel and Levina, (2008) Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Annals of Statistics, 36(1):199–227.
Breiman and Friedman, (1997) Breiman, L. and Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3–54.
Cai et al., (2010) Cai, T. T., Zhang, C.-H., and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. Annals of Statistics, 38:2118–2144.
Chen and Huang, (2012) Chen, L. and Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association, 107(500):1533–1545.
Cook et al., (2013) Cook, R. D., Forzani, L., and Rothman, A. J. (2013). Prediction in abundant high-dimensional linear regression. Electronic Journal of Statistics, 7:3059–3088.
Cook et al., (2010) Cook, R. D., Li, B., and Chiaromonte, F. (2010). Envelope models for parsimonious and efficient multivariate linear regression (with discussion). Statistica Sinica, 20:927–1010.
Friedman et al., (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
Hsieh et al., (2011) Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. K. (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems, volume 24, pages 2330–2338. MIT Press, Cambridge, MA.
Huang et al., (2006) Huang, J., Liu, N., Pourahmadi, M., and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika, 93(1):85–98.
Izenman, (1975) Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5(2):248–264.
Izenman, (2009) Izenman, A. J. (2009). Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer.
Lam and Fan, (2009) Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrices estimation. Annals of Statistics, 37:4254–4278.
Lee and Liu, (2012) Lee, W. and Liu, Y. (2012). Simultaneous multiple response regression and inverse covariance matrix estimation via penalized gaussian maximum likelihood. Journal of Multivariate Analysis, 111:241–255.
Mukherjee and Zhu, (2011) Mukherjee, A. and Zhu, J. (2011). Reduced rank ridge regression and its kernel extensions. Statistical Analysis and Data Mining, 4(6):612–622.
Obozinski et al., (2010) Obozinski, G., Taskar, B., and Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252.
Peng et al., (2010) Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R., and Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. The Annals of Applied Statistics, 4(1):53.
Ravikumar et al., (2011) Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980.
Reinsel and Velu, (1998) Reinsel, G. C. and Velu, R. P. (1998). Multivariate Reduced-rank Regression. Springer.
Rothman et al., (2008) Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515.
Rothman et al., (2010) Rothman, A. J., Levina, E., and Zhu, J. (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4):947–962.
Su and Cook, (2011) Su, Z. and Cook, R. D. (2011). Partial envelopes for efficient estimation in multivariate linear regression. Biometrika, 98:133–146.
Su and Cook, (2012) Su, Z. and Cook, R. D. (2012). Inner envelopes: Efficient estimation in multivariate linear regression. Biometrika, 99:687–702.
Su and Cook, (2013) Su, Z. and Cook, R. D. (2013). Scaled envelopes: Scale invariant and efficient estimation in multivariate linear regression. Biometrika, 100:921–938.
Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc., Ser. B, 58:267–288.
Turlach et al., (2005) Turlach, B. A., Venables, W. N., and Wright, S. J. (2005). Simultaneous variable selection. Technometrics, 47(3):349–363.
Witten and Tibshirani, (2009) Witten, D. M. and Tibshirani, R. (2009). Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):615–636.
Yuan, (2008) Yuan, M. (2008). Efficient computation of l1 regularized estimates in gaussian graphical models. Journal of Computational and Graphical Statistics, 17(4):809–826.
Yuan et al., (2007) Yuan, M., Ekici, A., Lu, Z., and Monteiro, R. (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3):329–346.
Yuan and Lin, (2007) Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35.

$\displaystyle\Sigma_{*E}^{-1}$	$\displaystyle=(\Sigma_{YY}-\Sigma^{\prime}_{XY}\Sigma_{XX}^{-1}\Sigma_{XY})^{-1}$
	$\displaystyle=\Sigma_{YY}^{-1}+\Sigma_{YY}^{-1}\Sigma^{\prime}_{XY}\left(\Sigma_{XX}^{-1}-\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma^{\prime}_{XY}\right)^{-1}\Sigma_{XY}\Sigma_{YY}^{-1}$
	$\displaystyle=\Sigma_{YY}^{-1}+\eta_{}\Delta_{}^{-1}\eta_{}^{\prime}.$	(12)

	$\displaystyle\\|A^{(1)}A^{(2)}A^{(3)}-B^{(1)}B^{(2)}B^{(3)}\\|$	$\displaystyle\leq\sum_{j=1}^{3}\\|A^{(j)}-B^{(j)}\\|\prod_{k\neq j}\\|B^{(k)}\\|$
		$\displaystyle+\sum_{j=1}^{3}\\|B^{(j)}\\|\prod_{k\neq j}\\|A^{(k)}-B^{(k)}\\|+\prod_{j=1}^{3}\\|A^{(j)}-B^{(j)}\\|.$		(13)

$\displaystyle\\|\hat{\beta}-\beta_{*}\\|=$	$\displaystyle\\|\hat{\Delta}^{-1}\hat{\eta}^{\prime}\hat{\Sigma}_{E}-\Delta_{}^{-1}\eta_{}^{\prime}\Sigma_{*E}\\|$
$\displaystyle\leq$	$\displaystyle\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\eta_{}\\|\cdot\\|\Sigma_{E}\\|+\\|\hat{\eta}-\eta_{}\\|\cdot\\|\Delta_{}^{-1}\\|\cdot\\|\Sigma_{E}\\|+\\|\hat{\Sigma}_{E}-\Sigma_{E}\\|\cdot\\|\Delta_{}^{-1}\\|\cdot\\|\eta_{*}\\|$
	$\displaystyle+\\|\Delta_{}^{-1}\\|\cdot\\|\hat{\eta}-\eta_{}\\|\cdot\\|\hat{\Sigma}_{E}-\Sigma_{E}\\|+\\|\eta_{}\\|\cdot\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\hat{\Sigma}_{E}-\Sigma_{E}\\|$
	$\displaystyle+\\|\Sigma_{E}\\|\cdot\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\hat{\eta}-\eta_{}\\|+\\|\hat{\eta}-\eta_{}\\|\cdot\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\hat{\Sigma}_{E}-\Sigma_{E}\\|.$	(14)

$\displaystyle\\|\hat{\Sigma}_{E}^{-1}-\Sigma_{*E}^{-1}\\|$	$\displaystyle\leq\\|\hat{\Sigma}_{YY}^{-1}-\Sigma_{YY}^{-1}\\|+\\|\hat{\eta}\hat{\Delta}^{-1}\hat{\eta}^{\prime}-\eta_{}\Delta_{}^{-1}\eta_{}^{\prime}\\|$
	$\displaystyle\leq\\|\hat{\Sigma}_{YY}^{-1}-\Sigma_{YY}^{-1}\\|+2\\|\hat{\eta}-\eta_{}\\|\cdot\\|\Delta_{}^{-1}\\|\cdot\\|\eta_{}\\|+\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\eta_{}\\|^{2}$
	$\displaystyle+2\\|\eta_{}\\|\cdot\\|\hat{\Delta}^{-1}-\Delta_{}^{-1}\\|\cdot\\|\hat{\eta}-\eta_{}\\|+\\|\Delta_{}^{-1}\\|\cdot\\|\hat{\eta}-\eta_{}\\|^{2}+\\|\hat{\eta}-\eta_{}\\|^{2}\\|\hat{\Delta}^{-1}-\Delta_{*}^{-1}\\|$
	$\displaystyle=O_{P}\left(c_{n}+a_{n}\\|\eta_{}\\|\cdot\\|\Delta_{}^{-1}\\|+b_{n}\\|\eta_{*}\\|^{2}\right).$	(16)

Indirect multivariate response linear regression

Abstract

1 Introduction

2 A new class of indirect estimators of β∗\beta_{*}

2.1 Class definition

Proposition 1.

2.2 Related work

3 Asymptotic Analysis

Proposition 2.

4 Example estimators in our class

4.1 Sparse inverse regression

4.2 Reduced rank inverse regression

Proposition 3.

5 Simulations

5.1 Sparse inverse regression simulation

5.2 Reduced rank inverse regression simulation

5.3 Reduced rank forward regression simulation

6 Tobacco chemical composition data example

Acknowledgment

Appendix A Appendix

A.1 Proofs

Proof of Proposition 1.

Proof of Proposition 2.

References

2 A new class of indirect estimators of $\beta_{*}$