On the testing of multiple hypothesis in sliced inverse regression

Zhigen Zhaolabel=e1]zhaozhg@temple.edu [ Xin Xinglabel=e2]xinxing@vt.edu [ Department of Statistics, Operations, and Data Science, Temple University, Philadelphia, PA, 19122, USA
Department of Statistics, Virginia Tech University, Temple University, Blacksburg, VA 24061

Abstract

We consider the multiple testing of the general regression framework aiming at studying the relationship between a univariate response and a p-dimensional predictor. To test the hypothesis of the effect of each predictor, we construct an Angular Balanced Statistic (ABS) based on the estimator of the sliced inverse regression without assuming a model of the conditional distribution of the response. According to the developed limiting distribution results in this paper, we have shown that ABS is asymptotically symmetric with respect to zero under the null hypothesis. We then propose a Model-free multiple Testing procedure using Angular balanced statistics (MTA) and show theoretically that the false discovery rate of this method is less than or equal to a designated level asymptotically. Numerical evidence has shown that the MTA method is much more powerful than its alternatives, subject to the control of the false discovery rate.

\setattribute

journalname \setattributeinfolinetext \stackMath

and

t1 Zhigen Zhao is Associate Professor of Department of Statistics, Operations, and Data Science, Temple University. t2 Xin Xing is Assistant Professor of the Department of Statistics, Virginia Tech University.

Keywords: model free, FDR, sufficient dimension reduction

1 Introduction

In the general framework of the regression analysis, the goal is to infer the relation between a univariate response variable $y$ and a $p\times 1$ vector $\bm{x}$ . One would like to know $y|\bm{x}$ , namely, how the distribution of $y$ depends on the value of $\bm{x}$ . Among the literature of sufficient dimension reduction [25, 8, 9, 28, 29], the fundamental idea is to replace the predictor by its projection to a subspace without loss of information. In other words, we seek for a subspace $\mathcal{S}_{y|\bm{x}}$ of the predictor space such that

y\rotatebox[origin={c}]{90.0}{$\models$}\bm{x}|\mathbf{P}_{\mathcal{S}_{y|\bm{x}}}\bm{x}.

(1)

Here $\models$ indicates independence, and $P_{(\cdot)}$ stands for a projection operator. The subspace $\mathcal{S}_{y|\bm{x}}$ is called the central subspace. Let $d$ be the dimension of this central subspace. Let $\mathbf{B}$ , a $p\times d$ matrix, be a basis of the central subspace $\mathcal{S}_{y|\bm{x}}$ . Then the equation (1) is equivalent to

y\rotatebox[origin={c}]{90.0}{$\models$}\bm{x}|\mathbf{B}\bm{x}.

(2)

To further reduce the dimensionality especially when the number of predictors $p$ diverges with respect to $n$ , it is commonly assumed that $y$ depends on $\bm{x}$ through a subset of $\bm{x}$ , known as the Markov blanket and denoted as $\mathcal{MB}(y,\bm{x})$ [33, 36, 5], such that

y\rotatebox[origin={c}]{90.0}{$\models$}\bm{x}|\mathcal{MB}(y,\bm{x}).

For each predictor, one would like to know whether $x_{j}\in\mathcal{MB}(y,\bm{x})$ , which could be formulated as a multiple testing problem. The null hypothesis stating that $x_{j}\notin\mathcal{MB}(y,\bm{x})$ is equivalent to

\mathcal{H}_{j}:\mathbf{P}_{span(x_{j})}(\mathcal{S}_{y|\bm{x}})=\mathcal{O}_{p},

(3)

where $\mathcal{O}_{p}$ is the origin point in the $p$ -dimensional space [10]. In other words, it is equivalent to saying that the $j$ -th row of the matrix $\mathbf{B}$ consists of all zeros.

There are many attempts targeting estimating the central subspace in the existing literature on sufficient dimension reduction. The most widely used method is the sliced inverse regression (SIR) which was first introduced in [25]. Later, there are many attempts to extend SIR, including, but not limited to, sliced average variance estimation [14, 8], directional regression [24], and constructive estimation [41]. Nevertheless, most of the existing methods and theories in sufficient dimension reduction focus on the estimation of the central subspace $\mathcal{S}_{y|\bm{x}}$ . The result of statistical inference is very limited when $p$ diverges, not to mention the procedure of controlling the false discovery rate (FDR) when testing these hypotheses simultaneously.

The challenge arises from two perspectives. First, among the literature on sufficient dimension reduction, the result on the limiting distribution of the estimator of the central subspace is very limited when $p$ diverges. When $p$ is fixed, [20, 46, 30, 23] have derived the asymptotic distribution of the sliced inverse regression. To the best of the authors’ knowledge, there are no results on the limiting distribution when $p$ diverges unless assuming the signal is strong and the total number of false hypotheses is fixed [40].

Second, after the test statistic is determined for each hypothesis, how to combine these test statistics to derive a method that controls the false discovery rate is challenging. Many existing procedures, such as [2, 3, 42, 35], work on the (generalized) linear regression models. In [5], the authors considered an arbitrary joint distribution of $y$ and $\bm{x}$ and proposed the model-X knockoff to control FDR. However, this method requires that the distribution of the design matrix is known, which is not feasible in many practice.

The study conducted by [21] explored variable selection for the linear regression model, assuming the condition of weak and rare signals. It is noted in this paper that the selection consistency is not possible and allowing for false positives is necessary. While several existing penalization-based methods on sufficient dimension reduction require imposing uniform signal strength condition to achieve consistent results ([27, 29, 38, 44]), this work tackles a more challenging scenario. Specifically, we develop the central limit theorem of SIR, utilizing recent theories on Gaussian approximation [6, 7] without relying on the uniform signal strength conditions. This theoretical result is the first of its kind in the literature on sufficient dimension reduction and is a necessity for simultaneous inference when the effects of some relevant predictors are either moderate or weak.

We proceed by constructing a statistic for each hypothesis based on sliced inverse regression. Applying Gaussian approximation theory, we demonstrate that this statistic is asymptotically symmetric about zero when the null hypothesis holds. We refer to this as a Angular Balanced Statistic (ABS). We then develop a single-step procedure that rejects a hypothesis when its ABS exceeds a certain threshold. Additionally, we provide an estimator for the false discovery proportion. For a designated FDR level $q$ , we adaptively select a threshold such that the estimated false discovery proportion is no greater than $q$ . This method is referred to as the Model-free multiple Testing procedure using Angular balanced statistics (MTA). Theoretical analysis confirms that MTA asymptotically controls the FDR at the $q$ -level under regularity conditions. Simulation results and data analysis demonstrate that MTA significantly outperforms its competitors in terms of power while controlling the FDR.

The paper is organized as follows. In Section 2, we derive the central limit theorem of SIR when both the dimension $p$ and the number of important predictors $s$ diverge with $n$ using the recently developed Gaussian approximation theory. In Section 3, we construct ABS based on the estimator of SIR and propose the MTA method. It is shown that the FDR of the MTA method is less than or equal to a designated level asymptotically. In Sections 4 and 5, we provide numerical evidence including extensive simulations and a real data analysis to demonstrate advantages of MTA. We conclude the paper in Section 6, and include all technical details in the appendix.

Notation

We adopt the following notations throughout this paper. For a matrix $\bm{A}$ , we call the space generated by its column vectors the column space and denote it by $col(\bm{A})$ . The element at the $i$ -th row and $j$ -th column of a matrix $\bm{A}$ is denoted as $A_{ij}$ or $a_{ij}$ . The $i$ -th row and $j$ -th column of the matrix are denoted by $\bm{A}_{i\cdot}$ and $\bm{A}_{\cdot j}$ , respectively. The minimum and maximum eigenvalues of $\bm{A}$ are denoted as $\lambda_{min}(\bm{A})$ and $\lambda_{\max}(\bm{A})$ respectively. For two positive numbers $a,b$ , we use $a\vee b$ and $a\wedge b$ to denote $\max\{a,b\}$ and $\min\{a,b\}$ respectively. We use $c$ , $C$ , $C^{\prime}$ , $C_{1}$ and $C_{2}$ to denote generic absolute constants, though the actual value may vary from case to case.

2 Gauss Approximaton of SIR

Recall that $y$ is the response and $\bm{x}$ is a p-dimensional vector. In the literature of sufficient dimension reduction, people aim to find the central subspace $\mathcal{S}_{y|\bm{x}}$ defined in (1). The sliced inverse regression (SIR) introduced in [25] is the first and the most popular method among many existing ones.

Assume that the covariance matrix of $\bm{x}$ is $\bm{\Sigma}$ . Let $\bm{\Omega}=\bm{\Sigma}^{-1}$ be the precision matrix of $\bm{x}$ . Let $s_{j}=\#\{k:\bm{\Omega}_{jk}\neq 0\}$ and $s=\max_{j}s_{j}$ . When the distribution of $\bm{x}$ is elliptically symmetric, it is shown in [25] that

\displaystyle\bm{\Sigma}\mathcal{S}_{y|\bm{x}}=col(\bm{\Lambda}),

(4)

where $\bm{\Lambda}=var(\mathbb{E}[\bm{x}|y])$ and $col(\bm{\Lambda})$ is the column space spanned by $\bm{\Lambda}$ .

Given $n$ $i.i.d.$ samples $(y_{i},\bm{x}_{i})$ , $i=1,\cdots,n$ . To estimate $\bm{\Lambda}$ , divide the data into $H$ slices according to the order statistics $y_{(i)}$ , $i=1,\ldots,n$ . Let $\bm{x}_{(i)}$ be the concomitant associated with $y_{(i)}$ . Note that slicing the data naturally forms a partition of the support of the response variable, denoted as $\mathcal{H}$ . Let $\mathcal{P}_{h}$ be the $h$ -th slice in the partition $\mathcal{H}$ . Here we let $\mathcal{P}_{1}=(-\infty,y_{(\lceil n/H\rceil)}]$ and $\mathcal{P}_{H}=(y_{(\lceil n/H\rceil*(H-1)+1)},+\infty)$ . Let $\bar{\bm{x}}$ be the mean of all the $\bm{x}$ ’s and $\bar{\bm{x}}_{h,\cdot}$ be the sample mean of the vectors $\bm{x}_{(j)}$ ’s such that its concomitant $y_{(j)}\in\mathcal{P}_{h}$ and estimate $\bm{\Lambda}\triangleq var(\mathbb{E}[\bm{x}|y])$ by

\widehat{\bm{\Lambda}}_{\mathcal{H}}=\frac{1}{H}\sum_{h=1}^{H}(\bar{\bm{x}}_{h,\cdot}-\bar{\bm{x}})(\bar{\bm{x}}_{h,\cdot}-\bar{\bm{x}})^{\tau}.

(5)

The $\widehat{\bm{\Lambda}}_{\mathcal{H}}$ was shown to be a consistent estimator of $\bm{\Lambda}$ under some technical conditions [18, 20, 46, 25, 28].

Alternatively, we could view SIR through a sequence of ordinary least squares regressions. Let $f_{h}(y),h=1,2,\cdots,H$ be a sequence of transformations of $Y$ . Following the proof of [43, 40], one knows that under the linearity condition [25],

Ef(y)\phi(y)\in\mathcal{S}_{y|\bm{x}},

where $\phi(y)=\bm{\Sigma}^{-1}E(\bm{x}|y)$ . Let $\bm{\beta}_{h}\in R^{p\times 1}$ be defined as

\bm{\beta}_{h}=argmin_{\bm{\beta}_{h}}E(f_{h}(y)-\bm{x}\bm{\beta}_{h})^{2}.

Assuming the coverage condition [32, 13, 40], then

Span(\mathbf{B})=\mathcal{S}_{y|\bm{x}},

where $\mathbf{B}=(\bm{\beta}_{1},\cdots,\bm{\beta}_{H})\in R^{p\times H}$ .

Note that different choices of $f_{h}(y)$ lead to different methods [40, 17]. To name a few here, [43] suggested $f_{h}(y)=y^{h}$ where $h\leq H$ . After slicing the data into $H$ slices according to the value of the response variable $y$ , [13] suggested $f_{h}(y)=y$ if $y$ is in the $h$ -th slice and 0 otherwise. If we choose $f_{h}(y)=\mathbbm{1}(y\in\mathcal{P}_{h})$ , this will lead to SIR, which is the main focus in this paper [43, 40].

After obtaining data $(\bm{x}_{i},y_{i})$ based on a sample of $n$ subjects, let

f_{h}(\bm{y})=\mathbbm{1}(\bm{y}\in\mathcal{P}_{h})=\left(\mathbbm{1}(y_{1}\in\mathcal{P}_{h}),\mathbbm{1}(y_{2}\in\mathcal{P}_{h}),\cdots,\mathbbm{1}(y_{n}\in\mathcal{P}_{h})\right)^{T}.

Let $\hat{\bm{\beta}}_{h}$ be defined as

\hat{\bm{\beta}}_{h}=argmin_{\bm{\beta}_{h}}||f_{h}(\bm{y})-\bm{x}^{T}\bm{\beta}_{h}||^{2}=(\bm{x}\bm{x}^{T})^{-1}\bm{x}f_{h}(\bm{y}),

or a general form

\hat{\bm{\beta}}_{h}=argmin_{\bm{\beta}_{h}}||f_{h}(\bm{y})-\bm{x}^{T}\bm{\beta}_{h}||^{2}=\frac{1}{n}\hat{\bm{\Omega}}\bm{x}f_{h}(\bm{y}),

(6)

where $\hat{\bm{\Omega}}$ is a suitable approximation of the inverse of the Gram matrix $\hat{\bm{\Sigma}}=\bm{x}^{T}\bm{x}/n$ . Let

\hat{\bm{B}}=(\hat{\bm{\beta}}_{1},\cdots,\hat{\bm{\beta}}_{H}).

(7)

There are many methods to estimate the precision matrix. As an example, we could consider the one given by the Lasso for the node-wide regression on the design matrix $\bm{x}$ [31]. Next, we will derive the central limit theorem of SIR when $p$ converges to the infinity.

Our derivation is built upon the Gaussian approximation (GAR) theory recently developed in [6, 7]. Let $\mathcal{P}_{H}=\cup_{h}\mathcal{P}_{h}$ be a partition of the sample space of $y$ and $p_{h}=P(y\in\mathcal{P}_{h})$ . Define

\tilde{\bm{\beta}}_{h}=\frac{1}{n}\bm{\Omega}\bm{x}f_{h}(\bm{y}).

(8)

Let $\tilde{\bm{B}}=(\tilde{\bm{\beta}}_{1},\cdots,\tilde{\bm{\beta}}_{H})$ . For $i=1,2,\cdots,n,j=1,2,\cdots,p,h=1,2,\cdots,H$ , let $z_{ijh}$ ’s be normal random variables such that

•

$Ez_{ijh}=p_{h}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}|y\in\mathcal{P}_{h})$ ;
•

$V(z_{ijh})=p_{h}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}\bm{x}^{T}|y\in\mathcal{P}_{h})\bm{\Omega}_{\cdot j}-p_{h}^{2}(\bm{\Omega}_{\cdot j}^{T}E(\bm{x}|y\in\mathcal{P}_{h}))^{2}$ ;
•

$Cov(z_{ijh},z_{ikh})=p_{h}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}\bm{x}^{T}|y\in\mathcal{P}_{h})\bm{\Omega}_{\cdot k}-p_{h}^{2}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}|y\in\mathcal{P}_{h})\bm{\Omega}_{\cdot k}^{T}\\ E(\bm{x}|y\in\mathcal{P}_{h})$ ;
•

$Cov(z_{ijh_{1}},z_{ikh_{2}})=-p_{h_{1}}p_{h_{2}}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}|y\in\mathcal{P}_{h_{1}})\bm{\Omega}_{\cdot k}^{T}E(\bm{x}|y\in\mathcal{P}_{h_{2}})$ .

Note that when $\mathcal{H}_{j}$ is true and the coverage condition holds, then $Ez_{ijh}=0$ for any $i$ and $h$ (see Remark LABEL:remark:z in the appendix).

Let $\bm{z}_{i\cdot\cdot}$ be a $p$ by $H$ matrix consisting of $z_{ijh}$ where $j=1,2,\cdots,p$ and $h=1,2,\cdots,H$ . Let $\mathcal{A}^{re}$ consist of all sets $A$ of the form

A=\{\omega\in R^{pH}:a_{j}\leq\omega_{j}\leq b_{j},j=1,2,\cdots,pH\}

(9)

for some $-\infty\leq a_{j}\leq b_{j}\leq\infty$ . We will develop a bound of the quantity

\rho_{n}(\mathcal{A}^{re})=\sup_{A\in\mathcal{A}^{re}}\left|\mathbb{P}(\sqrt{n}\hat{\mathbf{B}}\in A)-\mathbb{P}(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{z}_{i\cdot\cdot}\in A)\right|.

(10)

The theoretical investigation requires the following assumptions

Assumption 1.

The design matrix $\bm{x}$ has iid sub-Gaussian rows. In other words, $sup_{||a||_{2}\leq 1}\mathbf{E}exp\{|\sum_{j=1}^{p}a_{j}x_{ij}|^{2}/C\}\leq 1$ for some large enough positive constant $C$ .

Assumption 2.

There exists two positive constants $c$ and $C$ such that the smallest eigenvalue $\lambda_{min}$ of $\bm{\Sigma}$ is greater than $c$ and the largest eigenvalue $\lambda_{max}$ of $\bm{\Sigma}$ is less than $C$ .

Assumption 1 is assumed in [45], describing the tail property of the predictors. The bounds of the eigenvalues of $\bm{\Sigma}$ are commonly assumed when considering the estimation of the covariance matrix and the precision matrix. In the sliced inverse regression, the data are sliced according to the response variable $y$ . This creates a partition on the support of the response variable. We assume the following condition on the partition.

Definition 1 (Distinguishable Partition).

For a given $H$ , we call $\mathcal{DP}_{H}(b)$ , a collection of all partitions $-\infty=a_{0}<a_{1}<\cdots<a_{H-1}<a_{H}=\infty$ of $\mathbb{R}$ , distinguishable if for any partition in $\mathcal{P}_{H}$ ,

(D1)

there exists two constants $\gamma_{1}$ and $\gamma_{2}$ such that

$\frac{\gamma_{1}}{H}\leq p_{h}\leq\frac{\gamma_{2}}{H},\textrm{where $p_{h}=\mathbb{P}(a_{h-1}\leq Y<a_{h})$};$
(D2)

$\lambda_{min}\left(Cov\left(\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h}),\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h})\right)\right)>b,\forall h=1,2,\cdots,H,$ where $b$ is a constant which does not depend on n and p.

The condition (D1) on the partition requires that the probability data falls in each slice is inversely proportional to the number of slices. This condition is assumed in [28] when establishing the phase transition of SIR. The condition (D2) requires a sufficient variation of the $\bm{x}$ , which falls in each slice. Otherwise, if the smallest eigenvalue of that covariance matrix in a given slice converges to zero, the corresponding estimator becomes unstable. In Remark 2, we give examples that (D2) holds.

Theorem 1.

Assume Assumptions (1-2) and a partition $\mathcal{P}_{H}\in\mathcal{DP}_{H}(b)$ . If $s^{3}p^{3}\log(pHn)^{7}/n\leq C_{1}n^{-c_{1}}$ for some constants $c_{1},C_{1}>0$ , then with probability at least $1-2\exp(-cn^{\alpha})$ ,

\rho(\mathcal{A}^{re})\leq CD_{n,p},

(11)

where $C$ is a constant and $D_{n,p}=\left(\frac{\log^{7}(pHn)}{n}\right)^{1/6}$ .

The proof is put in the appendix. From this theorem, it is seen that the estimator $\hat{\mathbf{B}}$ could be approximated by Gaussian random vectors which preserve the mean and covariance structure.

Remark 1.

When the dimension $p$ is fixed, it is known that the limiting distribution of the estimator based on SIR is normally distributed ([20, 30, 47]. In [40], they have used the penalization method to estimate the central space. They have further established the central limit theorem. However, this theoretical development requires a uniform signal strength condition that the minimum of the magnitude of nonzero parameters is greater than some bound such that the nulls and alternatives could be well separated ([44]). On the contrary, Theorem 1 does not rely on the uniform signal strength condition.

Additionally, this theorem provides a result on the uniform convergence which is essential for simultaneous inference when the effects of some relevant predictors are either moderate or weak.

Remark 2.

Comments on the distinguishable condition. The condition (D2) guarantees that the ordinary least estimation in each slice is stable. This condition holds in examples such as the inverse regression model suggested by Cook [12, 11]. In this model, it is assumed that the covariate $\bm{x}$ is given as

\bm{x}_{y}=\mu+\bm{\Gamma}\bm{\nu}_{y}+\sigma\bm{\epsilon},

where $\mu\in\mathbb{R}^{p}$ , $\bm{\Gamma}\in\mathbb{R}^{p\times d}$ , $\bm{\Gamma}^{T}\bm{\Gamma}=I_{d}$ . Then

\displaystyle Cov(\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h},\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h}))\succcurlyeq p_{h}\sigma^{2}\text{Cov}(\bm{\epsilon},\bm{\epsilon}).

In [22], the authors considered the following model

\bm{x}|y\in\mathcal{S}_{h}\sim MVN(\mu_{h},\bm{\Sigma}),\forall h.

It is easily seen that

\displaystyle Cov(\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h},\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h}))\succcurlyeq p_{h}\bm{\Sigma}.

3 Model Free Multiple Testing using Angular Balanced Statistics (MTA)

In this section, we consider testing the hypothesis defined in (3) simultaneously such that the false discovery rate is controlled at a designated level $q$ . We start with the construction of the test statistic.

First, we randomly split the sample as two parts, denoted as $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ [15]. Let $\hat{\mathbf{B}}^{1}$ and $\hat{\mathbf{B}}^{2}$ be the estimators of the sliced inverse regression based on these two subsets respectively. We define the following mirror statistics

M_{j}=\sum_{h=1}^{H}\hat{\mathbf{B}}^{1}_{jh}\hat{\mathbf{B}}^{2}_{jh}.

(12)

Note that when the j-th hypothesis is true, namely, $\mathbf{P}_{span(x_{j})}(\mathcal{S}_{y|\bm{x}})=\mathcal{O}_{p}$ , one would expect that both vectors $\hat{\mathbf{B}}_{j\cdot}^{1}$ and $\hat{\mathbf{B}}_{j\cdot}^{2}$ would center around zero. On the other hand, when $H_{j}$ is false, both vectors center around a common non-zero vector $\bm{B}_{j\cdot}$ . The angles between $\hat{\bm{B}}_{j\cdot}^{k},k=1,2,$ and $\bm{B}_{j\cdot}$ are approximately the same. Consequently, we call this statistic $M_{j}$ a Angular Balanced Statistic (ABS). Intuitively, the ABS statistic would be symmetric with respect to zero under the null and it tends to have a positive value under the alternative. This is depicted in Figure 1. In the left panel, when considering the predictor $x_{1}\in\mathcal{MB}(y,\bm{x})$ , the estimated coefficient vectors $\hat{\mathbf{B}}_{1\cdot}^{1}$ and $\hat{\mathbf{B}}_{1\cdot}^{2}$ center around a non-zero vector. On the other hand when considering the estimated coefficients of $x_{2}$ which does not belong to $\mathcal{MB}(y,\bm{x})$ , the estimated coefficients $\hat{\mathbf{B}}_{2\cdot}^{1}$ and $\hat{\mathbf{B}}_{2\cdot}^{2}$ center around the origin. In the right panel, we plot the histogram of all the mirror statistics when generating the data according to Setting 1 in Section 4, it is seen that $M_{j}$ ’s are roughly symmetric with respect to zero when the corresponding null hypotheses are true. When the corresponding null hypothesis is false, this statistic tends to have a large positive value.

Refer to caption — Figure 1: In the left panel, we show an illustration of the coefficient vectors for three variables $X_{1},X_{2}$ and $X_{3}$ where $\mathcal{H}_{1}$ is false and $\mathcal{H}_{2},\mathcal{H}_{3}$ are true. The right panel shows the histogram of mirror statistics for all variables when the data is generated according to Setting 1 with $\rho=0.4$ in Section 4.

The following theorem provides a rigorous argument for the above intuition. Namely, we show that the ABS is symmetric with respect to zero under the null hypothesis asymptotically.

Theorem 2.

Assume the same conditions of Theorem 1, if the null hypothesis $\mathcal{H}_{j}$ is true, then for any $t>0$ , with probability at least $1-2\exp(-cn^{\alpha})$ ,

\sup_{t}\left|\mathbb{P}\left(M_{j}<-t\right)-\mathbb{P}\left(M_{j}>t\right)\right|\leq CD_{n,p}.

(13)

We now turn to the false discovery rate control. To simplify our notation, let $\theta_{j}$ be the indicator of whether the $j$ -th hypothesis defined in (3) is false. Let $\mathcal{S}_{0}=\{j:\theta_{j}=0\}$ be the set consisting of all the true hypotheses. Consider the single-step procedure where the $j$ -th hypothesis is rejected if the ABS $M_{j}$ is greater than some threshold $t$ . Namely, we choose a decision $\bm{\delta}=(\delta_{1},\cdots,\delta_{p})$ as $\delta_{j}=1(M_{j}>t)$ . Then the false discovery proportion is

FDP(t)=\frac{\sum_{j:\theta_{j}\in\mathcal{S}_{0}}\delta_{j}}{(\sum_{j}\delta_{j})\vee 1}.

According to Theorem 2, the probabilities $\mathbb{P}(M_{j}>t)$ and $\mathbb{P}(M_{j}<-t)$ are approximately the same when the hypothesis $\mathcal{H}_{j}$ is true. Namely,

	$\displaystyle\mathbb{E}\sum_{j:\theta_{j}\in\mathcal{S}_{0}}\delta_{j}=\sum_{j:\theta_{j}\in\mathcal{S}_{0}}\mathbb{P}(M_{j}>t)$
	$\displaystyle\approx\sum_{j:\theta_{j}\in\mathcal{S}_{0}}\mathbb{P}(M_{j}<-t)\leq\sum_{j}\mathbb{P}(M_{j}<-t).$

Consequently, we estimate the false discovery proportion by

\widehat{FDP(t)}=\frac{\sum_{j}1(M_{j}<-t)}{\sum_{j}1(M_{j}>t)}.

(14)

Note that $\widehat{FDP(t)}$ tends to overestimate the true false discovery proportion. However, this is not a serious issue due to two reasons: (i) the over-estimation protects us from large false positives and (ii) the probability that $\mathbb{P}(M_{j}<-t)$ is negligible when $\mathcal{H}_{j}$ is false.

For any designated level, say $q$ , we could choose a data-driven threshold $\tau_{q}$ as

\tau_{q}=\min\left\{t:\widehat{FDP(t)}\leq q\right\}.

We call this method Model free multiple Testing procedure based on ABS (MTA). The steps of MTA are summarized in Algorithm 1.

1.

Split the sample as two parts, denoted as $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ ;
2.

For each sample, estimate $\hat{\mathbf{B}}^{1}$ and $\hat{\mathbf{B}}^{2}$ according to (7);
3.

For $j=1,2,\cdots,p$ , calculate the ABS $M_{j}$ as $M_{j}=\sum_{h=1}^{H}\hat{\mathbf{B}}^{1}_{jh}\hat{\mathbf{B}}^{2}_{jh}$ ;
4.

Find the threshold $\tau_{q}$ such that

$\tau_{q}=\min\left\{t:\widehat{FDP(t)}\leq q\right\},$

where

$\widehat{FDP(t)}=\frac{\sum_{j=1}^{p}1(M_{j}\leq-t)}{\sum_{j=1}^{p}1(M_{j}>t)};$
5.

For $j=1,2,\cdots,p$ , reject $\mathcal{H}_{j}$ if $M_{j}>\tau_{q}$ and fail-to-reject $\mathcal{H}_{j}$ otherwise.

Algorithm 1 Model free multiple Testing procedure based on ABS (MTA)

We now turn our attention to the asymptotic property of MTA. We assume the following assumption:

Assumption 3.

(a) Let $\bm{\Sigma}^{jj}$ be the covariance matrix of $\bm{z}_{ij\cdot}$ , $\bm{\Sigma}^{jj^{\prime}}$ be the covariance between $\bm{z}_{ij\cdot}$ and $\bm{z}_{ij^{\prime}\cdot}$ . We assume that

\sum_{j,j^{\prime}\in\mathcal{S}_{k}}r_{1}\left((\bm{\Sigma}^{jj})^{-1/2}\bm{\Sigma}^{jj^{\prime}}(\bm{\Sigma}^{j^{\prime}j^{\prime}})^{-1}\bm{\Sigma}^{j^{\prime}j}(\bm{\Sigma}^{jj})^{-1/2}\right)<C_{1}p_{k}^{\alpha_{1}},

holds for $k=1,2$ , a constant $C_{1}>0$ , and $\alpha_{1}\in(0,2)$ where $r_{1}(\cdot)$ denotes the largest singular value. (b) Let $\rho(z_{ijh},z_{ij^{\prime}h})$ be the correlation coefficient of $z_{ijh}$ and $z_{ij^{\prime}h}$ . We assume that

\sum_{j,j^{\prime}\in\mathcal{S}_{0}}\sum_{h=1}^{H}|\rho(z_{ijh},z_{ij^{\prime}h})|<C_{2}p_{0}^{\alpha_{2}},

holds for a constant $C_{2}>0$ and $\alpha_{2}\in(0,2)$ .

For linear model, a similar assumption stating that the sum of partial correlation is bounded by $C_{1}p_{0}^{\alpha}$ , where $\alpha\in(0,2)$ is imposed in [42]. When dealing with nonlinear model, Assumption 3 requires that the sum of pair-wise partial correlation in each slice being bounded by $Cp_{0}^{\alpha}$ for a constant $C>0$ any $\alpha\in(0,2)$ . However, we consider a more general non-linear model where the contribution of the $j$ -th variable is related to vectors with length $H$ . Assumption 3(a) is equivalent to bound the pairwise canonical correlations which is a general procedure for investigating the relationships between two sets of variables. In Assumption 3(b), we consider the absolute sum of the correlation coefficients of corresponding variables in two sets. Since the correlation measure is not unique, we consider two different types to enlarge the suitable cases for our theory. The following two lemmas characterize the sum of pairwise covariances of mirror statistics, which is the key to establishing the asymptotic property of MTA.

Lemma 1.

Let $M_{j}$ be the angular balanced statistic defined in (12). If Assumption 3 holds, then with probability at least $1-2\exp(-cn^{\alpha})$ ,

\displaystyle\sum_{j,k\in\mathcal{S}_{0}}\text{Cov}\left(\mathbbm{1}(M_{j}\geq t),\mathbbm{1}(M_{k}\geq t)\right)\leq C_{1}^{\prime}|\mathcal{S}_{0}|^{\alpha_{1}},\;\forall\;t,

where $\alpha_{1}\in(0,2)$ , $C_{1}^{\prime}>0$ is a constant.

In Lemma 1, we build a relationship between the correlations among mirror statistics and the correlations among covariates. Next, we derive asymptotic properties of the MTA method. We define a few quantities:

	$\displaystyle V(t)=\frac{\#\{j:j\in\mathcal{S}_{0},M_{j}\leq-t\}}{p_{0}},\,V^{\prime}(t)=\frac{\#\{j:j\in\mathcal{S}_{0},M_{j}\geq t\}}{p_{0}},\,$
	$\displaystyle V^{1}(t)=\frac{\#\{j:j\in\mathcal{S}_{1},M_{j}\geq t\}}{p_{1}},$
	$\displaystyle G_{0}(t)=\lim_{p}\frac{\sum_{j\in\mathcal{S}_{0}}\mathbb{E}1(M_{j}\leq-t)}{p_{0}},\,G_{1}(t)=\lim_{p}\frac{\sum_{j\in\mathcal{S}_{1}}\mathbb{E}1(M_{j}\geq t)}{p_{1}},$

and

	$\displaystyle FDP(t):=\frac{V^{\prime}(t)}{(V^{\prime}(t)+r_{p}V^{1}(t))\vee 1/p},$
	$\displaystyle FDP^{\infty}(t):=\lim_{p}\frac{V^{\prime}(t)}{(V^{\prime}(t)+r_{p}V^{1}(t))\vee 1/p},$

where $r_{p}=p_{0}/p_{1}$ and $FDP^{\infty}(t)$ is the pointwise limit of $FDP(t)$ . We have the following results.

Lemma 2.

Suppose Assumption 3(a) or (b) holds and $G_{0}(t)$ is a continuous function. Then, we have

	$\displaystyle\sup_{t}\left\|{V(t)}-G_{0}(t)\right\|\xrightarrow[]{p}0,\quad\sup_{t}\left\|{V^{\prime}(t)}-G_{0}(t)\right\|\xrightarrow[]{p}0,$
	$\displaystyle\sup_{t}\left\|{V^{1}(t)}-G_{1}(t)\right\|\xrightarrow[]{p}0.$

Lemma 2 is based on the weak dependence assumption 3(a) or 3(b) under a model-free assumption. In literature, similar week dependence conditions on p-values [37] and linear models [15, 42, 16] are commonly used to study the asymptotical property of FDR. With the aid of this lemma, we show that the FDR of the proposed MTA method is less than or equal to $q$ asymptotically.

Theorem 3.

For any given level $q\in(0,1)$ , assume that the pointwise limit $\text{FDP}^{\infty}(t)$ of $\text{FDP}(t)$ exists for all $t>0$ , and there is a constant $t_{q}>0$ such that $\text{FDP}^{\infty}(t_{q})\leq q.$ When Assumption (1-3) holds and $\mathcal{P}_{H}\in\mathcal{DP}_{H}(b)$ , we have

\text{FDP}\leq q+o_{p}(1)\ \ \ \text{and}\ \ \ \limsup_{n,p\to\infty}\text{FDR}\leq q.

The proof of Theorem 3 is included in the appendix.

4 Numerical Studies

In this section, we use simulations to evaluate the finite sample performance of MTA with state-of-the-art competitors. First, we consider the model-X knockoff, which constructs knockoffs $\tilde{\bm{x}}$ by assuming the distribution of $\bm{x}$ is Gaussian with a known covariance matrix. We construct the feature statistic as

W_{j}=\sum_{h=1}^{H}\hat{\mathbf{B}}_{jh}\hat{\mathbf{B}}_{\tilde{j}h},

where $\hat{\mathbf{B}}$ is the solution of (7) when replacing $\bm{x}$ by $(\bm{x},\tilde{\bm{x}})$ . Second, we consider the marginal non-parametric test to test the dependence between the response variable and each predictor and calculate the Hilbert-Schmidt independence criterion (HSIC) test statistic proposed in [19]. After obtaining the p-values, we apply the Benjamini-Hochberg (BH) procedure (HSIC+BH). Third, we apply the dimension reduction coordinate test (DR-coor) in Cook [10] to test the contribution of each predictor and calculate the p-value followed by the BH method (DR-coor+BH). In the following sections, we consider three different multi-index models with a wide range of nonlinear dependencies between the response and predictors. For predictors, we consider two different designs including Gaussian design and a real-data design.

4.1 Simulation with Gaussian Design

We set $n=1000$ , $p=200$ . The covariates are generated from multivariate Gaussian distribution $N(0,\Sigma)$ . The covariance matrix $\Sigma$ is autoregressive, i.e., $\sigma_{ij}$ , the element at the i-th row and j-th column is $\rho^{|i-j|}$ , where $\rho$ is taken among $0,0.2,0.4,0.6,$ and $0.8$ , respectively. We randomly set $20$ nonzero coefficients for each indices and generate the non-zero coefficients from $N(0,20/\sqrt{n})$ .

Setting 1:

y=f_{1}(\bm{x})+\sigma\epsilon=\sin(\bm{a}_{1}^{T}\bm{x})+(\bm{a}_{2}^{T}\bm{x})^{3}+\sigma\epsilon,

Setting 2:

y=f_{2}(\bm{x})+\sigma\epsilon=\frac{(3\bm{a}_{1}^{T}\bm{x})}{0.5+(1.5+\bm{a}_{2}^{T}\bm{x})^{2}}+\sigma\epsilon,

Setting 3:

\displaystyle y=f_{3}(x,\epsilon)=(a_{1}^{T}x)^{3}+(a_{2}^{T}x)\times\sigma\epsilon,

Setting 4:

y=f_{4}(\bm{x},\epsilon)=\frac{1}{1+\exp{[\sum_{\ell=1}^{5}\rm{Relu}(\bm{a}_{\ell}^{T}\bm{x})}]}+\sigma\epsilon,

where $\rm{Relu}(\cdot)=\max(0,\cdot)$ is a popular choice of activation function in the context of artificial neural networks. This setting is equivalent to a fully connected neural network with one hidden layer.

In the settings 1 to 4, $\epsilon$ is generated using a standard Gaussian distribution. The value of $\sigma_{1}$ is kept at 0.5 for the first experiment. For $j=1,2,3$ , we adjust $\sigma_{j}$ so that the ratio $\text{Var}(f_{j}(x))/\sigma_{j}$ is equal to $\text{Var}(f_{1}(x))/\sigma_{1}$ . This means that the signal-to-noise ratios are kept constant across all the experiments. For the first three experiments, we use two indexes, while in the fourth experiment, we increase the number of indexes to five. Moreover, we treat the error as homoscedastic in experiments 1, 2, and 4. However, in the third experiment, we account for the error as heteroscedastic

Note that the Markov blanket is $\mathcal{MB}(y,\bm{x})=\{x_{j}|a_{1j}\neq 0\quad\mbox{or}\quad a_{2j}\neq 0\}$ . To evaluate the performance, we calculate the number of true positives (TPs) as the number of variables in $\mathcal{MB}(y,\bm{x})$ which is selected by a particular procedure, and the number of false positives (FPs) as the difference between the total number of selected variables and TPs. The empirical power is defined as $\#\{TPs\}/\#\{\mathcal{MB}(y,x)\}$ and the empirical FDP is defined as $\#\{FPs\}/\#\{\mbox{selected variables}\}$ . In all the settings, we set the number of slices $H$ as $20$ for our proposed method.

In Figure 2, we plot the boxplot of simulated FDP and power of the considered methods under setting 1-4. It is seen that FDR of MTA is controlled at the predefined level under all settings and MTA has larger power than its competitors. We observed that FDPs obtained by Model-X knockoff tend to be either conservative or exceed the predefined levels by a large margin when the correlation is high. This phenomenon has also been observed in [42, 5]. HSIC-BH methods control the FDR well in settings 1, 2 and 4. But it has a significantly inflated FDR in setting 3 where the noise is heteroscedastic. The power of HSIC-BH methods is lower than our proposed method. The DR-Coor-BH test achieves high power when $\rho=0$ . However, its power decreases significantly when $\rho$ increases. The FDR is inflated when $\rho$ increases.

4.2 Simulation studies based on brain connectome

In this section, we consider using the real design matrix extracted from brain connectome data in the human connectome project (HCP) dataset. HCP dataset aims at characterizing human brain connectivity in about 1,200 healthy adults to enable detailed comparisons between brain circuits, behavior, and genetics at the level of individual subjects. Customized scanners were used to produce high-quality and consistent data to measure brain connectivity. The data containing various traits and MRI data can be easily accessed through (db.humanconnectome.org). The real design matrix includes variables with a more complex correlation structure. Moreover, the marginal distributions of variables include skewed, heavy-tailed, and multi-modal distributions. To be consistent with our Gaussian setting, we randomly select $p=200$ brain connections as our predictors and synthetic $y$ using $f_{1},f_{2},f_{3}$ listed in our settings 1-3. In this case, we consider different signal to noise level by generating the noise with $\sigma=\psi/\sqrt{n}$ where $\psi=10,15,20,25$ .

It is seen that the proposed MTA method is robust under complex correlation structures. As shown in Figure 3, for settings 1-4, the MTA method maintains high power subject to the control of FDR under different signal-to-noise levels. Since the distribution of brain connectomes is highly non-Gaussian, the power of the Model-X knockoff method becomes conservative with diminishing power. In contrast, HIC+BH and DR-Coor+BH can not control the FDR well partially due to the high correlation among the covariance in brain connectome data.

5 Real Data Analysis on Brain Connectome

Neural imaging studies hold great promise for predicting and reducing psychiatric disease burden and advancing our understanding of the cognitive abilities that underlie humanity’s intellectual feats. Functional magnetic resonance imaging (fMRI) data are increasingly being used for the ambitious task of relating individual differences in studying brain function to typical variations in complex psychological phenotypes. In functional brain imaging research, e.g., via fMRI, functional network analysis, which focuses on understanding the connections between different brain areas using some quantitative methods, has become a booming area popular approach. The human connectome project (HCP) dataset aims at characterizing human brain connectivity in about 1,200 healthy adults to enable detailed studies of the association between brain functions and behaviors at the level of individual subjects. Customized scanners were used to produce high-quality and consistent data to measure brain connectivity. The data containing various traits and fMRI data can be easily accessed through (db.humanconnectome.org)

To obtain functional connectomes, we used a state-of-the-art fMRI data preprocessing framework – population-based functional connectome (PSC) mapping. The dataset used in this work comes from the HCP Q1 release [39] consisting of fMRI scans from 68 subjects during resting state (rfMRI). The acquisition parameters are as follows: 90×104 matrix, 220mm FOV, 72 slices, TR=0.72s, TE=33.1ms, flip angle=52°. Preprocessing steps of the rfMRI data have been conducted within the Q1 release, including motion correction, spatial smoothing, temporal pre-whitening, slice time correction, and global drift removal. The fMRI data from each individual in each task was then linearly registered to the MNI152 template.

Since we are particularly interested in cognition, we extract four cognition-related measures as y from HCP, including:

•

Oral reading recognition test: participants on this test are asked to read and pronounce letters
•

Crystallized composite score: crystallized cognition composite can be interpreted as a global assessment of verbal reasoning. We use the age-corrected standard score.

Our predictors are brain functional connections between each pair of ROIs. Based on the brain parcellation, we have $68*68/2$ connections. We first applied a screening procedure to select top $1/5$ of the connections as our candidates. We applied the proposed MTA method to select important connections associated with the cognitive traits with controlled false discovery rate at $10\%$ .

As shown in the left panel of Figure 4, we observe dense connections within both the left and right frontal lobes. In particular, we see that brain regions such as r23 (rostral middle frontal), and l27 (caudal middle frontal) are densely involved (nodes with high degrees) in the selected connections. We also observe that the four nodes in the temporal lobe (regions 8, 14, 29), all appear in the selected connections to the frontal lobe. The frontal lobe processes how you interpret language. The temporal lobes are also believed to play an important role in processing auditory information and with the encoding of memory [34]. Hence, our results show that richer connections between the visual and auditory are strongly associated with higher learning ability in vocabulary. As shown in the right panel of Figure 4, we observe more connections between the front lobe and the temporal lobe. Temporal lobe regions are linked to memory, language, and visual perception [1]. Additionally, we observe a high level of communication between the left and right brain hemispheres, which contributes to higher cognition [4].

6 Conclusion

In this paper, we consider the multiple hypothesis testing procedure which aims at controlling FDR without assuming a model of the conditional distribution of the response. We combine the idea of data-splitting, Gaussian mirror, and sliced inverse regression (SIR) to construct mirror statistics. With the aid of the developed central limit theorem of SIR, it is shown that the mirror statistic is approximately symmetric with respect to zero when the null hypothesis is true. We then provide an estimator of the false discovery proportion for the single-step testing method and suggest a data-driven threshold based on such an estimator. It is theoretically shown that the FDR of the proposed method is less than or equal to a designated level asymptotically.

The theoretical investigation of the SIR using the newly developed Gaussian approximation theory paves the road for the statistical inference for sufficient dimension reduction with a diverging dimension. It has the potential to be used for other sufficient dimension reduction methods such as sliced average variance estimator (SAVE, [8]), principle Hessian directions (PHD, [26]), and many others. We will leave them in future discussions.

The proposed approach is flexible due to the lean assumptions on the link function. The popular deep neural network could be viewed as the multiple index model. In an ongoing project, it would be interesting to see how the proposed method could be used to reduce the dimension under this framework.

7 Acknowledgements

Data used in the preparation of this article were obtained from the Human Connectome Project (www.humanconnectome.org). The HCP WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) were funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. We would like to thank Bing Li, Wen Zhou, Asaf Weinstein for their comments and suggestions. Xing’s research is supported by NSF (DMS-2124535).

References

Allone et al. [2017] Cettina Allone, Viviana Lo Buono, Francesco Corallo, Laura Rosa Pisani, Patrizia Pollicino, Placido Bramanti, and Silvia Marino. Neuroimaging and cognitive functions in temporal lobe epilepsy: a review of the literature. Journal of the neurological sciences, 381:7–15, 2017.
Barber and Candès [2015] Rina Foygel Barber and Emmanuel J Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
Barber and Candès [2019] Rina Foygel Barber and Emmanuel J Candès. A knockoff filter for high-dimensional selective inference. The Annals of Statistics, 47(5):2504–2537, 2019.
Boisgueheneuc et al. [2006] Foucaud du Boisgueheneuc, Richard Levy, Emmanuelle Volle, Magali Seassau, Hughes Duffau, Serge Kinkingnehun, Yves Samson, Sandy Zhang, and Bruno Dubois. Functions of the left superior frontal gyrus in humans: a lesion study. Brain, 129(12):3315–3328, 2006.
Candes et al. [2018] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: model-X knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
Chernozhukov et al. [2013] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, 2013.
Chernozhukov et al. [2017] Victor Chernozhukov, Denis Chetverikov, Kengo Kato, et al. Central limit theorems and bootstrap in high dimensions. Annals of Probability, 45(4):2309–2352, 2017.
Cook [2000] D. R. Cook. SAVE: a method for dimension reduction and graphics in regression. Communications in statistics-Theory and methods, 29(9-10):2109–2121, 2000.
Cook et al. [2007] D. R. Cook, B. Li, and F. Chiaromonte. Dimension reduction in regression without matrix inversion. Biometrika, 94(3):569–584, 2007.
Cook [2004] R Dennis Cook. Testing predictor contributions in sufficient dimension reduction. The Annals of Statistics, 32(3):1062–1092, 2004.
Cook [2007] R Dennis Cook. Fisher lecture: Dimension reduction in regression. Statistical Science, pages 1–26, 2007.
Cook and Ni [2005] R Dennis Cook and Liqiang Ni. Sufficient dimension reduction via inverse regression: A minimum discrepancy approach. Journal of the American Statistical Association, 100(470):410–428, 2005.
Cook and Ni [2006] R Dennis Cook and Liqiang Ni. Using intraslice covariances for improved estimation of the central subspace in regression. Biometrika, 93(1):65–74, 2006.
Cook and Weisberg [1991] R Dennis Cook and Sanford Weisberg. Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414):328–332, 1991.
Dai et al. [2020] Chenguang Dai, Buyu Lin, Xin Xing, and Jun S Liu. False discovery rate control via data splitting. arXiv preprint arXiv:2002.08542, 2020.
Dai et al. [2022] Chenguang Dai, Buyu Lin, Xin Xing, and Jun S Liu. False discovery rate control via data splitting. Journal of the American Statistical Association, (just-accepted):1–38, 2022.
Dong [2021] Yuexiao Dong. A brief review of linear sufficient dimension reduction through optimization. Journal of Statistical Planning and Inference, 211:154–161, 2021.
Duan and Li [1991] N. Duan and K. C. Li. Slicing regression: a link-free regression method. The Annals of Statistics, 19(2):505–530, 1991.
Gretton et al. [2007] Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Schölkopf, and Alex Smola. A kernel statistical test of independence. Advances in neural information processing systems, 20, 2007.
Hsing and Carroll [1992] T. Hsing and R. J. Carroll. An asymptotic theory for sliced inverse regression. The Annals of Statistics, 20(2):1040–1061, 1992.
Ji and Zhao [2014] Pengsheng Ji and Zhigen Zhao. Rate optimal multiple testing procedure in high-dimensional regression. arXiv preprint arXiv:1404.2961, 2014.
Jiang and Liu [2014] Bo Jiang and Jun S Liu. Variable selection for general index models via sliced inverse regression. The Annals of Statistics, 42(5):1751–1786, 2014.
Li and Dong [2009] Bing Li and Yuexiao Dong. Dimension reduction for nonelliptically distributed predictors. The Annals of Statistics, 37(3):1272–1298, 2009.
Li and Wang [2007] Bing Li and Shaoli Wang. On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479):997–1008, 2007.
Li [1991] K. C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
Li [1992] Ker-Chau Li. On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma. Journal of the American Statistical Association, 87(420):1025–1039, 1992.
Li [2007] L. Li. Sparse sufficient dimension reduction. Biometrika, 94(3):603–613, 2007.
Lin et al. [2018] Qian Lin, Zhigen Zhao, and Jun S Liu. On consistency and sparsity for sliced inverse regression in high dimensions. The Annals of Statistics, 46(2):580–610, 2018.
Lin et al. [2019] Qian Lin, Zhigen Zhao, and Jun S Liu. Sparse sliced inverse regression via lasso. Journal of the American Statistical Association, 114(528):1726–1739, 2019.
Me Saracco [1997] JÉRÔ Me Saracco. An asymptotic theory for sliced inverse regression. Communications in statistics-Theory and methods, 26(9):2141–2171, 1997.
Meinshausen and Bühlmann [2006] Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
Ni et al. [2005] L. Ni, D. R. Cook, and C. L. Tsai. A note on shrinkage sliced inverse regression. Biometrika, 92(1):242–247, 2005.
Pearl [1988] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
Perrodin et al. [2014] Catherine Perrodin, Christoph Kayser, Nikos K Logothetis, and Christopher I Petkov. Auditory and visual modulation of temporal lobe neurons in voice-sensitive and association cortices. Journal of Neuroscience, 34(7):2524–2537, 2014.
Sarkar and Tang [2021] Sanat K Sarkar and Cheng Yong Tang. Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff assisted variable selection. arXiv preprint arXiv:2102.09080, 2021.
Statnikov et al. [2013] Alexander Statnikov, Nikita I Lytkin, Jan Lemeire, and Constantin F Aliferis. Algorithms for discovery of multiple Markov boundaries. Journal of machine learning research: JMLR, 14:499, 2013.
Storey et al. [2004] John D Storey, Jonathan E Taylor, and David Siegmund. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):187–205, 2004.
Tan et al. [2020] Kai Tan, Lei Shi, and Zhou Yu. Sparse SIR: Optimal rates and adaptive estimation. The Annals of Statistics, 48(1):64 – 85, 2020. 10.1214/18-AOS1791. URL https://doi.org/10.1214/18-AOS1791.
Van Essen et al. [2013] David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The WU-Minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
Wu and Li [2011] Yichao Wu and Lexin Li. Asymptotic properties of sufficient dimension reduction with a diverging number of predictors. Statistica Sinica, 2011(21):707, 2011.
Xia [2007] Yingcun Xia. A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics, 35(6):2654–2690, 2007.
Xing et al. [2021] Xin Xing, Zhigen Zhao, and Jun S Liu. Controlling false discovery rate using gaussian mirrors. Journal of the American Statistical Association, (just-accepted):1–45, 2021.
Yin and Cook [2002] Xiangrong Yin and R Dennis Cook. Dimension reduction for the conditional k-th moment in regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(2):159–175, 2002.
Zhang and Zhang [2014] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014.
Zhang and Cheng [2017] Xianyang Zhang and Guang Cheng. Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112(518):757–768, 2017.
Zhu et al. [2006] L. Zhu, B. Miao, and H. Peng. On sliced inverse regression with high-dimensional covariates. Journal of the American Statistical Association, 101(474):640–643, 2006.
Zhu and Fang [1996] Li-Xing Zhu and Kai-Tai Fang. Asymptotics for kernel estimate of sliced inverse regression. The Annals of Statistics, 24(3):1053–1068, 1996.

(a1) Power for setting 1	(a2) FDP for setting 1

(b1) Power for setting 2	(b2) FDP for setting 2

(c1) Power for setting 3	(c2) FDP for setting 3

(d1) Power for setting 4	(d2) FDP for setting 4

(a1) Power for setting 1	(a2) FDP for setting 1

(b1) Power for setting 2	(b2) FDP for setting 2

(c1) Power for setting 3	(c2) FDP for setting 3

(d1) Power for setting 4	(d2) FDP for setting 4