This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the testing of multiple hypothesis in sliced inverse regression

Zhigen Zhaolabel=e1]zhaozhg@temple.edu [    Xin Xinglabel=e2]xinxing@vt.edu [ Department of Statistics, Operations, and Data Science, Temple University, Philadelphia, PA, 19122, USA
Department of Statistics, Virginia Tech University, Temple University, Blacksburg, VA 24061
Abstract

We consider the multiple testing of the general regression framework aiming at studying the relationship between a univariate response and a p-dimensional predictor. To test the hypothesis of the effect of each predictor, we construct an Angular Balanced Statistic (ABS) based on the estimator of the sliced inverse regression without assuming a model of the conditional distribution of the response. According to the developed limiting distribution results in this paper, we have shown that ABS is asymptotically symmetric with respect to zero under the null hypothesis. We then propose a Model-free multiple Testing procedure using Angular balanced statistics (MTA) and show theoretically that the false discovery rate of this method is less than or equal to a designated level asymptotically. Numerical evidence has shown that the MTA method is much more powerful than its alternatives, subject to the control of the false discovery rate.

\setattribute

journalname \setattributeinfolinetext \stackMath

and

t1 Zhigen Zhao is Associate Professor of Department of Statistics, Operations, and Data Science, Temple University. t2 Xin Xing is Assistant Professor of the Department of Statistics, Virginia Tech University.

Keywords: model free, FDR, sufficient dimension reduction

1 Introduction

In the general framework of the regression analysis, the goal is to infer the relation between a univariate response variable yy and a p×1p\times 1 vector 𝒙\bm{x}. One would like to know y|𝒙y|\bm{x}, namely, how the distribution of yy depends on the value of 𝒙\bm{x}. Among the literature of sufficient dimension reduction [25, 8, 9, 28, 29], the fundamental idea is to replace the predictor by its projection to a subspace without loss of information. In other words, we seek for a subspace 𝒮y|𝒙\mathcal{S}_{y|\bm{x}} of the predictor space such that

y𝒙|𝐏𝒮y|𝒙𝒙.y\rotatebox[origin={c}]{90.0}{$\models$}\bm{x}|\mathbf{P}_{\mathcal{S}_{y|\bm{x}}}\bm{x}. (1)

Here \models indicates independence, and P()P_{(\cdot)} stands for a projection operator. The subspace 𝒮y|𝒙\mathcal{S}_{y|\bm{x}} is called the central subspace. Let dd be the dimension of this central subspace. Let 𝐁\mathbf{B}, a p×dp\times d matrix, be a basis of the central subspace 𝒮y|𝒙\mathcal{S}_{y|\bm{x}}. Then the equation (1) is equivalent to

y𝒙|𝐁𝒙.y\rotatebox[origin={c}]{90.0}{$\models$}\bm{x}|\mathbf{B}\bm{x}. (2)

To further reduce the dimensionality especially when the number of predictors pp diverges with respect to nn, it is commonly assumed that yy depends on 𝒙\bm{x} through a subset of 𝒙\bm{x}, known as the Markov blanket and denoted as (y,𝒙)\mathcal{MB}(y,\bm{x}) [33, 36, 5], such that

y𝒙|(y,𝒙).y\rotatebox[origin={c}]{90.0}{$\models$}\bm{x}|\mathcal{MB}(y,\bm{x}).

For each predictor, one would like to know whether xj(y,𝒙)x_{j}\in\mathcal{MB}(y,\bm{x}), which could be formulated as a multiple testing problem. The null hypothesis stating that xj(y,𝒙)x_{j}\notin\mathcal{MB}(y,\bm{x}) is equivalent to

j:𝐏span(xj)(𝒮y|𝒙)=𝒪p,\mathcal{H}_{j}:\mathbf{P}_{span(x_{j})}(\mathcal{S}_{y|\bm{x}})=\mathcal{O}_{p}, (3)

where 𝒪p\mathcal{O}_{p} is the origin point in the pp-dimensional space [10]. In other words, it is equivalent to saying that the jj-th row of the matrix 𝐁\mathbf{B} consists of all zeros.

There are many attempts targeting estimating the central subspace in the existing literature on sufficient dimension reduction. The most widely used method is the sliced inverse regression (SIR) which was first introduced in [25]. Later, there are many attempts to extend SIR, including, but not limited to, sliced average variance estimation [14, 8], directional regression [24], and constructive estimation [41]. Nevertheless, most of the existing methods and theories in sufficient dimension reduction focus on the estimation of the central subspace 𝒮y|𝒙\mathcal{S}_{y|\bm{x}}. The result of statistical inference is very limited when pp diverges, not to mention the procedure of controlling the false discovery rate (FDR) when testing these hypotheses simultaneously.

The challenge arises from two perspectives. First, among the literature on sufficient dimension reduction, the result on the limiting distribution of the estimator of the central subspace is very limited when pp diverges. When pp is fixed, [20, 46, 30, 23] have derived the asymptotic distribution of the sliced inverse regression. To the best of the authors’ knowledge, there are no results on the limiting distribution when pp diverges unless assuming the signal is strong and the total number of false hypotheses is fixed [40].

Second, after the test statistic is determined for each hypothesis, how to combine these test statistics to derive a method that controls the false discovery rate is challenging. Many existing procedures, such as [2, 3, 42, 35], work on the (generalized) linear regression models. In [5], the authors considered an arbitrary joint distribution of yy and 𝒙\bm{x} and proposed the model-X knockoff to control FDR. However, this method requires that the distribution of the design matrix is known, which is not feasible in many practice.

The study conducted by [21] explored variable selection for the linear regression model, assuming the condition of weak and rare signals. It is noted in this paper that the selection consistency is not possible and allowing for false positives is necessary. While several existing penalization-based methods on sufficient dimension reduction require imposing uniform signal strength condition to achieve consistent results ([27, 29, 38, 44]), this work tackles a more challenging scenario. Specifically, we develop the central limit theorem of SIR, utilizing recent theories on Gaussian approximation [6, 7] without relying on the uniform signal strength conditions. This theoretical result is the first of its kind in the literature on sufficient dimension reduction and is a necessity for simultaneous inference when the effects of some relevant predictors are either moderate or weak.

We proceed by constructing a statistic for each hypothesis based on sliced inverse regression. Applying Gaussian approximation theory, we demonstrate that this statistic is asymptotically symmetric about zero when the null hypothesis holds. We refer to this as a Angular Balanced Statistic (ABS). We then develop a single-step procedure that rejects a hypothesis when its ABS exceeds a certain threshold. Additionally, we provide an estimator for the false discovery proportion. For a designated FDR level qq, we adaptively select a threshold such that the estimated false discovery proportion is no greater than qq. This method is referred to as the Model-free multiple Testing procedure using Angular balanced statistics (MTA). Theoretical analysis confirms that MTA asymptotically controls the FDR at the qq-level under regularity conditions. Simulation results and data analysis demonstrate that MTA significantly outperforms its competitors in terms of power while controlling the FDR.

The paper is organized as follows. In Section 2, we derive the central limit theorem of SIR when both the dimension pp and the number of important predictors ss diverge with nn using the recently developed Gaussian approximation theory. In Section 3, we construct ABS based on the estimator of SIR and propose the MTA method. It is shown that the FDR of the MTA method is less than or equal to a designated level asymptotically. In Sections 4 and 5, we provide numerical evidence including extensive simulations and a real data analysis to demonstrate advantages of MTA. We conclude the paper in Section 6, and include all technical details in the appendix.

Notation

We adopt the following notations throughout this paper. For a matrix 𝑨\bm{A}, we call the space generated by its column vectors the column space and denote it by col(𝑨)col(\bm{A}). The element at the ii-th row and jj-th column of a matrix 𝑨\bm{A} is denoted as AijA_{ij} or aija_{ij}. The ii-th row and jj-th column of the matrix are denoted by 𝑨i\bm{A}_{i\cdot} and 𝑨j\bm{A}_{\cdot j}, respectively. The minimum and maximum eigenvalues of 𝑨\bm{A} are denoted as λmin(𝑨)\lambda_{min}(\bm{A}) and λmax(𝑨)\lambda_{\max}(\bm{A}) respectively. For two positive numbers a,ba,b, we use aba\vee b and aba\wedge b to denote max{a,b}\max\{a,b\} and min{a,b}\min\{a,b\} respectively. We use cc, CC, CC^{\prime}, C1C_{1} and C2C_{2} to denote generic absolute constants, though the actual value may vary from case to case.

2 Gauss Approximaton of SIR

Recall that yy is the response and 𝒙\bm{x} is a p-dimensional vector. In the literature of sufficient dimension reduction, people aim to find the central subspace 𝒮y|𝒙\mathcal{S}_{y|\bm{x}} defined in (1). The sliced inverse regression (SIR) introduced in [25] is the first and the most popular method among many existing ones.

Assume that the covariance matrix of 𝒙\bm{x} is 𝚺\bm{\Sigma}. Let 𝛀=𝚺1\bm{\Omega}=\bm{\Sigma}^{-1} be the precision matrix of 𝒙\bm{x}. Let sj=#{k:𝛀jk0}s_{j}=\#\{k:\bm{\Omega}_{jk}\neq 0\} and s=maxjsjs=\max_{j}s_{j}. When the distribution of 𝒙\bm{x} is elliptically symmetric, it is shown in [25] that

𝚺𝒮y|𝒙=col(𝚲),\displaystyle\bm{\Sigma}\mathcal{S}_{y|\bm{x}}=col(\bm{\Lambda}), (4)

where 𝚲=var(𝔼[𝒙|y])\bm{\Lambda}=var(\mathbb{E}[\bm{x}|y]) and col(𝚲)col(\bm{\Lambda}) is the column space spanned by 𝚲\bm{\Lambda}.

Given nn i.i.d.i.i.d. samples (yi,𝒙i)(y_{i},\bm{x}_{i}), i=1,,ni=1,\cdots,n. To estimate 𝚲\bm{\Lambda}, divide the data into HH slices according to the order statistics y(i)y_{(i)}, i=1,,ni=1,\ldots,n. Let 𝒙(i)\bm{x}_{(i)} be the concomitant associated with y(i)y_{(i)}. Note that slicing the data naturally forms a partition of the support of the response variable, denoted as \mathcal{H}. Let 𝒫h\mathcal{P}_{h} be the hh-th slice in the partition \mathcal{H}. Here we let 𝒫1=(,y(n/H)]\mathcal{P}_{1}=(-\infty,y_{(\lceil n/H\rceil)}] and 𝒫H=(y(n/H(H1)+1),+)\mathcal{P}_{H}=(y_{(\lceil n/H\rceil*(H-1)+1)},+\infty). Let 𝒙¯\bar{\bm{x}} be the mean of all the 𝒙\bm{x}’s and 𝒙¯h,\bar{\bm{x}}_{h,\cdot} be the sample mean of the vectors 𝒙(j)\bm{x}_{(j)}’s such that its concomitant y(j)𝒫hy_{(j)}\in\mathcal{P}_{h} and estimate 𝚲var(𝔼[𝒙|y])\bm{\Lambda}\triangleq var(\mathbb{E}[\bm{x}|y]) by

𝚲^=1Hh=1H(𝒙¯h,𝒙¯)(𝒙¯h,𝒙¯)τ.\widehat{\bm{\Lambda}}_{\mathcal{H}}=\frac{1}{H}\sum_{h=1}^{H}(\bar{\bm{x}}_{h,\cdot}-\bar{\bm{x}})(\bar{\bm{x}}_{h,\cdot}-\bar{\bm{x}})^{\tau}. (5)

The 𝚲^\widehat{\bm{\Lambda}}_{\mathcal{H}} was shown to be a consistent estimator of 𝚲\bm{\Lambda} under some technical conditions [18, 20, 46, 25, 28].

Alternatively, we could view SIR through a sequence of ordinary least squares regressions. Let fh(y),h=1,2,,Hf_{h}(y),h=1,2,\cdots,H be a sequence of transformations of YY. Following the proof of [43, 40], one knows that under the linearity condition [25],

Ef(y)ϕ(y)𝒮y|𝒙,Ef(y)\phi(y)\in\mathcal{S}_{y|\bm{x}},

where ϕ(y)=𝚺1E(𝒙|y)\phi(y)=\bm{\Sigma}^{-1}E(\bm{x}|y). Let 𝜷hRp×1\bm{\beta}_{h}\in R^{p\times 1} be defined as

𝜷h=argmin𝜷hE(fh(y)𝒙𝜷h)2.\bm{\beta}_{h}=argmin_{\bm{\beta}_{h}}E(f_{h}(y)-\bm{x}\bm{\beta}_{h})^{2}.

Assuming the coverage condition [32, 13, 40], then

Span(𝐁)=𝒮y|𝒙,Span(\mathbf{B})=\mathcal{S}_{y|\bm{x}},

where 𝐁=(𝜷1,,𝜷H)Rp×H\mathbf{B}=(\bm{\beta}_{1},\cdots,\bm{\beta}_{H})\in R^{p\times H}.

Note that different choices of fh(y)f_{h}(y) lead to different methods [40, 17]. To name a few here, [43] suggested fh(y)=yhf_{h}(y)=y^{h} where hHh\leq H. After slicing the data into HH slices according to the value of the response variable yy, [13] suggested fh(y)=yf_{h}(y)=y if yy is in the hh-th slice and 0 otherwise. If we choose fh(y)=𝟙(y𝒫h)f_{h}(y)=\mathbbm{1}(y\in\mathcal{P}_{h}), this will lead to SIR, which is the main focus in this paper [43, 40].

After obtaining data (𝒙i,yi)(\bm{x}_{i},y_{i}) based on a sample of nn subjects, let

fh(𝒚)=𝟙(𝒚𝒫h)=(𝟙(y1𝒫h),𝟙(y2𝒫h),,𝟙(yn𝒫h))T.f_{h}(\bm{y})=\mathbbm{1}(\bm{y}\in\mathcal{P}_{h})=\left(\mathbbm{1}(y_{1}\in\mathcal{P}_{h}),\mathbbm{1}(y_{2}\in\mathcal{P}_{h}),\cdots,\mathbbm{1}(y_{n}\in\mathcal{P}_{h})\right)^{T}.

Let 𝜷^h\hat{\bm{\beta}}_{h} be defined as

𝜷^h=argmin𝜷hfh(𝒚)𝒙T𝜷h2=(𝒙𝒙T)1𝒙fh(𝒚),\hat{\bm{\beta}}_{h}=argmin_{\bm{\beta}_{h}}||f_{h}(\bm{y})-\bm{x}^{T}\bm{\beta}_{h}||^{2}=(\bm{x}\bm{x}^{T})^{-1}\bm{x}f_{h}(\bm{y}),

or a general form

𝜷^h=argmin𝜷hfh(𝒚)𝒙T𝜷h2=1n𝛀^𝒙fh(𝒚),\hat{\bm{\beta}}_{h}=argmin_{\bm{\beta}_{h}}||f_{h}(\bm{y})-\bm{x}^{T}\bm{\beta}_{h}||^{2}=\frac{1}{n}\hat{\bm{\Omega}}\bm{x}f_{h}(\bm{y}), (6)

where 𝛀^\hat{\bm{\Omega}} is a suitable approximation of the inverse of the Gram matrix 𝚺^=𝒙T𝒙/n\hat{\bm{\Sigma}}=\bm{x}^{T}\bm{x}/n. Let

𝑩^=(𝜷^1,,𝜷^H).\hat{\bm{B}}=(\hat{\bm{\beta}}_{1},\cdots,\hat{\bm{\beta}}_{H}). (7)

There are many methods to estimate the precision matrix. As an example, we could consider the one given by the Lasso for the node-wide regression on the design matrix 𝒙\bm{x} [31]. Next, we will derive the central limit theorem of SIR when pp converges to the infinity.

Our derivation is built upon the Gaussian approximation (GAR) theory recently developed in [6, 7]. Let 𝒫H=h𝒫h\mathcal{P}_{H}=\cup_{h}\mathcal{P}_{h} be a partition of the sample space of yy and ph=P(y𝒫h)p_{h}=P(y\in\mathcal{P}_{h}). Define

𝜷~h=1n𝛀𝒙fh(𝒚).\tilde{\bm{\beta}}_{h}=\frac{1}{n}\bm{\Omega}\bm{x}f_{h}(\bm{y}). (8)

Let 𝑩~=(𝜷~1,,𝜷~H)\tilde{\bm{B}}=(\tilde{\bm{\beta}}_{1},\cdots,\tilde{\bm{\beta}}_{H}). For i=1,2,,n,j=1,2,,p,h=1,2,,Hi=1,2,\cdots,n,j=1,2,\cdots,p,h=1,2,\cdots,H, let zijhz_{ijh}’s be normal random variables such that

  • Ezijh=ph𝛀jTE(𝒙|y𝒫h)Ez_{ijh}=p_{h}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}|y\in\mathcal{P}_{h});

  • V(zijh)=ph𝛀jTE(𝒙𝒙T|y𝒫h)𝛀jph2(𝛀jTE(𝒙|y𝒫h))2V(z_{ijh})=p_{h}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}\bm{x}^{T}|y\in\mathcal{P}_{h})\bm{\Omega}_{\cdot j}-p_{h}^{2}(\bm{\Omega}_{\cdot j}^{T}E(\bm{x}|y\in\mathcal{P}_{h}))^{2};

  • Cov(zijh,zikh)=ph𝛀jTE(𝒙𝒙T|y𝒫h)𝛀kph2𝛀jTE(𝒙|y𝒫h)𝛀kTE(𝒙|y𝒫h)Cov(z_{ijh},z_{ikh})=p_{h}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}\bm{x}^{T}|y\in\mathcal{P}_{h})\bm{\Omega}_{\cdot k}-p_{h}^{2}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}|y\in\mathcal{P}_{h})\bm{\Omega}_{\cdot k}^{T}\\ E(\bm{x}|y\in\mathcal{P}_{h});

  • Cov(zijh1,zikh2)=ph1ph2𝛀jTE(𝒙|y𝒫h1)𝛀kTE(𝒙|y𝒫h2)Cov(z_{ijh_{1}},z_{ikh_{2}})=-p_{h_{1}}p_{h_{2}}\bm{\Omega}_{\cdot j}^{T}E(\bm{x}|y\in\mathcal{P}_{h_{1}})\bm{\Omega}_{\cdot k}^{T}E(\bm{x}|y\in\mathcal{P}_{h_{2}}).

Note that when j\mathcal{H}_{j} is true and the coverage condition holds, then Ezijh=0Ez_{ijh}=0 for any ii and hh (see Remark LABEL:remark:z in the appendix).

Let 𝒛i\bm{z}_{i\cdot\cdot} be a pp by HH matrix consisting of zijhz_{ijh} where j=1,2,,pj=1,2,\cdots,p and h=1,2,,Hh=1,2,\cdots,H. Let 𝒜re\mathcal{A}^{re} consist of all sets AA of the form

A={ωRpH:ajωjbj,j=1,2,,pH}A=\{\omega\in R^{pH}:a_{j}\leq\omega_{j}\leq b_{j},j=1,2,\cdots,pH\} (9)

for some ajbj-\infty\leq a_{j}\leq b_{j}\leq\infty. We will develop a bound of the quantity

ρn(𝒜re)=supA𝒜re|(n𝐁^A)(1ni=1n𝒛iA)|.\rho_{n}(\mathcal{A}^{re})=\sup_{A\in\mathcal{A}^{re}}\left|\mathbb{P}(\sqrt{n}\hat{\mathbf{B}}\in A)-\mathbb{P}(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{z}_{i\cdot\cdot}\in A)\right|. (10)

The theoretical investigation requires the following assumptions

Assumption 1.

The design matrix 𝐱\bm{x} has iid sub-Gaussian rows. In other words, supa21𝐄exp{|j=1pajxij|2/C}1sup_{||a||_{2}\leq 1}\mathbf{E}exp\{|\sum_{j=1}^{p}a_{j}x_{ij}|^{2}/C\}\leq 1 for some large enough positive constant CC.

Assumption 2.

There exists two positive constants cc and CC such that the smallest eigenvalue λmin\lambda_{min} of 𝚺\bm{\Sigma} is greater than cc and the largest eigenvalue λmax\lambda_{max} of 𝚺\bm{\Sigma} is less than CC.

Assumption 1 is assumed in [45], describing the tail property of the predictors. The bounds of the eigenvalues of 𝚺\bm{\Sigma} are commonly assumed when considering the estimation of the covariance matrix and the precision matrix. In the sliced inverse regression, the data are sliced according to the response variable yy. This creates a partition on the support of the response variable. We assume the following condition on the partition.

Definition 1 (Distinguishable Partition).

For a given HH, we call 𝒟𝒫H(b)\mathcal{DP}_{H}(b), a collection of all partitions =a0<a1<<aH1<aH=-\infty=a_{0}<a_{1}<\cdots<a_{H-1}<a_{H}=\infty of \mathbb{R}, distinguishable if for any partition in 𝒫H\mathcal{P}_{H},

  • (D1)

    there exists two constants γ1\gamma_{1} and γ2\gamma_{2} such that

    γ1Hphγ2H,where ph=(ah1Y<ah);\frac{\gamma_{1}}{H}\leq p_{h}\leq\frac{\gamma_{2}}{H},\textrm{where $p_{h}=\mathbb{P}(a_{h-1}\leq Y<a_{h})$};
  • (D2)

    λmin(Cov(𝒙𝟙(y𝒫h),𝒙𝟙(y𝒫h)))>b,h=1,2,,H,\lambda_{min}\left(Cov\left(\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h}),\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h})\right)\right)>b,\forall h=1,2,\cdots,H, where bb is a constant which does not depend on n and p.

The condition (D1) on the partition requires that the probability data falls in each slice is inversely proportional to the number of slices. This condition is assumed in [28] when establishing the phase transition of SIR. The condition (D2) requires a sufficient variation of the 𝒙\bm{x}, which falls in each slice. Otherwise, if the smallest eigenvalue of that covariance matrix in a given slice converges to zero, the corresponding estimator becomes unstable. In Remark 2, we give examples that (D2) holds.

Theorem 1.

Assume Assumptions (1-2) and a partition 𝒫H𝒟𝒫H(b)\mathcal{P}_{H}\in\mathcal{DP}_{H}(b). If s3p3log(pHn)7/nC1nc1s^{3}p^{3}\log(pHn)^{7}/n\leq C_{1}n^{-c_{1}} for some constants c1,C1>0c_{1},C_{1}>0, then with probability at least 12exp(cnα)1-2\exp(-cn^{\alpha}),

ρ(𝒜re)CDn,p,\rho(\mathcal{A}^{re})\leq CD_{n,p}, (11)

where CC is a constant and Dn,p=(log7(pHn)n)1/6D_{n,p}=\left(\frac{\log^{7}(pHn)}{n}\right)^{1/6}.

The proof is put in the appendix. From this theorem, it is seen that the estimator 𝐁^\hat{\mathbf{B}} could be approximated by Gaussian random vectors which preserve the mean and covariance structure.

Remark 1.

When the dimension pp is fixed, it is known that the limiting distribution of the estimator based on SIR is normally distributed ([20, 30, 47]. In [40], they have used the penalization method to estimate the central space. They have further established the central limit theorem. However, this theoretical development requires a uniform signal strength condition that the minimum of the magnitude of nonzero parameters is greater than some bound such that the nulls and alternatives could be well separated ([44]). On the contrary, Theorem 1 does not rely on the uniform signal strength condition.

Additionally, this theorem provides a result on the uniform convergence which is essential for simultaneous inference when the effects of some relevant predictors are either moderate or weak.

Remark 2.

Comments on the distinguishable condition. The condition (D2) guarantees that the ordinary least estimation in each slice is stable. This condition holds in examples such as the inverse regression model suggested by Cook [12, 11]. In this model, it is assumed that the covariate 𝐱\bm{x} is given as

𝒙y=μ+𝚪𝝂y+σϵ,\bm{x}_{y}=\mu+\bm{\Gamma}\bm{\nu}_{y}+\sigma\bm{\epsilon},

where μp\mu\in\mathbb{R}^{p}, 𝚪p×d\bm{\Gamma}\in\mathbb{R}^{p\times d}, 𝚪T𝚪=Id\bm{\Gamma}^{T}\bm{\Gamma}=I_{d}. Then

Cov(𝒙𝟙(y𝒫h,𝒙𝟙(y𝒫h))phσ2Cov(ϵ,ϵ).\displaystyle Cov(\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h},\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h}))\succcurlyeq p_{h}\sigma^{2}\text{Cov}(\bm{\epsilon},\bm{\epsilon}).

In [22], the authors considered the following model

𝒙|y𝒮hMVN(μh,𝚺),h.\bm{x}|y\in\mathcal{S}_{h}\sim MVN(\mu_{h},\bm{\Sigma}),\forall h.

It is easily seen that

Cov(𝒙𝟙(y𝒫h,𝒙𝟙(y𝒫h))ph𝚺.\displaystyle Cov(\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h},\bm{x}\mathbbm{1}(y\in\mathcal{P}_{h}))\succcurlyeq p_{h}\bm{\Sigma}.

3 Model Free Multiple Testing using Angular Balanced Statistics (MTA)

In this section, we consider testing the hypothesis defined in (3) simultaneously such that the false discovery rate is controlled at a designated level qq. We start with the construction of the test statistic.

First, we randomly split the sample as two parts, denoted as 𝒟1\mathcal{D}_{1} and 𝒟2\mathcal{D}_{2} [15]. Let 𝐁^1\hat{\mathbf{B}}^{1} and 𝐁^2\hat{\mathbf{B}}^{2} be the estimators of the sliced inverse regression based on these two subsets respectively. We define the following mirror statistics

Mj=h=1H𝐁^jh1𝐁^jh2.M_{j}=\sum_{h=1}^{H}\hat{\mathbf{B}}^{1}_{jh}\hat{\mathbf{B}}^{2}_{jh}. (12)

Note that when the j-th hypothesis is true, namely, 𝐏span(xj)(𝒮y|𝒙)=𝒪p\mathbf{P}_{span(x_{j})}(\mathcal{S}_{y|\bm{x}})=\mathcal{O}_{p}, one would expect that both vectors 𝐁^j1\hat{\mathbf{B}}_{j\cdot}^{1} and 𝐁^j2\hat{\mathbf{B}}_{j\cdot}^{2} would center around zero. On the other hand, when HjH_{j} is false, both vectors center around a common non-zero vector 𝑩j\bm{B}_{j\cdot}. The angles between 𝑩^jk,k=1,2,\hat{\bm{B}}_{j\cdot}^{k},k=1,2, and 𝑩j\bm{B}_{j\cdot} are approximately the same. Consequently, we call this statistic MjM_{j} a Angular Balanced Statistic (ABS). Intuitively, the ABS statistic would be symmetric with respect to zero under the null and it tends to have a positive value under the alternative. This is depicted in Figure 1. In the left panel, when considering the predictor x1(y,𝒙)x_{1}\in\mathcal{MB}(y,\bm{x}), the estimated coefficient vectors 𝐁^11\hat{\mathbf{B}}_{1\cdot}^{1} and 𝐁^12\hat{\mathbf{B}}_{1\cdot}^{2} center around a non-zero vector. On the other hand when considering the estimated coefficients of x2x_{2} which does not belong to (y,𝒙)\mathcal{MB}(y,\bm{x}), the estimated coefficients 𝐁^21\hat{\mathbf{B}}_{2\cdot}^{1} and 𝐁^22\hat{\mathbf{B}}_{2\cdot}^{2} center around the origin. In the right panel, we plot the histogram of all the mirror statistics when generating the data according to Setting 1 in Section 4, it is seen that MjM_{j}’s are roughly symmetric with respect to zero when the corresponding null hypotheses are true. When the corresponding null hypothesis is false, this statistic tends to have a large positive value.

Refer to caption
Figure 1: In the left panel, we show an illustration of the coefficient vectors for three variables X1,X2X_{1},X_{2} and X3X_{3} where 1\mathcal{H}_{1} is false and 2,3\mathcal{H}_{2},\mathcal{H}_{3} are true. The right panel shows the histogram of mirror statistics for all variables when the data is generated according to Setting 1 with ρ=0.4\rho=0.4 in Section 4.

The following theorem provides a rigorous argument for the above intuition. Namely, we show that the ABS is symmetric with respect to zero under the null hypothesis asymptotically.

Theorem 2.

Assume the same conditions of Theorem 1, if the null hypothesis j\mathcal{H}_{j} is true, then for any t>0t>0, with probability at least 12exp(cnα)1-2\exp(-cn^{\alpha}),

supt|(Mj<t)(Mj>t)|CDn,p.\sup_{t}\left|\mathbb{P}\left(M_{j}<-t\right)-\mathbb{P}\left(M_{j}>t\right)\right|\leq CD_{n,p}. (13)

We now turn to the false discovery rate control. To simplify our notation, let θj\theta_{j} be the indicator of whether the jj-th hypothesis defined in (3) is false. Let 𝒮0={j:θj=0}\mathcal{S}_{0}=\{j:\theta_{j}=0\} be the set consisting of all the true hypotheses. Consider the single-step procedure where the jj-th hypothesis is rejected if the ABS MjM_{j} is greater than some threshold tt. Namely, we choose a decision 𝜹=(δ1,,δp)\bm{\delta}=(\delta_{1},\cdots,\delta_{p}) as δj=1(Mj>t)\delta_{j}=1(M_{j}>t). Then the false discovery proportion is

FDP(t)=j:θj𝒮0δj(jδj)1.FDP(t)=\frac{\sum_{j:\theta_{j}\in\mathcal{S}_{0}}\delta_{j}}{(\sum_{j}\delta_{j})\vee 1}.

According to Theorem 2, the probabilities (Mj>t)\mathbb{P}(M_{j}>t) and (Mj<t)\mathbb{P}(M_{j}<-t) are approximately the same when the hypothesis j\mathcal{H}_{j} is true. Namely,

𝔼j:θj𝒮0δj=j:θj𝒮0(Mj>t)\displaystyle\mathbb{E}\sum_{j:\theta_{j}\in\mathcal{S}_{0}}\delta_{j}=\sum_{j:\theta_{j}\in\mathcal{S}_{0}}\mathbb{P}(M_{j}>t)
j:θj𝒮0(Mj<t)j(Mj<t).\displaystyle\approx\sum_{j:\theta_{j}\in\mathcal{S}_{0}}\mathbb{P}(M_{j}<-t)\leq\sum_{j}\mathbb{P}(M_{j}<-t).

Consequently, we estimate the false discovery proportion by

FDP(t)^=j1(Mj<t)j1(Mj>t).\widehat{FDP(t)}=\frac{\sum_{j}1(M_{j}<-t)}{\sum_{j}1(M_{j}>t)}. (14)

Note that FDP(t)^\widehat{FDP(t)} tends to overestimate the true false discovery proportion. However, this is not a serious issue due to two reasons: (i) the over-estimation protects us from large false positives and (ii) the probability that (Mj<t)\mathbb{P}(M_{j}<-t) is negligible when j\mathcal{H}_{j} is false.

For any designated level, say qq, we could choose a data-driven threshold τq\tau_{q} as

τq=min{t:FDP(t)^q}.\tau_{q}=\min\left\{t:\widehat{FDP(t)}\leq q\right\}.

We call this method Model free multiple Testing procedure based on ABS (MTA). The steps of MTA are summarized in Algorithm 1.

  1. 1.

    Split the sample as two parts, denoted as 𝒟1\mathcal{D}_{1} and 𝒟2\mathcal{D}_{2};

  2. 2.

    For each sample, estimate 𝐁^1\hat{\mathbf{B}}^{1} and 𝐁^2\hat{\mathbf{B}}^{2} according to (7);

  3. 3.

    For j=1,2,,pj=1,2,\cdots,p, calculate the ABS MjM_{j} as Mj=h=1H𝐁^jh1𝐁^jh2M_{j}=\sum_{h=1}^{H}\hat{\mathbf{B}}^{1}_{jh}\hat{\mathbf{B}}^{2}_{jh};

  4. 4.

    Find the threshold τq\tau_{q} such that

    τq=min{t:FDP(t)^q},\tau_{q}=\min\left\{t:\widehat{FDP(t)}\leq q\right\},

    where

    FDP(t)^=j=1p1(Mjt)j=1p1(Mj>t);\widehat{FDP(t)}=\frac{\sum_{j=1}^{p}1(M_{j}\leq-t)}{\sum_{j=1}^{p}1(M_{j}>t)};
  5. 5.

    For j=1,2,,pj=1,2,\cdots,p, reject j\mathcal{H}_{j} if Mj>τqM_{j}>\tau_{q} and fail-to-reject j\mathcal{H}_{j} otherwise.

Algorithm 1 Model free multiple Testing procedure based on ABS (MTA)

We now turn our attention to the asymptotic property of MTA. We assume the following assumption:

Assumption 3.

(a) Let 𝚺jj\bm{\Sigma}^{jj} be the covariance matrix of 𝐳ij\bm{z}_{ij\cdot}, 𝚺jj\bm{\Sigma}^{jj^{\prime}} be the covariance between 𝐳ij\bm{z}_{ij\cdot} and 𝐳ij\bm{z}_{ij^{\prime}\cdot}. We assume that

j,j𝒮kr1((𝚺jj)1/2𝚺jj(𝚺jj)1𝚺jj(𝚺jj)1/2)<C1pkα1,\sum_{j,j^{\prime}\in\mathcal{S}_{k}}r_{1}\left((\bm{\Sigma}^{jj})^{-1/2}\bm{\Sigma}^{jj^{\prime}}(\bm{\Sigma}^{j^{\prime}j^{\prime}})^{-1}\bm{\Sigma}^{j^{\prime}j}(\bm{\Sigma}^{jj})^{-1/2}\right)<C_{1}p_{k}^{\alpha_{1}},

holds for k=1,2k=1,2, a constant C1>0C_{1}>0, and α1(0,2)\alpha_{1}\in(0,2) where r1()r_{1}(\cdot) denotes the largest singular value. (b) Let ρ(zijh,zijh)\rho(z_{ijh},z_{ij^{\prime}h}) be the correlation coefficient of zijhz_{ijh} and zijhz_{ij^{\prime}h}. We assume that

j,j𝒮0h=1H|ρ(zijh,zijh)|<C2p0α2,\sum_{j,j^{\prime}\in\mathcal{S}_{0}}\sum_{h=1}^{H}|\rho(z_{ijh},z_{ij^{\prime}h})|<C_{2}p_{0}^{\alpha_{2}},

holds for a constant C2>0C_{2}>0 and α2(0,2)\alpha_{2}\in(0,2).

For linear model, a similar assumption stating that the sum of partial correlation is bounded by C1p0αC_{1}p_{0}^{\alpha}, where α(0,2)\alpha\in(0,2) is imposed in [42]. When dealing with nonlinear model, Assumption 3 requires that the sum of pair-wise partial correlation in each slice being bounded by Cp0αCp_{0}^{\alpha} for a constant C>0C>0 any α(0,2)\alpha\in(0,2). However, we consider a more general non-linear model where the contribution of the jj-th variable is related to vectors with length HH. Assumption 3(a) is equivalent to bound the pairwise canonical correlations which is a general procedure for investigating the relationships between two sets of variables. In Assumption 3(b), we consider the absolute sum of the correlation coefficients of corresponding variables in two sets. Since the correlation measure is not unique, we consider two different types to enlarge the suitable cases for our theory. The following two lemmas characterize the sum of pairwise covariances of mirror statistics, which is the key to establishing the asymptotic property of MTA.

Lemma 1.

Let MjM_{j} be the angular balanced statistic defined in (12). If Assumption 3 holds, then with probability at least 12exp(cnα)1-2\exp(-cn^{\alpha}),

j,k𝒮0Cov(𝟙(Mjt),𝟙(Mkt))C1|𝒮0|α1,t,\displaystyle\sum_{j,k\in\mathcal{S}_{0}}\text{Cov}\left(\mathbbm{1}(M_{j}\geq t),\mathbbm{1}(M_{k}\geq t)\right)\leq C_{1}^{\prime}|\mathcal{S}_{0}|^{\alpha_{1}},\;\forall\;t,

where α1(0,2)\alpha_{1}\in(0,2), C1>0C_{1}^{\prime}>0 is a constant.

In Lemma 1, we build a relationship between the correlations among mirror statistics and the correlations among covariates. Next, we derive asymptotic properties of the MTA method. We define a few quantities:

V(t)=#{j:j𝒮0,Mjt}p0,V(t)=#{j:j𝒮0,Mjt}p0,\displaystyle V(t)=\frac{\#\{j:j\in\mathcal{S}_{0},M_{j}\leq-t\}}{p_{0}},\,V^{\prime}(t)=\frac{\#\{j:j\in\mathcal{S}_{0},M_{j}\geq t\}}{p_{0}},\,
V1(t)=#{j:j𝒮1,Mjt}p1,\displaystyle V^{1}(t)=\frac{\#\{j:j\in\mathcal{S}_{1},M_{j}\geq t\}}{p_{1}},
G0(t)=limpj𝒮0𝔼1(Mjt)p0,G1(t)=limpj𝒮1𝔼1(Mjt)p1,\displaystyle G_{0}(t)=\lim_{p}\frac{\sum_{j\in\mathcal{S}_{0}}\mathbb{E}1(M_{j}\leq-t)}{p_{0}},\,G_{1}(t)=\lim_{p}\frac{\sum_{j\in\mathcal{S}_{1}}\mathbb{E}1(M_{j}\geq t)}{p_{1}},

and

FDP(t):=V(t)(V(t)+rpV1(t))1/p,\displaystyle FDP(t):=\frac{V^{\prime}(t)}{(V^{\prime}(t)+r_{p}V^{1}(t))\vee 1/p},
FDP(t):=limpV(t)(V(t)+rpV1(t))1/p,\displaystyle FDP^{\infty}(t):=\lim_{p}\frac{V^{\prime}(t)}{(V^{\prime}(t)+r_{p}V^{1}(t))\vee 1/p},

where rp=p0/p1r_{p}=p_{0}/p_{1} and FDP(t)FDP^{\infty}(t) is the pointwise limit of FDP(t)FDP(t). We have the following results.

Lemma 2.

Suppose Assumption 3(a) or (b) holds and G0(t)G_{0}(t) is a continuous function. Then, we have

supt|V(t)G0(t)|𝑝0,supt|V(t)G0(t)|𝑝0,\displaystyle\sup_{t}\left|{V(t)}-G_{0}(t)\right|\xrightarrow[]{p}0,\quad\sup_{t}\left|{V^{\prime}(t)}-G_{0}(t)\right|\xrightarrow[]{p}0,
supt|V1(t)G1(t)|𝑝0.\displaystyle\sup_{t}\left|{V^{1}(t)}-G_{1}(t)\right|\xrightarrow[]{p}0.

Lemma 2 is based on the weak dependence assumption 3(a) or 3(b) under a model-free assumption. In literature, similar week dependence conditions on p-values [37] and linear models [15, 42, 16] are commonly used to study the asymptotical property of FDR. With the aid of this lemma, we show that the FDR of the proposed MTA method is less than or equal to qq asymptotically.

Theorem 3.

For any given level q(0,1)q\in(0,1), assume that the pointwise limit FDP(t)\text{FDP}^{\infty}(t) of FDP(t)\text{FDP}(t) exists for all t>0t>0, and there is a constant tq>0t_{q}>0 such that FDP(tq)q.\text{FDP}^{\infty}(t_{q})\leq q. When Assumption (1-3) holds and 𝒫H𝒟𝒫H(b)\mathcal{P}_{H}\in\mathcal{DP}_{H}(b), we have

FDPq+op(1)andlim supn,pFDRq.\text{FDP}\leq q+o_{p}(1)\ \ \ \text{and}\ \ \ \limsup_{n,p\to\infty}\text{FDR}\leq q.

The proof of Theorem 3 is included in the appendix.

4 Numerical Studies

In this section, we use simulations to evaluate the finite sample performance of MTA with state-of-the-art competitors. First, we consider the model-X knockoff, which constructs knockoffs 𝒙~\tilde{\bm{x}} by assuming the distribution of 𝒙\bm{x} is Gaussian with a known covariance matrix. We construct the feature statistic as

Wj=h=1H𝐁^jh𝐁^j~h,W_{j}=\sum_{h=1}^{H}\hat{\mathbf{B}}_{jh}\hat{\mathbf{B}}_{\tilde{j}h},

where 𝐁^\hat{\mathbf{B}} is the solution of (7) when replacing 𝒙\bm{x} by (𝒙,𝒙~)(\bm{x},\tilde{\bm{x}}). Second, we consider the marginal non-parametric test to test the dependence between the response variable and each predictor and calculate the Hilbert-Schmidt independence criterion (HSIC) test statistic proposed in [19]. After obtaining the p-values, we apply the Benjamini-Hochberg (BH) procedure (HSIC+BH). Third, we apply the dimension reduction coordinate test (DR-coor) in Cook [10] to test the contribution of each predictor and calculate the p-value followed by the BH method (DR-coor+BH). In the following sections, we consider three different multi-index models with a wide range of nonlinear dependencies between the response and predictors. For predictors, we consider two different designs including Gaussian design and a real-data design.

4.1 Simulation with Gaussian Design

We set n=1000n=1000, p=200p=200. The covariates are generated from multivariate Gaussian distribution N(0,Σ)N(0,\Sigma). The covariance matrix Σ\Sigma is autoregressive, i.e., σij\sigma_{ij}, the element at the i-th row and j-th column is ρ|ij|\rho^{|i-j|}, where ρ\rho is taken among 0,0.2,0.4,0.6,0,0.2,0.4,0.6, and 0.80.8, respectively. We randomly set 2020 nonzero coefficients for each indices and generate the non-zero coefficients from N(0,20/n)N(0,20/\sqrt{n}).

Setting 1:

y=f1(𝒙)+σϵ=sin(𝒂1T𝒙)+(𝒂2T𝒙)3+σϵ,y=f_{1}(\bm{x})+\sigma\epsilon=\sin(\bm{a}_{1}^{T}\bm{x})+(\bm{a}_{2}^{T}\bm{x})^{3}+\sigma\epsilon,

Setting 2:

y=f2(𝒙)+σϵ=(3𝒂1T𝒙)0.5+(1.5+𝒂2T𝒙)2+σϵ,y=f_{2}(\bm{x})+\sigma\epsilon=\frac{(3\bm{a}_{1}^{T}\bm{x})}{0.5+(1.5+\bm{a}_{2}^{T}\bm{x})^{2}}+\sigma\epsilon,

Setting 3:

y=f3(x,ϵ)=(a1Tx)3+(a2Tx)×σϵ,\displaystyle y=f_{3}(x,\epsilon)=(a_{1}^{T}x)^{3}+(a_{2}^{T}x)\times\sigma\epsilon,

Setting 4:

y=f4(𝒙,ϵ)=11+exp[=15Relu(𝐚T𝐱)]+σϵ,y=f_{4}(\bm{x},\epsilon)=\frac{1}{1+\exp{[\sum_{\ell=1}^{5}\rm{Relu}(\bm{a}_{\ell}^{T}\bm{x})}]}+\sigma\epsilon,

where Relu()=max(0,)\rm{Relu}(\cdot)=\max(0,\cdot) is a popular choice of activation function in the context of artificial neural networks. This setting is equivalent to a fully connected neural network with one hidden layer.

In the settings 1 to 4, ϵ\epsilon is generated using a standard Gaussian distribution. The value of σ1\sigma_{1} is kept at 0.5 for the first experiment. For j=1,2,3j=1,2,3, we adjust σj\sigma_{j} so that the ratio Var(fj(x))/σj\text{Var}(f_{j}(x))/\sigma_{j} is equal to Var(f1(x))/σ1\text{Var}(f_{1}(x))/\sigma_{1}. This means that the signal-to-noise ratios are kept constant across all the experiments. For the first three experiments, we use two indexes, while in the fourth experiment, we increase the number of indexes to five. Moreover, we treat the error as homoscedastic in experiments 1, 2, and 4. However, in the third experiment, we account for the error as heteroscedastic

Note that the Markov blanket is (y,𝒙)={xj|a1j0ora2j0}\mathcal{MB}(y,\bm{x})=\{x_{j}|a_{1j}\neq 0\quad\mbox{or}\quad a_{2j}\neq 0\}. To evaluate the performance, we calculate the number of true positives (TPs) as the number of variables in (y,𝒙)\mathcal{MB}(y,\bm{x}) which is selected by a particular procedure, and the number of false positives (FPs) as the difference between the total number of selected variables and TPs. The empirical power is defined as #{TPs}/#{(y,x)}\#\{TPs\}/\#\{\mathcal{MB}(y,x)\} and the empirical FDP is defined as #{FPs}/#{selected variables}\#\{FPs\}/\#\{\mbox{selected variables}\}. In all the settings, we set the number of slices HH as 2020 for our proposed method.

(a1) Power for setting 1 (a2) FDP for setting 1
Refer to caption Refer to caption
(b1) Power for setting 2 (b2) FDP for setting 2
Refer to caption Refer to caption
(c1) Power for setting 3 (c2) FDP for setting 3
Refer to caption Refer to caption
(d1) Power for setting 4 (d2) FDP for setting 4
Refer to caption Refer to caption
Figure 2: Empirical power and FDP for setting 1-3. Here the design matrix is generated from the multivariate normal distribution.

In Figure 2, we plot the boxplot of simulated FDP and power of the considered methods under setting 1-4. It is seen that FDR of MTA is controlled at the predefined level under all settings and MTA has larger power than its competitors. We observed that FDPs obtained by Model-X knockoff tend to be either conservative or exceed the predefined levels by a large margin when the correlation is high. This phenomenon has also been observed in [42, 5]. HSIC-BH methods control the FDR well in settings 1, 2 and 4. But it has a significantly inflated FDR in setting 3 where the noise is heteroscedastic. The power of HSIC-BH methods is lower than our proposed method. The DR-Coor-BH test achieves high power when ρ=0\rho=0. However, its power decreases significantly when ρ\rho increases. The FDR is inflated when ρ\rho increases.

4.2 Simulation studies based on brain connectome

In this section, we consider using the real design matrix extracted from brain connectome data in the human connectome project (HCP) dataset. HCP dataset aims at characterizing human brain connectivity in about 1,200 healthy adults to enable detailed comparisons between brain circuits, behavior, and genetics at the level of individual subjects. Customized scanners were used to produce high-quality and consistent data to measure brain connectivity. The data containing various traits and MRI data can be easily accessed through (db.humanconnectome.org). The real design matrix includes variables with a more complex correlation structure. Moreover, the marginal distributions of variables include skewed, heavy-tailed, and multi-modal distributions. To be consistent with our Gaussian setting, we randomly select p=200p=200 brain connections as our predictors and synthetic yy using f1,f2,f3f_{1},f_{2},f_{3} listed in our settings 1-3. In this case, we consider different signal to noise level by generating the noise with σ=ψ/n\sigma=\psi/\sqrt{n} where ψ=10,15,20,25\psi=10,15,20,25.

(a1) Power for setting 1 (a2) FDP for setting 1
Refer to caption Refer to caption
(b1) Power for setting 2 (b2) FDP for setting 2
Refer to caption Refer to caption
(c1) Power for setting 3 (c2) FDP for setting 3
Refer to caption Refer to caption
(d1) Power for setting 4 (d2) FDP for setting 4
Refer to caption Refer to caption
Figure 3: Empirical power and FDP for settings 1-3. Here the design matrix is generated from the brain connectome data.

It is seen that the proposed MTA method is robust under complex correlation structures. As shown in Figure 3, for settings 1-4, the MTA method maintains high power subject to the control of FDR under different signal-to-noise levels. Since the distribution of brain connectomes is highly non-Gaussian, the power of the Model-X knockoff method becomes conservative with diminishing power. In contrast, HIC+BH and DR-Coor+BH can not control the FDR well partially due to the high correlation among the covariance in brain connectome data.

5 Real Data Analysis on Brain Connectome

Neural imaging studies hold great promise for predicting and reducing psychiatric disease burden and advancing our understanding of the cognitive abilities that underlie humanity’s intellectual feats. Functional magnetic resonance imaging (fMRI) data are increasingly being used for the ambitious task of relating individual differences in studying brain function to typical variations in complex psychological phenotypes. In functional brain imaging research, e.g., via fMRI, functional network analysis, which focuses on understanding the connections between different brain areas using some quantitative methods, has become a booming area popular approach. The human connectome project (HCP) dataset aims at characterizing human brain connectivity in about 1,200 healthy adults to enable detailed studies of the association between brain functions and behaviors at the level of individual subjects. Customized scanners were used to produce high-quality and consistent data to measure brain connectivity. The data containing various traits and fMRI data can be easily accessed through (db.humanconnectome.org)

To obtain functional connectomes, we used a state-of-the-art fMRI data preprocessing framework – population-based functional connectome (PSC) mapping. The dataset used in this work comes from the HCP Q1 release [39] consisting of fMRI scans from 68 subjects during resting state (rfMRI). The acquisition parameters are as follows: 90×104 matrix, 220mm FOV, 72 slices, TR=0.72s, TE=33.1ms, flip angle=52°. Preprocessing steps of the rfMRI data have been conducted within the Q1 release, including motion correction, spatial smoothing, temporal pre-whitening, slice time correction, and global drift removal. The fMRI data from each individual in each task was then linearly registered to the MNI152 template.

Since we are particularly interested in cognition, we extract four cognition-related measures as y from HCP, including:

  • Oral reading recognition test: participants on this test are asked to read and pronounce letters

  • Crystallized composite score: crystallized cognition composite can be interpreted as a global assessment of verbal reasoning. We use the age-corrected standard score.

Our predictors are brain functional connections between each pair of ROIs. Based on the brain parcellation, we have 6868/268*68/2 connections. We first applied a screening procedure to select top 1/51/5 of the connections as our candidates. We applied the proposed MTA method to select important connections associated with the cognitive traits with controlled false discovery rate at 10%10\%.

Refer to caption
Figure 4: Left: selected functional connections associated with Oral reading recognition test. Right: selected functional connections associated with crystallized composition score.

As shown in the left panel of Figure 4, we observe dense connections within both the left and right frontal lobes. In particular, we see that brain regions such as r23 (rostral middle frontal), and l27 (caudal middle frontal) are densely involved (nodes with high degrees) in the selected connections. We also observe that the four nodes in the temporal lobe (regions 8, 14, 29), all appear in the selected connections to the frontal lobe. The frontal lobe processes how you interpret language. The temporal lobes are also believed to play an important role in processing auditory information and with the encoding of memory [34]. Hence, our results show that richer connections between the visual and auditory are strongly associated with higher learning ability in vocabulary. As shown in the right panel of Figure 4, we observe more connections between the front lobe and the temporal lobe. Temporal lobe regions are linked to memory, language, and visual perception [1]. Additionally, we observe a high level of communication between the left and right brain hemispheres, which contributes to higher cognition [4].

6 Conclusion

In this paper, we consider the multiple hypothesis testing procedure which aims at controlling FDR without assuming a model of the conditional distribution of the response. We combine the idea of data-splitting, Gaussian mirror, and sliced inverse regression (SIR) to construct mirror statistics. With the aid of the developed central limit theorem of SIR, it is shown that the mirror statistic is approximately symmetric with respect to zero when the null hypothesis is true. We then provide an estimator of the false discovery proportion for the single-step testing method and suggest a data-driven threshold based on such an estimator. It is theoretically shown that the FDR of the proposed method is less than or equal to a designated level asymptotically.

The theoretical investigation of the SIR using the newly developed Gaussian approximation theory paves the road for the statistical inference for sufficient dimension reduction with a diverging dimension. It has the potential to be used for other sufficient dimension reduction methods such as sliced average variance estimator (SAVE, [8]), principle Hessian directions (PHD, [26]), and many others. We will leave them in future discussions.

The proposed approach is flexible due to the lean assumptions on the link function. The popular deep neural network could be viewed as the multiple index model. In an ongoing project, it would be interesting to see how the proposed method could be used to reduce the dimension under this framework.

7 Acknowledgements

Data used in the preparation of this article were obtained from the Human Connectome Project (www.humanconnectome.org). The HCP WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) were funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. We would like to thank Bing Li, Wen Zhou, Asaf Weinstein for their comments and suggestions. Xing’s research is supported by NSF (DMS-2124535).

References

  • Allone et al. [2017] Cettina Allone, Viviana Lo Buono, Francesco Corallo, Laura Rosa Pisani, Patrizia Pollicino, Placido Bramanti, and Silvia Marino. Neuroimaging and cognitive functions in temporal lobe epilepsy: a review of the literature. Journal of the neurological sciences, 381:7–15, 2017.
  • Barber and Candès [2015] Rina Foygel Barber and Emmanuel J Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
  • Barber and Candès [2019] Rina Foygel Barber and Emmanuel J Candès. A knockoff filter for high-dimensional selective inference. The Annals of Statistics, 47(5):2504–2537, 2019.
  • Boisgueheneuc et al. [2006] Foucaud du Boisgueheneuc, Richard Levy, Emmanuelle Volle, Magali Seassau, Hughes Duffau, Serge Kinkingnehun, Yves Samson, Sandy Zhang, and Bruno Dubois. Functions of the left superior frontal gyrus in humans: a lesion study. Brain, 129(12):3315–3328, 2006.
  • Candes et al. [2018] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: model-X knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
  • Chernozhukov et al. [2013] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, 2013.
  • Chernozhukov et al. [2017] Victor Chernozhukov, Denis Chetverikov, Kengo Kato, et al. Central limit theorems and bootstrap in high dimensions. Annals of Probability, 45(4):2309–2352, 2017.
  • Cook [2000] D. R. Cook. SAVE: a method for dimension reduction and graphics in regression. Communications in statistics-Theory and methods, 29(9-10):2109–2121, 2000.
  • Cook et al. [2007] D. R. Cook, B. Li, and F. Chiaromonte. Dimension reduction in regression without matrix inversion. Biometrika, 94(3):569–584, 2007.
  • Cook [2004] R Dennis Cook. Testing predictor contributions in sufficient dimension reduction. The Annals of Statistics, 32(3):1062–1092, 2004.
  • Cook [2007] R Dennis Cook. Fisher lecture: Dimension reduction in regression. Statistical Science, pages 1–26, 2007.
  • Cook and Ni [2005] R Dennis Cook and Liqiang Ni. Sufficient dimension reduction via inverse regression: A minimum discrepancy approach. Journal of the American Statistical Association, 100(470):410–428, 2005.
  • Cook and Ni [2006] R Dennis Cook and Liqiang Ni. Using intraslice covariances for improved estimation of the central subspace in regression. Biometrika, 93(1):65–74, 2006.
  • Cook and Weisberg [1991] R Dennis Cook and Sanford Weisberg. Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414):328–332, 1991.
  • Dai et al. [2020] Chenguang Dai, Buyu Lin, Xin Xing, and Jun S Liu. False discovery rate control via data splitting. arXiv preprint arXiv:2002.08542, 2020.
  • Dai et al. [2022] Chenguang Dai, Buyu Lin, Xin Xing, and Jun S Liu. False discovery rate control via data splitting. Journal of the American Statistical Association, (just-accepted):1–38, 2022.
  • Dong [2021] Yuexiao Dong. A brief review of linear sufficient dimension reduction through optimization. Journal of Statistical Planning and Inference, 211:154–161, 2021.
  • Duan and Li [1991] N. Duan and K. C. Li. Slicing regression: a link-free regression method. The Annals of Statistics, 19(2):505–530, 1991.
  • Gretton et al. [2007] Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Schölkopf, and Alex Smola. A kernel statistical test of independence. Advances in neural information processing systems, 20, 2007.
  • Hsing and Carroll [1992] T. Hsing and R. J. Carroll. An asymptotic theory for sliced inverse regression. The Annals of Statistics, 20(2):1040–1061, 1992.
  • Ji and Zhao [2014] Pengsheng Ji and Zhigen Zhao. Rate optimal multiple testing procedure in high-dimensional regression. arXiv preprint arXiv:1404.2961, 2014.
  • Jiang and Liu [2014] Bo Jiang and Jun S Liu. Variable selection for general index models via sliced inverse regression. The Annals of Statistics, 42(5):1751–1786, 2014.
  • Li and Dong [2009] Bing Li and Yuexiao Dong. Dimension reduction for nonelliptically distributed predictors. The Annals of Statistics, 37(3):1272–1298, 2009.
  • Li and Wang [2007] Bing Li and Shaoli Wang. On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479):997–1008, 2007.
  • Li [1991] K. C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
  • Li [1992] Ker-Chau Li. On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma. Journal of the American Statistical Association, 87(420):1025–1039, 1992.
  • Li [2007] L. Li. Sparse sufficient dimension reduction. Biometrika, 94(3):603–613, 2007.
  • Lin et al. [2018] Qian Lin, Zhigen Zhao, and Jun S Liu. On consistency and sparsity for sliced inverse regression in high dimensions. The Annals of Statistics, 46(2):580–610, 2018.
  • Lin et al. [2019] Qian Lin, Zhigen Zhao, and Jun S Liu. Sparse sliced inverse regression via lasso. Journal of the American Statistical Association, 114(528):1726–1739, 2019.
  • Me Saracco [1997] JÉRÔ Me Saracco. An asymptotic theory for sliced inverse regression. Communications in statistics-Theory and methods, 26(9):2141–2171, 1997.
  • Meinshausen and Bühlmann [2006] Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
  • Ni et al. [2005] L. Ni, D. R. Cook, and C. L. Tsai. A note on shrinkage sliced inverse regression. Biometrika, 92(1):242–247, 2005.
  • Pearl [1988] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
  • Perrodin et al. [2014] Catherine Perrodin, Christoph Kayser, Nikos K Logothetis, and Christopher I Petkov. Auditory and visual modulation of temporal lobe neurons in voice-sensitive and association cortices. Journal of Neuroscience, 34(7):2524–2537, 2014.
  • Sarkar and Tang [2021] Sanat K Sarkar and Cheng Yong Tang. Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff assisted variable selection. arXiv preprint arXiv:2102.09080, 2021.
  • Statnikov et al. [2013] Alexander Statnikov, Nikita I Lytkin, Jan Lemeire, and Constantin F Aliferis. Algorithms for discovery of multiple Markov boundaries. Journal of machine learning research: JMLR, 14:499, 2013.
  • Storey et al. [2004] John D Storey, Jonathan E Taylor, and David Siegmund. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):187–205, 2004.
  • Tan et al. [2020] Kai Tan, Lei Shi, and Zhou Yu. Sparse SIR: Optimal rates and adaptive estimation. The Annals of Statistics, 48(1):64 – 85, 2020. 10.1214/18-AOS1791. URL https://doi.org/10.1214/18-AOS1791.
  • Van Essen et al. [2013] David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The WU-Minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
  • Wu and Li [2011] Yichao Wu and Lexin Li. Asymptotic properties of sufficient dimension reduction with a diverging number of predictors. Statistica Sinica, 2011(21):707, 2011.
  • Xia [2007] Yingcun Xia. A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics, 35(6):2654–2690, 2007.
  • Xing et al. [2021] Xin Xing, Zhigen Zhao, and Jun S Liu. Controlling false discovery rate using gaussian mirrors. Journal of the American Statistical Association, (just-accepted):1–45, 2021.
  • Yin and Cook [2002] Xiangrong Yin and R Dennis Cook. Dimension reduction for the conditional k-th moment in regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(2):159–175, 2002.
  • Zhang and Zhang [2014] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014.
  • Zhang and Cheng [2017] Xianyang Zhang and Guang Cheng. Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112(518):757–768, 2017.
  • Zhu et al. [2006] L. Zhu, B. Miao, and H. Peng. On sliced inverse regression with high-dimensional covariates. Journal of the American Statistical Association, 101(474):640–643, 2006.
  • Zhu and Fang [1996] Li-Xing Zhu and Kai-Tai Fang. Asymptotics for kernel estimate of sliced inverse regression. The Annals of Statistics, 24(3):1053–1068, 1996.