This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mapping the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer’s Disease

Dengdeng Yu
Department of Mathematics, University of Texas at Arlington
Linbo Wang
Department of Statistical Sciences, University of Toronto
Dehan Kong
Department of Statistical Sciences, University of Toronto
Hongtu Zhu
Department of Biostatistics, University of North Carolina, Chapel Hill
for the Alzheimer’s Disease Neuroimaging Initiative 111 Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Abstract

Alzheimer’s disease is a progressive form of dementia that results in problems with memory, thinking, and behavior. It often starts with abnormal aggregation and deposition of β\beta amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, leading to Alzheimer’s Disease (AD). The aim of this paper is to map the genetic-imaging-clinical pathway for AD in order to delineate the genetically-regulated brain changes that drive disease progression based on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. We develop a novel two-step approach to delineate the association between high-dimensional 2D hippocampal surface exposures and the Alzheimer’s Disease Assessment Scale (ADAS) cognitive score, while taking into account the ultra-high dimensional clinical and genetic covariates at baseline. Analysis results suggest that the radial distance of each pixel of both hippocampi is negatively associated with the severity of behavioral deficits conditional on observed clinical and genetic covariates. These associations are stronger in Cornu Ammonis region 1 (CA1) and subiculum subregions compared to Cornu Ammonis region 2 (CA2) and Cornu Ammonis region 3 (CA3) subregions. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

Keywords: 2D surface, behavioral deficits, confounders, hippocampus, variable selection.

1 Introduction

Alzheimer’s disease (AD) is an irreversible brain disorder that slowly destroys memory and thinking skills. According to World Alzheimer Reports (Gaugler et al., 2019), there are around 5555 million people worldwide living with Alzheimer’s disease and related dementia. The total global cost of Alzheimer’s disease and related dementia was estimated to be a trillion US dollars, equivalent to 1.1%1.1\% of global gross domestic product. Alzheimer’s patients often suffer from behavioral deficits including memory loss and difficulty of thinking, reasoning and decision making.

In the current model of AD pathogenesis, it is well established that deposition of amyloid plaques is an early event that, in conjunction with tau pathology, causes neuronal damage. Scientists have identified risk genes that may cause the abnormal aggregation and deposition of the amyloid plaques (e.g. Morishima-Kawashima and Ihara, 2002). The neuronal damage typically starts from the hippocampus and results in the first clinical manifestations of the disease in the form of episodic memory deficits (Weiner et al., 2013). Specifically, Jack Jr et al. (2010) presented a hypothetical model for biomarker dynamics in AD pathogenesis, which has been empirically and collectively supported by many works in the literature. The model begins with the abnormal deposition of β\beta amyloid (Aβ\beta) fibrils, as evidenced by a corresponding drop in the levels of soluble Aβ\beta-42 in cerebrospinal fluid (CSF) (Aizenstein et al., 2008). After that, neuronal damage begins to occur, as evidenced by increased levels of CSF tau protein (Hesse et al., 2001). Numerous studies have investigated how Aβ\beta and tau impact the hippocampus (e.g. Ferreira and Klein, 2011), known to be fundamentally involved in acquisition, consolidation, and recollection of new episodic memories (Frozza et al., 2018). In particular, as neuronal degeneration progresses, brain atrophy, which starts with hippocampal atrophy (Fox et al., 1996), becomes detectable by magnetic resonance imaging (MRI). Studies from recent years also found other important CSF proteins that may be related to hippocampal atrophy. For instance, the low levels of chromogranin A (CgA) and trefoil factor 3 (TFF3), and high level of cystatin C (CysC) are evidently associated with hippocampal atrophy (Khan et al., 2015; Paterson et al., 2014). Indeed, the impact of protein concentration on behavior can also be through atrophy of other brain regions. For example, there exists potential entorhinal tau pathology on episodic memory decline (Maass et al., 2018). As sufficient brain atrophy accumulates, it results in cognitive symptoms and impairment. This process of AD pathogenesis is summarized by the flow chart in Figure 1. Note that it is still debatable how Aβ\beta and tau interact with each other as mentioned by Jack Jr et al. (2013); Majdi et al. (2020); however, it is evident that Aβ\beta may still hit a biomarker detection threshold earlier than tau (Jack Jr et al., 2013). In addition, as noted by Hampel et al. (2018), it is likely that highly complex interactions exist between Aβ\beta as well as tau, and the cholinergic system. For instance, the association has been found between CSF biomarkers of amyloid and tau pathology in AD (Remnestål et al., 2021). It has also been found that other factors, such as dysregulation and dysfunction of the Wnt signaling pathway, may also contribute to Aβ\beta and tau pathologies (Ferrari et al., 2014). In addition, the M1 and M3 subtypes of muscarinic receptors increase amyloid precursor protein production via the induction of the phospholipase C/protein kinase C pathway and increase BACE expression in AD brains (Nitsch et al., 1992).

Refer to caption
Figure 1: A hypothetical model of AD pathogenesis based on Selkoe and Hardy (2016). The double arrows represent the possible interactions that exist between Aβ\beta as well as tau, and the cholinergic system. The red arrow denotes the conditional association we are interested in estimating.

The aim of this paper is to map the genetic-imaging-clinical (GIC) pathway for AD, which is the most important part of the hypothetical model of AD pathogenesis in Figure 1. Histological studies have shown that the hippocampus is particularly vulnerable to Alzheimer’s disease pathology and has already been considerably damaged at the first occurrence of clinical symptoms (Braak and Braak, 1998). Therefore, the hippocampus has become a major focus in Alzheimer’s studies (De Leon et al., 1989). Some neuroscientists even conjecture that the association between hippocampal atrophy and behavioral deficits may be causal, because the former destroys the connections that help the neuron communicate and results in a loss of function (Watson, 2019). We are interested in delineating the genetically-regulated hippocampal shape that drives AD related behavioral deficits and disease progression.

To map the GIC pathway, we extract clinical, imaging, and genetic variables from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study as follows. First, we use the Alzheimer’s Disease Assessment Scale (ADAS) cognitive score to quantify behavioral deficits, for which a higher score indicates more severe behavioral deficits. Second, we characterize the exposure of interest, hippocampal shape, by the hippocampal morphometry surface measure, summarized as two 100×150100\times 150 matrices corresponding to the left/right hippocampi. Each element of the matrices is a continuous-valued variable, representing the radial distance from the corresponding coordinate on the hipppocampal surface to the medial core of the hippocampus. Compared with the conventional scalar measure of hippocampus shape (Jack et al., 2003), recent studies show that the additional information contained in the hippocampal morphometry surface measure is valuable for Alzheimer’s diagnosis (Thompson et al., 2004). For example, Li et al. (2007) showed that the surface measures of the hippocampus could provide more subtle indexes compared with the volume differences in discriminating between patients with Alzheimer’s and healthy control subjects. In our case, with the 2D matrix radial distance measure, one may investigate how local shapes of hippocampal subfields are associated with the behavioral deficits. Third, the ADNI study measures ultra-high dimensional genetic covariates and other demongraphic covariates at baseline. There are more than 66 million genetic variants per subject.

The special data structure of the ADNI data application presents new challenges for statistically mapping the GIC pathway. First, unlike conventional statistical analysis that deals with scalar exposure, our exposure of interest is represented by high-dimensional 2D hippocampal imaging measures. Second, the dimension of baseline covariates, which are also potential confounders, is much larger than the sample size. Recently there have been many developments for confounder selection, most of which are in the causal inference literature. Studies show inclusion of the variables only associated with exposure but not directly with the outcome except through the exposure (known as instrumental variables) may result in loss of efficiency in the causal effect estimate (e.g. Schisterman et al., 2009), while inclusion of variables only related to the outcome but not the exposure (known as precision variables) may provide efficiency gains (e.g. Brookhart et al., 2006); see Shortreed and Ertefaie (2017); Richardson et al. (2018); Tang et al. (2021) and references therein for an overview.

When a large number of covariates are available, the primary difficulty for mapping the GIC pathway is how to include all the confounders and precision variables, while excluding all the instrumental variables and irrelevant variables (not related to either outcome or exposure). We develop a novel two-step approach to estimate the conditional association between the high-dimensional 2D hippocampal surface exposure and the Alzheimer’s behaviorial score, while taking into account the ultra-high dimensional baseline covariates. The first step is a fast screening procedure based on both the outcome and exposure models to rule out most of the irrelevant variables. The use of both models in screening is crucial for both computational efficiency and selection accuracy, as we will show in detail in Sections 3.3 and 3.4. The second step is a penalized regression procedure for the outcome generating model to further exclude instrumental and irrelevant variables, and simultaneously estimate the conditional association. Our simulations and ADNI data application demonstrate the effectiveness of the proposed procedure.

Our analysis represents a novel inferential target compared to recent developments in imaging genetics mediation analysis (Bi et al., 2017). Although we consider a similar set of variables and structure among these variables as illustrated later in Figure 2, our goal is to estimate the conditional association of hippocampal shape with behavioral deficits. In contrast, in mediation analysis, researchers are often interested in the effects of genetic factors on behavioral deficits, and how those are mediated through hippocampus. Direct application of methods developed for imaging genetics mediation analysis to our problem may select genetic factors that are confounders affecting both hippocampal shape and behavioral deficits. In comparison, we aim to include precision variables in the adjustment set as they may improve efficiency.

The rest of the article is organized as follows. Section 2 includes a detailed data and problem description. We introduce our models and a two-step variable selection procedure in Section 3. We analyze the ADNI data and estimate the association between hippocampal shape and behavioral deficits conditional on observed clinical and genetic covariates in Section 4. Simulations are conducted in Section 5 to evaluate the finite-sample performance of the proposed method. We finish with a discussion in Section 6. The theoretical properties of our procedure are included in Section 15 in the supplementary material.

2 Data and problem description

Understanding how human brains work and how they connect to human behavior is a central goal in medical studies. In this paper, we are interested in studying whether and how hippocampal shape is associated with behavioral deficits in Alzheimer’s studies. We consider the clinical, genetic, imaging and behavioral measures in the ADNI dataset. The outcome of interest is the Alzheimer’s Disease Assessment Scale cognitive score observed at 24th month after baseline measurements. The Alzheimer’s Disease Assessment Scale cognitive 13 items score (ADAS-13) (Mohs et al., 1997) includes 13 items: word recall task, naming objects and fingers, following commands, constructional praxis, ideational praxis, orientation, word recognition task, remembering test directions, spoken language, comprehension, and word-finding difficulty, delayed word recall and a number cancellation or maze task. A higher ADAS score indicates more severe behavioral deficits.

The exposure of interest is the baseline 2D surface data obtained from the left/right hippocampi. The hippocampus surface data were preprocessed from the raw MRI data, where the detailed preprocessing steps are included in Section 9.1 of the supplementary material. After preprocessing, we obtained left and right hippocampal shape representations as two 100×150100\times 150 matrices. The imaging measurement at each pixel is an absolute metric, representing the radial distance from the pixel to the medial core of the hippocampus. The unit for the measurement is in millimeters.

In the ADNI data, there are millions of observed covariates that one may need to adjust for, including the whole genome sequencing data from all of the 22 autosomes. We have included detailed genetic preprocessing techniques in Section 9.2 of the supplementary material. After preprocessing, 6,087,2056,087,205 bi-allelic markers (including SNPs and indels) of 756756 subjects were retained in the data analysis.

We excluded those subjects with missing hippocampus shape representations, baseline intracranial volume (ICV) information or ADAS-13 score observed at Month 24, after which there are 566566 subjects left. Our aim is to estimate the association between the hippocampal surface exposure and the ADAS-13 score conditional on clinical measures including age, gender and length of education, ICV, diagnosis status, and 6,087,2056,087,205 bi-allelic markers.

3 Methodology

3.1 Basic set-up

Suppose we observe independent and identically distributed samples {Li=(Xi,𝒁i,Yi),1in}\{L_{i}=(X_{i},\bm{Z}_{i},Y_{i}),1\leq i\leq n\} generated from LL, where L=(X,𝒁,Y)L=(X,\bm{Z},Y) has support =(𝒳×𝒵×𝒴)\mathcal{L}=(\mathcal{X}\times\mathcal{Z}\times\mathcal{Y}). Here 𝒁𝒵p×q\bm{Z}\in\mathcal{Z}\subseteq\mathbb{R}^{p\times q} is a 2D-image continuous exposure, Y𝒴Y\in\mathcal{Y} is a continuous outcome of interest, and X𝒳sX\in\mathcal{X}\subseteq\mathbb{R}^{s} denotes a vector of ultra-high dimensional genetic (and clinical) covariates, where we assume s>>ns>>n. We are interested in characterizing the association between the 2D exposure 𝒁\bm{Z} and outcome YY conditional on the observed covariates XX.

Refer to caption
Figure 2: Directed acyclic graph showing potential high dimensional confounder and precision variables XX (gold), the possible unmeasured confounders UU (light yellow), the 2D imaging exposure ZZ (green), the instrumental variables XX (purple) and the outcome of interest YY (blue). The red arrow denotes the association of interest.

Denote Xi=(Xi1,,Xis)TX_{i}=(X_{i1},\ldots,X_{is})^{\mathrm{\scriptscriptstyle T}}. Without loss of generality, we assume that XilX_{il} has been standardized for every 1ls1\leq l\leq s, and 𝒁i\bm{Z}_{i} and YiY_{i} have been centered. To map the GIC pathway, we assume the following linear equation models:

Yi\displaystyle Y_{i} =\displaystyle= l=1sXilβl+𝒁i,𝑩+ϵi(outcome model);\displaystyle\sum_{l=1}^{s}X_{il}\beta_{l}+\langle\bm{Z}_{i},\bm{B}\rangle+\epsilon_{i}\quad\textrm{(outcome model)}; (1)
𝒁i\displaystyle\bm{Z}_{i} =\displaystyle= l=1sXil𝑪l+𝑬i(exposure model).\displaystyle\sum_{l=1}^{s}X_{il}*\bm{C}_{l}+\bm{E}_{i}\quad\textrm{(exposure model)}. (2)

In (1), the matrix 𝑩p×q\bm{B}\in\mathbb{R}^{p\times q} is the main parameter of interest, representing the association between the 2D imaging treatment 𝒁i\bm{Z}_{i} and the behavioral outcome YiY_{i}, βl\beta_{l} represents the association between the ll-th observed covariate XilX_{il} and the behavioral outcome YiY_{i}, and ϵi\epsilon_{i} and 𝑬i\bm{E}_{i} are random errors that may be correlated. The inner product between two matrices is defined as 𝒁i,𝑩=vec(𝒁i),vec(𝑩)\langle\bm{Z}_{i},\bm{B}\rangle=\langle\mathrm{vec}(\bm{Z}_{i}),\mathrm{vec}(\bm{B})\rangle, where vec()\mathrm{vec}(\cdot) is a vectorization operator that stacks the columns of a matrix into a vector. Model (2), previously introduced in Kong et al. (2020), specifies the relationship between the 2D imaging exposure and the observed covariates. The 𝑪l\bm{C}_{l} is a p×qp\times q coefficient matrix characterizing the association between the llth covariate XilX_{il} and the 2D imaging exposure 𝒁i\bm{Z}_{i}, and 𝑬i\bm{E}_{i} is a p×qp\times q matrix of random errors with mean 0. The symbol “*” denotes element-wise multiplication. Define 1={1ls:βl0}\mathcal{M}_{1}=\{1\leq l\leq s:\beta_{l}\neq 0\} and 2={1ls:𝑪l𝟎}\mathcal{M}_{2}=\{1\leq l\leq s:\bm{C}_{l}\neq\bm{0}\}, where we assume |1|<<n|\mathcal{M}_{1}|<<n and |2|<<n|\mathcal{M}_{2}|<<n; here |1||\mathcal{M}_{1}| and |2||\mathcal{M}_{2}| represent the number of elements in 1\mathcal{M}_{1} and 2\mathcal{M}_{2} respectively.

To estimate 𝑩\bm{B}, the first step is to perform variable selection in models (1) and (2). For all the covariates XlX_{l}, we group them into four categories. Let 𝒜={1,,s}\mathcal{A}=\{1,\ldots,s\}, and denote 𝒞\mathcal{C} the indices of confounders, i.e. variables associated with both the outcome and the exposure; 𝒫\mathcal{P} denotes the indices of precision variables, i.e. predictors of the outcome, but not the exposure; \mathcal{I} denotes the indices of instrumental variables, i.e. covariates that are only associated with the exposure but not directly with the outcome except through the exposure; 𝒮\mathcal{S} denotes the indices of irrelevant variables, i.e. covariates that are not related to the outcome or the exposure. Mathematically speaking, 𝒞={l𝒜|βl0 and 𝑪l0}\mathcal{C}=\{l\in\mathcal{A}|\beta_{l}\neq 0\textrm{ and }\bm{C}_{l}\neq 0\}, 𝒫={l𝒜|βl0 and 𝑪l=0}\mathcal{P}=\{l\in\mathcal{A}|\beta_{l}\neq 0\textrm{ and }\bm{C}_{l}=0\}, ={l𝒜|βl=0 and 𝑪l0}\mathcal{I}=\{l\in\mathcal{A}|\beta_{l}=0\textrm{ and }\bm{C}_{l}\neq 0\} and 𝒮={l𝒜|βl=0 and 𝑪l=0}\mathcal{S}=\{l\in\mathcal{A}|\beta_{l}=0\textrm{ and }\bm{C}_{l}=0\}. The relationships among different types of XX, 𝒁\bm{Z} and YY are shown in Figure 2, where UU denotes possible unmeasured confounders. Since we are interested in characterizing the association between 𝒁\bm{Z} and YY conditional on XX, further discussion of UU will be omitted for the remainder of the paper.

When there are no unobserved confounders UU, the estimate of 𝑩\bm{B} has underlying causal interpretations. In this case, the ideal adjustment set includes all confounders to avoid bias and all precision variables to increase statistical efficiency, while excluding instrumental variables and irrelevant variables (Brookhart et al., 2006; Shortreed and Ertefaie, 2017). Although we are studying the conditional association rather than the causal relationship due to the possible unobserved confounding, our target adjustment set remains the same. In other words, we aim to retain all covariates from 1=𝒞𝒫={l𝒜|βl0}\mathcal{M}_{1}=\mathcal{C}\cup\mathcal{P}=\{l\in\mathcal{A}|\beta_{l}\neq 0\}, while excluding covariates from 𝒮={l𝒜|βl=0}\mathcal{I}\cup\mathcal{S}=\{l\in\mathcal{A}|\beta_{l}=0\}.

3.2 Naive screening methods

To find the nonzero βl\beta_{l}’s, a straightforward idea is to consider a penalized estimator obtained from the outcome generating model (1), where one imposes, say Lasso penalties, on βl\beta_{l}’s. However, this is computationally infeasible in our ADNI data application as the number of baseline covariates ss is over 66 million. Consequently, it is important to employ a screening procedure (e.g. Fan and Lv, 2008) to reduce the model size. To find covariates XlX_{l}’s that are associated with the outcome YY conditional on the exposure 𝒁\bm{Z}, one might consider a conditional screening procedure for model (1) (Barut et al., 2016). Specifically, one can fit the model Yi=Xilβl+𝒁i,𝑩+ϵiY_{i}=X_{il}\beta_{l}+\langle\bm{Z}_{i},\bm{B}\rangle+\epsilon_{i} for each 1ls1\leq l\leq s, obtain marginal estimates of β^lMZ\widehat{\beta}^{MZ}_{l}’s and then sort the |β^lMZ||\widehat{\beta}^{MZ}_{l}|’s for screening. This procedure works well if the exposure variable 𝒁\bm{Z} is of low dimension as one only needs to fit low dimensional ordinary least squares (OLS) ss times. However, in our ADNI data application, the imaging exposure 𝒁\bm{Z} is of dimension pq=15,000pq=15,000. As a result, one cannot obtain an OLS estimate since n<pqn<pq. Thus, to apply the conditional sure independence screening procedure to our application, one may need to solve a penalized regression problem for each 1ls1\leq l\leq s, such as argmin𝑩,βl[12ni=1n(Yi𝒁i,𝑩Xilβl)2+Pλ(𝑩)]\arg\min_{\bm{B},\beta_{l}}\left[\frac{1}{2n}\sum_{i=1}^{n}\left(Y_{i}-\langle\bm{Z}_{i},\bm{B}\rangle-{X}_{il}\beta_{l}\right)^{2}+P_{\lambda}(\bm{B})\right], where Pλ(𝑩)P_{\lambda}(\bm{B}) is a penalty of 𝑩\bm{B}. In theory, for each l{1,,s},l\in\{1,\ldots,s\}, one can obtain the estimates β^l,λMZ\widehat{\beta}^{MZ}_{l,\lambda}, and then rank the |β^l,λMZ||\widehat{\beta}^{MZ}_{l,\lambda}|’s. However, this is computationally prohibitive in the ADNI data with s>6,000,000s>6,000,000. First, the penalized regression problem is much slower to solve compared to the OLS. Second, selection of the tuning parameter λ\lambda based on grid search substantially increases the computational burden.

Alternatively one may apply the marginal screening procedure of Fan and Lv (2008) to model (1). Specifically, one may solve the following marginal OLS on each XilX_{il} by ignoring the exposure 𝒁i\bm{Z}_{i}: argminβl[12ni=1n(YiXilβl)2]\arg\min_{\beta_{l}}\left[\frac{1}{2n}\sum_{i=1}^{n}\left(Y_{i}-{X}_{il}\beta_{l}\right)^{2}\right]. The marginal OLS estimate has a closed form β^lM=n1i=1nXilYi\widehat{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}X_{il}Y_{i}, and one can rank |β^lM||\widehat{\beta}_{l}^{M}|’s for screening. Specifically, the selected sub-model is defined as ^1={1ls:|β^lM|γ1,n}{\widehat{\mathcal{M}}_{1}^{*}}=\{1\leq l\leq s:|\widehat{\beta}_{l}^{M}|\geq\gamma_{1,n}\}, where γ1,n\gamma_{1,n} is a threshold. Computationally, it is much faster than conditional screening for model (1) as we only need to fit one dimensional OLS for s>6,000,000s>6,000,000 times. However, this procedure is likely to miss some important confounders. To see this, plugging model (2) into (1) yields

Yi\displaystyle Y_{i} =\displaystyle= l=1sXil(βl+𝑪l,𝑩)+𝑬i,𝑩+ϵi.\displaystyle\sum_{l=1}^{s}X_{il}(\beta_{l}+\langle\bm{C}_{l},\bm{B}\rangle)+\langle\bm{E}_{i},\bm{B}\rangle+\epsilon_{i}.

Even in the ideal case when XilX_{il}’s are orthogonal for 1ls1\leq l\leq s, β^lM\widehat{\beta}_{l}^{M} is not a good estimate of βl\beta_{l} because of the bias term 𝑪l,𝑩\langle\bm{C}_{l},\bm{B}\rangle. Thus, we may miss some nonzero βl\beta_{l}’s in the screening step if βl\beta_{l} and 𝑪l,𝑩\langle\bm{C}_{l},\bm{B}\rangle are of similar magnitudes but opposite signs. We illustrate this point in Figures 5 in Section 5, in which cases the conventional marginal screening on (1) fails to capture some of the important confounders.

3.3 Joint screening

To overcome the drawbacks of the estimation methods discussed in Section 3.2, we develop a joint screening procedure, specifically for our ADNI data application. The procedure is not only computationally efficient, but can also select all the important confounders and precision variables with high probability. The key insight here is that although we are interested in selecting important variables in the outcome generating model, this can be done much more efficiently by incorporating information from the exposure model. Specifically, let 𝑪^lM=n1i=1nXil𝒁ip×q\widehat{\bm{C}}_{l}^{M}=n^{-1}\sum_{i=1}^{n}X_{il}*\bm{Z}_{i}\in\mathbb{R}^{p\times q} be the marginal OLS estimate in model (2) for l=1,,sl=1,\ldots,s. Following Kong et al. (2020), the important covariates in model (2) can be selected by ^2={1ls:𝑪^lMopγ2,n}{\widehat{\mathcal{M}}_{2}}=\{1\leq l\leq s:\|\widehat{\bm{C}}_{l}^{M}\|_{op}\geq\gamma_{2,n}\}, where ||||op||\cdot||_{op} is the operator norm of a matrix and γ2,n\gamma_{2,n} is a threshold.

We define our joint screening set as ^=^1^2\widehat{\mathcal{M}}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}. Intuitively, most important confounders and precision variables are contained in the set ^1{\widehat{\mathcal{M}}_{1}}^{*}. The only exceptions are the covariates XlX_{l} for which both βl\beta_{l} and 𝑪l,𝑩\langle\bm{C}_{l},\bm{B}\rangle are of similar magnitudes but opposite signs. On the other hand, these XlX_{l} will be included in ^2{\widehat{\mathcal{M}}_{2}} and hence, ^\widehat{\mathcal{M}} along with instrumental variables with large 𝑪lop||\bm{C}_{l}||_{op}. In Section 15 of the supplementary material, we show that with properly chosen γ1,n\gamma_{1,n} and γ2,n\gamma_{2,n}, the joint screening set includes the confounders and precision variables with high probability: P(1^)1P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}})\rightarrow 1 as nn\rightarrow\infty. In practice, we recommend choosing γ1,n\gamma_{1,n} and γ2,n\gamma_{2,n} such that |^1|=|^2|=k|\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|=k, where kk is the smallest integer such that |^|n/log(n)|\widehat{\mathcal{M}}|\geq\lfloor n/\log(n)\rfloor. We set them to be of equal sizes following the convention that the size of screening set is determined only by the sample size (Fan and Lv, 2008), which is the same for ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2}. Depending on the prior knowledge about the sizes and signal strengths of confounding, precision and instrumental variables, the sizes of |^1||\widehat{\mathcal{M}}_{1}^{*}| and |^2||\widehat{\mathcal{M}}_{2}| may be chosen differently. In the simulations and real data analyses, we conduct sensitivity analyses by varying the relative sizes of ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2}.

In general, the set ^\widehat{\mathcal{M}} includes not only confounders and precision variables in 1=𝒞𝒫\mathcal{M}_{1}=\mathcal{C}\bigcup\mathcal{P}, but also instrumental variables in \mathcal{I} and a small subset of the irrelevant variables 𝒮\mathcal{S}. Nevertheless, the size of |^||\widehat{\mathcal{M}}| is greatly reduced compared to that of all the observed covariates. This makes it feasible to perform the second step procedure, a refined penalized estimation of 𝑩{\bm{B}} based on the covariates {Xl:l^}\{X_{l}:l\in\widehat{\mathcal{M}}\}.

3.4 Blockwise joint screening

Linkage disequilibrium (LD) is a ubiquitous biological phenomenon where genetic variants present a strong blockwise correlation (LD) structure (Wall and Pritchard, 2003). If all the SNPs of a particular LD block are important but with relatively weak signals, they may be missed by the screening procedure described in Section 3.3. To appropriately utilize LD blocks’ structural information to select those missed SNPs, we develop a modified screening procedure described below.

Suppose that X=(X1,,Xs)TX=(X_{1},\ldots,X_{s})^{T} can be divided into bb discrete haplotype blocks: regions of high LD that are separated from other haplotype blocks by many historical recombination events (Wall and Pritchard, 2003). Let the index set of each bb non-overlapping block be 1,,b\mathcal{B}_{1},\ldots,\mathcal{B}_{b} with j=1bj={1,s}\cup_{j=1}^{b}\mathcal{B}_{j}=\{1\ldots,s\}. For l=1,,s,l=1,\ldots,s, we define

βlblock,M=j=1b1(lj)|j|ij|βiM|andClblock,M=j=1b1(lj)|j|ij𝑪iMop,{\beta^{block,M}_{l}}=\sum_{j=1}^{b}\frac{1(l\in\mathcal{B}_{j})}{|\mathcal{B}_{j}|}\sum_{i\in\mathcal{B}_{j}}|{\beta_{i}^{M}}|\quad\textrm{and}\quad{C^{block,M}_{l}}=\sum_{j=1}^{b}\frac{1(l\in\mathcal{B}_{j})}{|\mathcal{B}_{j}|}\sum_{i\in\mathcal{B}_{j}}\|{\bm{C}}_{i}^{M}\|_{op},
β^lblock,M=j=1b1(lj)|j|ij|β^iM|andC^lblock,M=j=1b1(lj)|j|ij𝑪^iMop,\widehat{\beta}^{block,M}_{l}=\sum_{j=1}^{b}\frac{1(l\in\mathcal{B}_{j})}{|\mathcal{B}_{j}|}\sum_{i\in\mathcal{B}_{j}}|\widehat{\beta}_{i}^{M}|\quad\textrm{and}\quad\widehat{C}^{block,M}_{l}=\sum_{j=1}^{b}\frac{1(l\in\mathcal{B}_{j})}{|\mathcal{B}_{j}|}\sum_{i\in\mathcal{B}_{j}}\|\widehat{\bm{C}}_{i}^{M}\|_{op},

where 1()1(\cdot) is the indicator function of an event. We also define

^1block,={1ls:β^lblock,Mγ3,n}and^2block={1ls:C^lblock,Mγ4,n}.{\widehat{\mathcal{M}}_{1}^{block,*}}=\{1\leq l\leq s:\widehat{\beta}^{block,M}_{l}\geq\gamma_{3,n}\}\quad\textrm{and}\quad{\widehat{\mathcal{M}}_{2}^{block}}=\{1\leq l\leq s:\widehat{C}_{l}^{block,M}\geq\gamma_{4,n}\}.

We propose to use the new set ^block=^1^2^1block,^2block\widehat{\mathcal{M}}^{block}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\cup{\widehat{\mathcal{M}}_{1}^{block,*}}\cup\widehat{\mathcal{M}}_{2}^{block}, rather than ^=^1^2\widehat{\mathcal{M}}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}, to select important covariates. Intuitively, when |βl1|>|βl2||\beta_{l_{1}}|>|\beta_{l_{2}}|, Xl1X_{l_{1}} is much more easily selected compared with Xl2X_{l_{2}} by ^1\widehat{\mathcal{M}}_{1}^{*}. However, suppose that l11l_{1}\in\mathcal{B}_{1} and l22l_{2}\in\mathcal{B}_{2}, with only a small proportion of XlX_{l} in 1\mathcal{B}_{1} having |βl|>0|\beta_{l}|>0, whereas a large proportion of XlX_{l} in 2\mathcal{B}_{2} has |βl|>0|\beta_{l}|>0. It may well be the case that βl1block,M<βl2block,M{\beta^{block,M}_{l_{1}}}<{\beta^{block,M}_{l_{2}}}, meaning that Xl2X_{l_{2}} can be selected more easily than Xl1X_{l_{1}} by ^1block,\widehat{\mathcal{M}}_{1}^{block,*}. In addition, as β^lM\widehat{\beta}_{l}^{M} is not a good estimate of βl\beta_{l} due to the bias term 𝑪l,𝑩\langle\bm{C}_{l},\bm{B}\rangle, β^lblock,M\widehat{\beta}^{block,M}_{l} is not a good estimate of βlblock,M{\beta^{block,M}_{l}} either. Therefore, some XlX_{l} with nonzero βlblock,M{\beta^{block,M}_{l}} may not be included in ^1block,{\widehat{\mathcal{M}}_{1}^{block,*}}. Nevertheless, they will be included in ^2block{\widehat{\mathcal{M}}_{2}^{block}} and hence ^block\widehat{\mathcal{M}}^{block}.

Theoretically, when γ1,n\gamma_{1,n}, γ2,n\gamma_{2,n}, γ3,n\gamma_{3,n} and γ4,n\gamma_{4,n} are chosen properly, P(1^block)1P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}}^{block})\rightarrow 1 as nn\rightarrow\infty; see Theorem 3 in Section 15 of the supplementary material. In practice, we recommend choosing γ1,n\gamma_{1,n}, γ2,n\gamma_{2,n}, γ3,n\gamma_{3,n} and γ4,n\gamma_{4,n}, such that |^1|=|^2|=|^1block,|=|^2block|=k|\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|=|{\widehat{\mathcal{M}}_{1}^{block,*}}|=|{\widehat{\mathcal{M}}_{2}^{block}}|=k, where kk is the smallest integer such that |^block|2n/log(n)|\widehat{\mathcal{M}}^{block}|\geq 2\lfloor n/\log(n)\rfloor. The threshold here is twice the threshold what we suggested in Section 3.3 since we unionize two additional sets.

3.5 Second-step estimation

In this step, we aim to estimate 𝑩\bm{B} by excluding the instrument variables \mathcal{I} and irrelevant variables in 𝒮\mathcal{S} from ^\widehat{\mathcal{M}} (or ^block\widehat{\mathcal{M}}^{block}) and keeping the other covariates. This can be done by solving the following optimization problem:

argmin𝑩,{βl,l^}[12ni=1n(Yi𝒁i,𝑩l^Xilβl)2+λ1l^|βl|+λ2𝑩].\operatorname*{arg\,min}_{\bm{B},\{\beta_{l},l\in\widehat{\mathcal{M}}\}}\left[\frac{1}{2n}\sum_{i=1}^{n}\left(Y_{i}-\langle\bm{Z}_{i},\bm{B}\rangle-\sum_{l\in\widehat{\mathcal{M}}}{X}_{il}\beta_{l}\right)^{2}+\lambda_{1}\sum_{l\in\widehat{\mathcal{M}}}|\beta_{l}|+\lambda_{2}||\bm{B}||_{*}\right]. (3)

Denote (𝑩^,𝜷^)(\widehat{\bm{B}},\widehat{\bm{\beta}}) the solution to the above optimization problem. The Lasso penalty on βl\beta_{l} is used to exclude instrumental and irrelevant variables in ^\widehat{\mathcal{M}}, whose corresponding coefficients βl\beta_{l}’s are zero. The nuclear norm penalty ||||||\cdot||_{*}, defined as the sum of all the singular values of a matrix, is used to achieve a low-rank estimate of 𝑩\bm{B}, where the low-rank assumption in estimating 2D structural coefficients is commonly used in the literature (Zhou and Li, 2014; Kong et al., 2020). For tuning parameters, we use five-fold cross validation based on two-dimensional grid search, and select λ1\lambda_{1} and λ2\lambda_{2} using the one standard error rule (Hastie et al., 2009).

4 ADNI data applications

We use the data obtained from the ADNI study (adni.loni.usc.edu). The data usage acknowledgement is included in Section 8 of the supplement material. As described in Section 2, we include 566566 subjects from the ADNI1 study. The exposure of interest is the baseline 2D hippocampal surface radial distance measures, which can be represented as a 100×150100\times 150 matrix for each part of the hippocampus. The outcome of interest is the ADAS-13 score observed at Month 24. The average ADAS-13 score is 20.820.8 with standard deviation 14.114.1. The covariates to adjust for include 6,087,2056,087,205 bi-allelic markers as well as clinical covariates, including age, gender and education length, baseline intracranial volume (ICV), and baseline diagnosis status. The average age is 75.575.5 years old with standard deviation 6.66.6 years, and the average education length is 15.615.6 years with standard deviation 2.92.9 years. Among all the 566566 subjects, 58.1%58.1\% were female. The average ICV was 1.28×1061.28\times 10^{6} mm3\rm{mm}^{3} with standard deviation 1.35×1051.35\times 10^{5} mm3\rm{mm}^{3}. There were 175175 (184184 at Month 24) cognitive normal patients, 268268 (157157 at Month 24) patients with mild cognitive impairment (MCI), and 123123 (225225 at Month 24) patients with AD at the baseline. Studies have shown that age and gender are the main risk factors for Alzheimer’s disease (Vina and Lloret, 2010; Guerreiro and Bras, 2015) with older people and females more likely to develop Alzheimer’s disease. Multiple studies have also shown that prevalence of dementia is greater among those with low or no education (Zhang et al., 1990). On the other hand, age, gender and length of education have been found to be strongly associated with the hippocampus (Van de Pol et al., 2006; Jack et al., 2000; Noble et al., 2012). Previous studies (Sargolzaei et al., 2015) suggest that the ICV is an important measure that needs to be adjusted for in studies of brain change and AD. In addition, the baseline diagnosis status may help explain the baseline hippocampal shape and the AD status at Month 24. Therefore, we consider age, gender, education length, baseline ICV and baseline diagnosis status as part of the confounders, and adjust for them in our analysis. In addition, we also adjust for population stratification, for which we use the top five principal components of the whole genome data. As both left and right hippocampi have 2D radial distance measures and the two parts of hippocampi have been found to be asymmetric (Pedraza et al., 2004), we apply our method to the left and right hippocampi separately.

We use the default method (Gabriel et al., 2002) of Haploview (Barrett et al., 2005) and PLINK (Purcell et al., 2007) to form linkage disequilibrium (LD) blocks. Previous studies report that about 50 genetic variants are associated with AD; see the review in Sims et al. (2020) for details. This provides support for our assumption that |1|<n|\mathcal{M}_{1}|<n (n=566n=566). On the other hand, a genome-wide association analysis of 19,629 individuals by Zhao et al. (2019) shows that 5757 genetic variants are associated with the left hippocampal volumes and 5454 are associated with the right hippocampal volumes. This provides support for our assumption that |2|<n|\mathcal{M}_{2}|<n. Therefore, we apply our blockwise joint screening procedure on those SNPs on each part of hippocampal outcome YiY_{i} and exposure 𝒁i\bm{Z}_{i} marginally. We choose the thresholds γ1,n\gamma_{1,n}, γ2,n\gamma_{2,n},γ3,n\gamma_{3,n} and γ4,n\gamma_{4,n} such that |^block|=2n/log(n)=178|\widehat{\mathcal{M}}^{block}|=2\lfloor n/\log(n)\rfloor=178. In Table 3 of the supplementary material, we list the top 2020 SNPs corresponding to left and right hippocampi, respectively. As suggested by one referee, we plot similar figures as the Manhattan plot for ^1\widehat{\mathcal{M}}^{*}_{1}, ^1block,\widehat{\mathcal{M}}^{block,*}_{1}, ^2\widehat{\mathcal{M}}^{\textbf{}}_{2} and ^2block\widehat{\mathcal{M}}^{block}_{2} in Figure 7 of the supplementary material, where genomic coordinates are displayed along the x-axis, the y-axis represents the magnitude of |β^lM||\widehat{\beta}^{M}_{l}|, β^lblock,M\widehat{\beta}^{block,M}_{l}, 𝑪^lMop\|\widehat{\bm{C}}_{l}^{M}\|_{op} and C^lblock,M\widehat{C}_{l}^{block,M} and the horizontal dashed line represents the threshold values γ1,n\gamma_{1,n}, γ2,n\gamma_{2,n}, γ3,n\gamma_{3,n}, and γ4,n\gamma_{4,n}, respectively.

From Table 3 and Figure 7, one can see that there are quite a few important SNPs for both hippocampi. For example, the top SNP is the rs429358 from the 19th chromosome. This SNP is a C/T single-nucleotide variant (snv) variation in the APOE gene. It is also one of the two SNPs that define the well-known APOE alleles, the major genetic risk factor for Alzheimer’s disease (Kim et al., 2009). In addition, a great portion of the SNPs in Table 3 has been found to be strongly associated with Alzheimer’s. These include rs10414043 (Du et al., 2018), an A/G snv variation in the APOC1 gene; rs7256200 (Takei et al., 2009), an A/G snv variation in the APOC1 gene; rs73052335 (Zhou et al., 2018), an A/C snv variation in the APOC1 gene; rs769449 (Chung et al., 2014), an A/G snv variation in the APOE gene; rs157594 (Hao et al., 2017), a G/T snv variation; rs56131196 (Gao et al., 2016), an A/G snv variation in the APOC1 gene; rs111789331 (Gao et al., 2016), an A/T snv variation; and rs4420638 (Coon et al., 2007), an A/G snv variation in the APOC1 gene.

Among those SNPs that have been found to be associated with Alzheimer’s, some of them are also directly associated with hippocampi. For example, Zhou et al. (2020) revealed that the SNPs rs10414043, rs73052335 and rs769449 are among the top SNPs that have significant genetic effects on the volumes of both left and right hippocampi. Guo et al. (2019) identified the SNP rs56131196 to be associated with hippocampal shape.

We then perform our second-step estimation procedure for each part of the hippocampi. Here X^X_{\widehat{\mathcal{M}}} denotes the SNPs selected in the screening step, the population stratification (top five principal components of the whole genome data) and the five clinical measures (age, gender, education length, baseline ICV and baseline diagnosis status), and 𝒁{\bm{Z}} denotes the left/right hippocampal surface image matrix. To visualize the results, we map the estimates 𝑩^\widehat{\bm{B}} corresponding to each part of the hippocampus onto a representative hippocampal surface and plot it in Figure 3(a). We have also plotted the hippocampal subfield (Apostolova et al., 2006) in Figure 3(b). Here Cornu Ammonis region 1 (CA1), Cornu Ammonis region 2 (CA2) and Cornu Ammonis region 3 (CA3) are a strip of pyramidal neurons within the hippocampus proper. CA1 is the top portion, named as “regio superior of Cajal” (Blackstad et al., 1970), which consists of small pyramidal neurons. Within the lower portion (regio inferior of Cajal), which consists of larger pyramidal neurons, there are a smaller area called CA2 and a larger area called CA3. The cytoarchitecture and connectivity of CA2 and CA3 are different. The subiculum is a pivotal structure of the hippocampal formation, positioned between the entorhinal cortex and the CA1 subfield of the hippocampus proper (for a complete review, see Dudek et al. 2016).

From the plots, we can see that all the 15,00015,000 entries of 𝑩^\widehat{\bm{B}} corresponding to both hippocampi are negative. This implies that the radial distance of each pixel of both hippocampi is negatively associated with the ADAS-13 score, which depicts the severity of behavioral deficits. The subfields with strongest associations are mostly CA1 and subiculum. Existing literature (Apostolova et al., 2010) has found that as Alzheimer’s disease progresses, it first affects CA1 and subiculum subregions and later CA2 and CA3 subregions. This can partially explain why the shapes of CA1 and subiculum may have stronger associations with ADAS-13 scores compared to CA2 and CA3 subregions.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Real data results: Panel (a) plots the estimate 𝑩^\widehat{\bm{B}} corresponding to the left hippocampi (left part) and the right hippocampi (right part). Panel (b) plots the hippocampal subfield.

We examine the effect size of the whole hippocampal shape by evaluating the term 𝒁i,𝑩^\langle\bm{Z}_{i},\widehat{\bm{B}}\rangle. Specifically, we calculate the proportion of variance explained by imaging covariates as follows:

R2=i=1n(yiy¯)2i=1n(yiy¯𝒁i,𝑩^)2i=1n(yiy¯)2.\displaystyle R^{2}=\frac{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}-\sum_{i=1}^{n}(y_{i}-\bar{y}-\langle\bm{Z}_{i},\widehat{\bm{B}}\rangle)^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}.

Our results show that the shape of the left hippocampi and that of the right one account for 5.83% and 4.71% of the total variations in behavior deficits, respectively. Such effect sizes are quite large compared with those for polygenic risk scores in genetics. In addition, we perform permutation test to test whether the R2R^{2} statistic is significant. In particular, we randomly permutate the {Y1,,Yn}\{Y_{1},\ldots,Y_{n}\}, denoted by {Yi,,Yn}\{Y_{i}^{*},\ldots,Y_{n}^{*}\}, and we then apply our estimation procedure based on (Xi,𝒁i,Yi)(X_{i},\bm{Z}_{i},Y_{i}^{*}), obtain 𝑩^\widehat{\bm{B}}^{*}, and calculate (R2)(R^{2})^{*}. We repeat this for 1,000 times and and we obtain the {(R(k)2),1k1000}\{(R_{(k)}^{2})^{*},1\leq k\leq 1000\}, which mimics the null distribution. Finally, the p-value can be calculated as 11000k=110001{(R(k)2)R2}\frac{1}{1000}\sum_{k=1}^{1000}1\{(R_{(k)}^{2})^{*}\geq R^{2}\}. The p-values for both hippocampi are less than 0.0010.001, suggesting that the R2R^{2}’s are significantly different from zero.

We also conduct sensitivity analysis by varying the relative sizes of ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2} in the joint screening procedure. The estimates 𝑩^\widehat{\bm{B}}s are similar among different choices of |^1||\widehat{\mathcal{M}}_{1}^{*}| and |^2||\widehat{\mathcal{M}}_{2}|; see supplementary material Section 11 for details. In addition, we repeated our analysis on the 391391 MCI and AD subjects. We have similar findings that the radial distances of each pixel of both hippocampi are mostly negatively associated with the ADAS-13 score. And the subfields with strongest associations are mostly CA1 and subiculum; see supplementary material Section 12 for details. As suggested by one referee, we have performed the SNP-imaging-outcome mediation analysis proposed by Bi et al. (2017); see Section 13 in the supplementary material for the detailed procedure. There is no evidence for the mediating relationship of SNP-imaging-outcome from our analysis.

5 Simulation studies

In this section, we perform simulation studies to evaluate the finite sample performance of the proposed method. The dimension of covariates is set as s=5000s=5000, and the exposure is a 64×6464\times 64 matrix. The XisX_{i}\in\mathbb{R}^{s} is independently generated from N(𝟎,𝚺x)N(\bm{0},\bm{\Sigma}_{x}), where 𝚺x=(σx,ll)\bm{\Sigma}_{x}=(\sigma_{x,ll^{\prime}}) has an autoregressive structure such that σx,ll=ρ1|ll|\sigma_{x,ll^{\prime}}=\rho_{1}^{|l-l^{\prime}|} holds for 1l,ls1\leq l,l^{\prime}\leq s with ρ1=0.5\rho_{1}=0.5. Define 𝑩{\bm{B}} as a 64×6464\times 64 image shown in Figure 4(a), and 𝑪{\bm{C}} a 64×6464\times 64 image shown in Figure 4(b). For 𝑩{\bm{B}}, the black regions of interest (ROIs) are assigned value 0.04080.0408 and white ROIs are assigned value 0. For 𝑪{\bm{C}}, the black ROIs are assigned value 0.03350.0335 and white ROIs are assigned value 0. Further we set 𝑪l=vl𝑪{\bm{C}}_{l}=v_{l}*\bm{C}, where v1=1/3v_{1}=-1/3, v2=1v_{2}=-1, v3=3v_{3}=-3, v207=3v_{207}=-3, v208=1v_{208}=-1, v209=1/3v_{209}=-1/3, and vl=0v_{l}=0 for 4l2064\leq l\leq 206 and 210ls210\leq l\leq s. We set the parameters β1=3\beta_{1}=3, β2=1\beta_{2}=1, β3=1/3\beta_{3}=1/3, β104=3\beta_{104}=3, β105=1\beta_{105}=1, β106=1/3\beta_{106}=1/3, and βl=0\beta_{l}=0 for 4l1034\leq l\leq 103 and 107ls107\leq l\leq s. In this setting, we have 𝒞={1,2,3}\mathcal{C}=\{1,2,3\}, 𝒫={104,105,106}\mathcal{P}=\{104,105,106\}, ={207,208,209}\mathcal{I}=\{207,208,209\} and 𝒮={1,5000}\{1,2,3,104,105,106,207,208,209}\mathcal{S}=\{1\ldots,5000\}\backslash\{1,2,3,104,105,106,207,208,209\}.

Refer to caption

(a)

Refer to caption

(b)
Figure 4: Panels (a) and (b) plot 𝑩{\bm{B}} and 𝑪{\bm{C}} respectively. In Panels (a), the value at each pixel is either 0 (white) or 0.0408 (black). In Panels (b), the value at each pixel is either 0 (white) or 0.0335 (black).

The random error vec(𝑬i)\mathrm{vec}(\bm{E}_{i}) is independently generated from N(𝟎,𝚺e)N(\bm{0},\bm{\Sigma}_{e}), where we set the standard deviations of all elements in 𝑬i\bm{E}_{i} to be σe=0.2\sigma_{e}=0.2 and the correlation between 𝑬i,jk\bm{E}_{i,jk} and 𝑬i,jk\bm{E}_{i,j^{\prime}k^{\prime}} to be ρ2|jj|+|kk|\rho_{2}^{|j-j^{\prime}|+|k-k^{\prime}|} for 1j,k,j,k641\leq j,k,j^{\prime},k^{\prime}\leq 64 with ρ2=0.5\rho_{2}=0.5. The random error ϵi\epsilon_{i} is generated independently from N(0,σ2)N(0,\sigma^{2}), where we consider σ=1\sigma=1 or 0.50.5. The YiY_{i}’s and 𝒁i{\bm{Z}}_{i}’s are generated from models (1) and (2). We consider three different sample sizes n=200,500n=200,500 and 10001000.

5.1 Simulation for screening

We perform our screening procedure (denoted by “joint”) and report the coverage proportion of 1\mathcal{M}_{1}, which is defined as |^1||1|\frac{|\widehat{\mathcal{M}}\cap\mathcal{M}_{1}|}{|\mathcal{M}_{1}|}, where the size of the selected set |^||\widehat{\mathcal{M}}| changes from 11 to 100100. In addition, we report the coverage proportion for each of the confounding and precision variables, i.e. each of the jj’s in the set 1={1,2,3,104,105,106}{\mathcal{M}}_{1}=\{1,2,3,104,105,106\}. All the coverage proportions are averaged over 100100 Monte Carlo runs.

To control the changing size of the |^||\widehat{\mathcal{M}}|, we first set |^1|=|^2|=1|\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|=1 by specifying appropriate γ^1,n\widehat{\gamma}_{1,n} and γ^2,n\widehat{\gamma}_{2,n}. Then we sequentially add two variables, one to ^1\widehat{\mathcal{M}}_{1}^{*} by increasing γ^1,n\widehat{\gamma}_{1,n} and one to ^2\widehat{\mathcal{M}}_{2} by increasing γ^2,n\widehat{\gamma}_{2,n}, until |^||\widehat{\mathcal{M}}| reaches 100100. Note that we always keep |^1|=|^2||\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}| in the procedure. We may not obtain all the sizes between 11 and 100100 because |^||\widehat{\mathcal{M}}| may increase by at most 22. Therefore, for those sizes that cannot be reached, we use a linear interpolation to estimate the coverage proportion of 1\mathcal{M}_{1} by using the closest two end points.

We compare the proposed joint screening procedure to two competing procedures. The first is an outcome screening procedure that selects set ^1.\widehat{\mathcal{M}}_{1}^{*}. For fair comparison, we let |^1||\widehat{\mathcal{M}}_{1}^{*}| range from 11 to 100100. The second is an intersection screening procedure, that selects set ^=^1^2\widehat{\mathcal{M}}_{\cap}=\widehat{\mathcal{M}}_{1}^{*}\cap\widehat{\mathcal{M}}_{2}. We let |^||\widehat{\mathcal{M}}_{\cap}| range from 11 to 100100, while keeping |^1|=|^2||\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|. Similarly, for those specific sizes that |^||\widehat{\mathcal{M}}^{*}| cannot reach, we use linear interpolation to estimate the coverage proportions. We plot the results for (n,s,σ)=(200,5000,1)(n,s,\sigma)=(200,5000,1) in Figure 5. The remaining results for (n,s,σ)=(200,5000,0.5)(n,s,\sigma)=(200,5000,0.5), (500,5000,1)(500,5000,1), (500,5000,0.5)(500,5000,0.5), (1000,5000,1)(1000,5000,1) and (1000,5000,0.5)(1000,5000,0.5) can be found in Figures 1014 of the supplementary material.

Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 5: Simulation results for the case (n,s,σ)=(200,5000,1)(n,s,\sigma)=(200,5000,1): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.

From the plots, one can see that both the “intersection” and “outcome” screening methods miss the confounder X3X_{3} with a very high probability even as the size of the selected set approaches 100100. In contrast, our method can select X3X_{3} with high probability when |^||\widehat{\mathcal{M}}| is relatively small. For confounders X1X_{1} and X2X_{2}, all three methods perform similarly. For the precision variables, the “outcome” method and our “joint” method perform similarly in covering these variables, while the “intersection” performs badly. Combining the results, one can see that our method performs the best as our method selects all the confounders and precision variables with high probabilities. In addition, we find that the coverage proportion of our method increases when the sample size increases, which validates the sure independence screening property developed in Section 15 of the supplementary material.

5.2 Simulation for estimation

In this part, we evaluate the performance of our estimation procedure after the first-step screening. For the size of ^\widehat{\mathcal{M}} in the screening step, we set |^|=n/log(n)|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor. We compare the proposed estimate with the oracle estimate, which is calculated by adjusting for the ideal adjustment set including only confounders and precision variables as XX and then estimate 𝑩\bm{B} by using the optimization (3) without imposing the l1l_{1}-regularization. We report the mean squared errors (MSEs) for 𝜷\bm{\beta} and 𝑩\bm{B} defined as 𝜷𝜷^22||{\bm{\beta}}-\widehat{\bm{\beta}}||_{2}^{2} and 𝑩𝑩^F2\|{\bm{B}}-\widehat{\bm{B}}\|_{F}^{2}, respectively. Table 1 summarizes the average MSEs of the proposed and oracle estimates for 𝜷\bm{\beta} and 𝑩\bm{B} among 100 Monte Carlo runs when n=200n=200, 500500 and 10001000. We can see that the MSE decreases as the sample size increases. In terms of the primary parameter of interest 𝑩\bm{B}, the proposed estimate is close to the oracle estimate.

Table 1: Simulation results of the proposed joint screening method and oracle estimates for σ=1\sigma=1 and σ=0.5\sigma=0.5, when n=200n=200, 500500 and 10001000: the average MSEs for 𝜷\bm{\beta} and 𝑩{\bm{B}}, and their associated standard errors in the parentheses are reported. The results are based on 100 Monte Carlo repetitions.
σ=1.0\sigma=1.0 MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}} σ=0.5\sigma=0.5 MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}}
n = 200
Proposed 0.496(0.021) 0.667(0.005) Proposed 0.276(0.009) 0.528(0.005)
Oracle 0.086(0.005) 0.624(0.004) Oracle 0.021(0.001) 0.501(0.004)
n = 500
Proposed 0.303(0.008) 0.574(0.006) Proposed 0.191(0.005) 0.345(0.004)
Oracle 0.036(0.002) 0.553(0.005) Oracle 0.006(0.000) 0.340(0.004)
n = 1000
Proposed 0.217(0.004) 0.449(0.004) Proposed 0.128(0.006) 0.234(0.002)
Oracle 0.013(0.001) 0.460(0.005) Oracle 0.003(0.000) 0.233(0.002)

We plot the 2D map of 𝑩^\widehat{\bm{B}} based on the average of 100 Monte Carlo runs in Figure 6(c). For comparison, we also plot the corresponding average oracle estimate 𝑩^oracle\widehat{{\bm{B}}}_{\rm{oracle}} in Figure 6(b) and the true 𝑩{\bm{B}} in Figure 6(a). One can see that the proposed method recovers the signal pattern reasonably well.

Refer to caption

(a) Truth

Refer to caption

(b) Oracle

Refer to caption

(c) Proposed
Figure 6: Panel (a) plots the true 𝑩{\bm{B}} (Truth), Panel (b) plots the average of 𝑩^oracle\widehat{\bm{B}}_{\rm{oracle}} (Oracle), Panel (c) plots the average of 𝑩^\widehat{\bm{B}} (proposed). Here (n,s,σ)=(1000,5000,0.5)(n,s,\sigma)=(1000,5000,0.5). The value at each pixel is a gray scale with 0 as white and 0.0408 as black.

We also report the sensitivity and specificity of the estimates in Section 14.1 of the supplementary material. We have found that although the proposed method may not remove all of the instrumental variables, eliminating even just some of the instruments greatly reduces the MSEs of both 𝜷\bm{\beta} and 𝑩{\bm{B}}, compared to the method where we do not impose l1l_{1}-regularization on 𝜷\bm{\beta} in the second-step estimation.

5.3 Screening and estimation using blockwise joint screening

Linkage disequilibrium (LD) is a ubiquitous biological phenomenon where genetic variants present a strong blockwise correlation (LD) structure (Wall and Pritchard, 2003). In Section 3.4, we propose the blockwise joint screening procedure to appropriately utilize LD blocks’ structural information. The performance of this procedure is illustrated in this section using an adapted simulation based on the settings of Dehman et al. (2015).

For i=1,,ni=1,\ldots,n, XisX_{i}\in\mathbb{R}^{s} is independently generated from an ss-dimensional multivariate distribution N(𝟎,𝚺x)N(\bm{0},\bm{\Sigma}_{x}), where 𝚺x=(σx,ll)\bm{\Sigma}_{x}=(\sigma_{x,ll^{\prime}}) is block-diagonal. If lll\neq l^{\prime} are in the same block, the covariance σx,ll=0.4\sigma_{x,ll^{\prime}}=0.4, else σx,ll=0\sigma_{x,ll^{\prime}}=0, and the diagonal elements σx,ll\sigma_{x,ll}s are all set to 11. We set XijX_{ij} to 0, 11 or 22 according to whether Xij<d1X_{ij}<d_{1}, d1Xijd2d_{1}\leq X_{ij}\leq d_{2}, or Xij>d2X_{ij}>d_{2}, where d1d_{1} and d2d_{2} are thresholds determined for producing a given minor allel frequency (MAF). For instance, choosing d1=Φ1(16MAF/5)d_{1}=\Phi^{-1}(1-6\mathrm{MAF}/5) and d2=Φ1(12MAF/5)d_{2}=\Phi^{-1}(1-2\mathrm{MAF}/5), where Φ\Phi is the c.d.f. of standard normal distribution, corresponds to a given fixed MAF. In order to generate more realistic MAF distributions, we simulate genotype XijX_{ij}, where the MAF for each jj is uniformly sampled between 0.05 and 0.5 (Dehman et al., 2015).

Adapting the simulation setting of Section 5.1 according to Dehman et al. (2015), we set s=5000s=5000, with 300300 blocks of covariates of size 2,4,6,12,24,522,4,6,12,24,52 replicated 5050 times. We perform 100 Monte Carlo runs, and the ordering of the block is drawn at random for each run. The settings for 𝑩{\bm{B}} and 𝑪{\bm{C}} remain the same as before: 𝑩{\bm{B}} is as in Figure 5(a), and 𝑪{\bm{C}} is as in Figure 5(b). Further we set 𝑪l=vl𝑪{\bm{C}}_{l}=v_{l}*\bm{C}, where v1=1/3v_{1}=-1/3, v2=1v_{2}=-1, v3=3v_{3}=-3, v207=3v_{207}=-3, v208=1v_{208}=-1, v209=1/3v_{209}=-1/3, and vl=0v_{l}=0 for 4l2064\leq l\leq 206 and 210ls210\leq l\leq s. We set β1=3\beta_{1}=3, β2=1\beta_{2}=1, β3=1/3\beta_{3}=1/3, β104=3\beta_{104}=3, β105=1\beta_{105}=1, β106=1/3\beta_{106}=1/3, βj\beta_{j} = 1/41/4 for j𝒫LDj\in\mathcal{P}_{LD}, and βl=0\beta_{l}=0 otherwise. Here 𝒫LD\mathcal{P}_{LD} is a randomly selected block consisting of KK consecutive indices from {210,211,,s}\{210,211,\ldots,s\}, where K{2,4,6,12,24,52}K\in\{2,4,6,12,24,52\}. We have 𝒞={1,2,3}\mathcal{C}=\{1,2,3\}, 𝒫={104,105,106}𝒫LD\mathcal{P}=\{104,105,106\}\cup\mathcal{P}_{LD}, ={207,208,209}\mathcal{I}=\{207,208,209\} and 𝒮={1,5000}\(𝒞𝒫)\mathcal{S}=\{1\ldots,5000\}\backslash(\mathcal{C}\cup\mathcal{P}\cup\mathcal{I}). The other settings are the same as the ones in Section 5.1.

We consider three different sample sizes n=200n=200, 500500 and 10001000. We first perform the screening procedure and report the coverage proportion of 1=𝒞𝒫\mathcal{M}_{1}=\mathcal{C}\cup\mathcal{P}. We also report the coverage proportion for each of the confounding and precision variables. In particular, we include the screening results, in which s=5000s=5000, σ=1\sigma=1, n=200,500,1000n=200,500,1000, and K=2,4,6,12,24,52K=2,4,6,12,24,52, can be found in Figures 2946 of the supplementary material.

From the plots, one can see that the blockwise joint screening method (blue dotted line) selects 𝒫LD\mathcal{P}_{LD} and 1\mathcal{M}_{1} with higher probability compared with the original joint screening method (green solid line). Based on the results, the blockwise joint screening method can better utilize precision variables with block structures to select 1\mathcal{M}_{1}.

In addition, we evaluate the performance of the two proposed estimation procedure after the first-step screening. For the sizes of ^\widehat{\mathcal{M}} and ^block\widehat{\mathcal{M}}^{block} in the screening step, we set |^|=n/log(n)|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor for the original joint screening procedure, and |^block|=2n/log(n)|\widehat{\mathcal{M}}^{block}|=2\lfloor n/\log(n)\rfloor for the blockwise joint screening procedure. We report the average MSEs for 𝜷\bm{\beta} and 𝑩\bm{B} when (n,s,σ)=(200,5000,1)(n,s,\sigma)=(200,5000,1) in Table 2. The complete results, in which s=5000s=5000, σ=1\sigma=1, n=200,500,1000n=200,500,1000, and K=2,4,6,12,24,52K=2,4,6,12,24,52, can be found in Table 9 of the supplementary material.

In summary, the blockwise joint screening estimate outperforms the original joint screening estimate when the sample size nn is small or block size of precision variables KK is large. For the rest of the scenarios, there are no significant differences between the two methods.

Table 2: Simulation results for (n,s,σ)=(200,5000,1)(n,s,\sigma)=(200,5000,1): the average MSEs for 𝜷\bm{\beta} and 𝑩{\bm{B}}, and their associated standard errors in the parentheses are reported. The left panel summarizes the results from the joint screening method; the right panel summarizes the results from the blockwise joint screening method. The results are based on 100 Monte Carlo repetitions.
Proposed MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}} Proposed (block) MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}}
K=2 1.423(0.096) 0.785(0.009) K=2 1.390(0.090) 0.793(0.010)
K=4 1.667(0.096) 0.815(0.011) K=4 1.548(0.088) 0.805(0.010)
K=6 1.955(0.101) 0.826(0.010) K=6 1.701(0.084) 0.816(0.009)
K=12 2.466(0.096) 0.890(0.039) K=12 2.223(0.129) 0.838(0.011)
K=24 2.533(0.164) 0.847(0.014) K=24 2.136(0.138) 0.821(0.010)
K=52 14.650(0.815) 2.034(0.487) K=52 13.693(0.728) 1.870(0.459)

In the supplementary material, we assess the variable screening results for various sparsity levels of instrumental variables in Section 14.2, evaluate the performance of our estimation procedure for different covariances of exposure errors in Section 14.3, and assess the sensitivity of the choices for different sizes of ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2} in Section 14.4.

6 Discussion

This paper aims at mapping the complex GIC pathway for AD. The unique features of the hippocampal morphometry surface measure data motivate us to develop a computationally efficient two-step screening and estimation procedure, which can select biomarkers among more than 66 million observed covariates and estimate the conditional association simultaneously. If there was no unmeasured confounding, then the conditional association we estimate corresponds to the causal effect. This is, however, not the case in the ADNI study because we have unmeasured confounders such as Aβ\beta and tau protein levels. To control for unmeasured confounding and estimate the causal effect, one possible approach is to use the generic variants as potential instrumental variables (e.g. Lin et al., 2015).

There are a number of other important directions for future work. Firstly, the vast majority of AD, known as “polygenic AD”, is influenced by the actions and interactions of multiple genetic variants simultaneously, likely in concert with non-genetic factors, such as environmental exposures and lifestyle choices among many others (Bertram and Tanzi, 2020). Therefore, various types of interaction effects, such as genetic-genetic and imaging-genetic, could be incorporated into the outcome generating model (1). However, this may significantly increase the computation as the dimension of genetic relevant covariates will increase from 6,087,2056,087,205 to more than 9090 billion covariates, if we add all the possible imaging-genetics interaction terms. One may consider interaction screening procedures (Hao and Zhang, 2014) as the first-step. Secondly, this study simply removes observations with missingness. Accommodation of missing exposure, confounders and outcome under the proposed model framework is of great practical value and worth further investigation. Thirdly, baseline diagnosis status is an important effect modifier, as the effect of hippocampus shape on behavioral measures can be different across the CN/MCI/AD groups. However, the relatively small sample size in the ADNI study does not allow us to conduct a reliable subgroup analysis. The subgroup analyses are pertinent for further exploration when a larger sample size is available. Fourthly, in the ADNI dataset, there are longitudinal ADAS-13 scores observed at different months as well as other longitudinal behavioral scores obtained from Mini-Mental State Examination and Rey Auditory Verbal Learning Test, which can provide a more comprehensive characterization of the behavioral deficits. Integrating these different scores as a multivariate longitudinal outcome to improve the estimation of the conditional association requires substantial effort for future research. Lastly, one could consider incorporating information from other brain regions. For instance, an entorhinal tau may exist on episodic memory decline through other brain regions, such as the medial temporal lobe (Maass et al., 2018).

Supplementary Material

Supplementary material available online contains detailed derivations and explanations of the main algorithm, ADNI data usage acknowledgement, image and genetic data preprocessing steps, screening results and sensitivity analyses of the ADNI data application with a subgroup analysis including only MCI and AD patients, detailed procedure and results for the SNP-imaging-outcome mediation analyses, additional simulation results, theoretical properties of the proposed procedure including the main theorems, assumptions needed for our main theorems, and proofs of auxiliary lemmas and main theorems.

Acknowledgement

The authors thank the editor, associate editor and referees for their constructive comments, which have substantially improved the paper. Yu was partially supported by the Canadian Statistical Sciences Institute (CANSSI) postdoctoral fellowship and the start-up fund of University of Texas at Arlington. Wang and Kong were partially supported by the Natural Science and Engineering Research Council of Canada and the CANSSI Collaborative Research Team Grant. Zhu was partially supported by the NIH-R01-MH116527.

Supplementary Material for
“Mapping the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer’s Disease”

Dengdeng Yu
Department of Mathematics, University of Texas at Arlington
Linbo Wang
Department of Statistical Sciences, University of Toronto
Dehan Kong
Department of Statistical Sciences, University of Toronto
Hongtu Zhu
Department of Biostatistics, University of North Carolina, Chapel Hill
for the Alzheimer’s Disease Neuroimaging Initiative 111 Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

The supplementary file is organized as follows. The detailed description of the main algorithm is included in Section 7. We include the ADNI data usage acknowledgement in Section 8 and image and genetic data preprocessing steps in Section 9. In Section 10, we list the screening results of ADNI data applications. Section 11 examines the sensitivity of the estimate 𝑩^\widehat{\bm{B}} from the ADNI data application by varying the relative sizes of ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2}. Section 12 includes a subgroup analysis including only the mild cognitive impairment (MCI) and Alzheimer’s disease (AD) patients. We include the detailed SNP-imaging-outcome mediation analysis procedure and results in Section 13. In Section 14, we list additional simulation results. The theoretical properties including the main theorems of our procedure are included in Section 15. We state the assumptions needed for the main theorems in Section 16. In Section 17, we include the auxiliary lemmas needed for the theorems and their proofs. We give the detailed proofs of our main theorems in Section 18.

7 Description and derivation of Algorithm 1

To solve the minimization problem (3) of the main paper, we utilize the Nesterov optimal gradient method (Nesterov, 1998), which has been widely used in solving optimization problems for non-smooth and non-convex objective functions (Beck and Teboulle, 2009; Zhou and Li, 2014).

Before we introduce Nesterov’s gradient algorithm, we first state two propositions on shrinkage thresholding formulas (Beck and Teboulle, 2009; Cai et al., 2010).

Proposition 1.

For a given matrix 𝐀\bm{A} with singular value decomposition 𝐀=𝐔diag(𝐚)𝐕T\bm{A}=\bm{U}\mathrm{diag}(\bm{a})\bm{V}^{\mathrm{\scriptscriptstyle T}}, where 𝐚=(a1,,am)T\bm{a}=(a_{1},\ldots,a_{m})^{\mathrm{\scriptscriptstyle T}} with {ak}1kr\{a_{k}\}_{1\leq k\leq r} being 𝐀\bm{A}’s singular values, the optimal solution to

min𝑩{12𝑩𝑨F2+λ𝑩}\min_{\bm{B}}\left\{\frac{1}{2}\|\bm{B}-\bm{A}\|_{F}^{2}+\lambda\|\bm{B}\|_{*}\right\}

share the same singular vectors as 𝐀\bm{A} and its singular values are bk=(akλ)+b_{k}=(a_{k}-\lambda)_{+} for k=1,,rk=1,\ldots,r, where (ak)+=max(0,ak)(a_{k})_{+}=\max(0,a_{k}).

Proposition 2.

For vectors 𝐚=(a1,,ar)T\bm{a}=(a_{1},\ldots,a_{r})^{\mathrm{\scriptscriptstyle T}} and 𝐛=(b1,,br)T\bm{b}=(b_{1},\ldots,b_{r})^{\mathrm{\scriptscriptstyle T}}, the optimal solution to

min𝒃{12𝒃𝒂22+λ𝒃1}\min_{\bm{b}}\left\{\frac{1}{2}\|\bm{b}-\bm{a}\|_{2}^{2}+\lambda\|\bm{b}\|_{1}\right\}

is bk=sgn(ak)(|ak|λ)+b_{k}=\textrm{sgn}(a_{k})(|a_{k}|-\lambda)_{+} for k=1,,rk=1,\ldots,r, where sgn()\textrm{sgn}(\cdot) denotes the sign of the number.

The Nestrov’s gradient method utilizes the first-order gradient of the objective function to produce the next iterate based on the current search point. Differed from the standard gradient descent algorithm, the Nesterov’s gradient algorithm uses two previous iterates to generate the next search point by extrapolating, which can dramatically improve the convergence rate. The Nesterov’s gradient algorithm for problem (3) is presented as follows. Denote l(𝜷,𝑩)=12ni=1n(Yi𝜷,Xi𝒁i,𝑩)2l(\bm{\beta},\bm{B})=\frac{1}{2n}\sum_{i=1}^{n}\left(Y_{i}-\langle\bm{\beta},X_{i}\rangle-\langle\bm{Z}_{i},\bm{B}\rangle\right)^{2} and P(𝜷,𝑩)=P1(𝜷)+P2(𝑩)P(\bm{\beta},\bm{B})=P_{1}(\bm{\beta})+P_{2}(\bm{B}), where P1(𝜷)=λ1l|βl|P_{1}(\bm{\beta})=\lambda_{1}\sum_{l}|\beta_{l}| and P2(𝑩)=λ2𝑩P_{2}(\bm{B})=\lambda_{2}||\bm{B}||_{*}. We also define

g(𝜷,𝑩|𝒔(t),𝑺(t),δ)\displaystyle g(\bm{\beta},\bm{B}|\bm{s}^{(t)},\bm{S}^{(t)},\delta) =\displaystyle= l(𝒔(t),𝑺(t))+l(𝒔(t),𝑺(t)),[(𝜷𝒔(t))T,{vec(𝑩𝑺(t))}T]T\displaystyle l(\bm{s}^{(t)},\bm{S}^{(t)})+\left\langle\nabla l(\bm{s}^{(t)},\bm{S}^{(t)}),\left[(\bm{\beta}-\bm{s}^{(t)})^{\mathrm{\scriptscriptstyle T}},\{\mathrm{vec}(\bm{B}-\bm{S}^{(t)})\}^{\mathrm{\scriptscriptstyle T}}\right]^{\mathrm{\scriptscriptstyle T}}\right\rangle
+(2δ)1(𝜷𝒔(t)22+𝑩𝑺(t)F2)+P(𝜷,𝑩)\displaystyle+(2\delta)^{-1}\left(\left\|\bm{\beta}-\bm{s}^{(t)}\right\|_{2}^{2}+\left\|\bm{B}-\bm{S}^{(t)}\right\|_{F}^{2}\right)+P(\bm{\beta},\bm{B})
=\displaystyle= (2δ)1[𝜷{𝒔(t)δ𝜷l(𝒔(t),𝑺(t))}22\displaystyle(2\delta)^{-1}\left[\left\|\bm{\beta}-\left\{\bm{s}^{(t)}-\delta\partial_{\bm{\beta}}l(\bm{s}^{(t)},\bm{S}^{(t)})\right\}\right\|_{2}^{2}\right.
+vec(𝑩){vec(𝑺(t))δvec(𝑩)l(𝒔(t),𝑺(t))}22]\displaystyle+\left.\left\|\mathrm{vec}(\bm{B})-\left\{\mathrm{vec}(\bm{S}^{(t)})-\delta\partial_{\mathrm{vec}(\bm{B})}l(\bm{s}^{(t)},\bm{S}^{(t)})\right\}\right\|_{2}^{2}\right]
+P(𝜷,𝑩)+c(t),\displaystyle+P(\bm{\beta},\bm{B})+c^{(t)},

where l(𝜷,𝑩)=[(𝜷l)T,{vec(𝑩)l}T]T|^|+pq\nabla l(\bm{\beta},\bm{B})=[(\partial_{\bm{\beta}}l)^{\mathrm{\scriptscriptstyle T}},\{\partial_{\mathrm{vec}(\bm{B})}l\}^{\mathrm{\scriptscriptstyle T}}]^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|+pq} denotes the first-order gradient of l(𝜷,𝑩)l(\bm{\beta},\bm{B}) with respect to [𝜷T,{vec(𝑩)}T]T|^|+pq\left[\bm{\beta}^{\mathrm{\scriptscriptstyle T}},\{\mathrm{vec}(\bm{B})\}^{\mathrm{\scriptscriptstyle T}}\right]^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|+pq}. We define

𝜷l(𝜷,𝑩)\displaystyle\frac{\partial}{\partial_{\bm{\beta}}}l(\bm{\beta},\bm{B}) =\displaystyle= n1i=1nXi(𝜷,Xi+𝑩,𝒁iYi),\displaystyle n^{-1}\sum_{i=1}^{n}X_{i}\left(\langle\bm{\beta},X_{i}\rangle+\langle\bm{B},\bm{Z}_{i}\rangle-Y_{i}\right),
vec(𝑩)l(𝜷,𝑩)\displaystyle\frac{\partial}{\partial_{\mathrm{vec}(\bm{B})}}l(\bm{\beta},\bm{B}) =\displaystyle= vec{n1i=1n𝒁i(𝜷,Xi+𝑩,𝒁iYi)},\displaystyle\mathrm{vec}\left\{n^{-1}\sum_{i=1}^{n}\bm{Z}_{i}\left(\langle\bm{\beta},X_{i}\rangle+\langle\bm{B},\bm{Z}_{i}\rangle-Y_{i}\right)\right\},

with 𝜷l(𝜷,𝑩)|^|\partial_{\bm{\beta}}l(\bm{\beta},\bm{B})\in\mathbb{R}^{|\widehat{\mathcal{M}}|}, vec(𝑩)l(𝜷,𝑩)pq\partial_{\mathrm{vec}(\bm{B})}l(\bm{\beta},\bm{B})\in\mathbb{R}^{pq}. Here s(t)s^{(t)} and S(t)S^{(t)} are interpolations between 𝜷(t1)\bm{\beta}^{(t-1)} and 𝜷(t)\bm{\beta}^{(t)}, and 𝑩(t1)\bm{B}^{(t-1)} and 𝑩(t)\bm{B}^{(t)} respectively, which will be defined below; c(t)c^{(t)} denotes all terms that are irrelevant to 𝑩\bm{B}, and δ>0\delta>0 is a suitable step size. Given previous search points 𝒔(t)\bm{s}^{(t)} and 𝑺(t)\bm{S}^{(t)}, the next search points 𝒔(t+1)\bm{s}^{(t+1)} and 𝑺(t+1)\bm{S}^{(t+1)} would be the minimizer of g(𝜷,𝑩|𝒔(t),𝑺(t),δ)g(\bm{\beta},\bm{B}|\bm{s}^{(t)},\bm{S}^{(t)},\delta). For the search points 𝒔(t)\bm{s}^{(t)} and 𝑺(t)\bm{S}^{(t)}, they can be generated by linearly extrapolating two previous algorithmic iterates. A key advantage of using the Nestrov gradient method is that it has an explicit solution at each iteration. In fact, minimizing g(𝜷,𝑩|𝒔(t),𝑺(t),δ)g(\bm{\beta},\bm{B}|\bm{s}^{(t)},\bm{S}^{(t)},\delta) can be divided into two sub-problems, minimizing (2δ)1𝜷(𝒔(t)(2\delta)^{-1}\left\|\bm{\beta}-\left(\bm{s}^{(t)}-\right.\right. δ𝜷l(𝒔(t),𝑺(t)))22+λ1l|βl|\left.\left.\delta\partial_{\bm{\beta}}l(\bm{s}^{(t)},\bm{S}^{(t)})\right)\right\|_{2}^{2}+\lambda_{1}\sum_{l}|\beta_{l}| and (2δ)1vec(𝑩){vec(𝑺(t))(2\delta)^{-1}\left\|\mathrm{vec}(\bm{B})-\left\{\mathrm{vec}(\bm{S}^{(t)})-\right.\right. δvec(𝑩)l(𝒔(t),𝑺(t))}22+λ2||𝑩||\left.\left.\delta\partial_{\mathrm{vec}(\bm{B})}l(\bm{s}^{(t)},\bm{S}^{(t)})\right\}\right\|_{2}^{2}+\lambda_{2}||\bm{B}||_{*}, respectively. These sub-problems can be solved by the shrinkage thresholding formulas in Propositions 1 and 2, respectively.

Let 𝑿^=(X1^,,Xn^)Tn×|^|\bm{X}^{\widehat{\mathcal{M}}}=(X^{\widehat{\mathcal{M}}}_{1},\ldots,X^{\widehat{\mathcal{M}}}_{n})^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{n\times|\widehat{\mathcal{M}}|} where Xi^X^{\widehat{\mathcal{M}}}_{i} is {Xij}j^T|^|\{X_{ij}\}^{\mathrm{\scriptscriptstyle T}}_{j\in\widehat{\mathcal{M}}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|} for i=1,,ni=1,\ldots,n. Define 𝒁new=(vec(𝒁1),,vec(𝒁n))Tn×pq\bm{Z}_{new}=(\mathrm{vec}(\bm{Z}_{1}),\ldots,\mathrm{vec}(\bm{Z}_{n}))^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{n\times pq} and 𝑿new=(𝑿^,𝒁new)n×(|^|+pq)\bm{X}_{new}=(\bm{X}^{\widehat{\mathcal{M}}},\bm{Z}_{new})\in\mathbb{R}^{n\times(|\widehat{\mathcal{M}}|+pq)}. For a given vector 𝒂=(a1,,ar)Tr\bm{a}=(a_{1},\ldots,a_{r})^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{r}, (𝒂)+(\bm{a})_{+} is defined as {(a1)+,,(ar)+}Tp\{(a_{1})_{+},\ldots,(a_{r})_{+}\}^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{p}, where (a)+=max(0,a)(a)_{+}=\max(0,a). Similarly, sgn(𝒂)\textrm{sgn}(\bm{a}) is obtained by taking the sign of 𝒂\bm{a} componentwisely. For a given pair of tuning parameters λ1\lambda_{1} and λ2\lambda_{2}, (3) can be solved by Algorithm 1.

Algorithm 1 Shrinkage thresholding algorithm to solve (3)
  1. 1.

    Initialize: 𝜷(0)=𝜷(1)\bm{\beta}^{(0)}=\bm{\beta}^{(1)}, 𝑩(0)=𝑩(1)\bm{B}^{(0)}=\bm{B}^{(1)}, α(0)=0\alpha^{(0)}=0 and α(1)=1\alpha^{(1)}=1,
    δ=n/λmax(𝑿newT𝑿new)\delta=n/\lambda_{\textrm{max}}(\bm{X}_{new}^{\mathrm{\scriptscriptstyle T}}\bm{X}_{new}).

  2. 2.

    Repeat (a) to (f) until the objective function Q(𝜷,𝑩)Q(\bm{\beta},\bm{B}) converges:

    1. (a)

      𝒔(t)=𝜷(t)+α(t1)1α(t)(𝜷(t)𝜷(t1))\bm{s}^{(t)}=\bm{\beta}^{(t)}+\frac{\alpha^{(t-1)}-1}{\alpha^{(t)}}(\bm{\beta}^{(t)}-\bm{\beta}^{(t-1)}),
      𝑺(t)=𝑩(t)+α(t1)1α(t)(𝑩(t)𝑩(t1))\bm{S}^{(t)}=\bm{B}^{(t)}+\frac{\alpha^{(t-1)}-1}{\alpha^{(t)}}(\bm{B}^{(t)}-\bm{B}^{(t-1)});

    2. (b)

      𝜷temp=𝒔(t)δl(𝒔(t),𝑩(t))𝜷\bm{\beta}_{\textrm{temp}}=\bm{s}^{(t)}-\delta\frac{\partial l(\bm{s}^{(t)},\bm{B}^{(t)})}{\partial\bm{\beta}};
      vec(𝑩temp)=vec(𝑺(t))δl(𝜷(t),𝑺(t))vec(𝑩)\mathrm{vec}(\bm{B}_{\textrm{temp}})=\mathrm{vec}(\bm{S}^{(t)})-\delta\frac{\partial l(\bm{\beta}^{(t)},\bm{S}^{(t)})}{\partial\mathrm{vec}(\bm{B})};

    3. (c)

      Singular value decomposition: 𝑩temp=𝑼diag(𝑩)𝑽T{\bm{B}}_{\textrm{temp}}=\bm{U}\mathrm{diag}(\bm{B})\bm{V}^{\mathrm{\scriptscriptstyle T}};

    4. (d)

      𝒂new=sgn(𝜷temp)(|𝜷temp|λ1δ𝟏)+\bm{a}_{new}=\textrm{sgn}(\bm{\beta}_{\textrm{temp}})\cdot(|\bm{\beta}_{temp}|-\lambda_{1}\delta\cdot\bm{1})_{+},
      𝒃new=(𝒃λ2δ𝟏)+\bm{b}_{new}=(\bm{b}-\lambda_{2}\delta\cdot\bm{1})_{+};

    5. (e)

      𝜷(t+1)=𝒂new\bm{\beta}^{(t+1)}=\bm{a}_{new},
      𝑩(t+1)=𝑼diag(𝒃new)𝑽T\bm{B}^{(t+1)}=\bm{U}\mathrm{diag}(\bm{b}_{new})\bm{V}^{\mathrm{\scriptscriptstyle T}};

    6. (f)

      α(t+1)=[1+1+(2α(t))2]/2\alpha^{(t+1)}=\left[1+\sqrt{1+(2\alpha^{(t)})^{2}}\right]/2.

In particular, step 2(a) predicts search points 𝒔(t)\bm{s}^{(t)} and 𝑺(t)\bm{S}^{(t)} by linear extrapolations from the solutions of previous two iterates, where α(t)\alpha^{(t)} is a scalar sequence that plays a critical role in the extrapolation. This sequence is updated in step 2(f) as in the original Nesterov method. Next, steps 2(b) – 2(d) perform gradient descent from the current search points to obtain the optimal solutions at current iteration. Specifically, the gradient descent is based on minimizing g(𝜷,𝑩|g(\bm{\beta},\bm{B}| 𝒔(t),𝑺(t),δ)\bm{s}^{(t)},\bm{S}^{(t)},\delta), the first-order approximation to the loss function, at the current search points 𝒔(t)\bm{s}^{(t)} and 𝑺(t)\bm{S}^{(t)}. This minimization problem is tackled by minimizing two sub-problems by the shrinkage thresholding formulas in Propositions 1 and 2 respectively, as mentioned above. Finally, step 2(e) forces the descent property of the next iterate.

A sufficient condition for the convergence of {𝜷(t)}t1\{\bm{\beta}^{(t)}\}_{t\geq 1} and {𝑩(t)}t1\{\bm{B}^{(t)}\}_{t\geq 1} is that the step size δ\delta should be smaller than or equal to 1/Lf1/{L_{f}}, where LfL_{f} is the smallest Lipschitz constant of the function l(𝜷,𝑩)l(\bm{\beta},\bm{B}) (Beck and Teboulle, 2009). In our case, LfL_{f} is equal to λmax(𝑿newT𝑿new)/n\lambda_{\textrm{max}}(\bm{X}_{new}^{\mathrm{\scriptscriptstyle T}}\bm{X}_{new})/n, where λmax()\lambda_{\textrm{max}}(\cdot) denotes the largest eigenvalue of a matrix.

8 Data usage acknowledgement

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment and early Alzheimer’s disease. For up-to-date information, see www.adni-info.org.

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

9 Image and genetic data preprocessing

9.1 Image data preprocessing

The hippocampus surface data were preprocessed from the raw MRI data, which were collected across a variety of 1.5 Tesla MRI scanners with protocols individualized for each scanner. Standard T1-weighted images were obtained by using volumetric 3-dimensional sagittal MPRAGE or equivalent protocols with varying resolutions. The typical protocol includes: inversion time (TI) = 1000 ms, flip angle = 8o, repetition time (TR) = 2400 ms, and field of view (FOV) = 24 cm with a 256×256×170256\times 256\times 170 acquisition matrix in the xx-, yy-, and zz-dimensions yielding a voxel size of 1.25×1.26×1.21.25\times 1.26\times 1.2 mm3. We adopted a surface fluid registration based hippocampal subregional analysis package (Shi et al., 2013), which uses isothermal coordinates and fluid registration to generate one-to-one hippocampal surface registration for surface statistics computation. It introduced two cuts on a hippocampal surface to convert it into a genus zero surface with two open boundaries. The locations of the two cuts were at the front and back of the hippocampal surface. By using conformal parameterization, it essentially converts a 3D surface registration problem into a 2D image registration problem. The flow induced in the parameter domain establishes high-order correspondences between 3D surfaces. Finally, the radial distance was computed on the registered surface. This software package and associated image processing methods have been adopted and described in Wang et al. (2011).

9.2 Genetic data preprocessing

For the genetic data, we applied the following preprocessing technique to the 756756 subjects in ADNI1 study. The first line quality control steps include (i) call rate check per subject and per single nucleotide polymorphism (SNP) marker, (ii) gender check, (iii) sibling pair identification, (iv) the Hardy-Weinberg equilibrium test, (v) marker removal by the minor allele frequency, and (vi) population stratification. The second line preprocessing steps include removal of SNPs with (i) more than 5%\% missing values, (ii) minor allele frequency smaller than 10%\%, and (iii) Hardy-Weinberg equilibrium pp-value <106<10^{-6}. The 503,892 SNPs obtained from 22 autosomes were included for further processing. MACH-Admix software (http://www.unc.edu/ yunmli/MaCH-Admix/) (Liu et al., 2013) is applied on all the subjects to perform genotype imputation, using 1000G Phase I Integrated Release Version 3 haplotypes (http://www.1000genomes.org) (Consortium et al., 2012) as reference panel. Quality control was also conducted after imputation, excluding markers with (i) low imputation accuracy (based on imputation output R2R^{2}), (ii) Hardy-Weinberg equilibrium pp-value 10610^{-6}, and (iii) minor allele frequency <5%<5\%.

10 Screening results of ADNI data applications

In Table 3, we list the top 2020 SNPs selected through the blockwise joint screening procedure corresponding to left and right hippocampi respectively.

Left hippocampi Right hippocampi
Chromesome number SNP name Chromesome number SNP name
19 rs429358 19 rs429358
7 rs1016394 19 rs10414043
19 rs10414043 14 14:25618120:G_GC
7 rs1181947 19 rs7256200
19 rs7256200 14 rs41470748
22 rs134828 19 rs73052335
19 rs73052335 14 14:25613747:G_GT
7 7:101403195:C_CA 19 rs157594
19 rs157594 14 rs72684825
13 rs12864178 19 rs769449
19 rs769449 6 rs9386934
2 rs13030626 6 rs9374191
19 rs56131196 19 rs56131196
2 rs13030634 6 rs9372261
19 rs4420638 19 rs4420638
2 rs11694935 6 rs73526504
19 rs111789331 19 rs111789331
2 rs11696076 14 rs187421061
19 rs66626994 19 rs66626994
2 rs11692218 13 rs342709
Table 3: The top 20 SNPs selected through the blockwise joint screening procedure. The left two columns correspond to results from the left hippocampi, and the right two columns correspond to results from the right hipppocampus.

We plot similar figures as the Manhattan plot for ^1\widehat{\mathcal{M}}^{*}_{1}, ^1block,\widehat{\mathcal{M}}^{block,*}_{1}, ^2\widehat{\mathcal{M}}^{\textbf{}}_{2} and ^2block\widehat{\mathcal{M}}^{block}_{2} in Figure 7. Unlike the conventional Manhattan plots, where genomic coordinates are displayed along the x-axis, with the negative logarithm of the association p-value for each SNP displayed on the y-axis, in our analysis, we do not have the p-values. So in these figures, the y-axis represents the magnitude of |β^lM||\widehat{\beta}^{M}_{l}|, β^lblock,M\widehat{\beta}^{block,M}_{l}, 𝑪^lMop\|\widehat{\bm{C}}_{l}^{M}\|_{op} and C^lblock,M\widehat{C}_{l}^{block,M} and the horizontal dashed line represents the threshold values γ1,n\gamma_{1,n} (Panel (a)), γ2,n\gamma_{2,n}((Panel (b)), γ3,n\gamma_{3,n} ((Panel (c)) and γ4,n\gamma_{4,n} ((Panel (d)). In Panels (c) and (d), the left and right figures represent the left and right hippocampi, respectively. The SNPs with |β^lM||\widehat{\beta}^{M}_{l}|, β^lblock,M\widehat{\beta}^{block,M}_{l}, 𝑪^lMop\|\widehat{\bm{C}}_{l}^{M}\|_{op} and C^lblock,M\widehat{C}_{l}^{block,M} greater than or equal to γ1,n\gamma_{1,n}, γ2,n\gamma_{2,n}, γ3,n\gamma_{3,n} and γ4,n\gamma_{4,n}, hence being selected by ^1\widehat{\mathcal{M}}^{*}_{1}, ^1block,\widehat{\mathcal{M}}^{block,*}_{1}, ^2\widehat{\mathcal{M}}^{\textbf{}}_{2} and ^2block\widehat{\mathcal{M}}^{block}_{2} respectively, are highlighted with red diamond symbols.

Refer to caption
(a) |β^lM||\widehat{\beta}^{M}_{l}|
Refer to caption
(b) β^lblock,M\widehat{\beta}^{block,M}_{l}
Refer to caption
Refer to caption
(c) 𝑪^lMop\|\widehat{\bm{C}}_{l}^{M}\|_{op}
Refer to caption
Refer to caption
(d) C^lblock,M\widehat{C}_{l}^{block,M}
Figure 7: Real data results: Panels (a) – (d) present the results for |β^lM||\widehat{\beta}^{M}_{l}|, β^lblock,M\widehat{\beta}^{block,M}_{l}, 𝑪^lMop\|\widehat{\bm{C}}_{l}^{M}\|_{op} and C^lblock,M\widehat{C}_{l}^{block,M}, where genomic coordinates are displayed along the x-axis, y-axis represents the magnitude of |β^lM||\widehat{\beta}^{M}_{l}|, β^lblock,M\widehat{\beta}^{block,M}_{l}, 𝑪^lMop\|\widehat{\bm{C}}_{l}^{M}\|_{op} and C^lblock,M\widehat{C}_{l}^{block,M} and the horizontal dashed line represents the threshold values γ1,n\gamma_{1,n} (Panel (a)), γ2,n\gamma_{2,n}((Panel (b)), γ3,n\gamma_{3,n} ((Panel (c)) and γ4,n\gamma_{4,n} ((Panel (d)). In Panels (c) and (d), the left and right figures represent the left and right hippocampi, respectively. The SNPs with |β^lM||\widehat{\beta}^{M}_{l}|, β^lblock,M\widehat{\beta}^{block,M}_{l}, 𝑪^lMop\|\widehat{\bm{C}}_{l}^{M}\|_{op} and C^lblock,M\widehat{C}_{l}^{block,M} greater than or equal to γ1,n\gamma_{1,n}, γ2,n\gamma_{2,n}, γ3,n\gamma_{3,n} and γ4,n\gamma_{4,n}, hence being selected by ^1\widehat{\mathcal{M}}^{*}_{1}, ^1block,\widehat{\mathcal{M}}^{block,*}_{1}, ^2\widehat{\mathcal{M}}^{\textbf{}}_{2} and ^2block\widehat{\mathcal{M}}^{block}_{2} respectively, are highlighted with red diamond symbols.

11 Sensitivity analysis of ADNI data applications

In our analysis, we set ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2} the same sizes, following the convention that the size of screening set is determined only by the sample size (Fan and Lv, 2008), which is the same for ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2}. To assess the sensitivity of our results to this choice, we conduct sensitivity analyses varying the relative sizes of ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2} in the joint screening procedure. For simplicity, we only consider the joint screening procedure proposed in Section 3.3. Figure 8 lists the estimate 𝑩^\widehat{\bm{B}} corresponding to the left hippocampi (left part) and the right hippocampi (right part) using ^=^1^2\widehat{\mathcal{M}}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}. Denote the estimates corresponding to |^2|/|^1|=1/2,1,2|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=1/2,1,2 by 𝑩^(0.5)\widehat{\bm{B}}^{(0.5)}, 𝑩^(1)\widehat{\bm{B}}^{(1)}, 𝑩^(2)\widehat{\bm{B}}^{(2)} respectively. We set |^|=n/log(n)=89|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor=89. The estimates 𝑩^(0.5)\widehat{\bm{B}}^{(0.5)}, 𝑩^(1)\widehat{\bm{B}}^{(1)} and 𝑩^(2)\widehat{\bm{B}}^{(2)} are plotted in Figure 8 (a), (b) and (c) correspondingly. In addition, we consider |^2|/|^1|=1|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=1 but |^|=2n/log(n)=178|\widehat{\mathcal{M}}|=2\lfloor n/\log(n)\rfloor=178. Denote the corresponding estimate by 𝑩~(1)\widetilde{\bm{B}}^{(1)}. We plot 𝑩~(1)\widetilde{\bm{B}}^{(1)} in Figure 8 (d).

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 8: Real Data Results: Panels (a), (b), (c) and (d) plot the estimate 𝑩^(0.5)\widehat{\bm{B}}^{(0.5)}, 𝑩^(1)\widehat{\bm{B}}^{(1)}, 𝑩^(2)\widehat{\bm{B}}^{(2)} and 𝑩~(1)\widetilde{\bm{B}}^{(1)} corresponding to the left hippocampi (left part) and the right hippocampi (right part).

Furthermore, by defining the relative risk of an estimate 𝑩^\widehat{\bm{B}} as RR(𝑩^)=𝑩^𝑩^(1)F2𝑩^(1)F2\mathrm{RR}(\widehat{\bm{B}})=\frac{\|\widehat{\bm{B}}-\widehat{\bm{B}}^{(1)}\|_{F}^{2}}{\|\widehat{\bm{B}}^{(1)}\|_{F}^{2}}, we report the relative risks of three estimates 𝑩^(0.5)\widehat{\bm{B}}^{(0.5)}, 𝑩^(2)\widehat{\bm{B}}^{(2)} and 𝑩~(1)\widetilde{\bm{B}}^{(1)} in Table 4.

Left hippocampi Right hippocampi
RR(𝑩^(0.5))\mathrm{RR}(\widehat{\bm{B}}^{(0.5)}) 0.0022 0.1074
RR(𝑩^(2))\mathrm{RR}(\widehat{\bm{B}}^{(2)}) 0.2938 0.0907
RR(𝑩~(1))\mathrm{RR}(\widetilde{\bm{B}}^{(1)}) 0.0611 0.0927
Table 4: The relative risks of 𝑩^(0.5)\widehat{\bm{B}}^{(0.5)}, 𝑩^(2)\widehat{\bm{B}}^{(2)} and 𝑩~(1)\widetilde{\bm{B}}^{(1)} for left and right hippocampi.
Number of negative entries Left hippocampi Right hippocampi
𝑩^(0.5)\widehat{\bm{B}}^{(0.5)} 15,000 15,000
𝑩^(1)\widehat{\bm{B}}^{(1)} 15,000 15,000
𝑩^(2)\widehat{\bm{B}}^{(2)} 14,600 15,000
𝑩~(1)\widetilde{\bm{B}}^{(1)} 15,000 15,000
Table 5: Number of negative entries of 𝑩^\widehat{\bm{B}} for left and right hippocampi.

To summarize, the estimate 𝑩^\widehat{\bm{B}} is not very sensitive against the choices of |^1||\widehat{\mathcal{M}}_{1}^{*}| and |^2||\widehat{\mathcal{M}}_{2}| except for the left hippocampi when |^2|/|^1|=2|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=2 and |^|=89|\widehat{\mathcal{M}}|=89. In fact, as shown in Table 5, when |^2|/|^1|=2|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=2, for the left hippocampi, there are 400400 entries are non-negative. We believe it may be due to some confounder variables being missed in the screening step. For instance, we find that rs157582, a previously identified risk loci for Alzheimer’s disease (Guo et al., 2019) is adjusted in estimating 𝑩^(0.5)\widehat{\bm{B}}^{(0.5)}, 𝑩^(1)\widehat{\bm{B}}^{(1)} and 𝑩~(1)\widetilde{\bm{B}}^{(1)} except for 𝑩^(2)\widehat{\bm{B}}^{(2)} . But in general, as demonstrated in Figure 8, the estimate 𝑩^\widehat{\bm{B}}s are similar among different choices of |^1||\widehat{\mathcal{M}}_{1}^{*}| and |^2||\widehat{\mathcal{M}}_{2}| .

12 Subgroup analysis ADNI data applications

We repeat the analysis on the 391391 MCI and AD subjects. The estimates 𝑩^\widehat{\bm{B}} corresponding to each part of the hippocampus onto a representative hippocampal surface are plotted in Figure 9(a). We have also plotted the hippocampal subfield (Apostolova et al., 2006) in Figure 9(b). The results are similar to the complete data analysis including all the 566566 subjects. For example, from these plots, we can see that 13,70013,700 entries of 𝑩^\widehat{\bm{B}} corresponding to left and all the 15,00015,000 entries of 𝑩^\widehat{\bm{B}} corresponding to right hippocampi are negative. This implies that the radial distances of each pixel of both hippocampi are mostly negatively associated with the ADAS-13 score, which depicts the severity of behavioral deficits. Furthermore, the subfields with the strongest associations are still mostly CA1 and subiculum.

Refer to caption
(a)
Refer to caption
(b)
Figure 9: Real data results for MCI and AD subgroup: Panel (a) plots the estimate 𝑩^\widehat{\bm{B}} corresponding to the left hippocampi (left part) and the right hippocampi (right part). Panel (b) plots the hippocampal subfield.

13 Results for mediation analyses

We perform the SNP-imaging-outcome mediation analyses following the same procedure as in Bi et al. (2017). In specific, we regress the 30,00030,000 imaging measures against 6,087,2056,087,205 SNPs in the first step to search for the pairs of intermediate imaging measures and genetic variants. Then the behaviour outcome is fit against each candidate genetic variant to identify direct and significant influence. In the last step, the behaviour outcome is fit against identified genetic variant and its associated intermediate imaging measure simultaneously. A mediation relationship is built if a) the genetic variant is significant in both first and second steps, b) the intermediate imaging measure is significant in the last step, and c) the genetic variant has a smaller coefficient in the last step compared with the second step. Note that the total effect of the genetic variant in the second step should be the summation of direct and indirect effects which motivates the criterion c) of coefficient comparison. Note that, the total effect may not always be greater than the direct effect in the last step when the direct and indirect effects have opposite signs, while the causal inference tool proposed in this paper does not have this problem. Similar to Bi et al. (2017), we try to identify the pairs of SNP and imaging measure, for which the direct effect of SNP on behavioral outcome, the effect of SNP on imaging measure and the effect of imaging measure on the behavioral outcome are all significant. However, there is no SNP with at least one paired imaging measure (i.e. hippocampal imaging pixel) being significant. Therefore, there is no evidence for the mediating relationship of SNP-imaging-outcome from our analysis.

14 Additional results for simulation studies

In this section, we list additional simulation results. In particular, Figures 1014 present the screening results for Section 5.1 with (n,s,σ)=(200,5000,0.5)(n,s,\sigma)=(200,5000,0.5), (500,5000,1)(500,5000,1), (500,5000,0.5)(500,5000,0.5), (1000,5000,1)(1000,5000,1) and (1000,5000,0.5)(1000,5000,0.5), respectively. Section 14.1 presents the sensitivity and specificity analyses for Section 5.2, where the detailed definitions of sensitivity and specificity can be found here. Section 14.2 presents an additional simulation study considering various sparsity levels of instrumental variables. Section 14.3 presents an additional simulation study considering different covariances of exposure errors. Section 14.4 conducts an additional simulation study by varying the relative sizes of ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2}. Section 14.5 lists additional screening and estimation results for Section 5.3 of the main article.

Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 10: Simulation results for the case (n,s,σ)=(200,5000,0.5)(n,s,\sigma)=(200,5000,0.5): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 11: Simulation results for the case (n,s,σ)=(500,5000,1)(n,s,\sigma)=(500,5000,1): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 12: Simulation results for the case (n,s,σ)=(500,5000,0.5)(n,s,\sigma)=(500,5000,0.5): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 13: Simulation results for the case (n,s,σ)=(1000,5000,1)(n,s,\sigma)=(1000,5000,1): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 14: Simulation results for the case (n,s,σ)=(1000,5000,0.5)(n,s,\sigma)=(1000,5000,0.5): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.

14.1 Sensitivity and specificity analyses of simulation

In this subsection, we report the sensitivity and specificity of the estimates. The sensitivity (true positive rate) is defined as |{j:β^j0}1||1|\frac{|\{j:\widehat{\beta}_{j}\neq 0\}\cap\mathcal{M}_{1}|}{|\mathcal{M}_{1}|}, i.e. the proportion of variables in the oracle adjustment set 1\mathcal{M}_{1} that are selected by our estimation procedure. The specificity (true negative rate) is defined as |{j:β^j=0}(𝒮)||𝒮|\frac{|\{j:\widehat{\beta}_{j}=0\}\cap(\mathcal{I}\cup\mathcal{S})|}{|\mathcal{I}\cup\mathcal{S}|}, i.e. the proportion of variables not in the oracle adjustment set 1\mathcal{M}_{1} that are not selected by our estimation procedure. Furthermore, we define the instrumental specificity as |{j:β^j=0}|||\frac{|\{j:\widehat{\beta}_{j}=0\}\cap\mathcal{I}|}{|\mathcal{I}|}, i.e. the proportion of variables in the instrumental set \mathcal{I} that are not selected by our estimation procedure.

Table 6: Simulation results for σ=1\sigma=1 and σ=0.5\sigma=0.5, when n=500n=500. The average 𝜷\bm{\beta} sensitivity, 𝜷\bm{\beta} instrumental specificity and 𝜷\bm{\beta} specificity, MSE for 𝜷\bm{\beta}, and MSE for 𝑩{\bm{B}}, with their associated standard errors in the parentheses are reported. The results are based on 100 Monte Carlo repetitions. “No Lasso” estimate is calculated by including all the selected variables from the screening step then estimating 𝑩\bm{B} using the optimization (3) without the l1l_{1}-regularization. “Oracle” estimate is calculated by pretending to know the correct set of confounders and precision variables as XX and then estimate 𝑩\bm{B} using the optimization (3) without the l1l_{1}-regularization.
n = 500 Sensitivity Instrumental specificity Specificity MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}}
σ\sigma = 1.0
Oracle 1.000(0.000) 1.000(0.000) 1.000(0.000) 0.036(0.002) 0.553(0.005)
Proposed 0.833(0.000) 0.293(0.020) 0.998(0.000) 0.303(0.008) 0.574(0.006)
No Lasso 1.000(0.000) 0.000(0.000) 0.985(0.000) 1.740(0.078) 0.693(0.013)
σ\sigma = 0.5
Oracle 1.000(0.000) 1.000(0.000) 1.000(0.000) 0.006(0.000) 0.340(0.004)
Proposed 0.897(0.008) 0.217(0.017) 0.999(0.000) 0.191(0.005) 0.345(0.004)
No Lasso 1.000(0.000) 0.000(0.000) 0.985(0.000) 0.372(0.017) 0.371(0.004)

In the simulation studies, we report the results for the case n=500n=500, which is close to the sample size 566566 in the real data. From Table 6, one can see that the second step can only regularize out some of the instrumental variables. We guess there may be a better tuning method, which can regularize out more instrumental variables, while still keep the confounders and prevision variables in the model. We leave this as future research. Nevertheless, one can also see from Table 6 that although the proposed method may not remove all of the instrumental variables, eliminating even just some of the instruments greatly reduce the MSEs of both 𝜷\bm{\beta} and 𝑩{\bm{B}}, compared to the method where we do not impose l1l_{1}-regularization on 𝜷\bm{\beta} in the second-step estimation (denoted by the “No Lasso” method). In addition, the estimation of 𝑩{\bm{B}} is reasonably good compared to the oracle estimates as shown in Table 1 of the main article.

14.2 Screening under different sparsity levels

We also consider different sparsity levels in the simulation. It is particularly of interest for our study since when more instrumental variables than confounders and precision variables exist, which could be the case in an imaging-genetic study, the robustness of the proposed method may be undermined. As discussed before, to reduce bias and increase the statistical efficiency of the estimated 𝑩\bm{B}, the ideal adjustment set should include all confounders and precision variables while excluding instrumental variables and irrelevant variables. In particular, we consider three scenarios where the sizes of instrumental variables \mathcal{I} are the same, twice, and eight times of the size of confounders and precision variables 1\mathcal{M}_{1}.

We set s=5000s=5000 and the settings for 𝑩{\bm{B}} and 𝑪{\bm{C}} remain the same as before: 𝑩{\bm{B}} is as in Figure 4(a), and 𝑪{\bm{C}} is as in Figure 4(b). Further we set 𝑪l=vl𝑪{\bm{C}}_{l}=v_{l}*\bm{C}, where v1=1/3v_{1}=-1/3, v2=1v_{2}=-1, v3=3v_{3}=-3, v207=v210==v204+6L=3v_{207}=v_{210}=\ldots=v_{204+6L}=-3, v208=v211==v205+6L=1v_{208}=v_{211}=\ldots=v_{205+6L}=-1, v209=v212==v206+6L=1/3v_{209}=v_{212}=\ldots=v_{206+6L}=-1/3, and vl=0v_{l}=0 for 4l2064\leq l\leq 206 and 207+6Lls207+6L\leq l\leq s. Here LL is a positive integer. We set β1=3\beta_{1}=3, β2=1\beta_{2}=1, β3=1/3\beta_{3}=1/3, β104=3\beta_{104}=3, β105=1\beta_{105}=1, β106=1/3\beta_{106}=1/3, and βl=0\beta_{l}=0 for 4l1034\leq l\leq 103 and 107ls107\leq l\leq s. In this setting, we have 𝒞={1,2,3}\mathcal{C}=\{1,2,3\}, 𝒫={104,105,106}\mathcal{P}=\{104,105,106\}, ={207,208,209,,206+6L}\mathcal{I}=\{207,208,209,\ldots,206+6L\} and 𝒮={1,5000}\{1,2,3,104,105,106,207,208,209,,206+6L}\mathcal{S}=\{1\ldots,5000\}\backslash\{1,2,3,104,105,106,207,208,209,\ldots,206+6L\}. Note that |||𝒞𝒫|=|||1|=L\frac{|\mathcal{I}|}{|\mathcal{C}\cup\mathcal{P}|}=\frac{|\mathcal{I}|}{|\mathcal{M}_{1}|}=L. For n=200n=200, we let σ=1\sigma=1 and 0.50.5, and consider three different sparsity levels that L=1,2,8L=1,2,8. The complete screening results can be found in Figures 1621.

Specifically, as summarized in Figure 15, when the number of instrumental variables is much larger than that of confounders and precision variables, the size of 12\mathcal{M}_{1}\cup\mathcal{M}_{2} is larger than the number of covariates kept in the first screening step. And in this case, our results show that the screening step may include many instrumental variables, while missing some confounders and precision variables. This may deteriorate the accuracy and efficiency of the second step estimation.

Refer to caption
(a) |||1|=1\frac{|\mathcal{I}|}{|\mathcal{M}_{1}|}=1, σ=1\sigma=1
Refer to caption
(b) |||1|=1\frac{|\mathcal{I}|}{|\mathcal{M}_{1}|}=1, σ=0.5\sigma=0.5
Refer to caption
(c) |||1|=2\frac{|\mathcal{I}|}{|\mathcal{M}_{1}|}=2, σ=1\sigma=1
Refer to caption
(d) |||1|=2\frac{|\mathcal{I}|}{|\mathcal{M}_{1}|}=2, σ=0.5\sigma=0.5
Refer to caption
(e) |||1|=8\frac{|\mathcal{I}|}{|\mathcal{M}_{1}|}=8, σ=1\sigma=1
Refer to caption
(f) |||1|=8\frac{|\mathcal{I}|}{|\mathcal{M}_{1}|}=8, σ=0.5\sigma=0.5
Figure 15: Simulation results where the size of instrumental variables \mathcal{I} are the same, twice and eight times of 1\mathcal{M}_{1} for the case (n,s)=(200,5000)(n,s)=(200,5000): Panels (a) (c) (e) plot the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\} when σ=1\sigma=1. Panels (b) (d) (f) plot the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\} when σ=0.5\sigma=0.5. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 16: Simulation results where the number of instrumental variables are the same of 1\mathcal{M}_{1} for the case (n,s,σ)=(200,5000,1)(n,s,\sigma)=(200,5000,1): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 17: Simulation results where the number of instrumental variables are twice of 1\mathcal{M}_{1} for the case (n,s,σ)=(200,5000,1)(n,s,\sigma)=(200,5000,1): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 18: Simulation results where the number of instrumental variables are eight times of 1\mathcal{M}_{1} for the case (n,s,σ)=(200,5000,1)(n,s,\sigma)=(200,5000,1): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 19: Simulation results where the number of instrumental variables are the same of 1\mathcal{M}_{1} for the case (n,s,σ)=(200,5000,0.5)(n,s,\sigma)=(200,5000,0.5): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 20: Simulation results where the number of instrumental variables are twice of 1\mathcal{M}_{1} for the case (n,s,σ)=(200,5000,0.5)(n,s,\sigma)=(200,5000,0.5): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 21: Simulation results where the number of instrumental variables are eight times of 1\mathcal{M}_{1} for the case (n,s,σ)=(200,5000,0.5)(n,s,\sigma)=(200,5000,0.5): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.

14.3 Screening and estimation under different covariances of exposure errors

We also consider various exposure errors in the simulation. We use the same setting as Section 5.1 of the main paper but taking three different covariance structures of exposure errors 𝑬i\bm{E}_{i}. In particular, the random error vec(𝑬i)\mathrm{vec}(\bm{E}_{i}) is independently generated from N(𝟎,𝚺e)N(\bm{0},\bm{\Sigma}_{e}), where we set the standard deviations of all elements in 𝑬i\bm{E}_{i} to be σe=0.2\sigma_{e}=0.2 and the correlation between 𝑬i,jk\bm{E}_{i,jk} and 𝑬i,jk\bm{E}_{i,j^{\prime}k^{\prime}} to be ρ2|jj|+|kk|\rho_{2}^{|j-j^{\prime}|+|k-k^{\prime}|} for 1j,k,j,k641\leq j,k,j^{\prime},k^{\prime}\leq 64 with ρ2\rho_{2}. We consider three scenarios that ρ2=0.2\rho_{2}=0.2, 0.50.5, and 0.80.8, and report the selected covariates from the screening step. We consider σ=1\sigma=1 or 0.50.5 and fix the sample size n=200n=200. The complete screening results can be found in Figure 5 as well as Figures 10, 2326 here (ρ2\rho_{2} is set to be 0.50.5 in Section 5.1 of the main paper).

Specifically, as summarized in Figure 22, when ρ2\rho_{2} increases, the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\} does not change too much.

Refer to caption
(a) ρ2=0.2\rho_{2}=0.2, σ=1\sigma=1
Refer to caption
(b) ρ2=0.2\rho_{2}=0.2, σ=0.5\sigma=0.5
Refer to caption
(c) ρ2=0.5\rho_{2}=0.5, σ=1\sigma=1
Refer to caption
(d) ρ2=0.5\rho_{2}=0.5, σ=0.5\sigma=0.5
Refer to caption
(e) ρ2=0.8\rho_{2}=0.8, σ=1\sigma=1
Refer to caption
(f) ρ2=0.8\rho_{2}=0.8, σ=0.5\sigma=0.5
Figure 22: Simulation results where the size of instrumental variables \mathcal{I} are the same, twice and eight times of 1\mathcal{M}_{1} for the case (n,s)=(200,5000)(n,s)=(200,5000): Panels (a) (c) (e) plot the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\} when σ=1\sigma=1. Panels (b) (d) (f) plot the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\} when σ=0.5\sigma=0.5. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.

In addition, we evaluate the performance of our estimation procedure after the first-step screening. For the size of ^\widehat{\mathcal{M}} in the screening step, we set |^|=n/log(n)|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor, so that |^|=37|\widehat{\mathcal{M}}|=37 for sample size n=200n=200. We report the mean squared errors (MSEs) for 𝜷\bm{\beta} and 𝑩\bm{B} defined as 𝜷𝜷^22||{\bm{\beta}}-\widehat{\bm{\beta}}||_{2}^{2} and 𝑩𝑩^F2\|{\bm{B}}-\widehat{\bm{B}}\|_{F}^{2}, respectively.

Table 7 summarizes the average MSEs for 𝜷\bm{\beta} and 𝑩\bm{B} among 100 Monte Carlo runs. We can see that the MSE decreases with ρ2\rho_{2} increases. As nuclear norm penalization estimation procedure can be regarded as one way of spatial smoothing, the large correlations among 𝑬i\bm{E}_{i} actually help with the spatial smoothing, and thus we have better estimation accuracy when ρ2\rho_{2} increases.

Table 7: Simulation results for σ=1\sigma=1 and σ=0.5\sigma=0.5: the average MSEs for 𝜷\bm{\beta} and 𝑩{\bm{B}}, and their associated standard errors in the parentheses are reported. The results are based on 100 Monte Carlo repetitions.
σ=1.0\sigma=1.0 MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}} σ=0.5\sigma=0.5 MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}}
ρ2=0.2\rho_{2}=0.2 0.986(0.099) 0.802(0.011) ρ2=0.2\rho_{2}=0.2 0.397(0.042) 0.701(0.006)
ρ2=0.5\rho_{2}=0.5 0.496(0.021) 0.667(0.005) ρ2=0.5\rho_{2}=0.5 0.276(0.009) 0.528(0.005)
ρ2=0.8\rho_{2}=0.8 0.252(0.010) 0.439(0.007) ρ2=0.8\rho_{2}=0.8 0.097(0.006) 0.305(0.004)
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 23: Simulation results for the case (n,s,σ,ρ2)=(200,5000,1,0.2)(n,s,\sigma,\rho_{2})=(200,5000,1,0.2): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 24: Simulation results for the case (n,s,σ,ρ2)=(200,5000,1,0.8)(n,s,\sigma,\rho_{2})=(200,5000,1,0.8): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 25: Simulation results for the case (n,s,σ,ρ2)=(200,5000,0.5,0.2)(n,s,\sigma,\rho_{2})=(200,5000,0.5,0.2): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 26: Simulation results for the case (n,s,σ,ρ2)=(200,5000,0.5,0.8)(n,s,\sigma,\rho_{2})=(200,5000,0.5,0.8): Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.

14.4 Screening and estimation under different sizes of ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2}

In addition, we conduct a similar study following the same setting as described in Section 5.1 of the main paper, where |^2|/|^1|=2|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=2 and 1/21/2. We use n=500n=500 here since it is close to the number of observations n=566n=566 in the real data analysis. Specifically, as summarized in Figures 27 and 28, when the ratio of |^2|/|^1||\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}| is taken as 1/21/2 or 22, the performances of the proposed joint screening method are quite similar to each other. By comparing them with Figure 11, where |^2|/|^1|=1|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=1, the performances of the screening step results are quite similar.

Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 27: Simulation results for the case (n,s,σ)=(500,5000,1)(n,s,\sigma)=(500,5000,1) when |^2|/|^1|=1/2|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=1/2: Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Overall coverage of 1\mathcal{M}_{1}
Figure 28: Simulation results for the case (n,s,σ)=(500,5000,1)(n,s,\sigma)=(500,5000,1) when |^2|/|^1|=2|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=2: Panels (a) – (f) plot the average coverage proportion for XlX_{l}, where l=1,2,3,104,105l=1,2,3,104,105 and 106106. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (f) correspond to strong, moderate and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 1={1,2,3,104,105,106}\mathcal{M}_{1}=\{1,2,3,104,105,106\}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The green solid, the red dashed and the black dash dotted lines denote our joint screening method, the outcome screening method, and the intersection screening method, respectively.

Furthermore, we evaluate the performance of our estimation procedure after the first-step screening. For the size of ^\widehat{\mathcal{M}} in the screening step, we set |^|=n/log(n)|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor, so that |^|=89|\widehat{\mathcal{M}}|=89 for sample size n=500n=500. We report the mean squared errors (MSEs) for 𝜷\bm{\beta} and 𝑩\bm{B} defined as 𝜷𝜷^22||{\bm{\beta}}-\widehat{\bm{\beta}}||_{2}^{2} and 𝑩𝑩^F2\|{\bm{B}}-\widehat{\bm{B}}\|_{F}^{2}, respectively. As summarized in Table 8, the average MSEs for 𝜷\bm{\beta} and 𝑩\bm{B} among 100 Monte Carlo runs are all similar to each other for the different choices of |^1||\widehat{\mathcal{M}}_{1}^{*}| and |^2||\widehat{\mathcal{M}}_{2}|. Therefore, depending on the prior knowledge about the sizes and strengths of signals of confounding, precision and instrumental variables, one may choose ^1\widehat{\mathcal{M}}_{1}^{*} and ^2\widehat{\mathcal{M}}_{2} differently, though the estimations of 𝑩\bm{B} appear to be similar among the different choices.

Table 8: Simulation results of the proposed estimates for (n,s,σ)=(500,5000,1)(n,s,\sigma)=(500,5000,1) , when |^2|/|^1||\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}| is taken as 0.5, 1.0 and 2.0: the average MSEs for 𝜷\bm{\beta} and 𝑩{\bm{B}}, and their associated standard errors in the parentheses are reported. The results are based on 100 Monte Carlo repetitions.
|^2|/|^1||\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}| MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}}
0.5 0.301(0.008) 0.567(0.005)
1.0 0.303(0.008) 0.574(0.006)
2.0 0.302(0.008) 0.574(0.006)

14.5 Screening and estimation using blockwise joint screening

In this section, we list additional screening and estimation results for Section 5.3 of the main article. In particular, the results for the screening step, in which s=5000s=5000, σ=1\sigma=1, n=200,500,1000n=200,500,1000, and K=2,4,6,12,24,52K=2,4,6,12,24,52, can be found in Figures 2946. The complete results for the second step estimation, in which s=5000s=5000, σ=1\sigma=1, n=200,500,1000n=200,500,1000, and K=2,4,6,12,24,52K=2,4,6,12,24,52, can be found in Table 9.

Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 29: Simulation results for the case (n,s,K,σ)=(200,5000,2,1)(n,s,K,\sigma)=(200,5000,2,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 30: Simulation results for the case (n,s,K,σ)=(200,5000,4,1)(n,s,K,\sigma)=(200,5000,4,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 31: Simulation results for the case (n,s,K,σ)=(200,5000,6,1)(n,s,K,\sigma)=(200,5000,6,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 32: Simulation results for the case (n,s,K,σ)=(200,5000,12,1)(n,s,K,\sigma)=(200,5000,12,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 33: Simulation results for the case (n,s,K,σ)=(200,5000,24,1)(n,s,K,\sigma)=(200,5000,24,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 34: Simulation results for the case (n,s,K,σ)=(200,5000,52,1)(n,s,K,\sigma)=(200,5000,52,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 35: Simulation results for the case (n,s,K,σ)=(500,5000,2,1)(n,s,K,\sigma)=(500,5000,2,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 36: Simulation results for the case (n,s,K,σ)=(500,5000,4,1)(n,s,K,\sigma)=(500,5000,4,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 37: Simulation results for the case (n,s,K,σ)=(500,5000,6,1)(n,s,K,\sigma)=(500,5000,6,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 38: Simulation results for the case (n,s,K,σ)=(500,5000,12,1)(n,s,K,\sigma)=(500,5000,12,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 39: Simulation results for the case (n,s,K,σ)=(500,5000,24,1)(n,s,K,\sigma)=(500,5000,24,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 40: Simulation results for the case (n,s,K,σ)=(500,5000,52,1)(n,s,K,\sigma)=(500,5000,52,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 41: Simulation results for the case (n,s,K,σ)=(1000,5000,2,1)(n,s,K,\sigma)=(1000,5000,2,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 42: Simulation results for the case (n,s,K,σ)=(1000,5000,4,1)(n,s,K,\sigma)=(1000,5000,4,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 43: Simulation results for the case (n,s,K,σ)=(1000,5000,6,1)(n,s,K,\sigma)=(1000,5000,6,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 44: Simulation results for the case (n,s,K,σ)=(1000,5000,12,1)(n,s,K,\sigma)=(1000,5000,12,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 45: Simulation results for the case (n,s,K,σ)=(1000,5000,24,1)(n,s,K,\sigma)=(1000,5000,24,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Refer to caption
(a) Confounder: strong
outcome, weak exposure
Refer to caption
(b) Confounder: medium
outcome, medium exposure
Refer to caption
(c) Confounder: weak
outcome, strong exposure
Refer to caption
(d) Precision: strong
outcome, zero exposure
Refer to caption
(e) Precision: medium
outcome, zero exposure
Refer to caption
(f) Precision: weak
outcome, zero exposure
Refer to caption
(g) Precision: weaker
outcome, zero exposure
Refer to caption
(h) Overall coverage of 1\mathcal{M}_{1}
Figure 46: Simulation results for the case (n,s,K,σ)=(1000,5000,52,1)(n,s,K,\sigma)=(1000,5000,52,1): Panels (a) – (g) plot the average coverage proportion for XlX_{l}, where l1={1,2,3,104,105,106}𝒫LDl\in\mathcal{M}_{1}=\{1,2,3,104,105,106\}\cup\mathcal{P}_{LD}. Panels (a) – (c) correspond to strong outcome and weak exposure predictor, moderate outcome and moderate exposure predictor and weak outcome and strong exposure predictor; Panels (d) – (g) correspond to strong, moderate, and weak predictors of outcome only. Panel (g) plots the average coverage proportion for the index set 𝒫LD\mathcal{P}_{LD}. Panel (h) plots the average coverage proportion for the index set 1\mathcal{M}_{1}. The x-axis represents the size of ^\widehat{\mathcal{M}}, while y-axis denotes the average proportion. The blue dot, green solid, red dashed and black dash dotted lines denote the blockwise joint screening, joint screening, outcome screening, and intersection screening methods, respectively.
Table 9: Simulation results for σ=1\sigma=1: the average MSEs for 𝜷\bm{\beta} and 𝑩{\bm{B}}, and their associated standard errors in the parentheses are reported. The left panel summarizes the results from the joint screening method; the right panel summarizes the results from the blockwise joint screening method. The results are based on 100 Monte Carlo repetitions.
Proposed MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}} Proposed (block) MSE 𝜷\bm{\beta} MSE 𝑩{\bm{B}}
n=200,K=2 1.423(0.096) 0.785(0.009) n=200,K=2 1.390(0.090) 0.793(0.010)
n=500,K=2 0.831(0.069) 0.726(0.008) n=500,K=2 0.892(0.082) 0.734(0.009)
n=1000,K=2 0.591(0.050) 0.676(0.008) n=1000,K=2 0.488(0.028) 0.666(0.006)
n=200,K=4 1.667(0.096) 0.815(0.011) n=200,K=4 1.548(0.088) 0.805(0.010)
n=500,K=4 1.059(0.082) 0.751(0.011) n=500,K=4 1.094(0.090) 0.758(0.012)
n=1000,K=4 0.606(0.057) 0.671(0.008) n=1000,K=4 0.555(0.045) 0.678(0.008)
n=200,K=6 1.955(0.101) 0.826(0.010) n=200,K=6 1.701(0.084) 0.816(0.009)
n=500,K=6 1.155(0.085) 0.749(0.011) n=500,K=6 1.107(0.089) 0.752(0.011)
n=1000,K=6 0.578(0.051) 0.674(0.008) n=1000,K=6 0.551(0.047) 0.672(0.008)
n=200,K=12 2.466(0.096) 0.890(0.039) n=200,K=12 2.223(0.129) 0.838(0.011)
n=500,K=12 1.024(0.082) 0.735(0.010) n=500,K=12 0.927(0.077) 0.727(0.008)
n=1000,K=12 0.570(0.046) 0.673(0.008) n=1000,K=12 0.627(0.057) 0.681(0.009)
n=200,K=24 2.533(0.164) 0.847(0.014) n=200,K=24 2.136(0.138) 0.821(0.010)
n=500,K=24 1.065(0.080) 0.733(0.010) n=500,K=24 1.119(0.088) 0.737(0.011)
n=1000,K=24 0.662(0.050) 0.669(0.008) n=1000,K=24 0.677(0.056) 0.673(0.009)
n=200,K=52 14.650(0.815) 2.034(0.487) n=200,K=52 13.693(0.728) 1.870(0.459)
n=500,K=52 1.816(0.144) 0.775(0.019) n=500,K=52 1.725(0.143) 0.762(0.019)
n=1000,K=52 0.937(0.066) 0.684(0.010) n=1000,K=52 0.861(0.056) 0.675(0.008)

15 Theoretical properties

Starting from here, we denote 𝜷˙=(β˙1,,β˙s)T\dot{\bm{\beta}}=(\dot{\beta}_{1},\ldots,\dot{\beta}_{s})^{T}, 𝑩˙\dot{\bm{B}} and 𝑪˙l\dot{\bm{C}}_{l} as the true values for 𝜷=(β1,,βs)T\bm{\beta}=(\beta_{1},\ldots,\beta_{s})^{T}, 𝑩{\bm{B}} and 𝑪l{\bm{C}}_{l} respectively. Furthermore, we denote 𝒞˙\dot{\mathcal{C}}, 𝒫˙\dot{\mathcal{P}}, and ˙\dot{\mathcal{I}} as the true index sets of 𝒞{\mathcal{C}}, 𝒫{\mathcal{P}}, and {\mathcal{I}}.

15.1 Sure screening property

In this subsection, we study theoretical properties for our screening procedure. We let 1={1lsn:β˙l0}=𝒞˙𝒫˙\mathcal{M}_{1}=\{1\leq l\leq s_{n}:\dot{\beta}_{l}^{*}\neq 0\}=\dot{\mathcal{C}}\cup\dot{\mathcal{P}}, where 𝒞˙={1lsn:𝑪l˙𝟎 and β˙l0}\dot{\mathcal{C}}=\{1\leq l\leq s_{n}:\dot{\bm{C}_{l}}\neq\bm{0}\textrm{ and }\dot{\beta}_{l}^{*}\neq 0\} and 𝒫˙={1lsn:𝑪l˙=𝟎 and β˙l0}\dot{\mathcal{P}}=\{1\leq l\leq s_{n}:\dot{\bm{C}_{l}}=\bm{0}\textrm{ and }\dot{\beta}_{l}^{*}\neq 0\}. Here β˙l\dot{\beta}_{l}^{*} and 𝑪l˙\dot{\bm{C}_{l}} are the true values for βl\beta_{l} and 𝑪l\bm{C}_{l}, respectively, and 𝑩˙\dot{\bm{B}} is the true value of 𝑩\bm{B}.

We have the following theorems, where the assumptions needed are included in Section 16.

Theorem 1.

Under Assumptions (A0) – (A3) and (A5), let γ1,n=αD1nκ\gamma_{1,n}=\alpha D_{1}n^{-\kappa} and γ2,n=αD1(pq)1/2\gamma_{2,n}=\alpha D_{1}(pq)^{1/2} nκn^{-\kappa} with 0<α<10<\alpha<1, then we have P(1^)1P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}})\to 1 as nn\to\infty.

Since the screening procedure automatically includes all the significant covariates for small value of γ1,n\gamma_{1,n} and γ2,n\gamma_{2,n}, it is necessary to consider the size of ^\widehat{\mathcal{M}}, which we quantify in Theorem 2.

Theorem 2.

Under Assumptions (A0) – (A5), when γ1,n=αD1nκ\gamma_{1,n}=\alpha D_{1}n^{-\kappa} and γ2,n=αD1(pq)1/2nκ\gamma_{2,n}=\alpha D_{1}(pq)^{1/2}n^{-\kappa} with 0<α<10<\alpha<1, we have P(|^|=O(n2κ+τ))1P(|\widehat{\mathcal{M}}|=O(n^{2\kappa+\tau}))\to 1 as nn\to\infty.

Corollary 1.

Under Assumptions (A0) – (A5), when γ1,n=αD1nκ\gamma_{1,n}=\alpha D_{1}n^{-\kappa} and γ2,n=αD1(pq)1/2nκ\gamma_{2,n}=\alpha D_{1}(pq)^{1/2}n^{-\kappa} with 0<α<10<\alpha<1, we have P(|^^1|=O(n2κ+τ))1P(|\widehat{\mathcal{M}}-\widehat{\mathcal{M}}_{1}^{*}|=O(n^{2\kappa+\tau}))\to 1 as nn\to\infty.

Theorem 1 shows that if γ1,n\gamma_{1,n} and γ2,n\gamma_{2,n} are chosen properly, our screening procedure will include all significant variables with a high probability. Theorem 2 guarantees that the size of selected model from the screening procedure is only of a polynomial order of nn even though the original model size is of an exponential order of nn. Therefore, the false selection rate of our screening procedure vanishes as nn\to\infty, while the size of ^\widehat{\mathcal{M}} grows in a polynomial order of nn, where the order depends on two constants κ\kappa and τ\tau defined in Section 16. Theorem 3 shows our blockwise screening procedure also enjoys the screening property. The proofs of these theorems are collected in Section 18.

Theorem 3.

Under Assumptions (A0) – (A3), (A5), and further assume the jj-th block size |j|=D6nν1|\mathcal{B}_{j}|=D_{6}n^{\nu_{1}} for some constant D6>0D_{6}>0. Let γ1,n=αD1nκ\gamma_{1,n}=\alpha D_{1}n^{-\kappa}, γ2,n=αD1(pq)1/2\gamma_{2,n}=\alpha D_{1}(pq)^{1/2} nκn^{-\kappa} with 0<α<10<\alpha<1, γ3,n=αD1D61nκν1\gamma_{3,n}=\alpha D_{1}D_{6}^{-1}n^{-\kappa-\nu_{1}}, and γ4,n=αD1D61(pq)1/2\gamma_{4,n}=\alpha D_{1}D_{6}^{-1}(pq)^{1/2} nκν1n^{-\kappa-\nu_{1}} with 0<α<10<\alpha<1, then we have P(1^block)1P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}}^{block})\to 1 as nn\to\infty.

15.2 Theory for two-step estimator

In this section, we develop a unified theory for our two-step estimator. In particular, we derive a non-asymptotic bound for the final estimates. We first introduce some notation.

Denote parameter 𝜽={𝜷T,vecT(𝑩)}Ts+pq\bm{\theta}=\left\{\bm{\beta}^{\mathrm{\scriptscriptstyle T}},\mathrm{vec}^{\mathrm{\scriptscriptstyle T}}(\bm{B})\right\}^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{s+pq}, where 𝜷s\bm{\beta}\in\mathbb{R}^{s} and 𝑩p×q\bm{B}\in\mathbb{R}^{p\times q}. Using this notation, problem (3) can be recasted as minimizing l(𝜽)+P(𝜽)l(\bm{\theta})+P(\bm{\theta}), where l(𝜽)=(2n)1i=1n(Yil(\bm{\theta})=(2n)^{-1}\sum_{i=1}^{n}\left(Y_{i}-\right. 𝒁i,𝑩\left.\langle\bm{Z}_{i},\bm{B}\rangle-\right. l^Xilβl)2\left.\sum_{l\in\widehat{\mathcal{M}}}{X}_{il}\beta_{l}\right)^{2}, and P(𝜽)=λ1l^|βl|+λ2||𝑩||P(\bm{\theta})=\lambda_{1}\sum_{l\in\widehat{\mathcal{M}}}|\beta_{l}|+\lambda_{2}||\bm{B}||_{*}. In addition, we let 𝜽˙={𝜷˙T,\dot{\bm{\theta}}=\{\dot{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}}, vec(𝑩˙)T}T\mathrm{vec}(\dot{\bm{B}})^{\mathrm{\scriptscriptstyle T}}\}^{\mathrm{\scriptscriptstyle T}} be the true value for 𝜽\bm{\theta}, where 𝜷˙\dot{\bm{\beta}} and 𝑩˙\dot{\bm{B}} is the true values for 𝜷\bm{\beta} and 𝑩\bm{B}, respectively. Let 𝜽^𝝀={𝜷^T,vec(𝑩^}T)T\widehat{\bm{\theta}}_{\bm{\lambda}}=\{\widehat{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\widehat{\bm{B}}\}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} be the proposed estimator for 𝜽\bm{\theta}, where 𝜷^\widehat{\bm{\beta}} and 𝑩^\widehat{\bm{B}} are the estimators obtained from (3) for tuning parameters 𝝀=(λ1,λ2)\bm{\lambda}=(\lambda_{1},\lambda_{2}).

We hereby give nonasymptotic error bound for the proposed two-step estimator 𝜽^𝝀\widehat{\bm{\theta}}_{\bm{\lambda}}:

Theorem 4.

(Nonasymptotic error bounds for two-step estimator) Under Assumptions (A0) – (A9), 2κ+τ<12\kappa+\tau<1 and κ<1/4\kappa<1/4, and the condition that 1^\mathcal{M}_{1}\subset\widehat{\mathcal{M}} with |^|=O(n2κ+τ)|\widehat{\mathcal{M}}|=O(n^{2\kappa+\tau}), conditional on ^\widehat{\mathcal{M}}, there exists some positive constants c1,c2,c3,c4c_{1},c_{2},c_{3},c_{4}, C0C_{0}, C1C_{1}, g0g_{0} and g1g_{1}, such that for λ12σ0[2n1{log(logn)+C0(2κ+τ)logn}]1/2\lambda_{1}\geq 2\sigma_{0}[2n^{-1}\{\log(\log n)+C_{0}(2\kappa+\tau)\log n\}]^{1/2} and λ22bs2σ0[2n1{3logs2+log(logn)}]1/2+4n1/2\lambda_{2}\geq 2bs_{2}\sigma_{0}[2n^{-1}\{3\log s_{2}+\log(\log n)\}]^{1/2}+4n^{-1/2} σϵ(p1/2+q1/2)\sigma_{\epsilon}(p^{1/2}+q^{1/2}), with probability at least 1c1/lognc2/(s2logn)c3exp{c4(p+q)}exp(n)1-c_{1}/\log n-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n), one has

𝜽^𝝀𝜽˙22C0max{C1λ12n2κ+τ,λ22r}ι2.\left\|\widehat{\bm{\theta}}_{\bm{\lambda}}-\dot{\bm{\theta}}\right\|_{2}^{2}\leq C_{0}\max\left\{C_{1}\lambda_{1}^{2}n^{2\kappa+\tau},\lambda_{2}^{2}r\right\}\iota^{-2}.

The bound in Theorem 4 implies that the convergence rate of the proposed estimator 𝜽^λ\widehat{\bm{\theta}}_{\lambda} is O(max{O(\max\{ n2κ+τ1,n12τ})n^{2\kappa+\tau-1},n^{1-2\tau}\}). Here ι\iota is a positive constant defined in Assumption (A6) in Section 16, and rr is the rank of 𝑩˙\dot{\bm{B}}. The convergence rate is controlled by κ\kappa and τ\tau, where κ\kappa controls the exponential rates of model complexity that can diverge and τ\tau controls the rate of largest eigenvalue of population covariance matrix that can grow. The proof of the theorem is deferred to section 18.

16 Assumptions for main theorems

In this section, we state the assumptions for the main theorems. We first make the following assumptions, which are needed for Theorems 1 and 2.

(A0) The covariates XiX_{i} are independent and identically distributed (i.i.d) with mean zero and covariance Σx\Sigma_{x}. The random error ϵi\epsilon_{i} are i.i.d with mean zero and variance σϵ2\sigma_{\epsilon}^{2}. Define σl2=(Σx)ll\sigma_{l}^{2}=(\Sigma_{x})_{ll}. The vectorized error matrices vec(Ei)\mathrm{vec}(E_{i}) are i.i.d with mean zero and covariance Σe\Sigma_{e}. There exists a constant σx>0\sigma_{x}>0 such that Σxσx\left\|\Sigma_{x}\right\|_{\infty}\leq\sigma_{x}. Moreover, xix_{i} is independent of Ei=(Ei,jk)E_{i}=(E_{i,jk}) and ϵi\epsilon_{i}.

(A1) There exist some constants D1>0D_{1}>0 and b>0b>0, and 0<κ<1/20<\kappa<1/2 such that

minl1|cov(l1xilβl˙,xil)|\displaystyle\min_{l\in\mathcal{M}_{1}}\left|cov\left(\sum_{l^{\prime}\in\mathcal{M}_{1}}x_{il^{\prime}}\dot{\beta_{l^{\prime}}^{*}},x_{il}\right)\right| \displaystyle\geq D1nκ,\displaystyle D_{1}n^{-\kappa},
minl2cov(l2xil𝑪l˙,xil)op\displaystyle\min_{l\in\mathcal{M}_{2}}\left\|cov\left(\sum_{l^{\prime}\in\mathcal{M}_{2}}x_{il^{\prime}}*\dot{\bm{C}_{l^{\prime}}},x_{il}\right)\right\|_{op} \displaystyle\geq D1(pq)1/2nκ,\displaystyle D_{1}(pq)^{1/2}n^{-\kappa},

and max{maxl2𝑪l˙,maxl2𝑪l˙op,maxl2|𝑪l˙,𝑩˙|,maxl1|β˙l|}<b\max\left\{\max_{l\in\mathcal{M}_{2}}\left\|\dot{\bm{C}_{l}}\right\|_{\infty},\max_{l\in\mathcal{M}_{2}}\left\|\dot{\bm{C}_{l}}\right\|_{op},\max_{l\in\mathcal{M}_{2}}\left|\langle\dot{\bm{C}_{l}},\dot{\bm{B}}\rangle\right|,\max_{l\in\mathcal{M}_{1}}|\dot{\beta}_{l}^{*}|\right\}<b.

(A2) There exist positive constants D2D_{2} and D3D_{3} such that

max{E[eD2xil2],E[eD2Ei,jk2],E[eD2𝑬i,𝑩˙2]}D3\max\left\{E[e^{D_{2}x_{il}^{2}}],E[e^{D_{2}E_{i,jk}^{2}}],E[e^{D_{2}\langle\bm{E}_{i},\dot{\bm{B}}\rangle^{2}}]\right\}\leq D_{3}

for every 1lsn1\leq l\leq s_{n}, 1jp1\leq j\leq p and 1kq1\leq k\leq q. Denote ϵ=(ϵ1,,ϵn)T\bm{\epsilon}=(\epsilon_{1},\ldots,\epsilon_{n})^{T} is a n-dimensional vector of zero-mean, there exists a constant σ0>0\sigma_{0}>0 such that for any fixed 𝒗2=1\|\bm{v}\|_{2}=1, P(|𝒗,ϵ|t)2exp(t22σ02)P(\left|\langle\bm{v},\bm{\epsilon}\rangle\right|\geq t)\leq 2\exp\left(-\frac{t^{2}}{2\sigma_{0}^{2}}\right) for all t>0t>0.

(A3) There exists a constant D4>0D_{4}>0 such that log(sn)=D4nξ\log(s_{n})=D_{4}n^{\xi} for ξ(0,12κ)\xi\in(0,1-2\kappa).

(A4) There exists constants D5>0D_{5}>0 and τ>0\tau>0 such that λmax(Σx)D5nτ\lambda_{\max}(\Sigma_{x})\leq D_{5}n^{\tau}.

(A5) log(pq)=o(n12κ)\log(pq)=o(n^{1-2\kappa}).

Before we state the assumptions for Theorem 4, we first introduce some notations.

Denote P(𝜽)=P1(𝜷)+P2(𝑩)P(\bm{\theta})=P_{1}(\bm{\beta})+P_{2}(\bm{B}) where P1(𝜷)=λ1l^|βl|P_{1}(\bm{\beta})=\lambda_{1}\sum_{l\in\widehat{\mathcal{M}}}|\beta_{l}| and P2(𝑩)=λ2||𝑩||P_{2}(\bm{B})=\lambda_{2}||\bm{B}||_{*}. In addition, let r=rank(𝑩˙)r=\mathrm{rank}(\dot{\bm{B}}), the true rank of matrix 𝑩˙p×q\dot{\bm{B}}\in\mathbb{R}^{p\times q}. Let us consider the class of matrices Θ\Theta that have rank rmin{p,q}r\leq\min\left\{p,q\right\}. For any given matrix Θ\Theta, we let row(Θ)p\mathrm{row}(\Theta)\subset\mathbb{R}^{p} and col(Θ)q\mathrm{col}(\Theta)\subset\mathbb{R}^{q} denote its row and column space, respectively. Let UU and VV be a given pair of rr-dimensional subspace UpU\subset\mathbb{R}^{p} and VqV\subset\mathbb{R}^{q}, respectively.

For a given 𝜽\bm{\theta} and pair (U,V)(U,V), we define the subspace Ω1(1)\Omega_{1}(\mathcal{M}_{1}), Ω¯1(1)\overline{\Omega}_{1}(\mathcal{M}_{1}), Ω¯1(1)\overline{\Omega}^{\perp}_{1}(\mathcal{M}_{1}), Ω2(U,V)\Omega_{2}(U,V), Ω¯2(U,V)\overline{\Omega}_{2}(U,V) and Ω¯2(U,V)\overline{\Omega}^{\perp}_{2}(U,V) as follows:

Ω1(1)\displaystyle{\Omega}_{1}(\mathcal{M}_{1}) =\displaystyle= Ω¯1(1):={𝜷s|βj=0 for all j1},\displaystyle\overline{\Omega}_{1}(\mathcal{M}_{1}):=\left\{\bm{\beta}\in\mathbb{R}^{s}|\beta_{j}=0\textrm{ for all }j\not\in\mathcal{M}_{1}\right\},
Ω¯1(1)\displaystyle\overline{\Omega}_{1}^{\perp}(\mathcal{M}_{1}) :=\displaystyle:= {𝜷s|βj=0 for all j1},\displaystyle\left\{\bm{\beta}\in\mathbb{R}^{s}|\beta_{j}=0\textrm{ for all }j\in\mathcal{M}_{1}\right\},
Ω2(U,V)\displaystyle{\Omega}_{2}(U,V) :=\displaystyle:= {Θp×q|row(Θ)V, and col(Θ)U},\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}|\mathrm{row}(\Theta)\subset V,\textrm{ and }\mathrm{col}(\Theta)\subset U\right\},
Ω¯2(U,V)\displaystyle\overline{\Omega}_{2}(U,V) :=\displaystyle:= {Θp×q|row(Θ)V, or col(Θ)U},\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}|\mathrm{row}(\Theta)\subset V,\textrm{ or }\mathrm{col}(\Theta)\subset U\right\},
Ω¯2(U,V)\displaystyle\overline{\Omega}_{2}^{\perp}(U,V) :=\displaystyle:= {Θp×q|row(Θ)V, and col(Θ)U}.\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}|\mathrm{row}(\Theta)\subset V^{\perp},\textrm{ and }\mathrm{col}(\Theta)\subset U^{\perp}\right\}.

Denote 𝚫={𝚫1T,vec(𝚫2)T}Ts+pq\bm{\Delta}=\{\bm{\Delta}_{1}^{\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\bm{\Delta}_{2})^{\mathrm{\scriptscriptstyle T}}\}^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{s+pq} with 𝚫1s\bm{\Delta}_{1}\in\mathbb{R}^{s} and 𝚫2p×q\bm{\Delta}_{2}\in\mathbb{R}^{p\times q}. Then 𝚫1,Ω¯1=argmin𝒗Ω¯1𝚫1𝒗2\bm{\Delta}_{1,\overline{\Omega}_{1}}=\arg\min\limits_{\bm{v}\in\overline{\Omega}_{1}}\|\bm{\Delta}_{1}-\bm{v}\|_{2} and 𝚫1,Ω¯1=argmin𝒗Ω¯1𝚫1𝒗2\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}}=\arg\min\limits_{\bm{v}\in\overline{\Omega}_{1}^{\perp}}\|\bm{\Delta}_{1}-\bm{v}\|_{2}; 𝚫2,Ω¯2=argmin𝒗Ω¯2𝚫2𝒗F\bm{\Delta}_{2,\overline{\Omega}_{2}}=\arg\min\limits_{\bm{v}\in\overline{\Omega}_{2}}\|\bm{\Delta}_{2}-\bm{v}\|_{F} and 𝚫2,Ω¯2=argmin𝒗Ω¯2𝚫2𝒗F\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}}=\arg\min\limits_{\bm{v}\in\overline{\Omega}_{2}^{\perp}}\|\bm{\Delta}_{2}-\bm{v}\|_{F}. We write 𝑿comp=(𝑿,𝒁new)n×(s+pq)\bm{X}_{comp}=(\bm{X},\bm{Z}_{new})\in\mathbb{R}^{n\times(s+pq)} with 𝒁new=(vec(𝒁1),\bm{Z}_{new}=(\mathrm{vec}(\bm{Z}_{1}), ,vec(𝒁n))Tn×pq\ldots,\mathrm{vec}(\bm{Z}_{n}))^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{n\times pq} and let Xcomp,iX_{comp,i} represent the ii-th column of 𝑿compT\bm{X}_{comp}^{\mathrm{\scriptscriptstyle T}} for i=1,,ni=1,\ldots,n. With loss of generality we assume that 𝑿\bm{X} has been column normalized, i.e. 𝒙l2/n=1\|\bm{x}_{l}\|_{2}/\sqrt{n}=1, for all l1,,sl\in 1,\ldots,s.

We need the following assumptions:

(A6) Define

ι:=min|𝚫1,Ω¯1|13|𝚫1,Ω¯1|1𝚫2,Ω¯23𝚫2,Ω¯21ni=1n{|Xcomp,i,𝚫|2𝚫22},\displaystyle\iota:=\min\limits_{{\tiny\begin{array}[]{c}\left|\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}}\right|_{1}\leq 3\left|\bm{\Delta}_{1,\overline{\Omega}_{1}}\right|_{1}\\ \left\|\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}}\right\|_{*}\leq 3\left\|\bm{\Delta}_{2,\overline{\Omega}_{2}}\right\|_{*}\end{array}}}\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{\left|\langle X_{comp,i},\bm{\Delta}\rangle\right|^{2}}{\|\bm{\Delta}\|_{2}^{2}}\right\},

and assume that ι\iota is a positive constant.

(A7) Assume max{p,q}/log(n)\max\{p,q\}/\log(n)\to\infty and max{p,q}=o(n12τ)\max\{p,q\}=o(n^{1-2\tau}) as nn\to\infty with τ<1/2\tau<1/2.

(A8) The vectorized error matrices vec(𝑬i)\mathrm{vec}(\bm{E}_{i}) are i.i.d. N(𝟎,𝚺e2)N(\bm{0},\bm{\Sigma}_{e}^{2}), where λmax(𝚺e)CU2<\lambda_{\max}(\bm{\Sigma}_{e})\leq C_{U}^{2}<\infty.

(A9) rank(𝑩˙)=r<min(p,q)\mathrm{rank}(\dot{\bm{B}})=r<\min(p,q) holds.

17 Auxiliary lemmas

In this section, we include the auxiliary lemmas needed for the theorems and their proofs.

Lemma 1.

(Bernstein’s inequality) Let T1,,TnT_{1},\ldots,T_{n} be independent random variable with zero mean such that E(|Ti|m)m!Mm2vi/2E(|T_{i}|^{m})\leq m!M^{m-2}v_{i}/2, for every m2m\geq 2 (and all i) and some constant MM and viv_{i}. Then

P(|i=1nTi|>x)2e12x2v+Mx,P(|\sum_{i=1}^{n}T_{i}|>x)\leq 2e^{-\frac{1}{2}\frac{x^{2}}{v+Mx}},

for v=i=1nviv=\sum_{i=1}^{n}v_{i}.

This is Lemma 2.2.11 from van der Vaart and Wellner (2000) and we omit the proof here.

Lemma 2.

Under Assumptions (A0), (A1) and (A2), for arbitrary t>0t>0 and for every l,l,j,kl,l^{\prime},j,k, we have that

P((|i=1n{xilxilE(xilxil)}|t)\displaystyle P\left((|\sum_{i=1}^{n}\{x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\}|\geq t\right) \displaystyle\leq 2exp{t22(2nD22eD2σxD3+t/D2)},\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+t/D_{2})}\right\},
P(|i=1n(xilEi,jk)|t)\displaystyle P\left(\left|\sum_{i=1}^{n}(x_{il}E_{i,jk})\right|\geq t\right) \displaystyle\leq 2exp{t22(2nD22D3+t/D2)},\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\},
P(|i=1n(xil𝑬i,𝑩˙)|t)\displaystyle P\left(\left|\sum_{i=1}^{n}\left(x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right|\geq t\right) \displaystyle\leq 2exp{t22(2nD22D3+t/D2)},\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\},
P(|i=1n(xilϵi)|t)\displaystyle P\left(|\sum_{i=1}^{n}(x_{il}\epsilon_{i})|\geq t\right) \displaystyle\leq 2exp{t22(2nD22D3+t/D2)}.\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\}.
Proof of Lemma 2:.

Note that the last part of Assumption (A2) actually implies that, there exist positive constants D2D^{\prime}_{2} and D3D^{\prime}_{3}, such that E[eD2ϵi2]D3E[e^{D^{\prime}_{2}\epsilon_{i}^{2}}]\leq D^{\prime}_{3} by applying Theorem 3.1 from Rivasplata (2012). Therefore, it can be unified into the first part of Assumption (A2) which implies that

max{E[eD2xil2],E[eD2Ei,jk2],E[eD2𝑬i,𝑩˙2],E[eD2ϵi2]}D3\max\left\{E[e^{D_{2}x_{il}^{2}}],E[e^{D_{2}E_{i,jk}^{2}}],E[e^{D_{2}\langle\bm{E}_{i},\dot{\bm{B}}\rangle^{2}}],E[e^{D_{2}\epsilon_{i}^{2}}]\right\}\leq D_{3}

for every 1lsn1\leq l\leq s_{n}, 1jp1\leq j\leq p and 1kq1\leq k\leq q. Therefore, by Assumptions (A0) and (A2) and Jensen’s inequality, we have

E[eD2|xilxilE(xilxil)|]E[eD2|xilxil|+D2|E(xilxil)|]=eD2|E(xilxil)|E[eD2|xilxil|]\displaystyle E\left[e^{D_{2}|x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)|}\right]\leq E\left[e^{D_{2}|x_{il}x_{il^{\prime}}|+D_{2}|E\left(x_{il}x_{il^{\prime}}\right)|}\right]=e^{D_{2}|E\left(x_{il}x_{il^{\prime}}\right)|}E\left[e^{D_{2}|x_{il}x_{il^{\prime}}|}\right]
eD2σxE[eD2xil2+xil22]eD2σx[E{eD2xil2}E{eD2xil2}]1/2eD2σxD3.\displaystyle\leq e^{D_{2}\sigma_{x}}E\left[e^{D_{2}\frac{x_{il}^{2}+x_{il^{\prime}}^{2}}{2}}\right]\leq e^{D_{2}\sigma_{x}}\left[E\left\{e^{D_{2}x_{il}^{2}}\right\}E\left\{e^{D_{2}x_{il^{\prime}}^{2}}\right\}\right]^{1/2}\leq e^{D_{2}\sigma_{x}}D_{3}.

Then for every m2m\geq 2, one has

E[|xilxilE(xilxil)|m]m!D2mE[eD2|xilxilE(xilxil)|]m!D2meD2σxD3.E\left[|x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})|^{m}\right]\leq\frac{m!}{D_{2}^{m}}E\left[e^{D_{2}|x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})|}\right]\leq\frac{m!}{D_{2}^{m}}e^{D_{2}\sigma_{x}}D_{3}.

It follows from Lemma 1 that

P(|i=1n{xilxilE(xilxil)}|t)2exp{t22(2nD22eD2σxD3+t/D2)}.P(|\sum_{i=1}^{n}\{x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\}|\geq t)\leq 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+t/D_{2})}\right\}.

Similarly we obtain

E[eD2|xilEi,jk|]E[eD2xil2+Ei,jk22][E{eD2xil2}E{eD2Ei,jk2}]1/2D3.\displaystyle E\left[e^{D_{2}\left|x_{il}E_{i,jk}\right|}\right]\leq E\left[e^{D_{2}\frac{x_{il}^{2}+E_{i,jk}^{2}}{2}}\right]\leq\left[E\left\{e^{D_{2}x_{il}^{2}}\right\}E\left\{e^{D_{2}E_{i,jk}^{2}}\right\}\right]^{1/2}\leq D_{3}.

Then for every m2m\geq 2, one has

E[|xilEi,jk|m]m!D2mE[e|xilEi,jk|]m!D2mD3.E\left[\left|x_{il}E_{i,jk}\right|^{m}\right]\leq\frac{m!}{D_{2}^{m}}E\left[e^{\left|x_{il}E_{i,jk}\right|}\right]\leq\frac{m!}{D_{2}^{m}}D_{3}.

It follows from Lemma 1 that

P(|i=1n(xilEi,jk)|t)2exp{t22(2nD22D3+t/D2)}.P\left(\left|\sum_{i=1}^{n}(x_{il}E_{i,jk})\right|\geq t\right)\leq 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\}.

Similarly we have

E[eD2|xil𝑬i,𝑩˙|]\displaystyle E\left[e^{D_{2}\left|x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right|}\right] \displaystyle\leq E[eD2xil2+𝑬i,𝑩˙22][E{eD2xil2}E{eD2𝑬i,𝑩˙2}]1/2D3,\displaystyle E\left[e^{D_{2}\frac{x_{il}^{2}+\langle\bm{E}_{i},\dot{\bm{B}}\rangle^{2}}{2}}\right]\leq\left[E\left\{e^{D_{2}x_{il}^{2}}\right\}E\left\{e^{D_{2}\langle\bm{E}_{i},\dot{\bm{B}}\rangle^{2}}\right\}\right]^{1/2}\leq D_{3},
E[eD2|xilϵi|]\displaystyle E\left[e^{D_{2}\left|x_{il}\epsilon_{i}\right|}\right] \displaystyle\leq E[eD2xil2+ϵi22][E{eD2xil2}E{eD2ϵi2}]1/2D3.\displaystyle E\left[e^{D_{2}\frac{x_{il}^{2}+\epsilon_{i}^{2}}{2}}\right]\leq\left[E\left\{e^{D_{2}x_{il}^{2}}\right\}E\left\{e^{D_{2}\epsilon_{i}^{2}}\right\}\right]^{1/2}\leq D_{3}.

Then following the proof of showing second inequality, one has

P(|i=1n(xil𝑬i,𝑩˙)|t)\displaystyle P\left(\left|\sum_{i=1}^{n}\left(x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right|\geq t\right) \displaystyle\leq 2exp{t22(2nD22D3+t/D2)},\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\},
P(|i=1n(xilϵi)|t)\displaystyle P\left(|\sum_{i=1}^{n}(x_{il}\epsilon_{i})|\geq t\right) \displaystyle\leq 2exp{t22(2nD22D3+t/D2)}.\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\}.

The following lemma is a standard result called Gaussian comparison inequality (Anderson, 1955).

Lemma 3.

Let XX and YY be zero-mean vector Gaussian random vectors with covariance matrix ΣX\Sigma_{X} and ΣY\Sigma_{Y} respectively. If ΣXΣY\Sigma_{X}-\Sigma_{Y} is positive semi-definite, then for any convex symmetric set CC, P(XC)P(YC)P(X\in C)\leq P(Y\in C).

18 Proof of theorems

Proof of Theorem 1:.

We can write

P{1(^1^2)}\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\right)\right\}
=\displaystyle= P{l1(^1^2)}\displaystyle P\left\{\cap_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\right)\right\}
=\displaystyle= 1P{l1(^1^2)c}\displaystyle 1-P\left\{\cup_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\right)^{c}\right\}
\displaystyle\geq 1l1P(^1c^2c)\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left(\widehat{\mathcal{M}}_{1}^{*c}\cap\widehat{\mathcal{M}}_{2}^{c}\right)
=\displaystyle= 1l1P(|βlM^|γ1,n,𝑪^lMopγ2,n)\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n},\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)
\displaystyle\geq 1l12P(𝑪^lMopγ2,n)l12cP(|βlM^|γ1,n).\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right).

Firstly, recall that 𝑪˙lM=cov(l2xil𝑪˙l,xil)\dot{\bm{C}}_{l}^{M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{2}}x_{il^{\prime}}*\dot{\bm{C}}_{l^{\prime}},x_{il}) i.e. C˙l,jkM=cov(l2xilC˙l,jk,\dot{C}_{l,jk}^{M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{2}}x_{il^{\prime}}\dot{C}_{l^{\prime},jk}, xil)=n1i=1nE(xilZi,jk)x_{il})=n^{-1}\sum_{i=1}^{n}E(x_{il}Z_{i,jk}). For every 1jp1\leq j\leq p, 1kq1\leq k\leq q and 1lsn1\leq l\leq s_{n}, we have

C^l,jkMC˙l,jkM=n1i=1n[xilZi,jkE(xilZi,jk)].\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}=n^{-1}\sum_{i=1}^{n}\left[x_{il}Z_{i,jk}-E(x_{il}Z_{i,jk})\right].

It follows from Assumptions (A0), (A1), (A2) and Lemma 2 that for any t>0t>0, one has

P(|C^l,jkMC˙l,jkM|t)=P(|i=1n[xilZi,jkE(xilZi,jk)|nt)\displaystyle P\left(\left|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right|\geq t\right)=P\left(\left|\sum_{i=1}^{n}[x_{il}Z_{i,jk}-E(x_{il}Z_{i,jk})\right|\geq nt\right)
=\displaystyle= P(|l2i=1n[xilxilE(xilxil)]C˙l,jk+i=1nxilEi,jk|nt)\displaystyle P\left(\left|\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left[x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\right]\dot{C}_{l^{\prime},jk}+\sum_{i=1}^{n}x_{il}E_{i,jk}\right|\geq nt\right)
\displaystyle\leq l2P(|i=1n[xilxilE(xilxil)]|ntb(s2+1))+P(|i=1nxilEi,jk|nts2+1)\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left(\left|\sum_{i=1}^{n}\left[x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\right]\right|\geq\frac{nt}{b(s_{2}+1)}\right)+P\left(\left|\sum_{i=1}^{n}x_{il}E_{i,jk}\right|\geq\frac{nt}{s_{2}+1}\right)
\displaystyle\leq 2s2exp[nt2b2(s1+1)22{2D22eD2σxD3+D21b1(s1+1)1t}]\displaystyle 2s_{2}\exp\left[-\frac{nt^{2}b^{-2}(s_{1}+1)^{-2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}t\}}\right]
+2exp[nt2(s2+1)22{2D22D3+D21(s2+1)1t}].\displaystyle+2\exp\left[-\frac{nt^{2}(s_{2}+1)^{-2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}t\}}\right].

Therefore, for every l2l\in\mathcal{M}_{2}, we have

P(𝑪^lMopγ2,n)P(𝑪^lM𝑪˙lMopD1(pq)1/2nκγ2,n)\displaystyle P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)\leq P\left(\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\|_{op}\geq D_{1}(pq)^{1/2}n^{-\kappa}-\gamma_{2,n}\right)
\displaystyle\leq P(𝑪^lM𝑪˙lMF(pq)1/2(1α)D1nκ)\displaystyle P\left(\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\|_{F}\geq(pq)^{1/2}(1-\alpha)D_{1}n^{-\kappa}\right)
=\displaystyle= P(j,k|C^l,jkMC˙l,jkM|2(pq)1/2(1α)D1nκ)\displaystyle P\left(\sum_{j,k}\left|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right|^{2}\geq(pq)^{1/2}(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq j,kP(|C^l,jkMC˙l,jkM|2{(1α)D1nκ}2)\displaystyle\sum_{j,k}P\left(\left|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right|^{2}\geq\left\{(1-\alpha)D_{1}n^{-\kappa}\right\}^{2}\right)
\displaystyle\leq j,kP(|C^l,jkMC˙l,jkM|(1α)D1nκ)\displaystyle\sum_{j,k}P\left(\left|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right|\geq(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq 2pq(s2exp[n12κ{(1α)D1b1(s2+1)1}22{2D22eD2σxD3+D21b1(s2+1)1(1α)D1nκ}]\displaystyle 2pq\left(s_{2}\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}n^{-\kappa}\}}\right]\right.
+exp[n12κ{(1α)D1(s2+1)1}22{2D22D3+D21(s2+1)1(1α)D1nκ}])\displaystyle\left.+\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}n^{-\kappa}\}}\right]\right)
\displaystyle\leq 2pq(s2exp[n12κ{(1α)D1b1(s2+1)1}22{2D22eD2σxD3+D21b1(s2+1)1(1α)D1}]\displaystyle 2pq\left(s_{2}\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right]\right.
+exp[n12κ{(1α)D1(s2+1)1}22{2D22D3+D21(s2+1)1(1α)D1}])\displaystyle\left.+\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right]\right)

Let

d0\displaystyle d_{0} =\displaystyle= min[{(1α)D1b1(s2+1)1}22{2D22eD2σxD3+D21b1(s2+1)1(1α)D1},\displaystyle\min\left[\frac{\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}},\right.
{(1α)D1(s2+1)1}22{2D22D3+D21(s2+1)1(1α)D1}],\displaystyle\left.\frac{\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right],

We have for every l2l\in\mathcal{M}_{2},

P(𝑪^lMopγ2,n)\displaystyle P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right) \displaystyle\leq 2pq(s2+1)exp(d0n12κ),\displaystyle 2pq(s_{2}+1)\exp(-d_{0}n^{1-2\kappa}), (5)

Let us consider P(|βlM^|γ1,n)P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right). recall that, β˙lM=β˙lM+𝑪˙lM\dot{\beta}_{l}^{M}=\dot{\beta}_{l}^{*M}+\langle\dot{\bm{C}}_{l}^{M}, 𝑩˙\dot{\bm{B}}\rangle, β˙lM=cov(l1\dot{\beta}_{l}^{*M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{1}} xilβ˙l,xil)x_{il^{\prime}}\dot{\beta}_{l^{\prime}},x_{il}) and β˙lM=n1i=1nE(xilYi)\dot{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}E(x_{il}Y_{i}). For every 1lsn1\leq l\leq s_{n}, we have

β^lMβ˙lM=n1i=1n{xilYiE(xilYi)}.\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}.

It follows from Assumptions (A0), (A1), (A2) and Lemma 2 that for any t>0t>0, we have

P(|β^lMβ˙lM|t)=P[|i=1n{xilYiE(xilYi)}|nt]\displaystyle P\left(\left|\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}\right|\geq t\right)=P\left[\left|\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}\right|\geq nt\right]
=\displaystyle= P[|l1i=1n{xilxilE(xilxil)}β˙lM+l2i=1n{xilxilE(xilxil)}𝑪˙lM,𝑩˙+\displaystyle P\left[\left|\sum_{l^{\prime}\in\mathcal{M}_{1}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\dot{\beta}_{l^{\prime}}^{*M}+\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\langle\dot{\bm{C}}_{l^{\prime}}^{M},\dot{\bm{B}}\rangle+\right.\right.
+\displaystyle+ i=1n{xil𝑬i,𝑩˙E(xil)E(𝑬i,𝑩˙)}+i=1nxilϵi|nt]\displaystyle\left.\left.\sum_{i=1}^{n}\left\{x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle-E(x_{il})E\left(\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right\}+\sum_{i=1}^{n}x_{il}\epsilon_{i}\right|\geq nt\right]
\displaystyle\leq P[l1|i=1n{xilxilE(xilxil)}|b+l2|i=1n{xilxilE(xilxil)}|b\displaystyle P\left[\sum_{l^{\prime}\in\mathcal{M}_{1}}\left|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right|b+\sum_{l^{\prime}\in\mathcal{M}_{2}}\left|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right|b\right.
+|i=1nxil𝑬i,𝑩˙|+|i=1nxilϵi|nt]\displaystyle+\left.\left|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right|+\left|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right|\geq nt\right]
\displaystyle\leq l1P[|i=1n{xilxilE(xilxil)}|ntb(s1+s2+2)]\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{1}}P\left[\left|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]
+l2P[|i=1n{xilxilE(xilxil)}|ntb(s1+s2+2)]\displaystyle+\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left[\left|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]
+P(|i=1nxil𝑬i,𝑩˙|nts1+s2+2)+P(|i=1nxilϵi|nts1+s2+2)\displaystyle+P\left(\left|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right|\geq\frac{nt}{s_{1}+s_{2}+2}\right)+P\left(\left|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right|\geq\frac{nt}{s_{1}+s_{2}+2}\right)
\displaystyle\leq 2(s1+s2)exp[nt2(2b)2(s1+s2+2)22{2D22eD2σxD3+D21(2b)1(s1+s2+2)1t}]\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{nt^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right]
+4exp[nt2(s1+s2+2)22{2D22D3+D21(s1+s2+2)1t}].\displaystyle+4\exp\left[-\frac{nt^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right].

For l1c2l\in\mathcal{M}_{1}\cap\mathcal{M}^{c}_{2}, we have 𝑪˙Ml,𝑩˙=0\langle\dot{\bm{C}}^{M}_{l},\dot{\bm{B}}\rangle=0, under Assumption (A1) and previous deduction, we have

P(|β^Ml|γ1,n)=P(|β^Ml|γ1,n)P(|β˙lM||β^Ml|D1nκγ1,n)\displaystyle P\left(|\widehat{\beta}^{M}_{l}|\leq\gamma_{1,n}\right)=P\left(-|\widehat{\beta}^{M}_{l}|\geq-\gamma_{1,n}\right)\leq P\left(|\dot{\beta}_{l}^{*M}|-|\widehat{\beta}^{M}_{l}|\geq D_{1}n^{-\kappa}-\gamma_{1,n}\right)
=\displaystyle= P(|β˙lM||𝑪˙Ml,𝑩˙||β^Ml|(1α)D1nκ)\displaystyle P\left(|\dot{\beta}_{l}^{*M}|-|\langle\dot{\bm{C}}^{M}_{l},\dot{\bm{B}}\rangle|-|\widehat{\beta}^{M}_{l}|\geq(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq P(|β˙lM||β^Ml|(1α)D1nκ)\displaystyle P\left(|\dot{\beta}_{l}^{M}|-|\widehat{\beta}^{M}_{l}|\geq(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq P(|β˙lMβ^Ml|(1α)D1nκ)\displaystyle P\left(|\dot{\beta}_{l}^{M}-\widehat{\beta}^{M}_{l}|\geq(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq 2(s1+s2)exp[n12κ(1α)2D12(2b)2(s1+s2+2)22{2D22eD2σxD3+D21(2b)1(s1+s2+2)1(1α)D1nκ}]\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}n^{-\kappa}\right\}}\right]
+4exp[n12κ(1α)2D12(s1+s2+2)22{2D22D3+D21(s1+s2+2)1(1α)D1pq1/2nκ}]\displaystyle+4\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}pq^{1/2}n^{-\kappa}\right\}}\right]
\displaystyle\leq 2(s1+s2)exp[n12κ(1α)2D12(2b)2(s1+s2+2)22{2D22eD2σxD3+D21(2b)1(s1+s2+2)1(1α)D1}]\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right]
+4exp[n12κ(1α)2D12(s1+s2+2)22{2D22D3+D21(s1+s2+2)1(1α)D1}]\displaystyle+4\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right]

Let

d1\displaystyle d_{1} =\displaystyle= min[(1α)2D12(2b)2(s1+s2+2)22{2D22eD2σxD3+D21(2b)1(s1+s2+2)1(1α)D1},\displaystyle\min\left[\frac{(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}},\right.
(1α)2D12(s1+s2+2)22{2D22D3+D21(s1+s2+2)1(1α)D1}],\displaystyle\left.\frac{(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right],

We have for each l12cl\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c},

P(|βlM^|γ2,n)2(s1+s2+2)exp(d1n12κ).\displaystyle P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{2,n}\right)\leq 2(s_{1}+s_{2}+2)\exp\left(-d_{1}n^{1-2\kappa}\right). (6)

In sum, by Assumption (A5), and (5) and (6), we have

P{1(^1^2)}\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\right)\right\}
\displaystyle\geq 1l12P(𝑪^lMopγ2,n)l12cP(|βlM^|γ1,n)\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right)
\displaystyle\geq 12pqs2(s2+1)exp(d0n12κ)2s1(s1+s2+2)exp(d1n12κ)\displaystyle 1-2pqs_{2}(s_{2}+1)\exp(-d_{0}n^{1-2\kappa})-2s_{1}(s_{1}+s_{2}+2)\exp\left(-d_{1}n^{1-2\kappa}\right)
\displaystyle\geq 1d0pqexp(d1n12κ)1, as n,\displaystyle 1-d_{0}^{\prime}pq\exp\left(-d_{1}^{\prime}n^{1-2\kappa}\right)\to 1,\quad\textrm{ as }n\to\infty,

for some positive constants d0d_{0}^{\prime} and d1d_{1}^{\prime}. Therefore, P(1^)1P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}})\to 1 as nn\to\infty. ∎

Proof of Theorem 2.

The proof consists of two steps. In step 1, we will show that P(^0)1P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0})\to 1, where 0=0102\mathcal{M}^{0}=\mathcal{M}^{0}_{1}\cup\mathcal{M}^{0}_{2}, 01={1lsn:|β˙lM|γ1,n/2}\mathcal{M}^{0}_{1}=\left\{1\leq l\leq s_{n}:|\dot{\beta}_{l}^{M}|\geq\gamma_{1,n}/2\right\} and 02={1lsn:\mathcal{M}^{0}_{2}=\left\{1\leq l\leq s_{n}:\right. 𝑪˙Mlopγ2,n/2}.\left.\left\|\dot{\bm{C}}^{M}_{l}\right\|_{op}\geq\gamma_{2,n}/2\right\}. Recall that ^=^1^2={1lsn:|β^Ml|\widehat{\mathcal{M}}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}=\left\{1\leq l\leq s_{n}:|\widehat{\beta}^{M}_{l}|\geq\right. γ1,n}\left.\gamma_{1,n}\right\} {1lsn:\cup\left\{1\leq l\leq s_{n}:\right. 𝑪^lMopγ2,n}\left.\left\|\widehat{{\bm{C}}}_{l}^{M}\right\|_{op}\geq\gamma_{2,n}\right\}. Let γ1,n=αD1nκ\gamma_{1,n}=\alpha D_{1}n^{-\kappa} and γ2,n=αD1(pq)1/2nκ\gamma_{2,n}=\alpha D_{1}(pq)^{1/2}n^{-\kappa} with 0<α<10<\alpha<1, we have

P(^0102)\displaystyle P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0}_{1}\cup\mathcal{M}^{0}_{2})
\displaystyle\geq P[1lsn{|βlM^β˙lM|γ1,n2}1lsn{||𝑪^lM𝑪˙lM||opγ2,n2}]\displaystyle P\left[\cap_{1\leq l\leq s_{n}}\left\{|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}|\leq\frac{\gamma_{1,n}}{2}\right\}\cap_{1\leq l\leq s_{n}}\left\{||\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}||_{op}\leq\frac{\gamma_{2,n}}{2}\right\}\right]
=\displaystyle= 1P[1lsn{|βlM^β˙lM|γ1,n2}1lsn{||𝑪^lM𝑪˙lM||opγ2,n2}]\displaystyle 1-P\left[\cup_{1\leq l\leq s_{n}}\{|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}|\geq\frac{\gamma_{1,n}}{2}\}\cup_{1\leq l\leq s_{n}}\{||\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}||_{op}\geq\frac{\gamma_{2,n}}{2}\}\right]
\displaystyle\geq 11lsn{P(|βlM^β˙lM|γ1,n2)+P(||𝑪^lM𝑪˙lM||opγ2,n2)}\displaystyle 1-\sum_{1\leq l\leq s_{n}}\left\{P\left(|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}|\geq\frac{\gamma_{1,n}}{2}\right)+P\left(||\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}||_{op}\geq\frac{\gamma_{2,n}}{2}\right)\right\}
\displaystyle\geq 11lsnP(|βlM^β˙lM|γ1,n2)1lsnP(||𝑪^lM𝑪˙lM||Fγ2,n2)\displaystyle 1-\sum_{1\leq l\leq s_{n}}P\left(|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}|\geq\frac{\gamma_{1,n}}{2}\right)-\sum_{1\leq l\leq s_{n}}P\left(||\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}||_{F}\geq\frac{\gamma_{2,n}}{2}\right)
\displaystyle\geq 11lsnP(|βlM^β˙lM|αD1nκ/2)\displaystyle 1-\sum_{1\leq l\leq s_{n}}P\left(|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}|\geq\alpha D_{1}n^{-\kappa}/2\right)
1lsnj,kP(|C^l,jkMC˙l,jkM|αD1nκ/2)\displaystyle-\sum_{1\leq l\leq s_{n}}\sum_{j,k}P\left(|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}|\geq\alpha D_{1}n^{-\kappa}/2\right)
\displaystyle\geq 12sn{(s1+s2)exp[α2D12(4b)2(s1+s2+2)2n12κ2{2D22eD2σxD3+D21(4b)1(s1+s2+2)1αD1nκ}]\displaystyle 1-2s_{n}\left\{(s_{1}+s_{2})\exp\left[-\frac{\alpha^{2}D_{1}^{2}(4b)^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(4b)^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}n^{-\kappa}\right\}}\right]\right.
+2exp[α2D1222(s1+s2+2)2n12κ2{2D22D3+D21(s1+s2+2)1αD1nκ}]}\displaystyle\left.+2\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}n^{-\kappa}\right\}}\right]\right\}
2snpq{s2exp[α2D1222b2(s1+1)2n12κ2{2D22eD2σxD3+D21b1(s1+1)121αD1nκ}]\displaystyle-2s_{n}pq\left\{s_{2}\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}b^{-2}(s_{1}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}2^{-1}\alpha D_{1}n^{-\kappa}\}}\right]\right.
+exp[α2D1222(s2+1)2n12κ2{2D22D3+D21(s2+1)121αD1nκ}]}\displaystyle\left.+\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{2}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}2^{-1}\alpha D_{1}n^{-\kappa}\}}\right]\right\}
\displaystyle\geq 12sn{(s1+s2)exp[α2D12(4b)2(s1+s2+2)2n12κ2{2D22eD2σxD3+D21(4b)1(s1+s2+2)1αD1}]\displaystyle 1-2s_{n}\left\{(s_{1}+s_{2})\exp\left[-\frac{\alpha^{2}D_{1}^{2}(4b)^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(4b)^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}\right\}}\right]\right.
+2exp[α2D1222(s1+s2+2)2n12κ2{2D22D3+D21(s1+s2+2)1αD1}]}\displaystyle\left.+2\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}\right\}}\right]\right\}
2snpq{s2exp[α2D1222b2(s1+1)2n12κ2{2D22eD2σxD3+D21b1(s1+1)121αD1}]\displaystyle-2s_{n}pq\left\{s_{2}\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}b^{-2}(s_{1}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}2^{-1}\alpha D_{1}\}}\right]\right.
+exp[α2D1222(s2+1)2n12κ2{2D22D3+D21(s2+1)121αD1}]}\displaystyle\left.+\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{2}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}2^{-1}\alpha D_{1}\}}\right]\right\}
=\displaystyle= 12exp(D4nξ){(s1+s2)exp[α2D12(4b)2(s1+s2+2)2n12κ2{2D22eD2σxD3+D21(4b)1(s1+s2+2)1αD1}]\displaystyle 1-2\exp(D_{4}n^{\xi})\left\{(s_{1}+s_{2})\exp\left[-\frac{\alpha^{2}D_{1}^{2}(4b)^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(4b)^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}\right\}}\right]\right.
+2exp[α2D1222(s1+s2+2)2n12κ2{2D22D3+D21(s1+s2+2)1αD1}]}\displaystyle\left.+2\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}\right\}}\right]\right\}
2pqexp(D4nξ){s2exp[α2D1222b2(s1+1)2n12κ2{2D22eD2σxD3+D21b1(s1+1)121αD1}]\displaystyle-2pq\exp(D_{4}n^{\xi})\left\{s_{2}\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}b^{-2}(s_{1}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}2^{-1}\alpha D_{1}\}}\right]\right.
+exp[α2D1222(s2+1)2n12κ2{2D22D3+D21(s2+1)121αD1}]}\displaystyle\left.+\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{2}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}2^{-1}\alpha D_{1}\}}\right]\right\}

By Assumptions (A3) and (A5), we have

P(^0102)1d2exp(d3n12κ),P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0}_{1}\cup\mathcal{M}^{0}_{2})\geq 1-d_{2}\exp(-d_{3}n^{1-2\kappa}),

for some constants d2d_{2} and d3>0d_{3}>0. Therefore, we have P(^0)1P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0})\to 1 as nn\to\infty.

In step 2, we will show that |0|=O(n2κ+τ)|\mathcal{M}^{0}|=O(n^{2\kappa+\tau}). As |0|=|0102||01|+|02|\left|\mathcal{M}^{0}\right|=\left|\mathcal{M}^{0}_{1}\cup\mathcal{M}^{0}_{2}\right|\leq\left|\mathcal{M}^{0}_{1}\right|+\left|\mathcal{M}^{0}_{2}\right|, we only need to show that both |01|\left|\mathcal{M}^{0}_{1}\right| and |02|\left|\mathcal{M}^{0}_{2}\right| are O(n2κ+τ)O(n^{2\kappa+\tau}).

Define 11={1lsn:|β˙lM|2γ1,n2/4}\mathcal{M}^{1}_{1}=\left\{1\leq l\leq s_{n}:\left|\dot{\beta}_{l}^{M}\right|^{2}\geq\gamma_{1,n}^{2}/4\right\} and 0111\mathcal{M}^{0}_{1}\subset\mathcal{M}^{1}_{1}. By the definition of 11\mathcal{M}^{1}_{1}, we have

|11|γ1,n2/4\displaystyle\left|\mathcal{M}^{1}_{1}\right|\gamma_{1,n}^{2}/4 \displaystyle\leq l=1sn|β˙lM|2=l=1sn(E[xilYi])2=E[𝒙iYi]2.\displaystyle\sum_{l=1}^{s_{n}}\left|\dot{\beta}_{l}^{M}\right|^{2}=\sum_{l=1}^{s_{n}}\left(E\left[x_{il}Y_{i}\right]\right)^{2}=\left\|E\left[\bm{x}_{i}*Y_{i}\right]\right\|^{2}.

Define 𝜷˙=(β˙l,,β˙sn)T\dot{\bm{\beta}}^{*}=(\dot{\beta}_{l}^{*},\ldots,\dot{\beta}_{s_{n}}^{*})^{\mathrm{\scriptscriptstyle T}} and 𝒄˙=(𝑪˙l,𝑩˙,,𝑪˙sn,𝑩˙)T\dot{\bm{c}}=(\langle\dot{\bm{C}}_{l},\dot{\bm{B}}\rangle,\ldots,\langle\dot{\bm{C}}_{s_{n}},\dot{\bm{B}}\rangle)^{\mathrm{\scriptscriptstyle T}}, we can write

Yi\displaystyle Y_{i} =\displaystyle= 𝒙iT(𝜷˙+𝒄˙)+𝑬i,𝑩˙+ϵi.\displaystyle\bm{x}_{i}^{\mathrm{\scriptscriptstyle T}}\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)+\langle\bm{E}_{i},\dot{\bm{B}}\rangle+\epsilon_{i}.

Multiplying 𝒙i\bm{x}_{i} on both sides and taking expectations yield E[𝒙iYi]=Σx(𝜷˙+𝒄˙)E\left[\bm{x}_{i}*Y_{i}\right]=\Sigma_{x}\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right). Therefore, we have

|21|γ2,n2/4\displaystyle\left|\mathcal{M}^{2}_{1}\right|\gamma_{2,n}^{2}/4 \displaystyle\leq Σx(𝜷˙+𝒄˙)2λmax(Σx)(𝜷˙+𝒄˙)T(𝜷˙+𝒄˙)4b2λmax(Σx).\displaystyle\left\|\Sigma_{x}\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)\right\|^{2}\leq\lambda_{max}(\Sigma_{x})\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)^{\mathrm{\scriptscriptstyle T}}\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)\leq 4b^{2}\lambda_{max}(\Sigma_{x}).

By Assumption (A4), we have |11|4b2λmax(Σx)γ1,n2=O(n2κ+τ)\left|\mathcal{M}^{1}_{1}\right|\leq 4b^{2}\lambda_{max}(\Sigma_{x})\gamma_{1,n}^{-2}=O(n^{2\kappa+\tau}). This implies that |01||11|O(n2κ+τ)\left|\mathcal{M}^{0}_{1}\right|\leq\left|\mathcal{M}^{1}_{1}\right|\leq O(n^{2\kappa+\tau}).

Define 12={1lsn:𝑪˙lMF2γ2,n2/4}\mathcal{M}^{1}_{2}=\left\{1\leq l\leq s_{n}:\left\|\dot{\bm{C}}_{l}^{M}\right\|_{F}^{2}\geq\gamma_{2,n}^{2}/4\right\}. As 𝑪˙lMop𝑪˙lMF\left\|\dot{\bm{C}}_{l}^{M}\right\|_{op}\leq\left\|\dot{\bm{C}}_{l}^{M}\right\|_{F}, we have 0211\mathcal{M}^{0}_{2}\subset\mathcal{M}^{1}_{1}. By the definition of 12\mathcal{M}^{1}_{2}, we have

|12|γ2,n2/4\displaystyle\left|\mathcal{M}^{1}_{2}\right|\gamma_{2,n}^{2}/4 \displaystyle\leq l=1sn𝑪˙lM2F\displaystyle\sum_{l=1}^{s_{n}}\left\|\dot{\bm{C}}_{l}^{M}\right\|^{2}_{F}
=\displaystyle= j,kl=1sn(C˙l,jkM)2=j,kl=1sn(E[xilZi,jk])2=j,kE[𝒙iZi,jk]2.\displaystyle\sum_{j,k}\sum_{l=1}^{s_{n}}(\dot{C}_{l,jk}^{M})^{2}=\sum_{j,k}\sum_{l=1}^{s_{n}}\left(E\left[x_{il}Z_{i,jk}\right]\right)^{2}=\sum_{j,k}\left\|E\left[\bm{x}_{i}*Z_{i,jk}\right]\right\|^{2}.

Define 𝑪˙jk=(C˙1,jk,,C˙sn,jk)T\dot{\bm{C}}_{jk}=(\dot{C}_{1,jk},\ldots,\dot{C}_{s_{n},jk})^{\mathrm{\scriptscriptstyle T}}, we can write Zi,jk=𝒙iT𝑪˙jk+Ei,jkZ_{i,jk}=\bm{x}_{i}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{C}}_{jk}+E_{i,jk}. Multiplying 𝒙i\bm{x}_{i} on both sides and taking expectations yield E[𝒙iZi,jk]=Σx𝑪˙jkE\left[\bm{x}_{i}*Z_{i,jk}\right]=\Sigma_{x}\dot{\bm{C}}_{jk}.

|12|γ1,n2/4\displaystyle\left|\mathcal{M}^{1}_{2}\right|\gamma_{1,n}^{2}/4 \displaystyle\leq j,kΣx𝑪˙jk2λmax(Σx)j,k𝑪˙jkT𝑪˙jkpqb2λmax(Σx).\displaystyle\sum_{j,k}\left\|\Sigma_{x}\dot{\bm{C}}_{jk}\right\|^{2}\leq\lambda_{max}(\Sigma_{x})\sum_{j,k}\dot{\bm{C}}_{jk}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{C}}_{jk}\leq pqb^{2}\lambda_{max}(\Sigma_{x}).

By Assumption (A4), we have |12|4pqb2λmax(Σx)γ2,n2=O(n2κ+τ)\left|\mathcal{M}^{1}_{2}\right|\leq 4pqb^{2}\lambda_{max}(\Sigma_{x})\gamma_{2,n}^{-2}=O(n^{2\kappa+\tau}).

Combining results from two steps above leads to P{|^|=O(n2κ+τ)}P(^0)1P\{|\widehat{\mathcal{M}}|=O(n^{2\kappa+\tau})\}\geq P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0})\to 1. ∎

Proof of Theorem 3:.

We can write

P{1(^1^2^1block,^2block)}\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,*}\cup\widehat{\mathcal{M}}_{2}^{block}\right)\right\}
=\displaystyle= P{l1(^1^2^1block,^2block)}\displaystyle P\left\{\cap_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,*}\cup\widehat{\mathcal{M}}_{2}^{block}\right)\right\}
=\displaystyle= 1P{l1(^1^2^1block,^2block)c}\displaystyle 1-P\left\{\cup_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,*}\cup\widehat{\mathcal{M}}_{2}^{block}\right)^{c}\right\}
\displaystyle\geq 1l1P{^1c^2c(^1block,)c(^2block)c}\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left\{\widehat{\mathcal{M}}_{1}^{*c}\cap\widehat{\mathcal{M}}_{2}^{c}\cap(\widehat{\mathcal{M}}_{1}^{block,*})^{c}\cap(\widehat{\mathcal{M}}_{2}^{block})^{c}\right\}
=\displaystyle= 1l1P(|βlM^|γ1,n,𝑪^lMopγ2,n,βlblock,M^γ3,n,Clblock,M^γ4,n)\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n},\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n},\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n},\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)
\displaystyle\geq 1l12P(𝑪^lMopγ2,n,Clblock,M^γ4,n)l12cP(|βlM^|γ1,n,βlblock,M^γ3,n)\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n},\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n},\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n}\right)
\displaystyle\geq 1l12P(𝑪^lMopγ2,n)l12P(Clblock,M^γ4,n)\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)
l12cP(|βlM^|γ1,n)l12cP(βlblock,M^γ3,n)\displaystyle-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n}\right)

Firstly, recall that 𝑪˙lM=cov(l2xil𝑪˙l,xil)\dot{\bm{C}}_{l}^{M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{2}}x_{il^{\prime}}*\dot{\bm{C}}_{l^{\prime}},x_{il}) i.e. C˙l,jkM=cov(l2xilC˙l,jk,\dot{C}_{l,jk}^{M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{2}}x_{il^{\prime}}\dot{C}_{l^{\prime},jk}, xil)=n1i=1nE(xilZi,jk)x_{il})=n^{-1}\sum_{i=1}^{n}E(x_{il}Z_{i,jk}). For every 1jp1\leq j\leq p, 1kq1\leq k\leq q and 1lsn1\leq l\leq s_{n}, we have

C^l,jkMC˙l,jkM=n1i=1n[xilZi,jkE(xilZi,jk)].\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}=n^{-1}\sum_{i=1}^{n}\left[x_{il}Z_{i,jk}-E(x_{il}Z_{i,jk})\right].

It follows from Assumptions (A0), (A1), (A2) and Lemma 2 that for any t>0t>0, one has

P(|C^l,jkMC˙l,jkM|t)=P(|i=1n[xilZi,jkE(xilZi,jk)|nt)\displaystyle P\left(\left|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right|\geq t\right)=P\left(\left|\sum_{i=1}^{n}[x_{il}Z_{i,jk}-E(x_{il}Z_{i,jk})\right|\geq nt\right)
=\displaystyle= P(|l2i=1n[xilxilE(xilxil)]C˙l,jk+i=1nxilEi,jk|nt)\displaystyle P\left(\left|\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left[x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\right]\dot{C}_{l^{\prime},jk}+\sum_{i=1}^{n}x_{il}E_{i,jk}\right|\geq nt\right)
\displaystyle\leq l2P(|i=1n[xilxilE(xilxil)]|ntb(s2+1))+P(|i=1nxilEi,jk|nts2+1)\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left(\left|\sum_{i=1}^{n}\left[x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\right]\right|\geq\frac{nt}{b(s_{2}+1)}\right)+P\left(\left|\sum_{i=1}^{n}x_{il}E_{i,jk}\right|\geq\frac{nt}{s_{2}+1}\right)
\displaystyle\leq 2s2exp[nt2b2(s1+1)22{2D22eD2σxD3+D21b1(s1+1)1t}]\displaystyle 2s_{2}\exp\left[-\frac{nt^{2}b^{-2}(s_{1}+1)^{-2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}t\}}\right]
+2exp[nt2(s2+1)22{2D22D3+D21(s2+1)1t}].\displaystyle+2\exp\left[-\frac{nt^{2}(s_{2}+1)^{-2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}t\}}\right].

Therefore, for every l2l\in\mathcal{M}_{2}, we have

P(𝑪^lMopγ2,n)P(𝑪^lM𝑪˙lMopD1(pq)1/2nκγ2,n)\displaystyle P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)\leq P\left(\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\|_{op}\geq D_{1}(pq)^{1/2}n^{-\kappa}-\gamma_{2,n}\right)
\displaystyle\leq P(𝑪^lM𝑪˙lMF(pq)1/2(1α)D1nκ)\displaystyle P\left(\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\|_{F}\geq(pq)^{1/2}(1-\alpha)D_{1}n^{-\kappa}\right)
=\displaystyle= P(j,k|C^l,jkMC˙l,jkM|2(pq)1/2(1α)D1nκ)\displaystyle P\left(\sum_{j,k}\left|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right|^{2}\geq(pq)^{1/2}(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq j,kP(|C^l,jkMC˙l,jkM|2{(1α)D1nκ}2)\displaystyle\sum_{j,k}P\left(\left|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right|^{2}\geq\left\{(1-\alpha)D_{1}n^{-\kappa}\right\}^{2}\right)
\displaystyle\leq j,kP(|C^l,jkMC˙l,jkM|(1α)D1nκ)\displaystyle\sum_{j,k}P\left(\left|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right|\geq(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq 2pq(s2exp[n12κ{(1α)D1b1(s2+1)1}22{2D22eD2σxD3+D21b1(s2+1)1(1α)D1nκ}]\displaystyle 2pq\left(s_{2}\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}n^{-\kappa}\}}\right]\right.
+exp[n12κ{(1α)D1(s2+1)1}22{2D22D3+D21(s2+1)1(1α)D1nκ}])\displaystyle\left.+\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}n^{-\kappa}\}}\right]\right)
\displaystyle\leq 2pq(s2exp[n12κ{(1α)D1b1(s2+1)1}22{2D22eD2σxD3+D21b1(s2+1)1(1α)D1}]\displaystyle 2pq\left(s_{2}\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right]\right.
+exp[n12κ{(1α)D1(s2+1)1}22{2D22D3+D21(s2+1)1(1α)D1}])\displaystyle\left.+\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right]\right)

Let

d0\displaystyle d_{0} =\displaystyle= min[{(1α)D1b1(s2+1)1}22{2D22eD2σxD3+D21b1(s2+1)1(1α)D1},\displaystyle\min\left[\frac{\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}},\right.
{(1α)D1(s2+1)1}22{2D22D3+D21(s2+1)1(1α)D1}],\displaystyle\left.\frac{\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right],

We have for every l2l\in\mathcal{M}_{2},

P(Clblock,M^γ4,n)P(𝑪^lMopγ2,n)\displaystyle P\left(\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)\leq P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right) \displaystyle\leq 2pq(s2+1)exp(d0n12κ),\displaystyle 2pq(s_{2}+1)\exp(-d_{0}n^{1-2\kappa}), (7)

Let us consider P(|βlM^|γ1,n)P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right). recall that, β˙lM=β˙lM+𝑪˙lM\dot{\beta}_{l}^{M}=\dot{\beta}_{l}^{*M}+\langle\dot{\bm{C}}_{l}^{M}, 𝑩˙\dot{\bm{B}}\rangle, β˙lM=cov(l1\dot{\beta}_{l}^{*M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{1}} xilβ˙l,xil)x_{il^{\prime}}\dot{\beta}_{l^{\prime}},x_{il}) and β˙lM=n1i=1nE(xilYi)\dot{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}E(x_{il}Y_{i}). For every 1lsn1\leq l\leq s_{n}, we have

β^lMβ˙lM=n1i=1n{xilYiE(xilYi)}.\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}.

It follows from Assumptions (A0), (A1), (A2) and Lemma 2 that for any t>0t>0, we have

P(|β^lMβ˙lM|t)=P[|i=1n{xilYiE(xilYi)}|nt]\displaystyle P\left(\left|\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}\right|\geq t\right)=P\left[\left|\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}\right|\geq nt\right]
=\displaystyle= P[|l1i=1n{xilxilE(xilxil)}β˙lM+l2i=1n{xilxilE(xilxil)}𝑪˙lM,𝑩˙+\displaystyle P\left[\left|\sum_{l^{\prime}\in\mathcal{M}_{1}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\dot{\beta}_{l^{\prime}}^{*M}+\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\langle\dot{\bm{C}}_{l^{\prime}}^{M},\dot{\bm{B}}\rangle+\right.\right.
+\displaystyle+ i=1n{xil𝑬i,𝑩˙E(xil)E(𝑬i,𝑩˙)}+i=1nxilϵi|nt]\displaystyle\left.\left.\sum_{i=1}^{n}\left\{x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle-E(x_{il})E\left(\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right\}+\sum_{i=1}^{n}x_{il}\epsilon_{i}\right|\geq nt\right]
\displaystyle\leq P[l1|i=1n{xilxilE(xilxil)}|b+l2|i=1n{xilxilE(xilxil)}|b\displaystyle P\left[\sum_{l^{\prime}\in\mathcal{M}_{1}}\left|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right|b+\sum_{l^{\prime}\in\mathcal{M}_{2}}\left|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right|b\right.
+|i=1nxil𝑬i,𝑩˙|+|i=1nxilϵi|nt]\displaystyle+\left.\left|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right|+\left|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right|\geq nt\right]
\displaystyle\leq l1P[|i=1n{xilxilE(xilxil)}|ntb(s1+s2+2)]\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{1}}P\left[\left|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]
+l2P[|i=1n{xilxilE(xilxil)}|ntb(s1+s2+2)]\displaystyle+\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left[\left|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]
+P(|i=1nxil𝑬i,𝑩˙|nts1+s2+2)+P(|i=1nxilϵi|nts1+s2+2)\displaystyle+P\left(\left|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right|\geq\frac{nt}{s_{1}+s_{2}+2}\right)+P\left(\left|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right|\geq\frac{nt}{s_{1}+s_{2}+2}\right)
\displaystyle\leq 2(s1+s2)exp[nt2(2b)2(s1+s2+2)22{2D22eD2σxD3+D21(2b)1(s1+s2+2)1t}]\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{nt^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right]
+4exp[nt2(s1+s2+2)22{2D22D3+D21(s1+s2+2)1t}].\displaystyle+4\exp\left[-\frac{nt^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right].

For l1c2l\in\mathcal{M}_{1}\cap\mathcal{M}^{c}_{2}, we have 𝑪˙Ml,𝑩˙=0\langle\dot{\bm{C}}^{M}_{l},\dot{\bm{B}}\rangle=0, under Assumption (A1) and previous deduction, we have

P(|β^Ml|γ1,n)=P(|β^Ml|γ1,n)P(|β˙lM||β^Ml|D1nκγ1,n)\displaystyle P\left(|\widehat{\beta}^{M}_{l}|\leq\gamma_{1,n}\right)=P\left(-|\widehat{\beta}^{M}_{l}|\geq-\gamma_{1,n}\right)\leq P\left(|\dot{\beta}_{l}^{*M}|-|\widehat{\beta}^{M}_{l}|\geq D_{1}n^{-\kappa}-\gamma_{1,n}\right)
=\displaystyle= P(|β˙lM||𝑪˙Ml,𝑩˙||β^Ml|(1α)D1nκ)\displaystyle P\left(|\dot{\beta}_{l}^{*M}|-|\langle\dot{\bm{C}}^{M}_{l},\dot{\bm{B}}\rangle|-|\widehat{\beta}^{M}_{l}|\geq(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq P(|β˙lM||β^Ml|(1α)D1nκ)\displaystyle P\left(|\dot{\beta}_{l}^{M}|-|\widehat{\beta}^{M}_{l}|\geq(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq P(|β˙lMβ^Ml|(1α)D1nκ)\displaystyle P\left(|\dot{\beta}_{l}^{M}-\widehat{\beta}^{M}_{l}|\geq(1-\alpha)D_{1}n^{-\kappa}\right)
\displaystyle\leq 2(s1+s2)exp[n12κ(1α)2D12(2b)2(s1+s2+2)22{2D22eD2σxD3+D21(2b)1(s1+s2+2)1(1α)D1nκ}]\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}n^{-\kappa}\right\}}\right]
+4exp[n12κ(1α)2D12(s1+s2+2)22{2D22D3+D21(s1+s2+2)1(1α)D1pq1/2nκ}]\displaystyle+4\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}pq^{1/2}n^{-\kappa}\right\}}\right]
\displaystyle\leq 2(s1+s2)exp[n12κ(1α)2D12(2b)2(s1+s2+2)22{2D22eD2σxD3+D21(2b)1(s1+s2+2)1(1α)D1}]\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right]
+4exp[n12κ(1α)2D12(s1+s2+2)22{2D22D3+D21(s1+s2+2)1(1α)D1}]\displaystyle+4\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right]

Let

d1\displaystyle d_{1} =\displaystyle= min[(1α)2D12(2b)2(s1+s2+2)22{2D22eD2σxD3+D21(2b)1(s1+s2+2)1(1α)D1},\displaystyle\min\left[\frac{(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}},\right.
(1α)2D12(s1+s2+2)22{2D22D3+D21(s1+s2+2)1(1α)D1}],\displaystyle\left.\frac{(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right],

We have for each l12cl\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c},

P(βlblock,M^γ3,n)P(|βlM^|γ1,n)2(s1+s2+2)exp(d1n12κ).\displaystyle P\left(\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n}\right)\leq P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right)\leq 2(s_{1}+s_{2}+2)\exp\left(-d_{1}n^{1-2\kappa}\right). (8)

In sum, by Assumption (A5), and (7) and (8), we have

P{1(^1^2^1block,^2block)}\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,*}\cup\widehat{\mathcal{M}}_{2}^{block}\right)\right\}
\displaystyle\geq 12l12P(𝑪^lMopγ2,n)2l12cP(|βlM^|γ1,n)\displaystyle 1-2\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)-2\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right)
\displaystyle\geq 14pqs2(s2+1)exp(d0n12κ)4s1(s1+s2+2)exp(d1n12κ)\displaystyle 1-4pqs_{2}(s_{2}+1)\exp(-d_{0}n^{1-2\kappa})-4s_{1}(s_{1}+s_{2}+2)\exp\left(-d_{1}n^{1-2\kappa}\right)
\displaystyle\geq 1d0pqexp(d1n12κ)1, as n,\displaystyle 1-d_{0}^{\prime}pq\exp\left(-d_{1}^{\prime}n^{1-2\kappa}\right)\to 1,\quad\textrm{ as }n\to\infty,

for some positive constants d0d_{0}^{\prime} and d1d_{1}^{\prime}. Therefore, P(1^)1P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}})\to 1 as nn\to\infty. ∎

Proof of Theorem 4:.

Denote 𝜽^=(𝜷^,T,vec(𝑩)T)T\bm{\theta}^{\widehat{\mathcal{M}}}=(\bm{\beta}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\bm{B})^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}, where 𝜷^={βl}l^|^|\bm{\beta}^{\widehat{\mathcal{M}}}=\{\beta_{l}\}_{l\in\widehat{\mathcal{M}}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|} and 𝑩p×q\bm{B}\in\mathbb{R}^{p\times q}. In addition, for a given pair 𝝀=(λ1,λ2)\bm{\lambda}=(\lambda_{1},\lambda_{2}), we let 𝜽˙^=(𝜷˙^,T,vec(𝑩˙)T)T\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}=(\dot{\bm{\beta}}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\dot{\bm{B}})^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} be the true value for 𝜽^\bm{\theta}^{\widehat{\mathcal{M}}} with 𝜷˙^\dot{\bm{\beta}}^{\widehat{\mathcal{M}}} and 𝑩˙\dot{\bm{B}} being true values for 𝜷^\bm{\beta}^{\widehat{\mathcal{M}}} and 𝑩\bm{B} respectively, and 𝜽^𝝀^=(𝜷^^,T,vec(𝑩^)T)T\widehat{\bm{\theta}}_{\bm{\lambda}}^{\widehat{\mathcal{M}}}=(\widehat{\bm{\beta}}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\widehat{\bm{B}})^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} be the proposed estimator for 𝜽^\bm{\theta}^{\widehat{\mathcal{M}}} with 𝜷^^\widehat{\bm{\beta}}^{\widehat{\mathcal{M}}} and 𝑩^\widehat{\bm{B}} being the estimators for 𝜷^\bm{\beta}^{\widehat{\mathcal{M}}} and 𝑩\bm{B} respectively. Furthermore, we let r=rank(𝑩˙)r=\mathrm{rank}(\dot{\bm{B}}), the true rank of matrix 𝑩˙p×q\dot{\bm{B}}\in\mathbb{R}^{p\times q}. Let us consider the class of matrices Θ\Theta that have rank rmin{p,q}r\leq\min\left\{p,q\right\}. For any given matrix Θ\Theta, we let row(Θ)p\mathrm{row}(\Theta)\subset\mathbb{R}^{p} and col(Θ)q\mathrm{col}(\Theta)\subset\mathbb{R}^{q} denote its row and column space, respectively. Let UU and VV be a given pair of rr-dimensional subspace UpU\subset\mathbb{R}^{p} and VqV\subset\mathbb{R}^{q}, respectively.

For a given 𝜽^\bm{\theta}^{\widehat{\mathcal{M}}} and pair (U,V)(U,V), we define the subspace Ω1(^)\Omega_{1}(\widehat{\mathcal{M}}), Ω2(U,V)\Omega_{2}(U,V), Ω¯1(^)\overline{\Omega}_{1}(\widehat{\mathcal{M}}), Ω¯2(U,V)\overline{\Omega}_{2}(U,V), Ω¯1(^)\overline{\Omega}^{\perp}_{1}(\widehat{\mathcal{M}}) and Ω¯2(U,V)\overline{\Omega}^{\perp}_{2}(U,V) as followed:

Ω1(^)\displaystyle{\Omega}_{1}(\widehat{\mathcal{M}}) =\displaystyle= Ω¯1(^):={𝜷^={βl}l^|^||βl=0 for all l1},\displaystyle\overline{\Omega}_{1}(\widehat{\mathcal{M}}):=\left\{\bm{\beta}^{\widehat{\mathcal{M}}}=\{\beta_{l}\}_{l\in\widehat{\mathcal{M}}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|}|\beta_{l}=0\textrm{ for all }l\not\in\mathcal{M}_{1}\right\},
Ω¯1(^)\displaystyle\overline{\Omega}_{1}^{\perp}(\widehat{\mathcal{M}}) :=\displaystyle:= {𝜷^={βl}l^|^||βl=0 for all l1},\displaystyle\left\{\bm{\beta}^{\widehat{\mathcal{M}}}=\{\beta_{l}\}_{l\in\widehat{\mathcal{M}}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|}|\beta_{l}=0\textrm{ for all }l\in\mathcal{M}_{1}\right\},
Ω2(U,V)\displaystyle{\Omega}_{2}(U,V) :=\displaystyle:= {Θp×q|row(Θ)V, and col(Θ)U},\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}|\mathrm{row}(\Theta)\subset V,\textrm{ and }\mathrm{col}(\Theta)\subset U\right\},
Ω¯2(U,V)\displaystyle\overline{\Omega}_{2}(U,V) :=\displaystyle:= {Θp×q|row(Θ)V, or col(Θ)U},\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}|\mathrm{row}(\Theta)\subset V,\textrm{ or }\mathrm{col}(\Theta)\subset U\right\},
Ω¯2(U,V)\displaystyle\overline{\Omega}_{2}^{\perp}(U,V) :=\displaystyle:= {Θp×q|row(Θ)V, and col(Θ)U},\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}|\mathrm{row}(\Theta)\subset V^{\perp},\textrm{ and }\mathrm{col}(\Theta)\subset U^{\perp}\right\},

where Ω¯1(^)\overline{\Omega}_{1}^{\perp}(\widehat{\mathcal{M}}) and Ω¯2(U,V)\overline{\Omega}_{2}^{\perp}(U,V) are the orthogonal complements for Ω¯1(^)\overline{\Omega}_{1}(\widehat{\mathcal{M}}) and Ω¯2(U,V)\overline{\Omega}_{2}(U,V) respectively. For simplicity, we will use Ω1{\Omega}_{1}, Ω¯1\overline{\Omega}_{1} and Ω¯1\overline{\Omega}_{1}^{\perp} to denote Ω1(^){\Omega}_{1}(\widehat{\mathcal{M}}), Ω¯1(^)\overline{\Omega}_{1}(\widehat{\mathcal{M}}) and Ω¯1(^)\overline{\Omega}_{1}^{\perp}(\widehat{\mathcal{M}}), respectively; and use Ω2{\Omega}_{2}, Ω¯2\overline{\Omega}_{2} and Ω¯2\overline{\Omega}_{2}^{\perp} to denote Ω2(U,V){\Omega}_{2}(U,V), Ω¯2(U,V)\overline{\Omega}_{2}(U,V) and Ω¯2(U,V)\overline{\Omega}_{2}^{\perp}(U,V). It is easy to see that both P1P_{1} and P2P_{2} satisfy the condition that they are decomposable with respect to the subspace pair (Ω1,Ω¯1)(\Omega_{1},\overline{\Omega}_{1}^{\perp}) and (Ω2,Ω¯2)(\Omega_{2},\overline{\Omega}_{2}^{\perp}), respectively. Therefore the regularizer terms P1P_{1} and P2P_{2} satisfies condition (G1) of Negahban et al. (2012).

Here we define the function F:|^|+pqF:\mathbb{R}^{|\widehat{\mathcal{M}}|+pq}\to\mathbb{R} by

F(𝚫)\displaystyle F(\bm{\Delta}) :=\displaystyle:= l(𝜽˙^+𝚫)l(𝜽˙^)+λ1{P1(𝜷˙^+𝚫1)P1(𝜷˙^)}\displaystyle l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}+\bm{\Delta})-l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}})+\lambda_{1}\left\{P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}+\bm{\Delta}_{1})-P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\}
+λ2{P2(𝑩˙+𝚫2)P2(𝑩˙)},\displaystyle+\lambda_{2}\left\{P_{2}(\dot{\bm{B}}+\bm{\Delta}_{2})-P_{2}(\dot{\bm{B}})\right\},

where, 𝚫={𝚫1T,vec(𝚫2)T}T|^|+pq\bm{\Delta}=\{\bm{\Delta}_{1}^{\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\bm{\Delta}_{2})^{\mathrm{\scriptscriptstyle T}}\}^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|+pq} with 𝚫1|^|\bm{\Delta}_{1}\in\mathbb{R}^{|\widehat{\mathcal{M}}|} and 𝚫2p×q\bm{\Delta}_{2}\in\mathbb{R}^{p\times q}. Next, We will derive a lower bound for F(𝚫)F(\bm{\Delta}).

Before we formally prove the result, we first introduce the concept of subspace compatibility constant. For a subspace Ω\Omega, the subspace compatibility constant with respect to the pair (P,)(P,\|\cdot\|) is given by

ψ(Ω):=supuΩ\{0}P(u)u.\psi(\Omega):=\sup_{u\in\Omega\backslash\{0\}}\frac{P(u)}{\|u\|}.

Therefore, we have

ψ1(Ω¯1)\displaystyle\psi_{1}(\overline{\Omega}_{1}) =\displaystyle= sup𝜷^Ω¯1\{0}P1(𝜷^)𝜷^=l^|βl|l^βl2l^|βl|2lM^12l^βl2|^|;\displaystyle\sup_{\bm{\beta}^{\widehat{\mathcal{M}}}\in\overline{\Omega}_{1}\backslash\{0\}}\frac{P_{1}(\bm{\beta}^{\widehat{\mathcal{M}}})}{\|\bm{\beta}^{\widehat{\mathcal{M}}}\|}=\frac{\sum_{l\in\widehat{\mathcal{M}}}|\beta_{l}|}{\sqrt{\sum_{l\in\widehat{\mathcal{M}}}{\beta_{l}^{2}}}}\leq\frac{\sqrt{\sum_{l\in\widehat{\mathcal{M}}}|\beta_{l}|^{2}\sum_{l\in\widehat{M}}1^{2}}}{\sqrt{\sum_{l\in\widehat{\mathcal{M}}}{\beta_{l}^{2}}}}\leq\sqrt{|\widehat{\mathcal{M}}|};
ψ2(Ω¯2)\displaystyle\psi_{2}(\overline{\Omega}_{2}) =\displaystyle= sup𝑩Ω¯2\{0}P2(𝑩)𝑩=𝑩𝑩Fr𝑩F𝑩F=r,\displaystyle\sup_{\bm{B}\in\overline{\Omega}_{2}\backslash\{0\}}\frac{P_{2}(\bm{B})}{\|\bm{B}\|}=\frac{\left\|\bm{B}\right\|_{*}}{\|\bm{B}\|_{F}}\leq\frac{\sqrt{r}\left\|\bm{B}\right\|_{F}}{\|\bm{B}\|_{F}}=\sqrt{r},

where the last step of first inequality is obtained by applying Cauchy-Schwarz inequality. Furthermore, we have

F(𝚫)\displaystyle F(\bm{\Delta}) =\displaystyle= l(𝜽˙^+𝚫)l(𝜽˙^)+λ1{P1(𝜷˙^+𝚫1)P1(𝜷˙^)}\displaystyle l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}+\bm{\Delta})-l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}})+\lambda_{1}\left\{P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}+\bm{\Delta}_{1})-P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\}
+λ2{P2(𝑩˙+𝚫2)P2(𝑩˙)}\displaystyle+\lambda_{2}\left\{P_{2}(\dot{\bm{B}}+\bm{\Delta}_{2})-P_{2}(\dot{\bm{B}})\right\}
\displaystyle\geq l(𝜽˙^),𝚫+ι𝚫2+λ1{P1(𝜷˙^+𝚫1)P1(𝜷˙^)}\displaystyle\langle\nabla l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}\rangle+\iota\left\|\bm{\Delta}\right\|^{2}+\lambda_{1}\left\{P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}+\bm{\Delta}_{1})-P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\}
+λ2{P2(𝑩˙+𝚫2)P2(𝑩˙)}\displaystyle+\lambda_{2}\left\{P_{2}(\dot{\bm{B}}+\bm{\Delta}_{2})-P_{2}(\dot{\bm{B}})\right\}
\displaystyle\geq l(𝜽˙^),𝚫+ι𝚫2+λ1[P1(𝚫1,Ω¯1)P1(𝚫1,Ω¯1)2P1{(𝜷˙^)Ω¯1}]\displaystyle\langle\nabla l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}\rangle+\iota\left\|\bm{\Delta}\right\|^{2}+\lambda_{1}\left[P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\}\right]
+λ2[P2(𝚫2,Ω¯2)P2(𝚫2,Ω¯2)2P2{(𝑩˙)Ω¯2}],\displaystyle+\lambda_{2}\left[P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\}\right],

where the first inequality follows from Assumption (A6) and the second inequality follows from Lemma 3 in Negahban et al. (2012). By the Cauchy-Schwarz inequality applied to PkP_{k} and its dual PkP_{k}^{*}, k=1,2k=1,2, where PkP_{k}^{*} is defined as the dual norm of PkP_{k} such that Pk(𝜽)=supPk(𝜼)1𝜽,𝜼P_{k}^{*}(\bm{\theta})=\sup_{P_{k}(\bm{\eta})\leq 1}\langle\bm{\theta},\bm{\eta}\rangle, we have |l(𝜽˙^),𝚫|=|l(𝜷˙^),𝚫1|+|l(𝑩˙),𝚫2|P1{l(𝜷˙^)}P1(𝚫1)+P2{l(𝑩˙)}P2(𝚫2).|\langle l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}\rangle|=|\langle\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}_{1}\rangle|+|\langle\nabla l(\dot{\bm{B}}),\bm{\Delta}_{2}\rangle|\leq P_{1}^{*}\left\{\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\}P_{1}\left(\bm{\Delta}_{1}\right)+P_{2}^{*}\left\{\nabla l(\dot{\bm{B}})\right\}P_{2}\left(\bm{\Delta}_{2}\right). If λk2Pk()\lambda_{k}\geq 2P_{k}^{*}(\cdot) holds, where P1()=l(𝜷˙^)P_{1}^{*}(\cdot)=\left\|\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\|_{\infty} and P2()=l(𝑩˙)opP_{2}^{*}(\cdot)=\left\|\nabla l(\dot{\bm{B}})\right\|_{op} (Negahban et al., 2012; Kong et al., 2020), one has |l(),𝚫k|12λkPk(𝚫k)12λk{Pk(𝚫k,Ω¯k)+Pk(𝚫k,Ω¯k)}.|\langle\nabla l(\cdot),\bm{\Delta}_{k}\rangle|\leq\frac{1}{2}\lambda_{k}P_{k}(\bm{\Delta}_{k})\leq\frac{1}{2}\lambda_{k}\left\{P_{k}(\bm{\Delta}_{k,\overline{\Omega}_{k}^{\perp}})+P_{k}(\bm{\Delta}_{k,\overline{\Omega}_{k}})\right\}. Therefore, we have

F(𝚫)\displaystyle F(\bm{\Delta}) \displaystyle\geq l(𝜽˙^),𝚫+ι𝚫2+λ1[P1(𝚫1,Ω¯1)P1(𝚫1,Ω¯1)2P1{(𝜷˙^)Ω¯1}]\displaystyle\langle\nabla l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}\rangle+\iota\left\|\bm{\Delta}\right\|^{2}+\lambda_{1}\left[P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\}\right]
+λ2[P2(𝚫2,Ω¯2)P2(𝚫2,Ω¯2)2P2{vec(𝑩˙)Ω¯2}]\displaystyle+\lambda_{2}\left[P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\{\mathrm{vec}(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\}\right]
=\displaystyle= l(𝜷˙^),𝚫1+l(𝑩˙),𝚫2+ι𝚫2\displaystyle\langle\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}_{1}\rangle+\langle\nabla l(\dot{\bm{B}}),\bm{\Delta}_{2}\rangle+\iota\left\|\bm{\Delta}\right\|^{2}
+λ1[P1(𝚫1,Ω¯1)P1(𝚫1,Ω¯1)2P1{(𝜷˙^)Ω¯1}]\displaystyle+\lambda_{1}\left[P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\left\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\right\}\right]
+λ2[P2(𝚫2,Ω¯2)P2(𝚫2,Ω¯2)2P2{(𝑩˙)Ω¯2}]\displaystyle+\lambda_{2}\left[P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\left\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\right\}\right]
\displaystyle\geq λ12{P1(𝚫1,Ω¯1)+P1(𝚫1,Ω¯1)}λ22{P2(𝚫2,Ω¯2)+P2(𝚫2,Ω¯2)}+ι𝚫2\displaystyle-\frac{\lambda_{1}}{2}\left\{P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})+P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})\right\}-\frac{\lambda_{2}}{2}\left\{P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})+P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})\right\}+\iota\left\|\bm{\Delta}\right\|^{2}
+λ1[P1(𝚫1,Ω¯1)P1(𝚫1,Ω¯1)2P1{(𝜷˙^)Ω¯1}]\displaystyle+\lambda_{1}\left[P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\left\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\right\}\right]
+λ2[P2(𝚫2,Ω¯2)P2(𝚫2,Ω¯2)2P2{(𝑩˙)Ω¯2}]\displaystyle+\lambda_{2}\left[P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\left\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\right\}\right]
=\displaystyle= ι𝚫2+λ1[12P1(𝚫1,Ω¯1)32P1(𝚫1,Ω¯1)2P1{(𝜷˙^)Ω¯1}]\displaystyle\iota\left\|\bm{\Delta}\right\|^{2}+\lambda_{1}\left[\frac{1}{2}P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-\frac{3}{2}P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\left\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\right\}\right]
+λ2[12P2(𝚫2,Ω¯2)32P2(𝚫2,Ω¯2)2P2{(𝑩˙)Ω¯2}].\displaystyle+\lambda_{2}\left[\frac{1}{2}P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-\frac{3}{2}P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\left\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\right\}\right].

By the subspace compatibility, we have Pk(𝚫k,Ω¯k)ψk(Ω¯k)𝚫k,Ω¯kP_{k}(\bm{\Delta}_{k,\overline{\Omega}_{k}})\leq\psi_{k}(\overline{\Omega}_{k})\|\bm{\Delta}_{k,\overline{\Omega}_{k}}\|, for k=1,2k=1,2. Substituting them into the previous inequality, and noticing that P1{(𝜷˙^)Ω¯1}=P2{(𝑩˙)Ω¯2}P_{1}\left\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\right\}=P_{2}\left\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\right\} =0=0, we obtain that

F(𝚫)\displaystyle F(\bm{\Delta}) \displaystyle\geq ι𝚫2k{1,2}3λk2ψk(Ω¯k)𝚫k,Ω¯k\displaystyle\iota\left\|\bm{\Delta}\right\|^{2}-\sum_{k\in\{1,2\}}\frac{3\lambda_{k}}{2}\psi_{k}(\overline{\Omega}_{k})\|\bm{\Delta}_{k,\overline{\Omega}_{k}}\|
\displaystyle\geq ι𝚫2k{1,2}3λk2ψk(Ω¯k)𝚫k\displaystyle\iota\left\|\bm{\Delta}\right\|^{2}-\sum_{k\in\{1,2\}}\frac{3\lambda_{k}}{2}\psi_{k}(\overline{\Omega}_{k})\|\bm{\Delta}_{k}\|
\displaystyle\geq ι𝚫23maxk{1,2}{λkψk(Ω¯k)}𝚫.\displaystyle\iota\left\|\bm{\Delta}\right\|^{2}-3\max_{k\in\{1,2\}}\{\lambda_{k}\psi_{k}(\overline{\Omega}_{k})\}\left\|\bm{\Delta}\right\|.

The right hand side is a quadratic form of 𝚫\bm{\Delta}. Therefore, as long as 𝚫2>94ι2max2k{1,2}{\left\|\bm{\Delta}\right\|^{2}>\frac{9}{4\iota^{2}}\max^{2}_{k\in\{1,2\}}\{ λkψk(Ω¯k)}\lambda_{k}\psi_{k}(\overline{\Omega}_{k})\}, one has F(𝚫)>0F(\bm{\Delta})>0. Therefore, by Lemma 4 in Negahban et al. (2012), we can establish that

𝜽^^𝝀𝜽˙^22Cmax{λ12|^|,λ22r}ι2,\displaystyle\left\|\widehat{\bm{\theta}}^{\widehat{\mathcal{M}}}_{\bm{\lambda}}-\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}\right\|_{2}^{2}\leq C\max\left\{\lambda_{1}^{2}|\widehat{\mathcal{M}}|,\lambda_{2}^{2}r\right\}\iota^{-2},

for some constant C>0C>0. It is easy to see that there exists some constant C0>0C_{0}>0 such that

𝜽^𝝀𝜽˙22C0max{C1n2κ+τλ12,λ22r}ι2.\displaystyle\left\|\widehat{\bm{\theta}}_{\bm{\lambda}}-\dot{\bm{\theta}}\right\|_{2}^{2}\leq C_{0}\max\left\{C_{1}n^{2\kappa+\tau}\lambda_{1}^{2},\lambda_{2}^{2}r\right\}\iota^{-2}.

Now we calculate P1()P_{1}^{*}(\cdot) and P2()P_{2}^{*}(\cdot). According to Kong et al. (2020); Negahban et al. (2012), we have that P1()=l(𝜷˙^)P_{1}^{*}(\cdot)=\left\|\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\|_{\infty} and P2()=l(𝑩˙)opP_{2}^{*}(\cdot)=\left\|\nabla l(\dot{\bm{B}})\right\|_{op}, where l(𝜷˙^)=n1i=1nϵiXi^\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})=-n^{-1}\sum_{i=1}^{n}\epsilon_{i}X_{i}^{\widehat{\mathcal{M}}} and l(𝑩˙)=n1i=1nϵi𝒁i\nabla l(\dot{\bm{B}})=-n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{Z}_{i}.

We first calculate l(𝜷˙^)\left\|\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\|_{\infty}. Denoting ϵ=(ϵ1,,ϵn)T\bm{\epsilon}=(\epsilon_{1},\ldots,\epsilon_{n})^{\mathrm{\scriptscriptstyle T}}, 𝑿^=(X1^,T,,Xn^,T)T\bm{X}^{\widehat{\mathcal{M}}}=(X_{1}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}},\ldots,X_{n}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}, where Xi^=(xij)Tj^|^|X_{i}^{\widehat{\mathcal{M}}}=(x_{ij})^{\mathrm{\scriptscriptstyle T}}_{j\in\widehat{\mathcal{M}}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|} for i=1,,ni=1,\ldots,n. we let 𝒙l^\bm{x}_{l}^{\widehat{\mathcal{M}}}, where l=1,,|^|l=1,\ldots,|\widehat{\mathcal{M}}| represent the ll-th column of 𝑿^\bm{X}^{\widehat{\mathcal{M}}}. Since 𝑿\bm{X} column normalized, i.e. 𝒙l2/n=1\|\bm{x}_{l}\|_{2}/\sqrt{n}=1, for all l1,,sl\in 1,\ldots,s, by Assumption (A2), there exists a constant σ0>0\sigma_{0}>0 such that P(|𝒙l^,ϵ/n|t)2exp(nt22σ02)P(\left|\langle\bm{x}_{l}^{\widehat{\mathcal{M}}},\bm{\epsilon}\rangle/n\right|\geq t)\leq 2\exp\left(-\frac{nt^{2}}{2\sigma_{0}^{2}}\right). Applying union bound, we have P(n1i=1nϵiXi^t)=P(𝑿^,Tϵ/nt)=P(supl^|𝒙l^,ϵ/n|t)2exp(nt22σ02+log|^|)2exp(P\left(\left\|-n^{-1}\sum_{i=1}^{n}\epsilon_{i}X_{i}^{\widehat{\mathcal{M}}}\right\|_{\infty}\geq t\right)=P\left(\left\|\bm{X}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}}\bm{\epsilon}/n\right\|_{\infty}\geq t\right)=P\left(\mathrm{sup}_{l\in\widehat{\mathcal{M}}}\left|\langle\bm{x}_{l}^{\widehat{\mathcal{M}}},\bm{\epsilon}\rangle/n\right|\geq t\right)\leq 2\exp\left(-\frac{nt^{2}}{2\sigma_{0}^{2}}+\log|\widehat{\mathcal{M}}|\right)\leq 2\exp\left(\right. nt22σ02+C1(2κ+τ)logn)\left.-\frac{nt^{2}}{2\sigma_{0}^{2}}+C_{1}(2\kappa+\tau)\log n\right). By choosing t2=2n1σ02t^{2}=2n^{-1}\sigma_{0}^{2} {log(logn)+C1(2κ+τ)logn}\{\log(\log n)+C_{1}(2\kappa+\tau)\log n\}, we can see that when λ12σ0[2n1{log(logn)+C1(2κ+τ)logn}]1/2\lambda_{1}\geq 2\sigma_{0}[2n^{-1}\{\log(\log n)+C_{1}(2\kappa+\tau)\log n\}]^{1/2}, then there exist a positive constant c1>0c_{1}>0 such that the choice of λ1\lambda_{1} holds with probability at least 1c1(logn)11-c_{1}(\log n)^{-1}.

Secondly, we calculate l(𝑩˙)op\left\|\nabla l(\dot{\bm{B}})\right\|_{op}:

l(𝑩˙)op=n1i=1nϵi𝒁iop=n1i=1nϵi(l2Xil𝑪˙l+𝑬i)op\displaystyle\left\|\nabla l(\dot{\bm{B}})\right\|_{op}=\left\|-n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{Z}_{i}\right\|_{op}=\left\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\left(\sum_{l\in\mathcal{M}_{2}}X_{il}*\dot{\bm{C}}_{l}+\bm{E}_{i}\right)\right\|_{op}
=\displaystyle= n1i=1nϵi(l2Xil𝑪˙l)+n1i=1nϵi𝑬iop\displaystyle\left\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\left(\sum_{l\in\mathcal{M}_{2}}X_{il}*\dot{\bm{C}}_{l}\right)+n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\|_{op}
=\displaystyle= n1l2𝒙l,ϵ𝑪˙l+n1i=1nϵi𝑬iop\displaystyle\left\|n^{-1}\sum_{l\in\mathcal{M}_{2}}\langle\bm{x}_{l},\bm{\epsilon}\rangle*\dot{\bm{C}}_{l}+n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\|_{op}
\displaystyle\leq n1l2|𝒙l,ϵ|𝑪˙lop+n1i=1nϵi𝑬iop.\displaystyle n^{-1}\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|\left\|\dot{\bm{C}}_{l}\right\|_{op}+\left\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\|_{op}.

Under condition (A1), n1l2|𝒙l,ϵ|𝑪˙lopn1bl2|𝒙l,ϵ|n^{-1}\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|\left\|\dot{\bm{C}}_{l}\right\|_{op}\leq n^{-1}b\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|. Therefore, we have

P(n1bl2|𝒙l,ϵ|t)=P(l2|𝒙l,ϵ/n|t/b)\displaystyle P(n^{-1}b\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|\geq t)=P(\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle/n\right|\geq t/b)
\displaystyle\leq P[l2{|𝒙l,ϵ/n|tbs2}]l2P(|𝒙l,ϵ/n|tbs2)\displaystyle P\left[\cup_{l\in\mathcal{M}_{2}}\left\{\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle/n\right|\geq\frac{t}{bs_{2}}\right\}\right]\leq\sum_{l\in\mathcal{M}_{2}}P\left(\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle/n\right|\geq\frac{t}{bs_{2}}\right)
\displaystyle\leq l2P(supl2|𝒙l,ϵ/n|tbs2)2exp(nt22b2s22σ02+2logs2).\displaystyle\sum_{l\in\mathcal{M}_{2}}P\left(\sup_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle/n\right|\geq\frac{t}{bs_{2}}\right)\leq 2\exp\left(-\frac{nt^{2}}{2b^{2}s_{2}^{2}\sigma_{0}^{2}}+2\log s_{2}\right).

Therefore, by choosing t=bs2σ0t=bs_{2}\sigma_{0} [2n1{3logs2+log(logn)}]1/2[2n^{-1}\{3\log s_{2}+\log(\log n)\}]^{1/2}, for any choice of t1tt_{1}\geq t, we guarantee that t1n1bl2|𝒙l,ϵ|t_{1}\geq n^{-1}b\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right| therefore t1n1l2|𝒙l,ϵ|𝑪˙lopt_{1}\geq n^{-1}\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|\left\|\dot{\bm{C}}_{l}\right\|_{op} is valid with probability at least 1c2/(s2logn)1-c_{2}/(s_{2}\log n) for some positive constant c2c_{2}.

On the other hand, let 𝑾i\bm{W}_{i} be a p×qp\times q random matrix with each entry i.i.d. standard normal. By Assumption (A8) and Lemma 3, conditioning on ϵi\epsilon_{i} we have

P(i=1nϵi𝑬iopt)=P(sup𝑨1𝑨,i=1nϵi𝑬it)P(sup𝑨1𝑨,i=1nϵi𝑾itCU),\displaystyle P(\left\|\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\|_{op}\geq t)=P(\sup_{\|\bm{A}\|_{*}\leq 1}\langle\bm{A},\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\rangle\geq t)\leq P(\sup_{\|\bm{A}\|_{*}\leq 1}\langle\bm{A},\sum_{i=1}^{n}\epsilon_{i}*\bm{W}_{i}\rangle\geq\frac{t}{C_{U}}),

since ΣeCu2Ipq×pq\Sigma_{e}\preceq C_{u}^{2}I_{pq\times pq}.

As sup𝑨1𝑨,i=1nϵi𝑾i=i=1nϵi𝑾i\sup_{\|\bm{A}\|_{*}\leq 1}\langle\bm{A},\sum_{i=1}^{n}\epsilon_{i}*\bm{W}_{i}\rangle=\left\|\sum_{i=1}^{n}\epsilon_{i}*\bm{W}_{i}\right\|_{\infty} conditioning on 𝑾i\bm{W}_{i}, each entry of the matrix i=1nϵi𝑾i\sum_{i=1}^{n}\epsilon_{i}*\bm{W}_{i} is i.i.d. N(0,ϵop2)N(0,\|\bm{\epsilon}\|_{op}^{2}). Since ϵop2σϵ2\frac{\|\bm{\epsilon}\|_{op}^{2}}{\sigma_{\epsilon}^{2}} is a χ2\chi^{2} random variable with nn degrees of freedom, one has

P(ϵop2nσϵ24)exp(n)P(\frac{\|\bm{\epsilon}\|_{op}^{2}}{n\sigma_{\epsilon}^{2}}\geq 4)\leq\exp(-n)

using the tail bound of χ2\chi^{2} presented by the corollary of Lemma 1 from Laurent and Massart (2000). Combing with the standard random matrix theory, we know that n1/2i=1nϵi\|n^{-1/2}\sum_{i=1}^{n}\epsilon_{i}* 𝑾iop2n1/2σϵ(p1/2+q1/2)\bm{W}_{i}\|_{op}\leq 2n^{-1/2}\sigma_{\epsilon}(p^{1/2}+q^{1/2}) with probability at least 1c3exp{c4(p+q)}exp(n)1-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n) where c3c_{3} and c4c_{4} are some positive constants. Combining with what we got from previous step, we have l2|𝒙l,ϵ|𝑪˙lop+i=1nϵi𝑬iopbs2σ0[2n1{3logs2+log(logn)}]1/2+2n1/2σϵ(p1/2+q1/2)\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|\left\|\dot{\bm{C}}_{l}\right\|_{op}+\left\|\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\|_{op}\leq bs_{2}\sigma_{0}[2n^{-1}\{3\log s_{2}+\log(\log n)\}]^{1/2}+2n^{-1/2}\sigma_{\epsilon}(p^{1/2}+q^{1/2}) holds with probability at least 1c2/(s2logn)c3exp{c4(p+q)}exp(n)1-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n). Therefore, the choice of λ22bs2σ0[2n1{3logs2+log(logn)}]1/2+4n1/2σϵ(p1/2+q1/2)\lambda_{2}\geq 2bs_{2}\sigma_{0}[2n^{-1}\{3\log s_{2}+\log(\log n)\}]^{1/2}+4n^{-1/2}\sigma_{\epsilon}(p^{1/2}+q^{1/2}) holds with probability at least 1c2/(s2logn)c3exp{c4(p+q)}exp(n)1-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n).

As a result, the event that both λ1\lambda_{1} and λ2\lambda_{2} satisfy the above inequalities holds with probability at least 1c1/lognc2/(s2logn)c3exp{c4(p+q)}exp(n)1-c_{1}/\log n-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n) for some postive constants c1c_{1}, c2c_{2}, c3c_{3} and c4c_{4}.

In sum, there exists some positive constants c1,c2,c3,c4c_{1},c_{2},c_{3},c_{4}, with probability at least 1c1/lognc2/(s2logn)c3exp{c4(p+q)}1-c_{1}/\log n-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}, one has

𝜽^𝝀𝜽˙22C0max{C1λ12n2κ+τ,λ22r}ι2,\left\|\widehat{\bm{\theta}}_{\bm{\lambda}}-\dot{\bm{\theta}}\right\|_{2}^{2}\leq C_{0}\max\left\{C_{1}\lambda_{1}^{2}n^{2\kappa+\tau},\lambda_{2}^{2}r\right\}\iota^{-2},

for some constants C0,C1>0C_{0},C_{1}>0. This completes the proof.

References

  • Aizenstein et al. (2008) Aizenstein, H. J., R. D. Nebes, J. A. Saxton, J. C. Price, C. A. Mathis, N. D. Tsopelas, S. K. Ziolko, J. A. James, B. E. Snitz, P. R. Houck, et al. (2008). Frequent amyloid deposition without significant cognitive impairment among the elderly. Archives of Neurology 65(11), 1509–1517.
  • Anderson (1955) Anderson, T. W. (1955). The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities. Proceedings of the American Mathematical Society 6(2), 170–176.
  • Apostolova et al. (2006) Apostolova, L. G., I. D. Dinov, R. A. Dutton, K. M. Hayashi, A. W. Toga, J. L. Cummings, and P. M. Thompson (2006). 3D comparison of hippocampal atrophy in amnestic mild cognitive impairment and Alzheimer’s disease. Brain 129(11), 2867–2873.
  • Apostolova et al. (2010) Apostolova, L. G., L. Mosconi, P. M. Thompson, A. E. Green, K. S. Hwang, A. Ramirez, R. Mistur, W. H. Tsui, and M. J. de Leon (2010). Subregional hippocampal atrophy predicts Alzheimer’s dementia in the cognitively normal. Neurobiology of Aging 31(7), 1077–1088.
  • Barrett et al. (2005) Barrett, J. C., B. Fry, J. Maller, and M. J. Daly (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21(2), 263–265.
  • Barut et al. (2016) Barut, E., J. Fan, and A. Verhasselt (2016). Conditional sure independence screening. Journal of the American Statistical Association 111(515), 1266–1277.
  • Beck and Teboulle (2009) Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202.
  • Bertram and Tanzi (2020) Bertram, L. and R. E. Tanzi (2020). Genomic mechanisms in Alzheimer’s disease. Brain Pathology 30(5), 966–977.
  • Bi et al. (2017) Bi, X., L. Yang, T. Li, B. Wang, H. Zhu, and H. Zhang (2017). Genome-wide mediation analysis of psychiatric and cognitive traits through imaging phenotypes. Human Brain Mapping 38(8), 4088–4097.
  • Blackstad et al. (1970) Blackstad, T., K. Brink, J. Hem, and B. June (1970). Distribution of hippocampal mossy fibers in the rat. An experimental study with silver impregnation methods. Journal of Comparative Neurology 138(4), 433–449.
  • Braak and Braak (1998) Braak, H. and E. Braak (1998). Evolution of neuronal changes in the course of Alzheimer’s disease. In Ageing and Dementia, pp.  127–140. Springer.
  • Brookhart et al. (2006) Brookhart, M. A., S. Schneeweiss, K. J. Rothman, R. J. Glynn, J. Avorn, and T. Stürmer (2006). Variable selection for propensity score models. American Journal of Epidemiology 163(12), 1149–1156.
  • Cai et al. (2010) Cai, J.-F., E. J. Candès, and Z. Shen (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4), 1956–1982.
  • Chung et al. (2014) Chung, S. J., M.-J. Kim, J. Kim, Y. J. Kim, S. You, J. Koh, S. Y. Kim, and J.-H. Lee (2014). Exome array study did not identify novel variants in Alzheimer’s disease. Neurobiology of Aging 35(8), 1958–e13.
  • Consortium et al. (2012) Consortium, . G. P. et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65.
  • Coon et al. (2007) Coon, K. D., A. J. Myers, D. W. Craig, J. A. Webster, J. V. Pearson, D. H. Lince, V. L. Zismann, T. G. Beach, D. Leung, L. Bryden, et al. (2007). A high-density whole-genome association study reveals that APOE is the major susceptibility gene for sporadic late-onset Alzheimer’s disease. The Journal of Clinical Psychiatry 68(4), 613–618.
  • De Leon et al. (1989) De Leon, M., A. George, L. Stylopoulos, G. Smith, and D. Miller (1989). Early marker for Alzheimer’s disease: the atrophic hippocampus. The Lancet 334(8664), 672–673.
  • Dehman et al. (2015) Dehman, A., C. Ambroise, and P. Neuvial (2015). Performance of a blockwise approach in variable selection using linkage disequilibrium information. BMC Bioinformatics 16(1), 1–14.
  • Du et al. (2018) Du, L., K. Liu, X. Yao, S. L. Risacher, J. Han, L. Guo, A. J. Saykin, and L. Shen (2018). Fast multi-task SCCA learning with feature selection for multi-modal brain imaging genetics. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.  356–361. IEEE.
  • Dudek et al. (2016) Dudek, S. M., G. M. Alexander, and S. Farris (2016). Rediscovering area CA2: unique properties and functions. Nature Reviews Neuroscience 17(2), 89–102.
  • Fan and Lv (2008) Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
  • Ferrari et al. (2014) Ferrari, D. V., M. E Avila, M. A Medina, E. Pérez-Palma, B. I Bustos, M. A Alarcon, et al. (2014). Wnt/β\beta-catenin signaling in Alzheimer’s disease. CNS & Neurological Disorders-Drug Targets (Formerly Current Drug Targets-CNS & Neurological Disorders) 13(5), 745–754.
  • Ferreira and Klein (2011) Ferreira, S. T. and W. L. Klein (2011). The Aβ\beta oligomer hypothesis for synapse failure and memory loss in Alzheimer’s disease. Neurobiology of Learning and Memory 96(4), 529–543.
  • Fox et al. (1996) Fox, N., E. Warrington, P. Freeborough, P. Hartikainen, A. Kennedy, J. Stevens, and M. N. Rossor (1996). Presymptomatic hippocampal atrophy in Alzheimer’s disease: a longitudinal MRI study. Brain 119(6), 2001–2007.
  • Frozza et al. (2018) Frozza, R. L., M. V. Lourenco, and F. G. De Felice (2018). Challenges for Alzheimer’s disease therapy: insights from novel mechanisms beyond memory defects. Frontiers in Neuroscience 12, 37.
  • Gabriel et al. (2002) Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Higgins, M. DeFelice, A. Lochner, M. Faggart, et al. (2002). The structure of haplotype blocks in the human genome. Science 296(5576), 2225–2229.
  • Gao et al. (2016) Gao, L., Z. Cui, L. Shen, and H.-F. Ji (2016). Shared genetic etiology between type 2 diabetes and Alzheimer’s disease identified by bioinformatics analysis. Journal of Alzheimer’s Disease 50(1), 13–17.
  • Gaugler et al. (2019) Gaugler, J., B. James, T. Johnson, A. Marin, and J. Weuve (2019). 2019 Alzheimer’s disease facts and figures. Alzheimer’s & Dementia 15(3), 321–387.
  • Guerreiro and Bras (2015) Guerreiro, R. and J. Bras (2015). The age factor in Alzheimer’s disease. Genome Medicine 7(1), 106.
  • Guo et al. (2019) Guo, Y., W. Xu, J.-Q. Li, Y.-N. Ou, X.-N. Shen, Y.-Y. Huang, Q. Dong, L. Tan, and J.-T. Yu (2019). Genome-wide association study of hippocampal atrophy rate in non-demented elders. Aging (Albany NY) 11(22), 10468–10484.
  • Hampel et al. (2018) Hampel, H., M.-M. Mesulam, A. C. Cuello, M. R. Farlow, E. Giacobini, G. T. Grossberg, A. S. Khachaturian, A. Vergallo, E. Cavedo, P. J. Snyder, et al. (2018). The cholinergic system in the pathophysiology and treatment of Alzheimer’s disease. Brain 141(7), 1917–1933.
  • Hao and Zhang (2014) Hao, N. and H. H. Zhang (2014). Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association 109(507), 1285–1301.
  • Hao et al. (2017) Hao, X., C. Li, L. Du, X. Yao, J. Yan, S. L. Risacher, A. J. Saykin, L. Shen, and D. Zhang (2017). Mining outcome-relevant brain imaging genetic associations via three-way sparse canonical correlation analysis in Alzheimer’s disease. Scientific Reports 7(1), 1–12.
  • Hastie et al. (2009) Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data mining, Inference, and Prediction (Second ed.). Springer Series in Statistics. Springer, New York.
  • Hesse et al. (2001) Hesse, C., L. Rosengren, N. Andreasen, P. Davidsson, H. Vanderstichele, E. Vanmechelen, and K. Blennow (2001). Transient increase in total tau but not phospho-tau in human cerebrospinal fluid after acute stroke. Neuroscience Letters 297(3), 187–190.
  • Jack et al. (2000) Jack, C., R. C. Petersen, Y. Xu, P. O’brien, G. Smith, R. Ivnik, B. F. Boeve, E. G. Tangalos, and E. Kokmen (2000). Rates of hippocampal atrophy correlate with change in clinical status in aging and AD. Neurology 55(4), 484–490.
  • Jack et al. (2003) Jack, C., M. Slomkowski, S. Gracon, T. Hoover, J. Felmlee, K. Stewart, Y. Xu, M. Shiung, P. O’Brien, R. Cha, et al. (2003). MRI as a biomarker of disease progression in a therapeutic trial of milameline for AD. Neurology 60(2), 253–260.
  • Jack Jr et al. (2013) Jack Jr, C. R., D. S. Knopman, W. J. Jagust, R. C. Petersen, M. W. Weiner, P. S. Aisen, L. M. Shaw, P. Vemuri, H. J. Wiste, S. D. Weigand, et al. (2013). Tracking pathophysiological processes in Alzheimer’s disease: an updated hypothetical model of dynamic biomarkers. The Lancet Neurology 12(2), 207–216.
  • Jack Jr et al. (2010) Jack Jr, C. R., D. S. Knopman, W. J. Jagust, L. M. Shaw, P. S. Aisen, M. W. Weiner, R. C. Petersen, and J. Q. Trojanowski (2010). Hypothetical model of dynamic biomarkers of the Alzheimer’s pathological cascade. The Lancet Neurology 9(1), 119–128.
  • Khan et al. (2015) Khan, W., C. Aguilar, S. J. Kiddle, O. Doyle, M. Thambisetty, S. Muehlboeck, M. Sattlecker, S. Newhouse, S. Lovestone, R. Dobson, et al. (2015). A subset of cerebrospinal fluid proteins from a multi-analyte panel associated with brain atrophy, disease classification and prediction in Alzheimer’s disease. PLoS One 10(8), e0134368.
  • Kim et al. (2009) Kim, J., J. M. Basak, and D. M. Holtzman (2009). The role of apolipoprotein E in Alzheimer’s disease. Neuron 63(3), 287–303.
  • Kong et al. (2020) Kong, D., B. An, J. Zhang, and H. Zhu (2020). L2RM: Low-rank linear regression models for high-dimensional matrix responses. Journal of the American Statistical Association 115(529), 403–424.
  • Laurent and Massart (2000) Laurent, B. and P. Massart (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 28(5), 1302–1338.
  • Li et al. (2007) Li, S., F. Shi, F. Pu, X. Li, T. Jiang, S. Xie, and Y. Wang (2007). Hippocampal shape analysis of Alzheimer disease based on machine learning methods. American Journal of Neuroradiology 28(7), 1339–1345.
  • Lin et al. (2015) Lin, W., R. Feng, and H. Li (2015). Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. Journal of the American Statistical Association 110(509), 270–288.
  • Liu et al. (2013) Liu, E., M. Li, W. Wang, and Y. Li (2013). MaCH-Admix: Genotype imputation for admixed populations. Genetic Epidemiology 37, 25–37.
  • Maass et al. (2018) Maass, A., S. N. Lockhart, T. M. Harrison, R. K. Bell, T. Mellinger, K. Swinnerton, S. L. Baker, G. D. Rabinovici, and W. J. Jagust (2018). Entorhinal tau pathology, episodic memory decline, and neurodegeneration in aging. Journal of Neuroscience 38(3), 530–543.
  • Majdi et al. (2020) Majdi, A., S. Sadigh-Eteghad, S. R. Aghsan, F. Farajdokht, S. M. Vatandoust, A. Namvaran, and J. Mahmoudi (2020). Amyloid-β\beta, tau, and the cholinergic system in Alzheimer’s disease: Seeking direction in a tangle of clues. Reviews in the Neurosciences 31(4), 391–413.
  • Mohs et al. (1997) Mohs, R. C., D. Knopman, R. C. Petersen, S. H. Ferris, C. Ernesto, M. Grundman, M. Sano, L. Bieliauskas, D. Geldmacher, C. Clark, et al. (1997). Development of cognitive instruments for use in clinical trials of antidementia drugs: additions to the Alzheimer’s disease assessment scale that broaden its scope. Alzheimer Disease and Associated Disorders 11(suppl 2), S13–21.
  • Morishima-Kawashima and Ihara (2002) Morishima-Kawashima, M. and Y. Ihara (2002). Alzheimer’s disease: β\beta-Amyloid protein and tau. Journal of Neuroscience Research 70(3), 392–401.
  • Negahban et al. (2012) Negahban, S. N., P. Ravikumar, M. J. Wainwright, B. Yu, et al. (2012). A unified framework for high-dimensional analysis of MM-estimators with decomposable regularizers. Statistical Science 27(4), 538–557.
  • Nesterov (1998) Nesterov, Y. (1998). Introductory lectures on convex programming volume i: Basic course. Lecture Notes 3(4), 5.
  • Nitsch et al. (1992) Nitsch, R. M., B. E. Slack, R. J. Wurtman, and J. H. Growdon (1992). Release of Alzheimer amyloid precursor derivatives stimulated by activation of muscarinic acetylcholine receptors. Science 258(5080), 304–307.
  • Noble et al. (2012) Noble, K. G., S. M. Grieve, M. S. Korgaonkar, L. E. Engelhardt, E. Y. Griffith, L. M. Williams, and A. M. Brickman (2012). Hippocampal volume varies with educational attainment across the life-span. Frontiers in Human Neuroscience 6, 307.
  • Paterson et al. (2014) Paterson, R., J. Bartlett, K. Blennow, N. Fox, L. Shaw, J. Trojanowski, H. Zetterberg, and J. Schott (2014). Cerebrospinal fluid markers including trefoil factor 3 are associated with neurodegeneration in amyloid-positive individuals. Translational Psychiatry 4(7), e419.
  • Pedraza et al. (2004) Pedraza, O., D. Bowers, and R. Gilmore (2004). Asymmetry of the hippocampus and amygdala in MRI volumetric measurements of normal adults. Journal of the International Neuropsychological Society 10(5), 664–678.
  • Purcell et al. (2007) Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. De Bakker, M. J. Daly, et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3), 559–575.
  • Remnestål et al. (2021) Remnestål, J., S. Bergström, J. Olofsson, E. Sjöstedt, M. Uhlén, K. Blennow, H. Zetterberg, A. Zettergren, S. Kern, I. Skoog, et al. (2021). Association of CSF proteins with tau and amyloid β\beta levels in asymptomatic 70-year-olds. Alzheimer’s Research & Therapy 13(1), 1–19.
  • Richardson et al. (2018) Richardson, T. S., J. M. Robins, and L. Wang (2018). Discussion of “Data-driven confounder selection via Markov and Bayesian networks” by Häggström. Biometrics 74(2), 403–406.
  • Rivasplata (2012) Rivasplata, O. (2012). Subgaussian random variables: an expository note. Internet publication, PDF.
  • Sargolzaei et al. (2015) Sargolzaei, S., A. Sargolzaei, M. Cabrerizo, G. Chen, M. Goryawala, S. Noei, Q. Zhou, R. Duara, W. Barker, and M. Adjouadi (2015). A practical guideline for intracranial volume estimation in patients with Alzheimer’s disease. BMC Bioinformatics 16(7), 1–10.
  • Schisterman et al. (2009) Schisterman, E. F., S. R. Cole, and R. W. Platt (2009). Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology (Cambridge, Mass.) 20(4), 488.
  • Selkoe and Hardy (2016) Selkoe, D. J. and J. Hardy (2016). The amyloid hypothesis of Alzheimer’s disease at 25 years. EMBO Molecular Medicine 8(6), 595–608.
  • Shi et al. (2013) Shi, J., P. M. Thompson, B. Gutman, and Y. Wang (2013, Sep). Surface fluid registration of conformal representation: application to detect disease burden and genetic influence on hippocampus. NeuroImage 78, 111–134.
  • Shortreed and Ertefaie (2017) Shortreed, S. M. and A. Ertefaie (2017). Outcome-adaptive lasso: variable selection for causal inference. Biometrics 73(4), 1111–1122.
  • Sims et al. (2020) Sims, R., M. Hill, and J. Williams (2020). The multiplex model of the genetics of Alzheimer’s disease. Nature Neuroscience 23(3), 311–322.
  • Takei et al. (2009) Takei, N., A. Miyashita, T. Tsukie, H. Arai, T. Asada, M. Imagawa, M. Shoji, S. Higuchi, K. Urakami, H. Kimura, et al. (2009). Genetic association study on in and around the APOE in late-onset Alzheimer’s disease in Japanese. Genomics 93(5), 441–448.
  • Tang et al. (2021) Tang, D., D. Kong, W. Pan, and L. Wang (2021). Ultra-high dimensional variable selection for doubly robust causal inference. Biometrics (just-accepted).
  • Thompson et al. (2004) Thompson, P. M., K. M. Hayashi, G. I. de Zubicaray, A. L. Janke, S. E. Rose, J. Semple, M. S. Hong, D. H. Herman, D. Gravano, D. M. Doddrell, et al. (2004). Mapping hippocampal and ventricular change in Alzheimer’s disease. NeuroImage 22(4), 1754–1766.
  • Van de Pol et al. (2006) Van de Pol, L., A. Hensel, F. Barkhof, H. Gertz, P. Scheltens, and W. Van Der Flier (2006). Hippocampal atrophy in Alzheimer’s disease: age matters. Neurology 66(2), 236–238.
  • van der Vaart and Wellner (2000) van der Vaart, A. W. and J. A. Wellner (2000). Weak convergence. Springer.
  • Vina and Lloret (2010) Vina, J. and A. Lloret (2010). Why women have more Alzheimer’s disease than men: gender and mitochondrial toxicity of amyloid-β\beta peptide. Journal of Alzheimer’s Disease 20(s2), S527–S533.
  • Wall and Pritchard (2003) Wall, J. D. and J. K. Pritchard (2003). Haplotype blocks and linkage disequilibrium in the human genome. Nature Reviews Genetics 4(8), 587–597.
  • Wang et al. (2011) Wang, Y., Y. Song, P. Rajagopalan, T. An, K. Liu, Y.-Y. Chou, B. Gutman, A. W. Toga, P. M. Thompson, ADNI, et al. (2011). Surface-based TBM boosts power to detect disease effects on the brain: an N = 804 ADNI study. NeuroImage 56(4), 1993–2010.
  • Watson (2019) Watson, S. (2019). Brain atrophy (cerebral atrophy). available at: www.healthline.com/health/brain-atrophy (Mar. 2019).
  • Weiner et al. (2013) Weiner, M. W., D. P. Veitch, P. S. Aisen, L. A. Beckett, N. J. Cairns, R. C. Green, D. Harvey, C. R. Jack, W. Jagust, E. Liu, et al. (2013). The Alzheimer’s Disease Neuroimaging Initiative: A review of papers published since its inception. Alzheimer’s & Dementia 9(5), e111–e194.
  • Zhang et al. (1990) Zhang, M., R. Katzman, D. Salmon, H. Jin, G. Cai, Z. Wang, G. Qu, I. Grant, E. Yu, P. Levy, et al. (1990). The prevalence of dementia and Alzheimer’s disease in Shanghai, China: impact of age, gender, and education. Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society 27(4), 428–437.
  • Zhao et al. (2019) Zhao, B., T. Luo, T. Li, Y. Li, J. Zhang, Y. Shan, X. Wang, L. Yang, F. Zhou, Z. Zhu, et al. (2019). Genome-wide association analysis of 19,629 individuals identifies variants influencing regional brain volumes and refines their genetic co-architecture with cognitive and mental health traits. Nature Genetics 51(11), 1637–1644.
  • Zhou et al. (2020) Zhou, F., H. Zhou, T. Li, and H. Zhu (2020). Analysis of secondary phenotypes in multi-group association studies. Biometrics 76(2), 606–618.
  • Zhou and Li (2014) Zhou, H. and L. Li (2014). Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2), 463–483.
  • Zhou et al. (2018) Zhou, X., Y. Chen, K. Y. Mok, Q. Zhao, K. Chen, Y. Chen, et al. (2018). Identification of genetic risk factors in the Chinese population implicates a role of immune system in Alzheimer’s disease pathogenesis. Proceedings of the National Academy of Sciences 115(8), 1697–1706.