Mapping the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer’s Disease

Dengdeng Yu
Department of Mathematics, University of Texas at Arlington
Linbo Wang
Department of Statistical Sciences, University of Toronto
Dehan Kong
Department of Statistical Sciences, University of Toronto
Hongtu Zhu
Department of Biostatistics, University of North Carolina, Chapel Hill
for the Alzheimer’s Disease Neuroimaging Initiative ¹¹1 Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Abstract

Alzheimer’s disease is a progressive form of dementia that results in problems with memory, thinking, and behavior. It often starts with abnormal aggregation and deposition of $\beta$ amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, leading to Alzheimer’s Disease (AD). The aim of this paper is to map the genetic-imaging-clinical pathway for AD in order to delineate the genetically-regulated brain changes that drive disease progression based on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. We develop a novel two-step approach to delineate the association between high-dimensional 2D hippocampal surface exposures and the Alzheimer’s Disease Assessment Scale (ADAS) cognitive score, while taking into account the ultra-high dimensional clinical and genetic covariates at baseline. Analysis results suggest that the radial distance of each pixel of both hippocampi is negatively associated with the severity of behavioral deficits conditional on observed clinical and genetic covariates. These associations are stronger in Cornu Ammonis region 1 (CA1) and subiculum subregions compared to Cornu Ammonis region 2 (CA2) and Cornu Ammonis region 3 (CA3) subregions. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

Keywords: 2D surface, behavioral deficits, confounders, hippocampus, variable selection.

1 Introduction

Alzheimer’s disease (AD) is an irreversible brain disorder that slowly destroys memory and thinking skills. According to World Alzheimer Reports (Gaugler et al., 2019), there are around $55$ million people worldwide living with Alzheimer’s disease and related dementia. The total global cost of Alzheimer’s disease and related dementia was estimated to be a trillion US dollars, equivalent to $1.1\%$ of global gross domestic product. Alzheimer’s patients often suffer from behavioral deficits including memory loss and difficulty of thinking, reasoning and decision making.

In the current model of AD pathogenesis, it is well established that deposition of amyloid plaques is an early event that, in conjunction with tau pathology, causes neuronal damage. Scientists have identified risk genes that may cause the abnormal aggregation and deposition of the amyloid plaques (e.g. Morishima-Kawashima and Ihara, 2002). The neuronal damage typically starts from the hippocampus and results in the first clinical manifestations of the disease in the form of episodic memory deficits (Weiner et al., 2013). Specifically, Jack Jr et al. (2010) presented a hypothetical model for biomarker dynamics in AD pathogenesis, which has been empirically and collectively supported by many works in the literature. The model begins with the abnormal deposition of $\beta$ amyloid (A $\beta$ ) fibrils, as evidenced by a corresponding drop in the levels of soluble A $\beta$ -42 in cerebrospinal fluid (CSF) (Aizenstein et al., 2008). After that, neuronal damage begins to occur, as evidenced by increased levels of CSF tau protein (Hesse et al., 2001). Numerous studies have investigated how A $\beta$ and tau impact the hippocampus (e.g. Ferreira and Klein, 2011), known to be fundamentally involved in acquisition, consolidation, and recollection of new episodic memories (Frozza et al., 2018). In particular, as neuronal degeneration progresses, brain atrophy, which starts with hippocampal atrophy (Fox et al., 1996), becomes detectable by magnetic resonance imaging (MRI). Studies from recent years also found other important CSF proteins that may be related to hippocampal atrophy. For instance, the low levels of chromogranin A (CgA) and trefoil factor 3 (TFF3), and high level of cystatin C (CysC) are evidently associated with hippocampal atrophy (Khan et al., 2015; Paterson et al., 2014). Indeed, the impact of protein concentration on behavior can also be through atrophy of other brain regions. For example, there exists potential entorhinal tau pathology on episodic memory decline (Maass et al., 2018). As sufficient brain atrophy accumulates, it results in cognitive symptoms and impairment. This process of AD pathogenesis is summarized by the flow chart in Figure 1. Note that it is still debatable how A $\beta$ and tau interact with each other as mentioned by Jack Jr et al. (2013); Majdi et al. (2020); however, it is evident that A $\beta$ may still hit a biomarker detection threshold earlier than tau (Jack Jr et al., 2013). In addition, as noted by Hampel et al. (2018), it is likely that highly complex interactions exist between A $\beta$ as well as tau, and the cholinergic system. For instance, the association has been found between CSF biomarkers of amyloid and tau pathology in AD (Remnestål et al., 2021). It has also been found that other factors, such as dysregulation and dysfunction of the Wnt signaling pathway, may also contribute to A $\beta$ and tau pathologies (Ferrari et al., 2014). In addition, the M1 and M3 subtypes of muscarinic receptors increase amyloid precursor protein production via the induction of the phospholipase C/protein kinase C pathway and increase BACE expression in AD brains (Nitsch et al., 1992).

Refer to caption — Figure 1: A hypothetical model of AD pathogenesis based on Selkoe and Hardy (2016). The double arrows represent the possible interactions that exist between A $\beta$ as well as tau, and the cholinergic system. The red arrow denotes the conditional association we are interested in estimating.

The aim of this paper is to map the genetic-imaging-clinical (GIC) pathway for AD, which is the most important part of the hypothetical model of AD pathogenesis in Figure 1. Histological studies have shown that the hippocampus is particularly vulnerable to Alzheimer’s disease pathology and has already been considerably damaged at the first occurrence of clinical symptoms (Braak and Braak, 1998). Therefore, the hippocampus has become a major focus in Alzheimer’s studies (De Leon et al., 1989). Some neuroscientists even conjecture that the association between hippocampal atrophy and behavioral deficits may be causal, because the former destroys the connections that help the neuron communicate and results in a loss of function (Watson, 2019). We are interested in delineating the genetically-regulated hippocampal shape that drives AD related behavioral deficits and disease progression.

To map the GIC pathway, we extract clinical, imaging, and genetic variables from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study as follows. First, we use the Alzheimer’s Disease Assessment Scale (ADAS) cognitive score to quantify behavioral deficits, for which a higher score indicates more severe behavioral deficits. Second, we characterize the exposure of interest, hippocampal shape, by the hippocampal morphometry surface measure, summarized as two $100\times 150$ matrices corresponding to the left/right hippocampi. Each element of the matrices is a continuous-valued variable, representing the radial distance from the corresponding coordinate on the hipppocampal surface to the medial core of the hippocampus. Compared with the conventional scalar measure of hippocampus shape (Jack et al., 2003), recent studies show that the additional information contained in the hippocampal morphometry surface measure is valuable for Alzheimer’s diagnosis (Thompson et al., 2004). For example, Li et al. (2007) showed that the surface measures of the hippocampus could provide more subtle indexes compared with the volume differences in discriminating between patients with Alzheimer’s and healthy control subjects. In our case, with the 2D matrix radial distance measure, one may investigate how local shapes of hippocampal subfields are associated with the behavioral deficits. Third, the ADNI study measures ultra-high dimensional genetic covariates and other demongraphic covariates at baseline. There are more than $6$ million genetic variants per subject.

The special data structure of the ADNI data application presents new challenges for statistically mapping the GIC pathway. First, unlike conventional statistical analysis that deals with scalar exposure, our exposure of interest is represented by high-dimensional 2D hippocampal imaging measures. Second, the dimension of baseline covariates, which are also potential confounders, is much larger than the sample size. Recently there have been many developments for confounder selection, most of which are in the causal inference literature. Studies show inclusion of the variables only associated with exposure but not directly with the outcome except through the exposure (known as instrumental variables) may result in loss of efficiency in the causal effect estimate (e.g. Schisterman et al., 2009), while inclusion of variables only related to the outcome but not the exposure (known as precision variables) may provide efficiency gains (e.g. Brookhart et al., 2006); see Shortreed and Ertefaie (2017); Richardson et al. (2018); Tang et al. (2021) and references therein for an overview.

When a large number of covariates are available, the primary difficulty for mapping the GIC pathway is how to include all the confounders and precision variables, while excluding all the instrumental variables and irrelevant variables (not related to either outcome or exposure). We develop a novel two-step approach to estimate the conditional association between the high-dimensional 2D hippocampal surface exposure and the Alzheimer’s behaviorial score, while taking into account the ultra-high dimensional baseline covariates. The first step is a fast screening procedure based on both the outcome and exposure models to rule out most of the irrelevant variables. The use of both models in screening is crucial for both computational efficiency and selection accuracy, as we will show in detail in Sections 3.3 and 3.4. The second step is a penalized regression procedure for the outcome generating model to further exclude instrumental and irrelevant variables, and simultaneously estimate the conditional association. Our simulations and ADNI data application demonstrate the effectiveness of the proposed procedure.

Our analysis represents a novel inferential target compared to recent developments in imaging genetics mediation analysis (Bi et al., 2017). Although we consider a similar set of variables and structure among these variables as illustrated later in Figure 2, our goal is to estimate the conditional association of hippocampal shape with behavioral deficits. In contrast, in mediation analysis, researchers are often interested in the effects of genetic factors on behavioral deficits, and how those are mediated through hippocampus. Direct application of methods developed for imaging genetics mediation analysis to our problem may select genetic factors that are confounders affecting both hippocampal shape and behavioral deficits. In comparison, we aim to include precision variables in the adjustment set as they may improve efficiency.

The rest of the article is organized as follows. Section 2 includes a detailed data and problem description. We introduce our models and a two-step variable selection procedure in Section 3. We analyze the ADNI data and estimate the association between hippocampal shape and behavioral deficits conditional on observed clinical and genetic covariates in Section 4. Simulations are conducted in Section 5 to evaluate the finite-sample performance of the proposed method. We finish with a discussion in Section 6. The theoretical properties of our procedure are included in Section 15 in the supplementary material.

2 Data and problem description

Understanding how human brains work and how they connect to human behavior is a central goal in medical studies. In this paper, we are interested in studying whether and how hippocampal shape is associated with behavioral deficits in Alzheimer’s studies. We consider the clinical, genetic, imaging and behavioral measures in the ADNI dataset. The outcome of interest is the Alzheimer’s Disease Assessment Scale cognitive score observed at 24th month after baseline measurements. The Alzheimer’s Disease Assessment Scale cognitive 13 items score (ADAS-13) (Mohs et al., 1997) includes 13 items: word recall task, naming objects and fingers, following commands, constructional praxis, ideational praxis, orientation, word recognition task, remembering test directions, spoken language, comprehension, and word-finding difficulty, delayed word recall and a number cancellation or maze task. A higher ADAS score indicates more severe behavioral deficits.

The exposure of interest is the baseline 2D surface data obtained from the left/right hippocampi. The hippocampus surface data were preprocessed from the raw MRI data, where the detailed preprocessing steps are included in Section 9.1 of the supplementary material. After preprocessing, we obtained left and right hippocampal shape representations as two $100\times 150$ matrices. The imaging measurement at each pixel is an absolute metric, representing the radial distance from the pixel to the medial core of the hippocampus. The unit for the measurement is in millimeters.

In the ADNI data, there are millions of observed covariates that one may need to adjust for, including the whole genome sequencing data from all of the 22 autosomes. We have included detailed genetic preprocessing techniques in Section 9.2 of the supplementary material. After preprocessing, $6,087,205$ bi-allelic markers (including SNPs and indels) of $756$ subjects were retained in the data analysis.

We excluded those subjects with missing hippocampus shape representations, baseline intracranial volume (ICV) information or ADAS-13 score observed at Month 24, after which there are $566$ subjects left. Our aim is to estimate the association between the hippocampal surface exposure and the ADAS-13 score conditional on clinical measures including age, gender and length of education, ICV, diagnosis status, and $6,087,205$ bi-allelic markers.

3 Methodology

3.1 Basic set-up

Suppose we observe independent and identically distributed samples $\{L_{i}=(X_{i},\bm{Z}_{i},Y_{i}),1\leq i\leq n\}$ generated from $L$ , where $L=(X,\bm{Z},Y)$ has support $\mathcal{L}=(\mathcal{X}\times\mathcal{Z}\times\mathcal{Y})$ . Here $\bm{Z}\in\mathcal{Z}\subseteq\mathbb{R}^{p\times q}$ is a 2D-image continuous exposure, $Y\in\mathcal{Y}$ is a continuous outcome of interest, and $X\in\mathcal{X}\subseteq\mathbb{R}^{s}$ denotes a vector of ultra-high dimensional genetic (and clinical) covariates, where we assume $s>>n$ . We are interested in characterizing the association between the 2D exposure $\bm{Z}$ and outcome $Y$ conditional on the observed covariates $X$ .

Denote $X_{i}=(X_{i1},\ldots,X_{is})^{\mathrm{\scriptscriptstyle T}}$ . Without loss of generality, we assume that $X_{il}$ has been standardized for every $1\leq l\leq s$ , and $\bm{Z}_{i}$ and $Y_{i}$ have been centered. To map the GIC pathway, we assume the following linear equation models:

	$\displaystyle Y_{i}$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{s}X_{il}\beta_{l}+\langle\bm{Z}_{i},\bm{B}\rangle+\epsilon_{i}\quad\textrm{(outcome model)};$		(1)
	$\displaystyle\bm{Z}_{i}$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{s}X_{il}*\bm{C}_{l}+\bm{E}_{i}\quad\textrm{(exposure model)}.$		(2)

In (1), the matrix $\bm{B}\in\mathbb{R}^{p\times q}$ is the main parameter of interest, representing the association between the 2D imaging treatment $\bm{Z}_{i}$ and the behavioral outcome $Y_{i}$ , $\beta_{l}$ represents the association between the $l$ -th observed covariate $X_{il}$ and the behavioral outcome $Y_{i}$ , and $\epsilon_{i}$ and $\bm{E}_{i}$ are random errors that may be correlated. The inner product between two matrices is defined as $\langle\bm{Z}_{i},\bm{B}\rangle=\langle\mathrm{vec}(\bm{Z}_{i}),\mathrm{vec}(\bm{B})\rangle$ , where $\mathrm{vec}(\cdot)$ is a vectorization operator that stacks the columns of a matrix into a vector. Model (2), previously introduced in Kong et al. (2020), specifies the relationship between the 2D imaging exposure and the observed covariates. The $\bm{C}_{l}$ is a $p\times q$ coefficient matrix characterizing the association between the $l$ th covariate $X_{il}$ and the 2D imaging exposure $\bm{Z}_{i}$ , and $\bm{E}_{i}$ is a $p\times q$ matrix of random errors with mean $0$ . The symbol “ $*$ ” denotes element-wise multiplication. Define $\mathcal{M}_{1}=\{1\leq l\leq s:\beta_{l}\neq 0\}$ and $\mathcal{M}_{2}=\{1\leq l\leq s:\bm{C}_{l}\neq\bm{0}\}$ , where we assume $|\mathcal{M}_{1}|<<n$ and $|\mathcal{M}_{2}|<<n$ ; here $|\mathcal{M}_{1}|$ and $|\mathcal{M}_{2}|$ represent the number of elements in $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ respectively.

To estimate $\bm{B}$ , the first step is to perform variable selection in models (1) and (2). For all the covariates $X_{l}$ , we group them into four categories. Let $\mathcal{A}=\{1,\ldots,s\}$ , and denote $\mathcal{C}$ the indices of confounders, i.e. variables associated with both the outcome and the exposure; $\mathcal{P}$ denotes the indices of precision variables, i.e. predictors of the outcome, but not the exposure; $\mathcal{I}$ denotes the indices of instrumental variables, i.e. covariates that are only associated with the exposure but not directly with the outcome except through the exposure; $\mathcal{S}$ denotes the indices of irrelevant variables, i.e. covariates that are not related to the outcome or the exposure. Mathematically speaking, $\mathcal{C}=\{l\in\mathcal{A}|\beta_{l}\neq 0\textrm{ and }\bm{C}_{l}\neq 0\}$ , $\mathcal{P}=\{l\in\mathcal{A}|\beta_{l}\neq 0\textrm{ and }\bm{C}_{l}=0\}$ , $\mathcal{I}=\{l\in\mathcal{A}|\beta_{l}=0\textrm{ and }\bm{C}_{l}\neq 0\}$ and $\mathcal{S}=\{l\in\mathcal{A}|\beta_{l}=0\textrm{ and }\bm{C}_{l}=0\}$ . The relationships among different types of $X$ , $\bm{Z}$ and $Y$ are shown in Figure 2, where $U$ denotes possible unmeasured confounders. Since we are interested in characterizing the association between $\bm{Z}$ and $Y$ conditional on $X$ , further discussion of $U$ will be omitted for the remainder of the paper.

When there are no unobserved confounders $U$ , the estimate of $\bm{B}$ has underlying causal interpretations. In this case, the ideal adjustment set includes all confounders to avoid bias and all precision variables to increase statistical efficiency, while excluding instrumental variables and irrelevant variables (Brookhart et al., 2006; Shortreed and Ertefaie, 2017). Although we are studying the conditional association rather than the causal relationship due to the possible unobserved confounding, our target adjustment set remains the same. In other words, we aim to retain all covariates from $\mathcal{M}_{1}=\mathcal{C}\cup\mathcal{P}=\{l\in\mathcal{A}|\beta_{l}\neq 0\}$ , while excluding covariates from $\mathcal{I}\cup\mathcal{S}=\{l\in\mathcal{A}|\beta_{l}=0\}$ .

3.2 Naive screening methods

To find the nonzero $\beta_{l}$ ’s, a straightforward idea is to consider a penalized estimator obtained from the outcome generating model (1), where one imposes, say Lasso penalties, on $\beta_{l}$ ’s. However, this is computationally infeasible in our ADNI data application as the number of baseline covariates $s$ is over $6$ million. Consequently, it is important to employ a screening procedure (e.g. Fan and Lv, 2008) to reduce the model size. To find covariates $X_{l}$ ’s that are associated with the outcome $Y$ conditional on the exposure $\bm{Z}$ , one might consider a conditional screening procedure for model (1) (Barut et al., 2016). Specifically, one can fit the model $Y_{i}=X_{il}\beta_{l}+\langle\bm{Z}_{i},\bm{B}\rangle+\epsilon_{i}$ for each $1\leq l\leq s$ , obtain marginal estimates of $\widehat{\beta}^{MZ}_{l}$ ’s and then sort the $|\widehat{\beta}^{MZ}_{l}|$ ’s for screening. This procedure works well if the exposure variable $\bm{Z}$ is of low dimension as one only needs to fit low dimensional ordinary least squares (OLS) $s$ times. However, in our ADNI data application, the imaging exposure $\bm{Z}$ is of dimension $pq=15,000$ . As a result, one cannot obtain an OLS estimate since $n<pq$ . Thus, to apply the conditional sure independence screening procedure to our application, one may need to solve a penalized regression problem for each $1\leq l\leq s$ , such as $\arg\min_{\bm{B},\beta_{l}}\left[\frac{1}{2n}\sum_{i=1}^{n}\left(Y_{i}-\langle\bm{Z}_{i},\bm{B}\rangle-{X}_{il}\beta_{l}\right)^{2}+P_{\lambda}(\bm{B})\right]$ , where $P_{\lambda}(\bm{B})$ is a penalty of $\bm{B}$ . In theory, for each $l\in\{1,\ldots,s\},$ one can obtain the estimates $\widehat{\beta}^{MZ}_{l,\lambda}$ , and then rank the $|\widehat{\beta}^{MZ}_{l,\lambda}|$ ’s. However, this is computationally prohibitive in the ADNI data with $s>6,000,000$ . First, the penalized regression problem is much slower to solve compared to the OLS. Second, selection of the tuning parameter $\lambda$ based on grid search substantially increases the computational burden.

Alternatively one may apply the marginal screening procedure of Fan and Lv (2008) to model (1). Specifically, one may solve the following marginal OLS on each $X_{il}$ by ignoring the exposure $\bm{Z}_{i}$ : $\arg\min_{\beta_{l}}\left[\frac{1}{2n}\sum_{i=1}^{n}\left(Y_{i}-{X}_{il}\beta_{l}\right)^{2}\right]$ . The marginal OLS estimate has a closed form $\widehat{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}X_{il}Y_{i}$ , and one can rank $|\widehat{\beta}_{l}^{M}|$ ’s for screening. Specifically, the selected sub-model is defined as ${\widehat{\mathcal{M}}_{1}^{*}}=\{1\leq l\leq s:|\widehat{\beta}_{l}^{M}|\geq\gamma_{1,n}\}$ , where $\gamma_{1,n}$ is a threshold. Computationally, it is much faster than conditional screening for model (1) as we only need to fit one dimensional OLS for $s>6,000,000$ times. However, this procedure is likely to miss some important confounders. To see this, plugging model (2) into (1) yields

\displaystyle Y_{i}

\displaystyle=

\displaystyle\sum_{l=1}^{s}X_{il}(\beta_{l}+\langle\bm{C}_{l},\bm{B}\rangle)+\langle\bm{E}_{i},\bm{B}\rangle+\epsilon_{i}.

Even in the ideal case when $X_{il}$ ’s are orthogonal for $1\leq l\leq s$ , $\widehat{\beta}_{l}^{M}$ is not a good estimate of $\beta_{l}$ because of the bias term $\langle\bm{C}_{l},\bm{B}\rangle$ . Thus, we may miss some nonzero $\beta_{l}$ ’s in the screening step if $\beta_{l}$ and $\langle\bm{C}_{l},\bm{B}\rangle$ are of similar magnitudes but opposite signs. We illustrate this point in Figures 5 in Section 5, in which cases the conventional marginal screening on (1) fails to capture some of the important confounders.

3.3 Joint screening

To overcome the drawbacks of the estimation methods discussed in Section 3.2, we develop a joint screening procedure, specifically for our ADNI data application. The procedure is not only computationally efficient, but can also select all the important confounders and precision variables with high probability. The key insight here is that although we are interested in selecting important variables in the outcome generating model, this can be done much more efficiently by incorporating information from the exposure model. Specifically, let $\widehat{\bm{C}}_{l}^{M}=n^{-1}\sum_{i=1}^{n}X_{il}*\bm{Z}_{i}\in\mathbb{R}^{p\times q}$ be the marginal OLS estimate in model (2) for $l=1,\ldots,s$ . Following Kong et al. (2020), the important covariates in model (2) can be selected by ${\widehat{\mathcal{M}}_{2}}=\{1\leq l\leq s:\|\widehat{\bm{C}}_{l}^{M}\|_{op}\geq\gamma_{2,n}\}$ , where $||\cdot||_{op}$ is the operator norm of a matrix and $\gamma_{2,n}$ is a threshold.

We define our joint screening set as $\widehat{\mathcal{M}}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}$ . Intuitively, most important confounders and precision variables are contained in the set ${\widehat{\mathcal{M}}_{1}}^{*}$ . The only exceptions are the covariates $X_{l}$ for which both $\beta_{l}$ and $\langle\bm{C}_{l},\bm{B}\rangle$ are of similar magnitudes but opposite signs. On the other hand, these $X_{l}$ will be included in ${\widehat{\mathcal{M}}_{2}}$ and hence, $\widehat{\mathcal{M}}$ along with instrumental variables with large $||\bm{C}_{l}||_{op}$ . In Section 15 of the supplementary material, we show that with properly chosen $\gamma_{1,n}$ and $\gamma_{2,n}$ , the joint screening set includes the confounders and precision variables with high probability: $P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}})\rightarrow 1$ as $n\rightarrow\infty$ . In practice, we recommend choosing $\gamma_{1,n}$ and $\gamma_{2,n}$ such that $|\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|=k$ , where $k$ is the smallest integer such that $|\widehat{\mathcal{M}}|\geq\lfloor n/\log(n)\rfloor$ . We set them to be of equal sizes following the convention that the size of screening set is determined only by the sample size (Fan and Lv, 2008), which is the same for $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ . Depending on the prior knowledge about the sizes and signal strengths of confounding, precision and instrumental variables, the sizes of $|\widehat{\mathcal{M}}_{1}^{*}|$ and $|\widehat{\mathcal{M}}_{2}|$ may be chosen differently. In the simulations and real data analyses, we conduct sensitivity analyses by varying the relative sizes of $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ .

In general, the set $\widehat{\mathcal{M}}$ includes not only confounders and precision variables in $\mathcal{M}_{1}=\mathcal{C}\bigcup\mathcal{P}$ , but also instrumental variables in $\mathcal{I}$ and a small subset of the irrelevant variables $\mathcal{S}$ . Nevertheless, the size of $|\widehat{\mathcal{M}}|$ is greatly reduced compared to that of all the observed covariates. This makes it feasible to perform the second step procedure, a refined penalized estimation of ${\bm{B}}$ based on the covariates $\{X_{l}:l\in\widehat{\mathcal{M}}\}$ .

3.4 Blockwise joint screening

Linkage disequilibrium (LD) is a ubiquitous biological phenomenon where genetic variants present a strong blockwise correlation (LD) structure (Wall and Pritchard, 2003). If all the SNPs of a particular LD block are important but with relatively weak signals, they may be missed by the screening procedure described in Section 3.3. To appropriately utilize LD blocks’ structural information to select those missed SNPs, we develop a modified screening procedure described below.

Suppose that $X=(X_{1},\ldots,X_{s})^{T}$ can be divided into $b$ discrete haplotype blocks: regions of high LD that are separated from other haplotype blocks by many historical recombination events (Wall and Pritchard, 2003). Let the index set of each $b$ non-overlapping block be $\mathcal{B}_{1},\ldots,\mathcal{B}_{b}$ with $\cup_{j=1}^{b}\mathcal{B}_{j}=\{1\ldots,s\}$ . For $l=1,\ldots,s,$ we define

{\beta^{block,M}_{l}}=\sum_{j=1}^{b}\frac{1(l\in\mathcal{B}_{j})}{|\mathcal{B}_{j}|}\sum_{i\in\mathcal{B}_{j}}|{\beta_{i}^{M}}|\quad\textrm{and}\quad{C^{block,M}_{l}}=\sum_{j=1}^{b}\frac{1(l\in\mathcal{B}_{j})}{|\mathcal{B}_{j}|}\sum_{i\in\mathcal{B}_{j}}\|{\bm{C}}_{i}^{M}\|_{op},

\widehat{\beta}^{block,M}_{l}=\sum_{j=1}^{b}\frac{1(l\in\mathcal{B}_{j})}{|\mathcal{B}_{j}|}\sum_{i\in\mathcal{B}_{j}}|\widehat{\beta}_{i}^{M}|\quad\textrm{and}\quad\widehat{C}^{block,M}_{l}=\sum_{j=1}^{b}\frac{1(l\in\mathcal{B}_{j})}{|\mathcal{B}_{j}|}\sum_{i\in\mathcal{B}_{j}}\|\widehat{\bm{C}}_{i}^{M}\|_{op},

where $1(\cdot)$ is the indicator function of an event. We also define

{\widehat{\mathcal{M}}_{1}^{block,*}}=\{1\leq l\leq s:\widehat{\beta}^{block,M}_{l}\geq\gamma_{3,n}\}\quad\textrm{and}\quad{\widehat{\mathcal{M}}_{2}^{block}}=\{1\leq l\leq s:\widehat{C}_{l}^{block,M}\geq\gamma_{4,n}\}.

We propose to use the new set $\widehat{\mathcal{M}}^{block}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\cup{\widehat{\mathcal{M}}_{1}^{block,*}}\cup\widehat{\mathcal{M}}_{2}^{block}$ , rather than $\widehat{\mathcal{M}}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}$ , to select important covariates. Intuitively, when $|\beta_{l_{1}}|>|\beta_{l_{2}}|$ , $X_{l_{1}}$ is much more easily selected compared with $X_{l_{2}}$ by $\widehat{\mathcal{M}}_{1}^{*}$ . However, suppose that $l_{1}\in\mathcal{B}_{1}$ and $l_{2}\in\mathcal{B}_{2}$ , with only a small proportion of $X_{l}$ in $\mathcal{B}_{1}$ having $|\beta_{l}|>0$ , whereas a large proportion of $X_{l}$ in $\mathcal{B}_{2}$ has $|\beta_{l}|>0$ . It may well be the case that ${\beta^{block,M}_{l_{1}}}<{\beta^{block,M}_{l_{2}}}$ , meaning that $X_{l_{2}}$ can be selected more easily than $X_{l_{1}}$ by $\widehat{\mathcal{M}}_{1}^{block,*}$ . In addition, as $\widehat{\beta}_{l}^{M}$ is not a good estimate of $\beta_{l}$ due to the bias term $\langle\bm{C}_{l},\bm{B}\rangle$ , $\widehat{\beta}^{block,M}_{l}$ is not a good estimate of ${\beta^{block,M}_{l}}$ either. Therefore, some $X_{l}$ with nonzero ${\beta^{block,M}_{l}}$ may not be included in ${\widehat{\mathcal{M}}_{1}^{block,*}}$ . Nevertheless, they will be included in ${\widehat{\mathcal{M}}_{2}^{block}}$ and hence $\widehat{\mathcal{M}}^{block}$ .

Theoretically, when $\gamma_{1,n}$ , $\gamma_{2,n}$ , $\gamma_{3,n}$ and $\gamma_{4,n}$ are chosen properly, $P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}}^{block})\rightarrow 1$ as $n\rightarrow\infty$ ; see Theorem 3 in Section 15 of the supplementary material. In practice, we recommend choosing $\gamma_{1,n}$ , $\gamma_{2,n}$ , $\gamma_{3,n}$ and $\gamma_{4,n}$ , such that $|\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|=|{\widehat{\mathcal{M}}_{1}^{block,*}}|=|{\widehat{\mathcal{M}}_{2}^{block}}|=k$ , where $k$ is the smallest integer such that $|\widehat{\mathcal{M}}^{block}|\geq 2\lfloor n/\log(n)\rfloor$ . The threshold here is twice the threshold what we suggested in Section 3.3 since we unionize two additional sets.

3.5 Second-step estimation

In this step, we aim to estimate $\bm{B}$ by excluding the instrument variables $\mathcal{I}$ and irrelevant variables in $\mathcal{S}$ from $\widehat{\mathcal{M}}$ (or $\widehat{\mathcal{M}}^{block}$ ) and keeping the other covariates. This can be done by solving the following optimization problem:

\operatorname*{arg\,min}_{\bm{B},\{\beta_{l},l\in\widehat{\mathcal{M}}\}}\left[\frac{1}{2n}\sum_{i=1}^{n}\left(Y_{i}-\langle\bm{Z}_{i},\bm{B}\rangle-\sum_{l\in\widehat{\mathcal{M}}}{X}_{il}\beta_{l}\right)^{2}+\lambda_{1}\sum_{l\in\widehat{\mathcal{M}}}|\beta_{l}|+\lambda_{2}||\bm{B}||_{*}\right].

(3)

Denote $(\widehat{\bm{B}},\widehat{\bm{\beta}})$ the solution to the above optimization problem. The Lasso penalty on $\beta_{l}$ is used to exclude instrumental and irrelevant variables in $\widehat{\mathcal{M}}$ , whose corresponding coefficients $\beta_{l}$ ’s are zero. The nuclear norm penalty $||\cdot||_{*}$ , defined as the sum of all the singular values of a matrix, is used to achieve a low-rank estimate of $\bm{B}$ , where the low-rank assumption in estimating 2D structural coefficients is commonly used in the literature (Zhou and Li, 2014; Kong et al., 2020). For tuning parameters, we use five-fold cross validation based on two-dimensional grid search, and select $\lambda_{1}$ and $\lambda_{2}$ using the one standard error rule (Hastie et al., 2009).

4 ADNI data applications

We use the data obtained from the ADNI study (adni.loni.usc.edu). The data usage acknowledgement is included in Section 8 of the supplement material. As described in Section 2, we include $566$ subjects from the ADNI1 study. The exposure of interest is the baseline 2D hippocampal surface radial distance measures, which can be represented as a $100\times 150$ matrix for each part of the hippocampus. The outcome of interest is the ADAS-13 score observed at Month 24. The average ADAS-13 score is $20.8$ with standard deviation $14.1$ . The covariates to adjust for include $6,087,205$ bi-allelic markers as well as clinical covariates, including age, gender and education length, baseline intracranial volume (ICV), and baseline diagnosis status. The average age is $75.5$ years old with standard deviation $6.6$ years, and the average education length is $15.6$ years with standard deviation $2.9$ years. Among all the $566$ subjects, $58.1\%$ were female. The average ICV was $1.28\times 10^{6}$ $\rm{mm}^{3}$ with standard deviation $1.35\times 10^{5}$ $\rm{mm}^{3}$ . There were $175$ ( $184$ at Month 24) cognitive normal patients, $268$ ( $157$ at Month 24) patients with mild cognitive impairment (MCI), and $123$ ( $225$ at Month 24) patients with AD at the baseline. Studies have shown that age and gender are the main risk factors for Alzheimer’s disease (Vina and Lloret, 2010; Guerreiro and Bras, 2015) with older people and females more likely to develop Alzheimer’s disease. Multiple studies have also shown that prevalence of dementia is greater among those with low or no education (Zhang et al., 1990). On the other hand, age, gender and length of education have been found to be strongly associated with the hippocampus (Van de Pol et al., 2006; Jack et al., 2000; Noble et al., 2012). Previous studies (Sargolzaei et al., 2015) suggest that the ICV is an important measure that needs to be adjusted for in studies of brain change and AD. In addition, the baseline diagnosis status may help explain the baseline hippocampal shape and the AD status at Month 24. Therefore, we consider age, gender, education length, baseline ICV and baseline diagnosis status as part of the confounders, and adjust for them in our analysis. In addition, we also adjust for population stratification, for which we use the top five principal components of the whole genome data. As both left and right hippocampi have 2D radial distance measures and the two parts of hippocampi have been found to be asymmetric (Pedraza et al., 2004), we apply our method to the left and right hippocampi separately.

We use the default method (Gabriel et al., 2002) of Haploview (Barrett et al., 2005) and PLINK (Purcell et al., 2007) to form linkage disequilibrium (LD) blocks. Previous studies report that about 50 genetic variants are associated with AD; see the review in Sims et al. (2020) for details. This provides support for our assumption that $|\mathcal{M}_{1}|<n$ ( $n=566$ ). On the other hand, a genome-wide association analysis of 19,629 individuals by Zhao et al. (2019) shows that $57$ genetic variants are associated with the left hippocampal volumes and $54$ are associated with the right hippocampal volumes. This provides support for our assumption that $|\mathcal{M}_{2}|<n$ . Therefore, we apply our blockwise joint screening procedure on those SNPs on each part of hippocampal outcome $Y_{i}$ and exposure $\bm{Z}_{i}$ marginally. We choose the thresholds $\gamma_{1,n}$ , $\gamma_{2,n}$ , $\gamma_{3,n}$ and $\gamma_{4,n}$ such that $|\widehat{\mathcal{M}}^{block}|=2\lfloor n/\log(n)\rfloor=178$ . In Table 3 of the supplementary material, we list the top $20$ SNPs corresponding to left and right hippocampi, respectively. As suggested by one referee, we plot similar figures as the Manhattan plot for $\widehat{\mathcal{M}}^{*}_{1}$ , $\widehat{\mathcal{M}}^{block,*}_{1}$ , $\widehat{\mathcal{M}}^{\textbf{}}_{2}$ and $\widehat{\mathcal{M}}^{block}_{2}$ in Figure 7 of the supplementary material, where genomic coordinates are displayed along the x-axis, the y-axis represents the magnitude of $|\widehat{\beta}^{M}_{l}|$ , $\widehat{\beta}^{block,M}_{l}$ , $\|\widehat{\bm{C}}_{l}^{M}\|_{op}$ and $\widehat{C}_{l}^{block,M}$ and the horizontal dashed line represents the threshold values $\gamma_{1,n}$ , $\gamma_{2,n}$ , $\gamma_{3,n}$ , and $\gamma_{4,n}$ , respectively.

From Table 3 and Figure 7, one can see that there are quite a few important SNPs for both hippocampi. For example, the top SNP is the rs429358 from the 19th chromosome. This SNP is a C/T single-nucleotide variant (snv) variation in the APOE gene. It is also one of the two SNPs that define the well-known APOE alleles, the major genetic risk factor for Alzheimer’s disease (Kim et al., 2009). In addition, a great portion of the SNPs in Table 3 has been found to be strongly associated with Alzheimer’s. These include rs10414043 (Du et al., 2018), an A/G snv variation in the APOC1 gene; rs7256200 (Takei et al., 2009), an A/G snv variation in the APOC1 gene; rs73052335 (Zhou et al., 2018), an A/C snv variation in the APOC1 gene; rs769449 (Chung et al., 2014), an A/G snv variation in the APOE gene; rs157594 (Hao et al., 2017), a G/T snv variation; rs56131196 (Gao et al., 2016), an A/G snv variation in the APOC1 gene; rs111789331 (Gao et al., 2016), an A/T snv variation; and rs4420638 (Coon et al., 2007), an A/G snv variation in the APOC1 gene.

Among those SNPs that have been found to be associated with Alzheimer’s, some of them are also directly associated with hippocampi. For example, Zhou et al. (2020) revealed that the SNPs rs10414043, rs73052335 and rs769449 are among the top SNPs that have significant genetic effects on the volumes of both left and right hippocampi. Guo et al. (2019) identified the SNP rs56131196 to be associated with hippocampal shape.

We then perform our second-step estimation procedure for each part of the hippocampi. Here $X_{\widehat{\mathcal{M}}}$ denotes the SNPs selected in the screening step, the population stratification (top five principal components of the whole genome data) and the five clinical measures (age, gender, education length, baseline ICV and baseline diagnosis status), and ${\bm{Z}}$ denotes the left/right hippocampal surface image matrix. To visualize the results, we map the estimates $\widehat{\bm{B}}$ corresponding to each part of the hippocampus onto a representative hippocampal surface and plot it in Figure 3(a). We have also plotted the hippocampal subfield (Apostolova et al., 2006) in Figure 3(b). Here Cornu Ammonis region 1 (CA1), Cornu Ammonis region 2 (CA2) and Cornu Ammonis region 3 (CA3) are a strip of pyramidal neurons within the hippocampus proper. CA1 is the top portion, named as “regio superior of Cajal” (Blackstad et al., 1970), which consists of small pyramidal neurons. Within the lower portion (regio inferior of Cajal), which consists of larger pyramidal neurons, there are a smaller area called CA2 and a larger area called CA3. The cytoarchitecture and connectivity of CA2 and CA3 are different. The subiculum is a pivotal structure of the hippocampal formation, positioned between the entorhinal cortex and the CA1 subfield of the hippocampus proper (for a complete review, see Dudek et al. 2016).

From the plots, we can see that all the $15,000$ entries of $\widehat{\bm{B}}$ corresponding to both hippocampi are negative. This implies that the radial distance of each pixel of both hippocampi is negatively associated with the ADAS-13 score, which depicts the severity of behavioral deficits. The subfields with strongest associations are mostly CA1 and subiculum. Existing literature (Apostolova et al., 2010) has found that as Alzheimer’s disease progresses, it first affects CA1 and subiculum subregions and later CA2 and CA3 subregions. This can partially explain why the shapes of CA1 and subiculum may have stronger associations with ADAS-13 scores compared to CA2 and CA3 subregions.

We examine the effect size of the whole hippocampal shape by evaluating the term $\langle\bm{Z}_{i},\widehat{\bm{B}}\rangle$ . Specifically, we calculate the proportion of variance explained by imaging covariates as follows:

\displaystyle R^{2}=\frac{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}-\sum_{i=1}^{n}(y_{i}-\bar{y}-\langle\bm{Z}_{i},\widehat{\bm{B}}\rangle)^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}.

Our results show that the shape of the left hippocampi and that of the right one account for 5.83% and 4.71% of the total variations in behavior deficits, respectively. Such effect sizes are quite large compared with those for polygenic risk scores in genetics. In addition, we perform permutation test to test whether the $R^{2}$ statistic is significant. In particular, we randomly permutate the $\{Y_{1},\ldots,Y_{n}\}$ , denoted by $\{Y_{i}^{*},\ldots,Y_{n}^{*}\}$ , and we then apply our estimation procedure based on $(X_{i},\bm{Z}_{i},Y_{i}^{*})$ , obtain $\widehat{\bm{B}}^{*}$ , and calculate $(R^{2})^{*}$ . We repeat this for 1,000 times and and we obtain the $\{(R_{(k)}^{2})^{*},1\leq k\leq 1000\}$ , which mimics the null distribution. Finally, the p-value can be calculated as $\frac{1}{1000}\sum_{k=1}^{1000}1\{(R_{(k)}^{2})^{*}\geq R^{2}\}$ . The p-values for both hippocampi are less than $0.001$ , suggesting that the $R^{2}$ ’s are significantly different from zero.

We also conduct sensitivity analysis by varying the relative sizes of $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ in the joint screening procedure. The estimates $\widehat{\bm{B}}$ s are similar among different choices of $|\widehat{\mathcal{M}}_{1}^{*}|$ and $|\widehat{\mathcal{M}}_{2}|$ ; see supplementary material Section 11 for details. In addition, we repeated our analysis on the $391$ MCI and AD subjects. We have similar findings that the radial distances of each pixel of both hippocampi are mostly negatively associated with the ADAS-13 score. And the subfields with strongest associations are mostly CA1 and subiculum; see supplementary material Section 12 for details. As suggested by one referee, we have performed the SNP-imaging-outcome mediation analysis proposed by Bi et al. (2017); see Section 13 in the supplementary material for the detailed procedure. There is no evidence for the mediating relationship of SNP-imaging-outcome from our analysis.

5 Simulation studies

In this section, we perform simulation studies to evaluate the finite sample performance of the proposed method. The dimension of covariates is set as $s=5000$ , and the exposure is a $64\times 64$ matrix. The $X_{i}\in\mathbb{R}^{s}$ is independently generated from $N(\bm{0},\bm{\Sigma}_{x})$ , where $\bm{\Sigma}_{x}=(\sigma_{x,ll^{\prime}})$ has an autoregressive structure such that $\sigma_{x,ll^{\prime}}=\rho_{1}^{|l-l^{\prime}|}$ holds for $1\leq l,l^{\prime}\leq s$ with $\rho_{1}=0.5$ . Define ${\bm{B}}$ as a $64\times 64$ image shown in Figure 4(a), and ${\bm{C}}$ a $64\times 64$ image shown in Figure 4(b). For ${\bm{B}}$ , the black regions of interest (ROIs) are assigned value $0.0408$ and white ROIs are assigned value $0$ . For ${\bm{C}}$ , the black ROIs are assigned value $0.0335$ and white ROIs are assigned value $0$ . Further we set ${\bm{C}}_{l}=v_{l}*\bm{C}$ , where $v_{1}=-1/3$ , $v_{2}=-1$ , $v_{3}=-3$ , $v_{207}=-3$ , $v_{208}=-1$ , $v_{209}=-1/3$ , and $v_{l}=0$ for $4\leq l\leq 206$ and $210\leq l\leq s$ . We set the parameters $\beta_{1}=3$ , $\beta_{2}=1$ , $\beta_{3}=1/3$ , $\beta_{104}=3$ , $\beta_{105}=1$ , $\beta_{106}=1/3$ , and $\beta_{l}=0$ for $4\leq l\leq 103$ and $107\leq l\leq s$ . In this setting, we have $\mathcal{C}=\{1,2,3\}$ , $\mathcal{P}=\{104,105,106\}$ , $\mathcal{I}=\{207,208,209\}$ and $\mathcal{S}=\{1\ldots,5000\}\backslash\{1,2,3,104,105,106,207,208,209\}$ .

The random error $\mathrm{vec}(\bm{E}_{i})$ is independently generated from $N(\bm{0},\bm{\Sigma}_{e})$ , where we set the standard deviations of all elements in $\bm{E}_{i}$ to be $\sigma_{e}=0.2$ and the correlation between $\bm{E}_{i,jk}$ and $\bm{E}_{i,j^{\prime}k^{\prime}}$ to be $\rho_{2}^{|j-j^{\prime}|+|k-k^{\prime}|}$ for $1\leq j,k,j^{\prime},k^{\prime}\leq 64$ with $\rho_{2}=0.5$ . The random error $\epsilon_{i}$ is generated independently from $N(0,\sigma^{2})$ , where we consider $\sigma=1$ or $0.5$ . The $Y_{i}$ ’s and ${\bm{Z}}_{i}$ ’s are generated from models (1) and (2). We consider three different sample sizes $n=200,500$ and $1000$ .

5.1 Simulation for screening

We perform our screening procedure (denoted by “joint”) and report the coverage proportion of $\mathcal{M}_{1}$ , which is defined as $\frac{|\widehat{\mathcal{M}}\cap\mathcal{M}_{1}|}{|\mathcal{M}_{1}|}$ , where the size of the selected set $|\widehat{\mathcal{M}}|$ changes from $1$ to $100$ . In addition, we report the coverage proportion for each of the confounding and precision variables, i.e. each of the $j$ ’s in the set ${\mathcal{M}}_{1}=\{1,2,3,104,105,106\}$ . All the coverage proportions are averaged over $100$ Monte Carlo runs.

To control the changing size of the $|\widehat{\mathcal{M}}|$ , we first set $|\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|=1$ by specifying appropriate $\widehat{\gamma}_{1,n}$ and $\widehat{\gamma}_{2,n}$ . Then we sequentially add two variables, one to $\widehat{\mathcal{M}}_{1}^{*}$ by increasing $\widehat{\gamma}_{1,n}$ and one to $\widehat{\mathcal{M}}_{2}$ by increasing $\widehat{\gamma}_{2,n}$ , until $|\widehat{\mathcal{M}}|$ reaches $100$ . Note that we always keep $|\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|$ in the procedure. We may not obtain all the sizes between $1$ and $100$ because $|\widehat{\mathcal{M}}|$ may increase by at most $2$ . Therefore, for those sizes that cannot be reached, we use a linear interpolation to estimate the coverage proportion of $\mathcal{M}_{1}$ by using the closest two end points.

We compare the proposed joint screening procedure to two competing procedures. The first is an outcome screening procedure that selects set $\widehat{\mathcal{M}}_{1}^{*}.$ For fair comparison, we let $|\widehat{\mathcal{M}}_{1}^{*}|$ range from $1$ to $100$ . The second is an intersection screening procedure, that selects set $\widehat{\mathcal{M}}_{\cap}=\widehat{\mathcal{M}}_{1}^{*}\cap\widehat{\mathcal{M}}_{2}$ . We let $|\widehat{\mathcal{M}}_{\cap}|$ range from $1$ to $100$ , while keeping $|\widehat{\mathcal{M}}_{1}^{*}|=|\widehat{\mathcal{M}}_{2}|$ . Similarly, for those specific sizes that $|\widehat{\mathcal{M}}^{*}|$ cannot reach, we use linear interpolation to estimate the coverage proportions. We plot the results for $(n,s,\sigma)=(200,5000,1)$ in Figure 5. The remaining results for $(n,s,\sigma)=(200,5000,0.5)$ , $(500,5000,1)$ , $(500,5000,0.5)$ , $(1000,5000,1)$ and $(1000,5000,0.5)$ can be found in Figures 10 – 14 of the supplementary material.

From the plots, one can see that both the “intersection” and “outcome” screening methods miss the confounder $X_{3}$ with a very high probability even as the size of the selected set approaches $100$ . In contrast, our method can select $X_{3}$ with high probability when $|\widehat{\mathcal{M}}|$ is relatively small. For confounders $X_{1}$ and $X_{2}$ , all three methods perform similarly. For the precision variables, the “outcome” method and our “joint” method perform similarly in covering these variables, while the “intersection” performs badly. Combining the results, one can see that our method performs the best as our method selects all the confounders and precision variables with high probabilities. In addition, we find that the coverage proportion of our method increases when the sample size increases, which validates the sure independence screening property developed in Section 15 of the supplementary material.

5.2 Simulation for estimation

In this part, we evaluate the performance of our estimation procedure after the first-step screening. For the size of $\widehat{\mathcal{M}}$ in the screening step, we set $|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor$ . We compare the proposed estimate with the oracle estimate, which is calculated by adjusting for the ideal adjustment set including only confounders and precision variables as $X$ and then estimate $\bm{B}$ by using the optimization (3) without imposing the $l_{1}$ -regularization. We report the mean squared errors (MSEs) for $\bm{\beta}$ and $\bm{B}$ defined as $||{\bm{\beta}}-\widehat{\bm{\beta}}||_{2}^{2}$ and $\|{\bm{B}}-\widehat{\bm{B}}\|_{F}^{2}$ , respectively. Table 1 summarizes the average MSEs of the proposed and oracle estimates for $\bm{\beta}$ and $\bm{B}$ among 100 Monte Carlo runs when $n=200$ , $500$ and $1000$ . We can see that the MSE decreases as the sample size increases. In terms of the primary parameter of interest $\bm{B}$ , the proposed estimate is close to the oracle estimate.

Table 1: Simulation results of the proposed joint screening method and oracle estimates for

\sigma=1

and

\sigma=0.5

, when

n=200

500

and

1000

: the average MSEs for

\bm{\beta}

and

{\bm{B}}

, and their associated standard errors in the parentheses are reported. The results are based on 100 Monte Carlo repetitions.

$\sigma=1.0$	MSE $\bm{\beta}$	MSE ${\bm{B}}$	$\sigma=0.5$	MSE $\bm{\beta}$	MSE ${\bm{B}}$
n = 200
Proposed	0.496(0.021)	0.667(0.005)	Proposed	0.276(0.009)	0.528(0.005)
Oracle	0.086(0.005)	0.624(0.004)	Oracle	0.021(0.001)	0.501(0.004)
n = 500
Proposed	0.303(0.008)	0.574(0.006)	Proposed	0.191(0.005)	0.345(0.004)
Oracle	0.036(0.002)	0.553(0.005)	Oracle	0.006(0.000)	0.340(0.004)
n = 1000
Proposed	0.217(0.004)	0.449(0.004)	Proposed	0.128(0.006)	0.234(0.002)
Oracle	0.013(0.001)	0.460(0.005)	Oracle	0.003(0.000)	0.233(0.002)

We plot the 2D map of $\widehat{\bm{B}}$ based on the average of 100 Monte Carlo runs in Figure 6(c). For comparison, we also plot the corresponding average oracle estimate $\widehat{{\bm{B}}}_{\rm{oracle}}$ in Figure 6(b) and the true ${\bm{B}}$ in Figure 6(a). One can see that the proposed method recovers the signal pattern reasonably well.

We also report the sensitivity and specificity of the estimates in Section 14.1 of the supplementary material. We have found that although the proposed method may not remove all of the instrumental variables, eliminating even just some of the instruments greatly reduces the MSEs of both $\bm{\beta}$ and ${\bm{B}}$ , compared to the method where we do not impose $l_{1}$ -regularization on $\bm{\beta}$ in the second-step estimation.

5.3 Screening and estimation using blockwise joint screening

Linkage disequilibrium (LD) is a ubiquitous biological phenomenon where genetic variants present a strong blockwise correlation (LD) structure (Wall and Pritchard, 2003). In Section 3.4, we propose the blockwise joint screening procedure to appropriately utilize LD blocks’ structural information. The performance of this procedure is illustrated in this section using an adapted simulation based on the settings of Dehman et al. (2015).

For $i=1,\ldots,n$ , $X_{i}\in\mathbb{R}^{s}$ is independently generated from an $s$ -dimensional multivariate distribution $N(\bm{0},\bm{\Sigma}_{x})$ , where $\bm{\Sigma}_{x}=(\sigma_{x,ll^{\prime}})$ is block-diagonal. If $l\neq l^{\prime}$ are in the same block, the covariance $\sigma_{x,ll^{\prime}}=0.4$ , else $\sigma_{x,ll^{\prime}}=0$ , and the diagonal elements $\sigma_{x,ll}$ s are all set to $1$ . We set $X_{ij}$ to $0$ , $1$ or $2$ according to whether $X_{ij}<d_{1}$ , $d_{1}\leq X_{ij}\leq d_{2}$ , or $X_{ij}>d_{2}$ , where $d_{1}$ and $d_{2}$ are thresholds determined for producing a given minor allel frequency (MAF). For instance, choosing $d_{1}=\Phi^{-1}(1-6\mathrm{MAF}/5)$ and $d_{2}=\Phi^{-1}(1-2\mathrm{MAF}/5)$ , where $\Phi$ is the c.d.f. of standard normal distribution, corresponds to a given fixed MAF. In order to generate more realistic MAF distributions, we simulate genotype $X_{ij}$ , where the MAF for each $j$ is uniformly sampled between 0.05 and 0.5 (Dehman et al., 2015).

Adapting the simulation setting of Section 5.1 according to Dehman et al. (2015), we set $s=5000$ , with $300$ blocks of covariates of size $2,4,6,12,24,52$ replicated $50$ times. We perform 100 Monte Carlo runs, and the ordering of the block is drawn at random for each run. The settings for ${\bm{B}}$ and ${\bm{C}}$ remain the same as before: ${\bm{B}}$ is as in Figure 5(a), and ${\bm{C}}$ is as in Figure 5(b). Further we set ${\bm{C}}_{l}=v_{l}*\bm{C}$ , where $v_{1}=-1/3$ , $v_{2}=-1$ , $v_{3}=-3$ , $v_{207}=-3$ , $v_{208}=-1$ , $v_{209}=-1/3$ , and $v_{l}=0$ for $4\leq l\leq 206$ and $210\leq l\leq s$ . We set $\beta_{1}=3$ , $\beta_{2}=1$ , $\beta_{3}=1/3$ , $\beta_{104}=3$ , $\beta_{105}=1$ , $\beta_{106}=1/3$ , $\beta_{j}$ = $1/4$ for $j\in\mathcal{P}_{LD}$ , and $\beta_{l}=0$ otherwise. Here $\mathcal{P}_{LD}$ is a randomly selected block consisting of $K$ consecutive indices from $\{210,211,\ldots,s\}$ , where $K\in\{2,4,6,12,24,52\}$ . We have $\mathcal{C}=\{1,2,3\}$ , $\mathcal{P}=\{104,105,106\}\cup\mathcal{P}_{LD}$ , $\mathcal{I}=\{207,208,209\}$ and $\mathcal{S}=\{1\ldots,5000\}\backslash(\mathcal{C}\cup\mathcal{P}\cup\mathcal{I})$ . The other settings are the same as the ones in Section 5.1.

We consider three different sample sizes $n=200$ , $500$ and $1000$ . We first perform the screening procedure and report the coverage proportion of $\mathcal{M}_{1}=\mathcal{C}\cup\mathcal{P}$ . We also report the coverage proportion for each of the confounding and precision variables. In particular, we include the screening results, in which $s=5000$ , $\sigma=1$ , $n=200,500,1000$ , and $K=2,4,6,12,24,52$ , can be found in Figures 29 – 46 of the supplementary material.

From the plots, one can see that the blockwise joint screening method (blue dotted line) selects $\mathcal{P}_{LD}$ and $\mathcal{M}_{1}$ with higher probability compared with the original joint screening method (green solid line). Based on the results, the blockwise joint screening method can better utilize precision variables with block structures to select $\mathcal{M}_{1}$ .

In addition, we evaluate the performance of the two proposed estimation procedure after the first-step screening. For the sizes of $\widehat{\mathcal{M}}$ and $\widehat{\mathcal{M}}^{block}$ in the screening step, we set $|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor$ for the original joint screening procedure, and $|\widehat{\mathcal{M}}^{block}|=2\lfloor n/\log(n)\rfloor$ for the blockwise joint screening procedure. We report the average MSEs for $\bm{\beta}$ and $\bm{B}$ when $(n,s,\sigma)=(200,5000,1)$ in Table 2. The complete results, in which $s=5000$ , $\sigma=1$ , $n=200,500,1000$ , and $K=2,4,6,12,24,52$ , can be found in Table 9 of the supplementary material.

In summary, the blockwise joint screening estimate outperforms the original joint screening estimate when the sample size $n$ is small or block size of precision variables $K$ is large. For the rest of the scenarios, there are no significant differences between the two methods.

Table 2: Simulation results for

(n,s,\sigma)=(200,5000,1)

: the average MSEs for

\bm{\beta}

and

{\bm{B}}

, and their associated standard errors in the parentheses are reported. The left panel summarizes the results from the joint screening method; the right panel summarizes the results from the blockwise joint screening method. The results are based on 100 Monte Carlo repetitions.

Proposed	MSE $\bm{\beta}$	MSE ${\bm{B}}$	Proposed (block)	MSE $\bm{\beta}$	MSE ${\bm{B}}$
K=2	1.423(0.096)	0.785(0.009)	K=2	1.390(0.090)	0.793(0.010)
K=4	1.667(0.096)	0.815(0.011)	K=4	1.548(0.088)	0.805(0.010)
K=6	1.955(0.101)	0.826(0.010)	K=6	1.701(0.084)	0.816(0.009)
K=12	2.466(0.096)	0.890(0.039)	K=12	2.223(0.129)	0.838(0.011)
K=24	2.533(0.164)	0.847(0.014)	K=24	2.136(0.138)	0.821(0.010)
K=52	14.650(0.815)	2.034(0.487)	K=52	13.693(0.728)	1.870(0.459)

In the supplementary material, we assess the variable screening results for various sparsity levels of instrumental variables in Section 14.2, evaluate the performance of our estimation procedure for different covariances of exposure errors in Section 14.3, and assess the sensitivity of the choices for different sizes of $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ in Section 14.4.

6 Discussion

This paper aims at mapping the complex GIC pathway for AD. The unique features of the hippocampal morphometry surface measure data motivate us to develop a computationally efficient two-step screening and estimation procedure, which can select biomarkers among more than $6$ million observed covariates and estimate the conditional association simultaneously. If there was no unmeasured confounding, then the conditional association we estimate corresponds to the causal effect. This is, however, not the case in the ADNI study because we have unmeasured confounders such as A $\beta$ and tau protein levels. To control for unmeasured confounding and estimate the causal effect, one possible approach is to use the generic variants as potential instrumental variables (e.g. Lin et al., 2015).

There are a number of other important directions for future work. Firstly, the vast majority of AD, known as “polygenic AD”, is influenced by the actions and interactions of multiple genetic variants simultaneously, likely in concert with non-genetic factors, such as environmental exposures and lifestyle choices among many others (Bertram and Tanzi, 2020). Therefore, various types of interaction effects, such as genetic-genetic and imaging-genetic, could be incorporated into the outcome generating model (1). However, this may significantly increase the computation as the dimension of genetic relevant covariates will increase from $6,087,205$ to more than $90$ billion covariates, if we add all the possible imaging-genetics interaction terms. One may consider interaction screening procedures (Hao and Zhang, 2014) as the first-step. Secondly, this study simply removes observations with missingness. Accommodation of missing exposure, confounders and outcome under the proposed model framework is of great practical value and worth further investigation. Thirdly, baseline diagnosis status is an important effect modifier, as the effect of hippocampus shape on behavioral measures can be different across the CN/MCI/AD groups. However, the relatively small sample size in the ADNI study does not allow us to conduct a reliable subgroup analysis. The subgroup analyses are pertinent for further exploration when a larger sample size is available. Fourthly, in the ADNI dataset, there are longitudinal ADAS-13 scores observed at different months as well as other longitudinal behavioral scores obtained from Mini-Mental State Examination and Rey Auditory Verbal Learning Test, which can provide a more comprehensive characterization of the behavioral deficits. Integrating these different scores as a multivariate longitudinal outcome to improve the estimation of the conditional association requires substantial effort for future research. Lastly, one could consider incorporating information from other brain regions. For instance, an entorhinal tau may exist on episodic memory decline through other brain regions, such as the medial temporal lobe (Maass et al., 2018).

Supplementary Material

Supplementary material available online contains detailed derivations and explanations of the main algorithm, ADNI data usage acknowledgement, image and genetic data preprocessing steps, screening results and sensitivity analyses of the ADNI data application with a subgroup analysis including only MCI and AD patients, detailed procedure and results for the SNP-imaging-outcome mediation analyses, additional simulation results, theoretical properties of the proposed procedure including the main theorems, assumptions needed for our main theorems, and proofs of auxiliary lemmas and main theorems.

Acknowledgement

The authors thank the editor, associate editor and referees for their constructive comments, which have substantially improved the paper. Yu was partially supported by the Canadian Statistical Sciences Institute (CANSSI) postdoctoral fellowship and the start-up fund of University of Texas at Arlington. Wang and Kong were partially supported by the Natural Science and Engineering Research Council of Canada and the CANSSI Collaborative Research Team Grant. Zhu was partially supported by the NIH-R01-MH116527.

Supplementary Material for
“Mapping the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer’s Disease”

The supplementary file is organized as follows. The detailed description of the main algorithm is included in Section 7. We include the ADNI data usage acknowledgement in Section 8 and image and genetic data preprocessing steps in Section 9. In Section 10, we list the screening results of ADNI data applications. Section 11 examines the sensitivity of the estimate $\widehat{\bm{B}}$ from the ADNI data application by varying the relative sizes of $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ . Section 12 includes a subgroup analysis including only the mild cognitive impairment (MCI) and Alzheimer’s disease (AD) patients. We include the detailed SNP-imaging-outcome mediation analysis procedure and results in Section 13. In Section 14, we list additional simulation results. The theoretical properties including the main theorems of our procedure are included in Section 15. We state the assumptions needed for the main theorems in Section 16. In Section 17, we include the auxiliary lemmas needed for the theorems and their proofs. We give the detailed proofs of our main theorems in Section 18.

7 Description and derivation of Algorithm 1

To solve the minimization problem (3) of the main paper, we utilize the Nesterov optimal gradient method (Nesterov, 1998), which has been widely used in solving optimization problems for non-smooth and non-convex objective functions (Beck and Teboulle, 2009; Zhou and Li, 2014).

Before we introduce Nesterov’s gradient algorithm, we first state two propositions on shrinkage thresholding formulas (Beck and Teboulle, 2009; Cai et al., 2010).

Proposition 1.

For a given matrix $\bm{A}$ with singular value decomposition $\bm{A}=\bm{U}\mathrm{diag}(\bm{a})\bm{V}^{\mathrm{\scriptscriptstyle T}}$ , where $\bm{a}=(a_{1},\ldots,a_{m})^{\mathrm{\scriptscriptstyle T}}$ with $\{a_{k}\}_{1\leq k\leq r}$ being $\bm{A}$ ’s singular values, the optimal solution to

\min_{\bm{B}}\left\{\frac{1}{2}\|\bm{B}-\bm{A}\|_{F}^{2}+\lambda\|\bm{B}\|_{*}\right\}

share the same singular vectors as $\bm{A}$ and its singular values are $b_{k}=(a_{k}-\lambda)_{+}$ for $k=1,\ldots,r$ , where $(a_{k})_{+}=\max(0,a_{k})$ .

Proposition 2.

For vectors $\bm{a}=(a_{1},\ldots,a_{r})^{\mathrm{\scriptscriptstyle T}}$ and $\bm{b}=(b_{1},\ldots,b_{r})^{\mathrm{\scriptscriptstyle T}}$ , the optimal solution to

\min_{\bm{b}}\left\{\frac{1}{2}\|\bm{b}-\bm{a}\|_{2}^{2}+\lambda\|\bm{b}\|_{1}\right\}

is $b_{k}=\textrm{sgn}(a_{k})(|a_{k}|-\lambda)_{+}$ for $k=1,\ldots,r$ , where $\textrm{sgn}(\cdot)$ denotes the sign of the number.

The Nestrov’s gradient method utilizes the first-order gradient of the objective function to produce the next iterate based on the current search point. Differed from the standard gradient descent algorithm, the Nesterov’s gradient algorithm uses two previous iterates to generate the next search point by extrapolating, which can dramatically improve the convergence rate. The Nesterov’s gradient algorithm for problem (3) is presented as follows. Denote $l(\bm{\beta},\bm{B})=\frac{1}{2n}\sum_{i=1}^{n}\left(Y_{i}-\langle\bm{\beta},X_{i}\rangle-\langle\bm{Z}_{i},\bm{B}\rangle\right)^{2}$ and $P(\bm{\beta},\bm{B})=P_{1}(\bm{\beta})+P_{2}(\bm{B})$ , where $P_{1}(\bm{\beta})=\lambda_{1}\sum_{l}|\beta_{l}|$ and $P_{2}(\bm{B})=\lambda_{2}||\bm{B}||_{*}$ . We also define

$\displaystyle g(\bm{\beta},\bm{B}\|\bm{s}^{(t)},\bm{S}^{(t)},\delta)$	$\displaystyle=$	$\displaystyle l(\bm{s}^{(t)},\bm{S}^{(t)})+\left\langle\nabla l(\bm{s}^{(t)},\bm{S}^{(t)}),\left[(\bm{\beta}-\bm{s}^{(t)})^{\mathrm{\scriptscriptstyle T}},\{\mathrm{vec}(\bm{B}-\bm{S}^{(t)})\}^{\mathrm{\scriptscriptstyle T}}\right]^{\mathrm{\scriptscriptstyle T}}\right\rangle$
		$\displaystyle+(2\delta)^{-1}\left(\left\\|\bm{\beta}-\bm{s}^{(t)}\right\\|_{2}^{2}+\left\\|\bm{B}-\bm{S}^{(t)}\right\\|_{F}^{2}\right)+P(\bm{\beta},\bm{B})$
	$\displaystyle=$	$\displaystyle(2\delta)^{-1}\left[\left\\|\bm{\beta}-\left\{\bm{s}^{(t)}-\delta\partial_{\bm{\beta}}l(\bm{s}^{(t)},\bm{S}^{(t)})\right\}\right\\|_{2}^{2}\right.$
		$\displaystyle+\left.\left\\|\mathrm{vec}(\bm{B})-\left\{\mathrm{vec}(\bm{S}^{(t)})-\delta\partial_{\mathrm{vec}(\bm{B})}l(\bm{s}^{(t)},\bm{S}^{(t)})\right\}\right\\|_{2}^{2}\right]$
		$\displaystyle+P(\bm{\beta},\bm{B})+c^{(t)},$

where $\nabla l(\bm{\beta},\bm{B})=[(\partial_{\bm{\beta}}l)^{\mathrm{\scriptscriptstyle T}},\{\partial_{\mathrm{vec}(\bm{B})}l\}^{\mathrm{\scriptscriptstyle T}}]^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|+pq}$ denotes the first-order gradient of $l(\bm{\beta},\bm{B})$ with respect to $\left[\bm{\beta}^{\mathrm{\scriptscriptstyle T}},\{\mathrm{vec}(\bm{B})\}^{\mathrm{\scriptscriptstyle T}}\right]^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|+pq}$ . We define

	$\displaystyle\frac{\partial}{\partial_{\bm{\beta}}}l(\bm{\beta},\bm{B})$	$\displaystyle=$	$\displaystyle n^{-1}\sum_{i=1}^{n}X_{i}\left(\langle\bm{\beta},X_{i}\rangle+\langle\bm{B},\bm{Z}_{i}\rangle-Y_{i}\right),$
	$\displaystyle\frac{\partial}{\partial_{\mathrm{vec}(\bm{B})}}l(\bm{\beta},\bm{B})$	$\displaystyle=$	$\displaystyle\mathrm{vec}\left\{n^{-1}\sum_{i=1}^{n}\bm{Z}_{i}\left(\langle\bm{\beta},X_{i}\rangle+\langle\bm{B},\bm{Z}_{i}\rangle-Y_{i}\right)\right\},$

with $\partial_{\bm{\beta}}l(\bm{\beta},\bm{B})\in\mathbb{R}^{|\widehat{\mathcal{M}}|}$ , $\partial_{\mathrm{vec}(\bm{B})}l(\bm{\beta},\bm{B})\in\mathbb{R}^{pq}$ . Here $s^{(t)}$ and $S^{(t)}$ are interpolations between $\bm{\beta}^{(t-1)}$ and $\bm{\beta}^{(t)}$ , and $\bm{B}^{(t-1)}$ and $\bm{B}^{(t)}$ respectively, which will be defined below; $c^{(t)}$ denotes all terms that are irrelevant to $\bm{B}$ , and $\delta>0$ is a suitable step size. Given previous search points $\bm{s}^{(t)}$ and $\bm{S}^{(t)}$ , the next search points $\bm{s}^{(t+1)}$ and $\bm{S}^{(t+1)}$ would be the minimizer of $g(\bm{\beta},\bm{B}|\bm{s}^{(t)},\bm{S}^{(t)},\delta)$ . For the search points $\bm{s}^{(t)}$ and $\bm{S}^{(t)}$ , they can be generated by linearly extrapolating two previous algorithmic iterates. A key advantage of using the Nestrov gradient method is that it has an explicit solution at each iteration. In fact, minimizing $g(\bm{\beta},\bm{B}|\bm{s}^{(t)},\bm{S}^{(t)},\delta)$ can be divided into two sub-problems, minimizing $(2\delta)^{-1}\left\|\bm{\beta}-\left(\bm{s}^{(t)}-\right.\right.$ $\left.\left.\delta\partial_{\bm{\beta}}l(\bm{s}^{(t)},\bm{S}^{(t)})\right)\right\|_{2}^{2}+\lambda_{1}\sum_{l}|\beta_{l}|$ and $(2\delta)^{-1}\left\|\mathrm{vec}(\bm{B})-\left\{\mathrm{vec}(\bm{S}^{(t)})-\right.\right.$ $\left.\left.\delta\partial_{\mathrm{vec}(\bm{B})}l(\bm{s}^{(t)},\bm{S}^{(t)})\right\}\right\|_{2}^{2}+\lambda_{2}||\bm{B}||_{*}$ , respectively. These sub-problems can be solved by the shrinkage thresholding formulas in Propositions 1 and 2, respectively.

Let $\bm{X}^{\widehat{\mathcal{M}}}=(X^{\widehat{\mathcal{M}}}_{1},\ldots,X^{\widehat{\mathcal{M}}}_{n})^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{n\times|\widehat{\mathcal{M}}|}$ where $X^{\widehat{\mathcal{M}}}_{i}$ is $\{X_{ij}\}^{\mathrm{\scriptscriptstyle T}}_{j\in\widehat{\mathcal{M}}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|}$ for $i=1,\ldots,n$ . Define $\bm{Z}_{new}=(\mathrm{vec}(\bm{Z}_{1}),\ldots,\mathrm{vec}(\bm{Z}_{n}))^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{n\times pq}$ and $\bm{X}_{new}=(\bm{X}^{\widehat{\mathcal{M}}},\bm{Z}_{new})\in\mathbb{R}^{n\times(|\widehat{\mathcal{M}}|+pq)}$ . For a given vector $\bm{a}=(a_{1},\ldots,a_{r})^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{r}$ , $(\bm{a})_{+}$ is defined as $\{(a_{1})_{+},\ldots,(a_{r})_{+}\}^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{p}$ , where $(a)_{+}=\max(0,a)$ . Similarly, $\textrm{sgn}(\bm{a})$ is obtained by taking the sign of $\bm{a}$ componentwisely. For a given pair of tuning parameters $\lambda_{1}$ and $\lambda_{2}$ , (3) can be solved by Algorithm 1.

Algorithm 1 Shrinkage thresholding algorithm to solve (3)

1.

Initialize: $\bm{\beta}^{(0)}=\bm{\beta}^{(1)}$ , $\bm{B}^{(0)}=\bm{B}^{(1)}$ , $\alpha^{(0)}=0$ and $\alpha^{(1)}=1$ ,
$\delta=n/\lambda_{\textrm{max}}(\bm{X}_{new}^{\mathrm{\scriptscriptstyle T}}\bm{X}_{new})$ .
2.
Repeat (a) to (f) until the objective function $Q(\bm{\beta},\bm{B})$ converges:
1. (a)
  
  $\bm{s}^{(t)}=\bm{\beta}^{(t)}+\frac{\alpha^{(t-1)}-1}{\alpha^{(t)}}(\bm{\beta}^{(t)}-\bm{\beta}^{(t-1)})$ ,
  $\bm{S}^{(t)}=\bm{B}^{(t)}+\frac{\alpha^{(t-1)}-1}{\alpha^{(t)}}(\bm{B}^{(t)}-\bm{B}^{(t-1)})$ ;
2. (b)
  
  $\bm{\beta}_{\textrm{temp}}=\bm{s}^{(t)}-\delta\frac{\partial l(\bm{s}^{(t)},\bm{B}^{(t)})}{\partial\bm{\beta}}$ ;
  $\mathrm{vec}(\bm{B}_{\textrm{temp}})=\mathrm{vec}(\bm{S}^{(t)})-\delta\frac{\partial l(\bm{\beta}^{(t)},\bm{S}^{(t)})}{\partial\mathrm{vec}(\bm{B})}$ ;
3. (c)
  
  Singular value decomposition: ${\bm{B}}_{\textrm{temp}}=\bm{U}\mathrm{diag}(\bm{B})\bm{V}^{\mathrm{\scriptscriptstyle T}}$ ;
4. (d)
  
  $\bm{a}_{new}=\textrm{sgn}(\bm{\beta}_{\textrm{temp}})\cdot(|\bm{\beta}_{temp}|-\lambda_{1}\delta\cdot\bm{1})_{+}$ ,
  $\bm{b}_{new}=(\bm{b}-\lambda_{2}\delta\cdot\bm{1})_{+}$ ;
5. (e)
  
  $\bm{\beta}^{(t+1)}=\bm{a}_{new}$ ,
  $\bm{B}^{(t+1)}=\bm{U}\mathrm{diag}(\bm{b}_{new})\bm{V}^{\mathrm{\scriptscriptstyle T}}$ ;
6. (f)
  
  $\alpha^{(t+1)}=\left[1+\sqrt{1+(2\alpha^{(t)})^{2}}\right]/2$ .

In particular, step 2(a) predicts search points $\bm{s}^{(t)}$ and $\bm{S}^{(t)}$ by linear extrapolations from the solutions of previous two iterates, where $\alpha^{(t)}$ is a scalar sequence that plays a critical role in the extrapolation. This sequence is updated in step 2(f) as in the original Nesterov method. Next, steps 2(b) – 2(d) perform gradient descent from the current search points to obtain the optimal solutions at current iteration. Specifically, the gradient descent is based on minimizing $g(\bm{\beta},\bm{B}|$ $\bm{s}^{(t)},\bm{S}^{(t)},\delta)$ , the first-order approximation to the loss function, at the current search points $\bm{s}^{(t)}$ and $\bm{S}^{(t)}$ . This minimization problem is tackled by minimizing two sub-problems by the shrinkage thresholding formulas in Propositions 1 and 2 respectively, as mentioned above. Finally, step 2(e) forces the descent property of the next iterate.

A sufficient condition for the convergence of $\{\bm{\beta}^{(t)}\}_{t\geq 1}$ and $\{\bm{B}^{(t)}\}_{t\geq 1}$ is that the step size $\delta$ should be smaller than or equal to $1/{L_{f}}$ , where $L_{f}$ is the smallest Lipschitz constant of the function $l(\bm{\beta},\bm{B})$ (Beck and Teboulle, 2009). In our case, $L_{f}$ is equal to $\lambda_{\textrm{max}}(\bm{X}_{new}^{\mathrm{\scriptscriptstyle T}}\bm{X}_{new})/n$ , where $\lambda_{\textrm{max}}(\cdot)$ denotes the largest eigenvalue of a matrix.

8 Data usage acknowledgement

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment and early Alzheimer’s disease. For up-to-date information, see www.adni-info.org.

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

9 Image and genetic data preprocessing

9.1 Image data preprocessing

The hippocampus surface data were preprocessed from the raw MRI data, which were collected across a variety of 1.5 Tesla MRI scanners with protocols individualized for each scanner. Standard T1-weighted images were obtained by using volumetric 3-dimensional sagittal MPRAGE or equivalent protocols with varying resolutions. The typical protocol includes: inversion time (TI) = 1000 ms, flip angle = 8^o, repetition time (TR) = 2400 ms, and field of view (FOV) = 24 cm with a $256\times 256\times 170$ acquisition matrix in the $x-$ , $y-$ , and $z-$ dimensions yielding a voxel size of $1.25\times 1.26\times 1.2$ mm³. We adopted a surface fluid registration based hippocampal subregional analysis package (Shi et al., 2013), which uses isothermal coordinates and fluid registration to generate one-to-one hippocampal surface registration for surface statistics computation. It introduced two cuts on a hippocampal surface to convert it into a genus zero surface with two open boundaries. The locations of the two cuts were at the front and back of the hippocampal surface. By using conformal parameterization, it essentially converts a 3D surface registration problem into a 2D image registration problem. The flow induced in the parameter domain establishes high-order correspondences between 3D surfaces. Finally, the radial distance was computed on the registered surface. This software package and associated image processing methods have been adopted and described in Wang et al. (2011).

9.2 Genetic data preprocessing

For the genetic data, we applied the following preprocessing technique to the $756$ subjects in ADNI1 study. The first line quality control steps include (i) call rate check per subject and per single nucleotide polymorphism (SNP) marker, (ii) gender check, (iii) sibling pair identification, (iv) the Hardy-Weinberg equilibrium test, (v) marker removal by the minor allele frequency, and (vi) population stratification. The second line preprocessing steps include removal of SNPs with (i) more than 5 $\%$ missing values, (ii) minor allele frequency smaller than 10 $\%$ , and (iii) Hardy-Weinberg equilibrium $p$ -value $<10^{-6}$ . The 503,892 SNPs obtained from 22 autosomes were included for further processing. MACH-Admix software (http://www.unc.edu/ yunmli/MaCH-Admix/) (Liu et al., 2013) is applied on all the subjects to perform genotype imputation, using 1000G Phase I Integrated Release Version 3 haplotypes (http://www.1000genomes.org) (Consortium et al., 2012) as reference panel. Quality control was also conducted after imputation, excluding markers with (i) low imputation accuracy (based on imputation output $R^{2}$ ), (ii) Hardy-Weinberg equilibrium $p$ -value $10^{-6}$ , and (iii) minor allele frequency $<5\%$ .

10 Screening results of ADNI data applications

In Table 3, we list the top $20$ SNPs selected through the blockwise joint screening procedure corresponding to left and right hippocampi respectively.

Left hippocampi		Right hippocampi
Chromesome number	SNP name	Chromesome number	SNP name
19	rs429358	19	rs429358
7	rs1016394	19	rs10414043
19	rs10414043	14	14:25618120:G_GC
7	rs1181947	19	rs7256200
19	rs7256200	14	rs41470748
22	rs134828	19	rs73052335
19	rs73052335	14	14:25613747:G_GT
7	7:101403195:C_CA	19	rs157594
19	rs157594	14	rs72684825
13	rs12864178	19	rs769449
19	rs769449	6	rs9386934
2	rs13030626	6	rs9374191
19	rs56131196	19	rs56131196
2	rs13030634	6	rs9372261
19	rs4420638	19	rs4420638
2	rs11694935	6	rs73526504
19	rs111789331	19	rs111789331
2	rs11696076	14	rs187421061
19	rs66626994	19	rs66626994
2	rs11692218	13	rs342709

Table 3: The top 20 SNPs selected through the blockwise joint screening procedure. The left two columns correspond to results from the left hippocampi, and the right two columns correspond to results from the right hipppocampus.

We plot similar figures as the Manhattan plot for $\widehat{\mathcal{M}}^{*}_{1}$ , $\widehat{\mathcal{M}}^{block,*}_{1}$ , $\widehat{\mathcal{M}}^{\textbf{}}_{2}$ and $\widehat{\mathcal{M}}^{block}_{2}$ in Figure 7. Unlike the conventional Manhattan plots, where genomic coordinates are displayed along the x-axis, with the negative logarithm of the association p-value for each SNP displayed on the y-axis, in our analysis, we do not have the p-values. So in these figures, the y-axis represents the magnitude of $|\widehat{\beta}^{M}_{l}|$ , $\widehat{\beta}^{block,M}_{l}$ , $\|\widehat{\bm{C}}_{l}^{M}\|_{op}$ and $\widehat{C}_{l}^{block,M}$ and the horizontal dashed line represents the threshold values $\gamma_{1,n}$ (Panel (a)), $\gamma_{2,n}$ ((Panel (b)), $\gamma_{3,n}$ ((Panel (c)) and $\gamma_{4,n}$ ((Panel (d)). In Panels (c) and (d), the left and right figures represent the left and right hippocampi, respectively. The SNPs with $|\widehat{\beta}^{M}_{l}|$ , $\widehat{\beta}^{block,M}_{l}$ , $\|\widehat{\bm{C}}_{l}^{M}\|_{op}$ and $\widehat{C}_{l}^{block,M}$ greater than or equal to $\gamma_{1,n}$ , $\gamma_{2,n}$ , $\gamma_{3,n}$ and $\gamma_{4,n}$ , hence being selected by $\widehat{\mathcal{M}}^{*}_{1}$ , $\widehat{\mathcal{M}}^{block,*}_{1}$ , $\widehat{\mathcal{M}}^{\textbf{}}_{2}$ and $\widehat{\mathcal{M}}^{block}_{2}$ respectively, are highlighted with red diamond symbols.

11 Sensitivity analysis of ADNI data applications

In our analysis, we set $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ the same sizes, following the convention that the size of screening set is determined only by the sample size (Fan and Lv, 2008), which is the same for $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ . To assess the sensitivity of our results to this choice, we conduct sensitivity analyses varying the relative sizes of $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ in the joint screening procedure. For simplicity, we only consider the joint screening procedure proposed in Section 3.3. Figure 8 lists the estimate $\widehat{\bm{B}}$ corresponding to the left hippocampi (left part) and the right hippocampi (right part) using $\widehat{\mathcal{M}}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}$ . Denote the estimates corresponding to $|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=1/2,1,2$ by $\widehat{\bm{B}}^{(0.5)}$ , $\widehat{\bm{B}}^{(1)}$ , $\widehat{\bm{B}}^{(2)}$ respectively. We set $|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor=89$ . The estimates $\widehat{\bm{B}}^{(0.5)}$ , $\widehat{\bm{B}}^{(1)}$ and $\widehat{\bm{B}}^{(2)}$ are plotted in Figure 8 (a), (b) and (c) correspondingly. In addition, we consider $|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=1$ but $|\widehat{\mathcal{M}}|=2\lfloor n/\log(n)\rfloor=178$ . Denote the corresponding estimate by $\widetilde{\bm{B}}^{(1)}$ . We plot $\widetilde{\bm{B}}^{(1)}$ in Figure 8 (d).

Furthermore, by defining the relative risk of an estimate $\widehat{\bm{B}}$ as $\mathrm{RR}(\widehat{\bm{B}})=\frac{\|\widehat{\bm{B}}-\widehat{\bm{B}}^{(1)}\|_{F}^{2}}{\|\widehat{\bm{B}}^{(1)}\|_{F}^{2}}$ , we report the relative risks of three estimates $\widehat{\bm{B}}^{(0.5)}$ , $\widehat{\bm{B}}^{(2)}$ and $\widetilde{\bm{B}}^{(1)}$ in Table 4.

	Left hippocampi	Right hippocampi
$\mathrm{RR}(\widehat{\bm{B}}^{(0.5)})$	0.0022	0.1074
$\mathrm{RR}(\widehat{\bm{B}}^{(2)})$	0.2938	0.0907
$\mathrm{RR}(\widetilde{\bm{B}}^{(1)})$	0.0611	0.0927

Table 4: The relative risks of

\widehat{\bm{B}}^{(0.5)}

\widehat{\bm{B}}^{(2)}

and

\widetilde{\bm{B}}^{(1)}

for left and right hippocampi.

Number of negative entries	Left hippocampi	Right hippocampi
$\widehat{\bm{B}}^{(0.5)}$	15,000	15,000
$\widehat{\bm{B}}^{(1)}$	15,000	15,000
$\widehat{\bm{B}}^{(2)}$	14,600	15,000
$\widetilde{\bm{B}}^{(1)}$	15,000	15,000

Table 5: Number of negative entries of

\widehat{\bm{B}}

for left and right hippocampi.

To summarize, the estimate $\widehat{\bm{B}}$ is not very sensitive against the choices of $|\widehat{\mathcal{M}}_{1}^{*}|$ and $|\widehat{\mathcal{M}}_{2}|$ except for the left hippocampi when $|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=2$ and $|\widehat{\mathcal{M}}|=89$ . In fact, as shown in Table 5, when $|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=2$ , for the left hippocampi, there are $400$ entries are non-negative. We believe it may be due to some confounder variables being missed in the screening step. For instance, we find that rs157582, a previously identified risk loci for Alzheimer’s disease (Guo et al., 2019) is adjusted in estimating $\widehat{\bm{B}}^{(0.5)}$ , $\widehat{\bm{B}}^{(1)}$ and $\widetilde{\bm{B}}^{(1)}$ except for $\widehat{\bm{B}}^{(2)}$ . But in general, as demonstrated in Figure 8, the estimate $\widehat{\bm{B}}$ s are similar among different choices of $|\widehat{\mathcal{M}}_{1}^{*}|$ and $|\widehat{\mathcal{M}}_{2}|$ .

12 Subgroup analysis ADNI data applications

We repeat the analysis on the $391$ MCI and AD subjects. The estimates $\widehat{\bm{B}}$ corresponding to each part of the hippocampus onto a representative hippocampal surface are plotted in Figure 9(a). We have also plotted the hippocampal subfield (Apostolova et al., 2006) in Figure 9(b). The results are similar to the complete data analysis including all the $566$ subjects. For example, from these plots, we can see that $13,700$ entries of $\widehat{\bm{B}}$ corresponding to left and all the $15,000$ entries of $\widehat{\bm{B}}$ corresponding to right hippocampi are negative. This implies that the radial distances of each pixel of both hippocampi are mostly negatively associated with the ADAS-13 score, which depicts the severity of behavioral deficits. Furthermore, the subfields with the strongest associations are still mostly CA1 and subiculum.

13 Results for mediation analyses

We perform the SNP-imaging-outcome mediation analyses following the same procedure as in Bi et al. (2017). In specific, we regress the $30,000$ imaging measures against $6,087,205$ SNPs in the first step to search for the pairs of intermediate imaging measures and genetic variants. Then the behaviour outcome is fit against each candidate genetic variant to identify direct and significant influence. In the last step, the behaviour outcome is fit against identified genetic variant and its associated intermediate imaging measure simultaneously. A mediation relationship is built if a) the genetic variant is significant in both first and second steps, b) the intermediate imaging measure is significant in the last step, and c) the genetic variant has a smaller coefficient in the last step compared with the second step. Note that the total effect of the genetic variant in the second step should be the summation of direct and indirect effects which motivates the criterion c) of coefficient comparison. Note that, the total effect may not always be greater than the direct effect in the last step when the direct and indirect effects have opposite signs, while the causal inference tool proposed in this paper does not have this problem. Similar to Bi et al. (2017), we try to identify the pairs of SNP and imaging measure, for which the direct effect of SNP on behavioral outcome, the effect of SNP on imaging measure and the effect of imaging measure on the behavioral outcome are all significant. However, there is no SNP with at least one paired imaging measure (i.e. hippocampal imaging pixel) being significant. Therefore, there is no evidence for the mediating relationship of SNP-imaging-outcome from our analysis.

14 Additional results for simulation studies

In this section, we list additional simulation results. In particular, Figures 10 – 14 present the screening results for Section 5.1 with $(n,s,\sigma)=(200,5000,0.5)$ , $(500,5000,1)$ , $(500,5000,0.5)$ , $(1000,5000,1)$ and $(1000,5000,0.5)$ , respectively. Section 14.1 presents the sensitivity and specificity analyses for Section 5.2, where the detailed definitions of sensitivity and specificity can be found here. Section 14.2 presents an additional simulation study considering various sparsity levels of instrumental variables. Section 14.3 presents an additional simulation study considering different covariances of exposure errors. Section 14.4 conducts an additional simulation study by varying the relative sizes of $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ . Section 14.5 lists additional screening and estimation results for Section 5.3 of the main article.

14.1 Sensitivity and specificity analyses of simulation

In this subsection, we report the sensitivity and specificity of the estimates. The sensitivity (true positive rate) is defined as $\frac{|\{j:\widehat{\beta}_{j}\neq 0\}\cap\mathcal{M}_{1}|}{|\mathcal{M}_{1}|}$ , i.e. the proportion of variables in the oracle adjustment set $\mathcal{M}_{1}$ that are selected by our estimation procedure. The specificity (true negative rate) is defined as $\frac{|\{j:\widehat{\beta}_{j}=0\}\cap(\mathcal{I}\cup\mathcal{S})|}{|\mathcal{I}\cup\mathcal{S}|}$ , i.e. the proportion of variables not in the oracle adjustment set $\mathcal{M}_{1}$ that are not selected by our estimation procedure. Furthermore, we define the instrumental specificity as $\frac{|\{j:\widehat{\beta}_{j}=0\}\cap\mathcal{I}|}{|\mathcal{I}|}$ , i.e. the proportion of variables in the instrumental set $\mathcal{I}$ that are not selected by our estimation procedure.

Table 6: Simulation results for

\sigma=1

and

\sigma=0.5

, when

n=500

. The average

\bm{\beta}

sensitivity,

\bm{\beta}

instrumental specificity and

\bm{\beta}

specificity, MSE for

\bm{\beta}

, and MSE for

{\bm{B}}

, with their associated standard errors in the parentheses are reported. The results are based on 100 Monte Carlo repetitions. “No Lasso” estimate is calculated by including all the selected variables from the screening step then estimating

\bm{B}

using the optimization (3) without the

l_{1}

-regularization. “Oracle” estimate is calculated by pretending to know the correct set of confounders and precision variables as

X

and then estimate

\bm{B}

using the optimization (3) without the

l_{1}

-regularization.

n = 500	Sensitivity	Instrumental specificity	Specificity	MSE $\bm{\beta}$	MSE ${\bm{B}}$
$\sigma$ = 1.0
Oracle	1.000(0.000)	1.000(0.000)	1.000(0.000)	0.036(0.002)	0.553(0.005)
Proposed	0.833(0.000)	0.293(0.020)	0.998(0.000)	0.303(0.008)	0.574(0.006)
No Lasso	1.000(0.000)	0.000(0.000)	0.985(0.000)	1.740(0.078)	0.693(0.013)
$\sigma$ = 0.5
Oracle	1.000(0.000)	1.000(0.000)	1.000(0.000)	0.006(0.000)	0.340(0.004)
Proposed	0.897(0.008)	0.217(0.017)	0.999(0.000)	0.191(0.005)	0.345(0.004)
No Lasso	1.000(0.000)	0.000(0.000)	0.985(0.000)	0.372(0.017)	0.371(0.004)

In the simulation studies, we report the results for the case $n=500$ , which is close to the sample size $566$ in the real data. From Table 6, one can see that the second step can only regularize out some of the instrumental variables. We guess there may be a better tuning method, which can regularize out more instrumental variables, while still keep the confounders and prevision variables in the model. We leave this as future research. Nevertheless, one can also see from Table 6 that although the proposed method may not remove all of the instrumental variables, eliminating even just some of the instruments greatly reduce the MSEs of both $\bm{\beta}$ and ${\bm{B}}$ , compared to the method where we do not impose $l_{1}$ -regularization on $\bm{\beta}$ in the second-step estimation (denoted by the “No Lasso” method). In addition, the estimation of ${\bm{B}}$ is reasonably good compared to the oracle estimates as shown in Table 1 of the main article.

14.2 Screening under different sparsity levels

We also consider different sparsity levels in the simulation. It is particularly of interest for our study since when more instrumental variables than confounders and precision variables exist, which could be the case in an imaging-genetic study, the robustness of the proposed method may be undermined. As discussed before, to reduce bias and increase the statistical efficiency of the estimated $\bm{B}$ , the ideal adjustment set should include all confounders and precision variables while excluding instrumental variables and irrelevant variables. In particular, we consider three scenarios where the sizes of instrumental variables $\mathcal{I}$ are the same, twice, and eight times of the size of confounders and precision variables $\mathcal{M}_{1}$ .

We set $s=5000$ and the settings for ${\bm{B}}$ and ${\bm{C}}$ remain the same as before: ${\bm{B}}$ is as in Figure 4(a), and ${\bm{C}}$ is as in Figure 4(b). Further we set ${\bm{C}}_{l}=v_{l}*\bm{C}$ , where $v_{1}=-1/3$ , $v_{2}=-1$ , $v_{3}=-3$ , $v_{207}=v_{210}=\ldots=v_{204+6L}=-3$ , $v_{208}=v_{211}=\ldots=v_{205+6L}=-1$ , $v_{209}=v_{212}=\ldots=v_{206+6L}=-1/3$ , and $v_{l}=0$ for $4\leq l\leq 206$ and $207+6L\leq l\leq s$ . Here $L$ is a positive integer. We set $\beta_{1}=3$ , $\beta_{2}=1$ , $\beta_{3}=1/3$ , $\beta_{104}=3$ , $\beta_{105}=1$ , $\beta_{106}=1/3$ , and $\beta_{l}=0$ for $4\leq l\leq 103$ and $107\leq l\leq s$ . In this setting, we have $\mathcal{C}=\{1,2,3\}$ , $\mathcal{P}=\{104,105,106\}$ , $\mathcal{I}=\{207,208,209,\ldots,206+6L\}$ and $\mathcal{S}=\{1\ldots,5000\}\backslash\{1,2,3,104,105,106,207,208,209,\ldots,206+6L\}$ . Note that $\frac{|\mathcal{I}|}{|\mathcal{C}\cup\mathcal{P}|}=\frac{|\mathcal{I}|}{|\mathcal{M}_{1}|}=L$ . For $n=200$ , we let $\sigma=1$ and $0.5$ , and consider three different sparsity levels that $L=1,2,8$ . The complete screening results can be found in Figures 16 – 21.

Specifically, as summarized in Figure 15, when the number of instrumental variables is much larger than that of confounders and precision variables, the size of $\mathcal{M}_{1}\cup\mathcal{M}_{2}$ is larger than the number of covariates kept in the first screening step. And in this case, our results show that the screening step may include many instrumental variables, while missing some confounders and precision variables. This may deteriorate the accuracy and efficiency of the second step estimation.

14.3 Screening and estimation under different covariances of exposure errors

We also consider various exposure errors in the simulation. We use the same setting as Section 5.1 of the main paper but taking three different covariance structures of exposure errors $\bm{E}_{i}$ . In particular, the random error $\mathrm{vec}(\bm{E}_{i})$ is independently generated from $N(\bm{0},\bm{\Sigma}_{e})$ , where we set the standard deviations of all elements in $\bm{E}_{i}$ to be $\sigma_{e}=0.2$ and the correlation between $\bm{E}_{i,jk}$ and $\bm{E}_{i,j^{\prime}k^{\prime}}$ to be $\rho_{2}^{|j-j^{\prime}|+|k-k^{\prime}|}$ for $1\leq j,k,j^{\prime},k^{\prime}\leq 64$ with $\rho_{2}$ . We consider three scenarios that $\rho_{2}=0.2$ , $0.5$ , and $0.8$ , and report the selected covariates from the screening step. We consider $\sigma=1$ or $0.5$ and fix the sample size $n=200$ . The complete screening results can be found in Figure 5 as well as Figures 10, 23 – 26 here ( $\rho_{2}$ is set to be $0.5$ in Section 5.1 of the main paper).

Specifically, as summarized in Figure 22, when $\rho_{2}$ increases, the average coverage proportion for the index set $\mathcal{M}_{1}=\{1,2,3,104,105,106\}$ does not change too much.

In addition, we evaluate the performance of our estimation procedure after the first-step screening. For the size of $\widehat{\mathcal{M}}$ in the screening step, we set $|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor$ , so that $|\widehat{\mathcal{M}}|=37$ for sample size $n=200$ . We report the mean squared errors (MSEs) for $\bm{\beta}$ and $\bm{B}$ defined as $||{\bm{\beta}}-\widehat{\bm{\beta}}||_{2}^{2}$ and $\|{\bm{B}}-\widehat{\bm{B}}\|_{F}^{2}$ , respectively.

Table 7 summarizes the average MSEs for $\bm{\beta}$ and $\bm{B}$ among 100 Monte Carlo runs. We can see that the MSE decreases with $\rho_{2}$ increases. As nuclear norm penalization estimation procedure can be regarded as one way of spatial smoothing, the large correlations among $\bm{E}_{i}$ actually help with the spatial smoothing, and thus we have better estimation accuracy when $\rho_{2}$ increases.

Table 7: Simulation results for

\sigma=1

and

\sigma=0.5

: the average MSEs for

\bm{\beta}

and

{\bm{B}}

, and their associated standard errors in the parentheses are reported. The results are based on 100 Monte Carlo repetitions.

$\sigma=1.0$	MSE $\bm{\beta}$	MSE ${\bm{B}}$	$\sigma=0.5$	MSE $\bm{\beta}$	MSE ${\bm{B}}$
$\rho_{2}=0.2$	0.986(0.099)	0.802(0.011)	$\rho_{2}=0.2$	0.397(0.042)	0.701(0.006)
$\rho_{2}=0.5$	0.496(0.021)	0.667(0.005)	$\rho_{2}=0.5$	0.276(0.009)	0.528(0.005)
$\rho_{2}=0.8$	0.252(0.010)	0.439(0.007)	$\rho_{2}=0.8$	0.097(0.006)	0.305(0.004)

14.4 Screening and estimation under different sizes of $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$

In addition, we conduct a similar study following the same setting as described in Section 5.1 of the main paper, where $|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=2$ and $1/2$ . We use $n=500$ here since it is close to the number of observations $n=566$ in the real data analysis. Specifically, as summarized in Figures 27 and 28, when the ratio of $|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|$ is taken as $1/2$ or $2$ , the performances of the proposed joint screening method are quite similar to each other. By comparing them with Figure 11, where $|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|=1$ , the performances of the screening step results are quite similar.

Furthermore, we evaluate the performance of our estimation procedure after the first-step screening. For the size of $\widehat{\mathcal{M}}$ in the screening step, we set $|\widehat{\mathcal{M}}|=\lfloor n/\log(n)\rfloor$ , so that $|\widehat{\mathcal{M}}|=89$ for sample size $n=500$ . We report the mean squared errors (MSEs) for $\bm{\beta}$ and $\bm{B}$ defined as $||{\bm{\beta}}-\widehat{\bm{\beta}}||_{2}^{2}$ and $\|{\bm{B}}-\widehat{\bm{B}}\|_{F}^{2}$ , respectively. As summarized in Table 8, the average MSEs for $\bm{\beta}$ and $\bm{B}$ among 100 Monte Carlo runs are all similar to each other for the different choices of $|\widehat{\mathcal{M}}_{1}^{*}|$ and $|\widehat{\mathcal{M}}_{2}|$ . Therefore, depending on the prior knowledge about the sizes and strengths of signals of confounding, precision and instrumental variables, one may choose $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$ differently, though the estimations of $\bm{B}$ appear to be similar among the different choices.

Table 8: Simulation results of the proposed estimates for

(n,s,\sigma)=(500,5000,1)

, when

|\widehat{\mathcal{M}}_{2}|/|\widehat{\mathcal{M}}_{1}^{*}|

is taken as 0.5, 1.0 and 2.0: the average MSEs for

\bm{\beta}

and

{\bm{B}}

, and their associated standard errors in the parentheses are reported. The results are based on 100 Monte Carlo repetitions.

$\|\widehat{\mathcal{M}}_{2}\|/\|\widehat{\mathcal{M}}_{1}^{*}\|$	MSE $\bm{\beta}$	MSE ${\bm{B}}$
0.5	0.301(0.008)	0.567(0.005)
1.0	0.303(0.008)	0.574(0.006)
2.0	0.302(0.008)	0.574(0.006)

14.5 Screening and estimation using blockwise joint screening

In this section, we list additional screening and estimation results for Section 5.3 of the main article. In particular, the results for the screening step, in which $s=5000$ , $\sigma=1$ , $n=200,500,1000$ , and $K=2,4,6,12,24,52$ , can be found in Figures 29 – 46. The complete results for the second step estimation, in which $s=5000$ , $\sigma=1$ , $n=200,500,1000$ , and $K=2,4,6,12,24,52$ , can be found in Table 9.

Table 9: Simulation results for

\sigma=1

: the average MSEs for

\bm{\beta}

and

{\bm{B}}

Proposed	MSE $\bm{\beta}$	MSE ${\bm{B}}$	Proposed (block)	MSE $\bm{\beta}$	MSE ${\bm{B}}$
n=200,K=2	1.423(0.096)	0.785(0.009)	n=200,K=2	1.390(0.090)	0.793(0.010)
n=500,K=2	0.831(0.069)	0.726(0.008)	n=500,K=2	0.892(0.082)	0.734(0.009)
n=1000,K=2	0.591(0.050)	0.676(0.008)	n=1000,K=2	0.488(0.028)	0.666(0.006)
n=200,K=4	1.667(0.096)	0.815(0.011)	n=200,K=4	1.548(0.088)	0.805(0.010)
n=500,K=4	1.059(0.082)	0.751(0.011)	n=500,K=4	1.094(0.090)	0.758(0.012)
n=1000,K=4	0.606(0.057)	0.671(0.008)	n=1000,K=4	0.555(0.045)	0.678(0.008)
n=200,K=6	1.955(0.101)	0.826(0.010)	n=200,K=6	1.701(0.084)	0.816(0.009)
n=500,K=6	1.155(0.085)	0.749(0.011)	n=500,K=6	1.107(0.089)	0.752(0.011)
n=1000,K=6	0.578(0.051)	0.674(0.008)	n=1000,K=6	0.551(0.047)	0.672(0.008)
n=200,K=12	2.466(0.096)	0.890(0.039)	n=200,K=12	2.223(0.129)	0.838(0.011)
n=500,K=12	1.024(0.082)	0.735(0.010)	n=500,K=12	0.927(0.077)	0.727(0.008)
n=1000,K=12	0.570(0.046)	0.673(0.008)	n=1000,K=12	0.627(0.057)	0.681(0.009)
n=200,K=24	2.533(0.164)	0.847(0.014)	n=200,K=24	2.136(0.138)	0.821(0.010)
n=500,K=24	1.065(0.080)	0.733(0.010)	n=500,K=24	1.119(0.088)	0.737(0.011)
n=1000,K=24	0.662(0.050)	0.669(0.008)	n=1000,K=24	0.677(0.056)	0.673(0.009)
n=200,K=52	14.650(0.815)	2.034(0.487)	n=200,K=52	13.693(0.728)	1.870(0.459)
n=500,K=52	1.816(0.144)	0.775(0.019)	n=500,K=52	1.725(0.143)	0.762(0.019)
n=1000,K=52	0.937(0.066)	0.684(0.010)	n=1000,K=52	0.861(0.056)	0.675(0.008)

15 Theoretical properties

Starting from here, we denote $\dot{\bm{\beta}}=(\dot{\beta}_{1},\ldots,\dot{\beta}_{s})^{T}$ , $\dot{\bm{B}}$ and $\dot{\bm{C}}_{l}$ as the true values for $\bm{\beta}=(\beta_{1},\ldots,\beta_{s})^{T}$ , ${\bm{B}}$ and ${\bm{C}}_{l}$ respectively. Furthermore, we denote $\dot{\mathcal{C}}$ , $\dot{\mathcal{P}}$ , and $\dot{\mathcal{I}}$ as the true index sets of ${\mathcal{C}}$ , ${\mathcal{P}}$ , and ${\mathcal{I}}$ .

15.1 Sure screening property

In this subsection, we study theoretical properties for our screening procedure. We let $\mathcal{M}_{1}=\{1\leq l\leq s_{n}:\dot{\beta}_{l}^{*}\neq 0\}=\dot{\mathcal{C}}\cup\dot{\mathcal{P}}$ , where $\dot{\mathcal{C}}=\{1\leq l\leq s_{n}:\dot{\bm{C}_{l}}\neq\bm{0}\textrm{ and }\dot{\beta}_{l}^{*}\neq 0\}$ and $\dot{\mathcal{P}}=\{1\leq l\leq s_{n}:\dot{\bm{C}_{l}}=\bm{0}\textrm{ and }\dot{\beta}_{l}^{*}\neq 0\}$ . Here $\dot{\beta}_{l}^{*}$ and $\dot{\bm{C}_{l}}$ are the true values for $\beta_{l}$ and $\bm{C}_{l}$ , respectively, and $\dot{\bm{B}}$ is the true value of $\bm{B}$ .

We have the following theorems, where the assumptions needed are included in Section 16.

Theorem 1.

Under Assumptions (A0) – (A3) and (A5), let $\gamma_{1,n}=\alpha D_{1}n^{-\kappa}$ and $\gamma_{2,n}=\alpha D_{1}(pq)^{1/2}$ $n^{-\kappa}$ with $0<\alpha<1$ , then we have $P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}})\to 1$ as $n\to\infty$ .

Since the screening procedure automatically includes all the significant covariates for small value of $\gamma_{1,n}$ and $\gamma_{2,n}$ , it is necessary to consider the size of $\widehat{\mathcal{M}}$ , which we quantify in Theorem 2.

Theorem 2.

Under Assumptions (A0) – (A5), when $\gamma_{1,n}=\alpha D_{1}n^{-\kappa}$ and $\gamma_{2,n}=\alpha D_{1}(pq)^{1/2}n^{-\kappa}$ with $0<\alpha<1$ , we have $P(|\widehat{\mathcal{M}}|=O(n^{2\kappa+\tau}))\to 1$ as $n\to\infty$ .

Corollary 1.

Theorem 1 shows that if $\gamma_{1,n}$ and $\gamma_{2,n}$ are chosen properly, our screening procedure will include all significant variables with a high probability. Theorem 2 guarantees that the size of selected model from the screening procedure is only of a polynomial order of $n$ even though the original model size is of an exponential order of $n$ . Therefore, the false selection rate of our screening procedure vanishes as $n\to\infty$ , while the size of $\widehat{\mathcal{M}}$ grows in a polynomial order of $n$ , where the order depends on two constants $\kappa$ and $\tau$ defined in Section 16. Theorem 3 shows our blockwise screening procedure also enjoys the screening property. The proofs of these theorems are collected in Section 18.

Theorem 3.

Under Assumptions (A0) – (A3), (A5), and further assume the $j$ -th block size $|\mathcal{B}_{j}|=D_{6}n^{\nu_{1}}$ for some constant $D_{6}>0$ . Let $\gamma_{1,n}=\alpha D_{1}n^{-\kappa}$ , $\gamma_{2,n}=\alpha D_{1}(pq)^{1/2}$ $n^{-\kappa}$ with $0<\alpha<1$ , $\gamma_{3,n}=\alpha D_{1}D_{6}^{-1}n^{-\kappa-\nu_{1}}$ , and $\gamma_{4,n}=\alpha D_{1}D_{6}^{-1}(pq)^{1/2}$ $n^{-\kappa-\nu_{1}}$ with $0<\alpha<1$ , then we have $P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}}^{block})\to 1$ as $n\to\infty$ .

15.2 Theory for two-step estimator

In this section, we develop a unified theory for our two-step estimator. In particular, we derive a non-asymptotic bound for the final estimates. We first introduce some notation.

Denote parameter $\bm{\theta}=\left\{\bm{\beta}^{\mathrm{\scriptscriptstyle T}},\mathrm{vec}^{\mathrm{\scriptscriptstyle T}}(\bm{B})\right\}^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{s+pq}$ , where $\bm{\beta}\in\mathbb{R}^{s}$ and $\bm{B}\in\mathbb{R}^{p\times q}$ . Using this notation, problem (3) can be recasted as minimizing $l(\bm{\theta})+P(\bm{\theta})$ , where $l(\bm{\theta})=(2n)^{-1}\sum_{i=1}^{n}\left(Y_{i}-\right.$ $\left.\langle\bm{Z}_{i},\bm{B}\rangle-\right.$ $\left.\sum_{l\in\widehat{\mathcal{M}}}{X}_{il}\beta_{l}\right)^{2}$ , and $P(\bm{\theta})=\lambda_{1}\sum_{l\in\widehat{\mathcal{M}}}|\beta_{l}|+\lambda_{2}||\bm{B}||_{*}$ . In addition, we let $\dot{\bm{\theta}}=\{\dot{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}},$ $\mathrm{vec}(\dot{\bm{B}})^{\mathrm{\scriptscriptstyle T}}\}^{\mathrm{\scriptscriptstyle T}}$ be the true value for $\bm{\theta}$ , where $\dot{\bm{\beta}}$ and $\dot{\bm{B}}$ is the true values for $\bm{\beta}$ and $\bm{B}$ , respectively. Let $\widehat{\bm{\theta}}_{\bm{\lambda}}=\{\widehat{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\widehat{\bm{B}}\}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ be the proposed estimator for $\bm{\theta}$ , where $\widehat{\bm{\beta}}$ and $\widehat{\bm{B}}$ are the estimators obtained from (3) for tuning parameters $\bm{\lambda}=(\lambda_{1},\lambda_{2})$ .

We hereby give nonasymptotic error bound for the proposed two-step estimator $\widehat{\bm{\theta}}_{\bm{\lambda}}$ :

Theorem 4.

(Nonasymptotic error bounds for two-step estimator) Under Assumptions (A0) – (A9), $2\kappa+\tau<1$ and $\kappa<1/4$ , and the condition that $\mathcal{M}_{1}\subset\widehat{\mathcal{M}}$ with $|\widehat{\mathcal{M}}|=O(n^{2\kappa+\tau})$ , conditional on $\widehat{\mathcal{M}}$ , there exists some positive constants $c_{1},c_{2},c_{3},c_{4}$ , $C_{0}$ , $C_{1}$ , $g_{0}$ and $g_{1}$ , such that for $\lambda_{1}\geq 2\sigma_{0}[2n^{-1}\{\log(\log n)+C_{0}(2\kappa+\tau)\log n\}]^{1/2}$ and $\lambda_{2}\geq 2bs_{2}\sigma_{0}[2n^{-1}\{3\log s_{2}+\log(\log n)\}]^{1/2}+4n^{-1/2}$ $\sigma_{\epsilon}(p^{1/2}+q^{1/2})$ , with probability at least $1-c_{1}/\log n-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n)$ , one has

\left\|\widehat{\bm{\theta}}_{\bm{\lambda}}-\dot{\bm{\theta}}\right\|_{2}^{2}\leq C_{0}\max\left\{C_{1}\lambda_{1}^{2}n^{2\kappa+\tau},\lambda_{2}^{2}r\right\}\iota^{-2}.

The bound in Theorem 4 implies that the convergence rate of the proposed estimator $\widehat{\bm{\theta}}_{\lambda}$ is $O(\max\{$ $n^{2\kappa+\tau-1},n^{1-2\tau}\})$ . Here $\iota$ is a positive constant defined in Assumption (A6) in Section 16, and $r$ is the rank of $\dot{\bm{B}}$ . The convergence rate is controlled by $\kappa$ and $\tau$ , where $\kappa$ controls the exponential rates of model complexity that can diverge and $\tau$ controls the rate of largest eigenvalue of population covariance matrix that can grow. The proof of the theorem is deferred to section 18.

16 Assumptions for main theorems

In this section, we state the assumptions for the main theorems. We first make the following assumptions, which are needed for Theorems 1 and 2.

(A0) The covariates $X_{i}$ are independent and identically distributed (i.i.d) with mean zero and covariance $\Sigma_{x}$ . The random error $\epsilon_{i}$ are i.i.d with mean zero and variance $\sigma_{\epsilon}^{2}$ . Define $\sigma_{l}^{2}=(\Sigma_{x})_{ll}$ . The vectorized error matrices $\mathrm{vec}(E_{i})$ are i.i.d with mean zero and covariance $\Sigma_{e}$ . There exists a constant $\sigma_{x}>0$ such that $\left\|\Sigma_{x}\right\|_{\infty}\leq\sigma_{x}$ . Moreover, $x_{i}$ is independent of $E_{i}=(E_{i,jk})$ and $\epsilon_{i}$ .

(A1) There exist some constants $D_{1}>0$ and $b>0$ , and $0<\kappa<1/2$ such that

	$\displaystyle\min_{l\in\mathcal{M}_{1}}\left\|cov\left(\sum_{l^{\prime}\in\mathcal{M}_{1}}x_{il^{\prime}}\dot{\beta_{l^{\prime}}^{*}},x_{il}\right)\right\|$	$\displaystyle\geq$	$\displaystyle D_{1}n^{-\kappa},$
	$\displaystyle\min_{l\in\mathcal{M}_{2}}\left\\|cov\left(\sum_{l^{\prime}\in\mathcal{M}_{2}}x_{il^{\prime}}*\dot{\bm{C}_{l^{\prime}}},x_{il}\right)\right\\|_{op}$	$\displaystyle\geq$	$\displaystyle D_{1}(pq)^{1/2}n^{-\kappa},$

and $\max\left\{\max_{l\in\mathcal{M}_{2}}\left\|\dot{\bm{C}_{l}}\right\|_{\infty},\max_{l\in\mathcal{M}_{2}}\left\|\dot{\bm{C}_{l}}\right\|_{op},\max_{l\in\mathcal{M}_{2}}\left|\langle\dot{\bm{C}_{l}},\dot{\bm{B}}\rangle\right|,\max_{l\in\mathcal{M}_{1}}|\dot{\beta}_{l}^{*}|\right\}<b$ .

(A2) There exist positive constants $D_{2}$ and $D_{3}$ such that

\max\left\{E[e^{D_{2}x_{il}^{2}}],E[e^{D_{2}E_{i,jk}^{2}}],E[e^{D_{2}\langle\bm{E}_{i},\dot{\bm{B}}\rangle^{2}}]\right\}\leq D_{3}

for every $1\leq l\leq s_{n}$ , $1\leq j\leq p$ and $1\leq k\leq q$ . Denote $\bm{\epsilon}=(\epsilon_{1},\ldots,\epsilon_{n})^{T}$ is a n-dimensional vector of zero-mean, there exists a constant $\sigma_{0}>0$ such that for any fixed $\|\bm{v}\|_{2}=1$ , $P(\left|\langle\bm{v},\bm{\epsilon}\rangle\right|\geq t)\leq 2\exp\left(-\frac{t^{2}}{2\sigma_{0}^{2}}\right)$ for all $t>0$ .

(A3) There exists a constant $D_{4}>0$ such that $\log(s_{n})=D_{4}n^{\xi}$ for $\xi\in(0,1-2\kappa)$ .

(A4) There exists constants $D_{5}>0$ and $\tau>0$ such that $\lambda_{\max}(\Sigma_{x})\leq D_{5}n^{\tau}$ .

(A5) $\log(pq)=o(n^{1-2\kappa})$ .

Before we state the assumptions for Theorem 4, we first introduce some notations.

Denote $P(\bm{\theta})=P_{1}(\bm{\beta})+P_{2}(\bm{B})$ where $P_{1}(\bm{\beta})=\lambda_{1}\sum_{l\in\widehat{\mathcal{M}}}|\beta_{l}|$ and $P_{2}(\bm{B})=\lambda_{2}||\bm{B}||_{*}$ . In addition, let $r=\mathrm{rank}(\dot{\bm{B}})$ , the true rank of matrix $\dot{\bm{B}}\in\mathbb{R}^{p\times q}$ . Let us consider the class of matrices $\Theta$ that have rank $r\leq\min\left\{p,q\right\}$ . For any given matrix $\Theta$ , we let $\mathrm{row}(\Theta)\subset\mathbb{R}^{p}$ and $\mathrm{col}(\Theta)\subset\mathbb{R}^{q}$ denote its row and column space, respectively. Let $U$ and $V$ be a given pair of $r$ -dimensional subspace $U\subset\mathbb{R}^{p}$ and $V\subset\mathbb{R}^{q}$ , respectively.

For a given $\bm{\theta}$ and pair $(U,V)$ , we define the subspace $\Omega_{1}(\mathcal{M}_{1})$ , $\overline{\Omega}_{1}(\mathcal{M}_{1})$ , $\overline{\Omega}^{\perp}_{1}(\mathcal{M}_{1})$ , $\Omega_{2}(U,V)$ , $\overline{\Omega}_{2}(U,V)$ and $\overline{\Omega}^{\perp}_{2}(U,V)$ as follows:

$\displaystyle{\Omega}_{1}(\mathcal{M}_{1})$	$\displaystyle=$	$\displaystyle\overline{\Omega}_{1}(\mathcal{M}_{1}):=\left\{\bm{\beta}\in\mathbb{R}^{s}\|\beta_{j}=0\textrm{ for all }j\not\in\mathcal{M}_{1}\right\},$
$\displaystyle\overline{\Omega}_{1}^{\perp}(\mathcal{M}_{1})$	$\displaystyle:=$	$\displaystyle\left\{\bm{\beta}\in\mathbb{R}^{s}\|\beta_{j}=0\textrm{ for all }j\in\mathcal{M}_{1}\right\},$
$\displaystyle{\Omega}_{2}(U,V)$	$\displaystyle:=$	$\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}\|\mathrm{row}(\Theta)\subset V,\textrm{ and }\mathrm{col}(\Theta)\subset U\right\},$
$\displaystyle\overline{\Omega}_{2}(U,V)$	$\displaystyle:=$	$\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}\|\mathrm{row}(\Theta)\subset V,\textrm{ or }\mathrm{col}(\Theta)\subset U\right\},$
$\displaystyle\overline{\Omega}_{2}^{\perp}(U,V)$	$\displaystyle:=$	$\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}\|\mathrm{row}(\Theta)\subset V^{\perp},\textrm{ and }\mathrm{col}(\Theta)\subset U^{\perp}\right\}.$

Denote $\bm{\Delta}=\{\bm{\Delta}_{1}^{\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\bm{\Delta}_{2})^{\mathrm{\scriptscriptstyle T}}\}^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{s+pq}$ with $\bm{\Delta}_{1}\in\mathbb{R}^{s}$ and $\bm{\Delta}_{2}\in\mathbb{R}^{p\times q}$ . Then $\bm{\Delta}_{1,\overline{\Omega}_{1}}=\arg\min\limits_{\bm{v}\in\overline{\Omega}_{1}}\|\bm{\Delta}_{1}-\bm{v}\|_{2}$ and $\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}}=\arg\min\limits_{\bm{v}\in\overline{\Omega}_{1}^{\perp}}\|\bm{\Delta}_{1}-\bm{v}\|_{2}$ ; $\bm{\Delta}_{2,\overline{\Omega}_{2}}=\arg\min\limits_{\bm{v}\in\overline{\Omega}_{2}}\|\bm{\Delta}_{2}-\bm{v}\|_{F}$ and $\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}}=\arg\min\limits_{\bm{v}\in\overline{\Omega}_{2}^{\perp}}\|\bm{\Delta}_{2}-\bm{v}\|_{F}$ . We write $\bm{X}_{comp}=(\bm{X},\bm{Z}_{new})\in\mathbb{R}^{n\times(s+pq)}$ with $\bm{Z}_{new}=(\mathrm{vec}(\bm{Z}_{1}),$ $\ldots,\mathrm{vec}(\bm{Z}_{n}))^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{n\times pq}$ and let $X_{comp,i}$ represent the $i$ -th column of $\bm{X}_{comp}^{\mathrm{\scriptscriptstyle T}}$ for $i=1,\ldots,n$ . With loss of generality we assume that $\bm{X}$ has been column normalized, i.e. $\|\bm{x}_{l}\|_{2}/\sqrt{n}=1$ , for all $l\in 1,\ldots,s$ .

We need the following assumptions:

(A6) Define

\displaystyle\iota:=\min\limits_{{\tiny\begin{array}[]{c}\left|\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}}\right|_{1}\leq 3\left|\bm{\Delta}_{1,\overline{\Omega}_{1}}\right|_{1}\\ \left\|\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}}\right\|_{*}\leq 3\left\|\bm{\Delta}_{2,\overline{\Omega}_{2}}\right\|_{*}\end{array}}}\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{\left|\langle X_{comp,i},\bm{\Delta}\rangle\right|^{2}}{\|\bm{\Delta}\|_{2}^{2}}\right\},

and assume that $\iota$ is a positive constant.

(A7) Assume $\max\{p,q\}/\log(n)\to\infty$ and $\max\{p,q\}=o(n^{1-2\tau})$ as $n\to\infty$ with $\tau<1/2$ .

(A8) The vectorized error matrices $\mathrm{vec}(\bm{E}_{i})$ are i.i.d. $N(\bm{0},\bm{\Sigma}_{e}^{2})$ , where $\lambda_{\max}(\bm{\Sigma}_{e})\leq C_{U}^{2}<\infty$ .

(A9) $\mathrm{rank}(\dot{\bm{B}})=r<\min(p,q)$ holds.

17 Auxiliary lemmas

In this section, we include the auxiliary lemmas needed for the theorems and their proofs.

Lemma 1.

(Bernstein’s inequality) Let $T_{1},\ldots,T_{n}$ be independent random variable with zero mean such that $E(|T_{i}|^{m})\leq m!M^{m-2}v_{i}/2$ , for every $m\geq 2$ (and all i) and some constant $M$ and $v_{i}$ . Then

P(|\sum_{i=1}^{n}T_{i}|>x)\leq 2e^{-\frac{1}{2}\frac{x^{2}}{v+Mx}},

for $v=\sum_{i=1}^{n}v_{i}$ .

This is Lemma 2.2.11 from van der Vaart and Wellner (2000) and we omit the proof here.

Lemma 2.

Under Assumptions (A0), (A1) and (A2), for arbitrary $t>0$ and for every $l,l^{\prime},j,k$ , we have that

$\displaystyle P\left((\|\sum_{i=1}^{n}\{x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\}\|\geq t\right)$	$\displaystyle\leq$	$\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+t/D_{2})}\right\},$
$\displaystyle P\left(\left\|\sum_{i=1}^{n}(x_{il}E_{i,jk})\right\|\geq t\right)$	$\displaystyle\leq$	$\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\},$
$\displaystyle P\left(\left\|\sum_{i=1}^{n}\left(x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right\|\geq t\right)$	$\displaystyle\leq$	$\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\},$
$\displaystyle P\left(\|\sum_{i=1}^{n}(x_{il}\epsilon_{i})\|\geq t\right)$	$\displaystyle\leq$	$\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\}.$

Proof of Lemma 2:.

Note that the last part of Assumption (A2) actually implies that, there exist positive constants $D^{\prime}_{2}$ and $D^{\prime}_{3}$ , such that $E[e^{D^{\prime}_{2}\epsilon_{i}^{2}}]\leq D^{\prime}_{3}$ by applying Theorem 3.1 from Rivasplata (2012). Therefore, it can be unified into the first part of Assumption (A2) which implies that

\max\left\{E[e^{D_{2}x_{il}^{2}}],E[e^{D_{2}E_{i,jk}^{2}}],E[e^{D_{2}\langle\bm{E}_{i},\dot{\bm{B}}\rangle^{2}}],E[e^{D_{2}\epsilon_{i}^{2}}]\right\}\leq D_{3}

for every $1\leq l\leq s_{n}$ , $1\leq j\leq p$ and $1\leq k\leq q$ . Therefore, by Assumptions (A0) and (A2) and Jensen’s inequality, we have

	$\displaystyle E\left[e^{D_{2}\|x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\|}\right]\leq E\left[e^{D_{2}\|x_{il}x_{il^{\prime}}\|+D_{2}\|E\left(x_{il}x_{il^{\prime}}\right)\|}\right]=e^{D_{2}\|E\left(x_{il}x_{il^{\prime}}\right)\|}E\left[e^{D_{2}\|x_{il}x_{il^{\prime}}\|}\right]$
	$\displaystyle\leq e^{D_{2}\sigma_{x}}E\left[e^{D_{2}\frac{x_{il}^{2}+x_{il^{\prime}}^{2}}{2}}\right]\leq e^{D_{2}\sigma_{x}}\left[E\left\{e^{D_{2}x_{il}^{2}}\right\}E\left\{e^{D_{2}x_{il^{\prime}}^{2}}\right\}\right]^{1/2}\leq e^{D_{2}\sigma_{x}}D_{3}.$

Then for every $m\geq 2$ , one has

E\left[|x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})|^{m}\right]\leq\frac{m!}{D_{2}^{m}}E\left[e^{D_{2}|x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})|}\right]\leq\frac{m!}{D_{2}^{m}}e^{D_{2}\sigma_{x}}D_{3}.

It follows from Lemma 1 that

P(|\sum_{i=1}^{n}\{x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\}|\geq t)\leq 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+t/D_{2})}\right\}.

Similarly we obtain

\displaystyle E\left[e^{D_{2}\left|x_{il}E_{i,jk}\right|}\right]\leq E\left[e^{D_{2}\frac{x_{il}^{2}+E_{i,jk}^{2}}{2}}\right]\leq\left[E\left\{e^{D_{2}x_{il}^{2}}\right\}E\left\{e^{D_{2}E_{i,jk}^{2}}\right\}\right]^{1/2}\leq D_{3}.

Then for every $m\geq 2$ , one has

E\left[\left|x_{il}E_{i,jk}\right|^{m}\right]\leq\frac{m!}{D_{2}^{m}}E\left[e^{\left|x_{il}E_{i,jk}\right|}\right]\leq\frac{m!}{D_{2}^{m}}D_{3}.

It follows from Lemma 1 that

P\left(\left|\sum_{i=1}^{n}(x_{il}E_{i,jk})\right|\geq t\right)\leq 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\}.

Similarly we have

	$\displaystyle E\left[e^{D_{2}\left\|x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|}\right]$	$\displaystyle\leq$	$\displaystyle E\left[e^{D_{2}\frac{x_{il}^{2}+\langle\bm{E}_{i},\dot{\bm{B}}\rangle^{2}}{2}}\right]\leq\left[E\left\{e^{D_{2}x_{il}^{2}}\right\}E\left\{e^{D_{2}\langle\bm{E}_{i},\dot{\bm{B}}\rangle^{2}}\right\}\right]^{1/2}\leq D_{3},$
	$\displaystyle E\left[e^{D_{2}\left\|x_{il}\epsilon_{i}\right\|}\right]$	$\displaystyle\leq$	$\displaystyle E\left[e^{D_{2}\frac{x_{il}^{2}+\epsilon_{i}^{2}}{2}}\right]\leq\left[E\left\{e^{D_{2}x_{il}^{2}}\right\}E\left\{e^{D_{2}\epsilon_{i}^{2}}\right\}\right]^{1/2}\leq D_{3}.$

Then following the proof of showing second inequality, one has

	$\displaystyle P\left(\left\|\sum_{i=1}^{n}\left(x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right\|\geq t\right)$	$\displaystyle\leq$	$\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\},$
	$\displaystyle P\left(\|\sum_{i=1}^{n}(x_{il}\epsilon_{i})\|\geq t\right)$	$\displaystyle\leq$	$\displaystyle 2\exp\left\{-\frac{t^{2}}{2(2nD_{2}^{-2}D_{3}+t/D_{2})}\right\}.$

∎

The following lemma is a standard result called Gaussian comparison inequality (Anderson, 1955).

Lemma 3.

Let $X$ and $Y$ be zero-mean vector Gaussian random vectors with covariance matrix $\Sigma_{X}$ and $\Sigma_{Y}$ respectively. If $\Sigma_{X}-\Sigma_{Y}$ is positive semi-definite, then for any convex symmetric set $C$ , $P(X\in C)\leq P(Y\in C)$ .

18 Proof of theorems

Proof of Theorem 1:.

We can write

			$\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\right)\right\}$
		$\displaystyle=$	$\displaystyle P\left\{\cap_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\right)\right\}$
		$\displaystyle=$	$\displaystyle 1-P\left\{\cup_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\right)^{c}\right\}$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left(\widehat{\mathcal{M}}_{1}^{*c}\cap\widehat{\mathcal{M}}_{2}^{c}\right)$
		$\displaystyle=$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n},\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n}\right)$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n}\right).$

Firstly, recall that $\dot{\bm{C}}_{l}^{M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{2}}x_{il^{\prime}}*\dot{\bm{C}}_{l^{\prime}},x_{il})$ i.e. $\dot{C}_{l,jk}^{M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{2}}x_{il^{\prime}}\dot{C}_{l^{\prime},jk},$ $x_{il})=n^{-1}\sum_{i=1}^{n}E(x_{il}Z_{i,jk})$ . For every $1\leq j\leq p$ , $1\leq k\leq q$ and $1\leq l\leq s_{n}$ , we have

\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}=n^{-1}\sum_{i=1}^{n}\left[x_{il}Z_{i,jk}-E(x_{il}Z_{i,jk})\right].

It follows from Assumptions (A0), (A1), (A2) and Lemma 2 that for any $t>0$ , one has

			$\displaystyle P\left(\left\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right\|\geq t\right)=P\left(\left\|\sum_{i=1}^{n}[x_{il}Z_{i,jk}-E(x_{il}Z_{i,jk})\right\|\geq nt\right)$
		$\displaystyle=$	$\displaystyle P\left(\left\|\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left[x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\right]\dot{C}_{l^{\prime},jk}+\sum_{i=1}^{n}x_{il}E_{i,jk}\right\|\geq nt\right)$
		$\displaystyle\leq$	$\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left(\left\|\sum_{i=1}^{n}\left[x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\right]\right\|\geq\frac{nt}{b(s_{2}+1)}\right)+P\left(\left\|\sum_{i=1}^{n}x_{il}E_{i,jk}\right\|\geq\frac{nt}{s_{2}+1}\right)$
		$\displaystyle\leq$	$\displaystyle 2s_{2}\exp\left[-\frac{nt^{2}b^{-2}(s_{1}+1)^{-2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}t\}}\right]$
			$\displaystyle+2\exp\left[-\frac{nt^{2}(s_{2}+1)^{-2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}t\}}\right].$

Therefore, for every $l\in\mathcal{M}_{2}$ , we have

			$\displaystyle P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n}\right)\leq P\left(\\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\\|_{op}\geq D_{1}(pq)^{1/2}n^{-\kappa}-\gamma_{2,n}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\\|_{F}\geq(pq)^{1/2}(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle=$	$\displaystyle P\left(\sum_{j,k}\left\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right\|^{2}\geq(pq)^{1/2}(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle\sum_{j,k}P\left(\left\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right\|^{2}\geq\left\{(1-\alpha)D_{1}n^{-\kappa}\right\}^{2}\right)$
		$\displaystyle\leq$	$\displaystyle\sum_{j,k}P\left(\left\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right\|\geq(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle 2pq\left(s_{2}\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}n^{-\kappa}\}}\right]\right.$
			$\displaystyle\left.+\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}n^{-\kappa}\}}\right]\right)$
		$\displaystyle\leq$	$\displaystyle 2pq\left(s_{2}\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right]\right.$
			$\displaystyle\left.+\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right]\right)$

Let

	$\displaystyle d_{0}$	$\displaystyle=$	$\displaystyle\min\left[\frac{\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}},\right.$
			$\displaystyle\left.\frac{\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right],$

We have for every $l\in\mathcal{M}_{2}$ ,

\displaystyle P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)

\displaystyle\leq

\displaystyle 2pq(s_{2}+1)\exp(-d_{0}n^{1-2\kappa}),

(5)

Let us consider $P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right)$ . recall that, $\dot{\beta}_{l}^{M}=\dot{\beta}_{l}^{*M}+\langle\dot{\bm{C}}_{l}^{M}$ , $\dot{\bm{B}}\rangle$ , $\dot{\beta}_{l}^{*M}=cov(\sum_{l^{\prime}\in\mathcal{M}_{1}}$ $x_{il^{\prime}}\dot{\beta}_{l^{\prime}},x_{il})$ and $\dot{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}E(x_{il}Y_{i})$ . For every $1\leq l\leq s_{n}$ , we have

\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}.

It follows from Assumptions (A0), (A1), (A2) and Lemma 2 that for any $t>0$ , we have

			$\displaystyle P\left(\left\|\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}\right\|\geq t\right)=P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}\right\|\geq nt\right]$
		$\displaystyle=$	$\displaystyle P\left[\left\|\sum_{l^{\prime}\in\mathcal{M}_{1}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\dot{\beta}_{l^{\prime}}^{*M}+\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\langle\dot{\bm{C}}_{l^{\prime}}^{M},\dot{\bm{B}}\rangle+\right.\right.$
		$\displaystyle+$	$\displaystyle\left.\left.\sum_{i=1}^{n}\left\{x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle-E(x_{il})E\left(\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right\}+\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq nt\right]$
		$\displaystyle\leq$	$\displaystyle P\left[\sum_{l^{\prime}\in\mathcal{M}_{1}}\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|b+\sum_{l^{\prime}\in\mathcal{M}_{2}}\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|b\right.$
			$\displaystyle+\left.\left\|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|+\left\|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq nt\right]$
		$\displaystyle\leq$	$\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{1}}P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]$
			$\displaystyle+\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]$
			$\displaystyle+P\left(\left\|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|\geq\frac{nt}{s_{1}+s_{2}+2}\right)+P\left(\left\|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq\frac{nt}{s_{1}+s_{2}+2}\right)$
		$\displaystyle\leq$	$\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{nt^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right]$
			$\displaystyle+4\exp\left[-\frac{nt^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right].$

For $l\in\mathcal{M}_{1}\cap\mathcal{M}^{c}_{2}$ , we have $\langle\dot{\bm{C}}^{M}_{l},\dot{\bm{B}}\rangle=0$ , under Assumption (A1) and previous deduction, we have

			$\displaystyle P\left(\|\widehat{\beta}^{M}_{l}\|\leq\gamma_{1,n}\right)=P\left(-\|\widehat{\beta}^{M}_{l}\|\geq-\gamma_{1,n}\right)\leq P\left(\|\dot{\beta}_{l}^{*M}\|-\|\widehat{\beta}^{M}_{l}\|\geq D_{1}n^{-\kappa}-\gamma_{1,n}\right)$
		$\displaystyle=$	$\displaystyle P\left(\|\dot{\beta}_{l}^{*M}\|-\|\langle\dot{\bm{C}}^{M}_{l},\dot{\bm{B}}\rangle\|-\|\widehat{\beta}^{M}_{l}\|\geq(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\|\dot{\beta}_{l}^{M}\|-\|\widehat{\beta}^{M}_{l}\|\geq(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\|\dot{\beta}_{l}^{M}-\widehat{\beta}^{M}_{l}\|\geq(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}n^{-\kappa}\right\}}\right]$
			$\displaystyle+4\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}pq^{1/2}n^{-\kappa}\right\}}\right]$
		$\displaystyle\leq$	$\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right]$
			$\displaystyle+4\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right]$

Let

	$\displaystyle d_{1}$	$\displaystyle=$	$\displaystyle\min\left[\frac{(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}},\right.$
			$\displaystyle\left.\frac{(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right],$

We have for each $l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}$ ,

\displaystyle P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{2,n}\right)\leq 2(s_{1}+s_{2}+2)\exp\left(-d_{1}n^{1-2\kappa}\right).

(6)

In sum, by Assumption (A5), and (5) and (6), we have

			$\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}\right)\right\}$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n}\right)$
		$\displaystyle\geq$	$\displaystyle 1-2pqs_{2}(s_{2}+1)\exp(-d_{0}n^{1-2\kappa})-2s_{1}(s_{1}+s_{2}+2)\exp\left(-d_{1}n^{1-2\kappa}\right)$
		$\displaystyle\geq$	$\displaystyle 1-d_{0}^{\prime}pq\exp\left(-d_{1}^{\prime}n^{1-2\kappa}\right)\to 1,\quad\textrm{ as }n\to\infty,$

for some positive constants $d_{0}^{\prime}$ and $d_{1}^{\prime}$ . Therefore, $P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}})\to 1$ as $n\to\infty$ . ∎

Proof of Theorem 2.

The proof consists of two steps. In step 1, we will show that $P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0})\to 1$ , where $\mathcal{M}^{0}=\mathcal{M}^{0}_{1}\cup\mathcal{M}^{0}_{2}$ , $\mathcal{M}^{0}_{1}=\left\{1\leq l\leq s_{n}:|\dot{\beta}_{l}^{M}|\geq\gamma_{1,n}/2\right\}$ and $\mathcal{M}^{0}_{2}=\left\{1\leq l\leq s_{n}:\right.$ $\left.\left\|\dot{\bm{C}}^{M}_{l}\right\|_{op}\geq\gamma_{2,n}/2\right\}.$ Recall that $\widehat{\mathcal{M}}=\widehat{\mathcal{M}}_{1}^{*}\cup\widehat{\mathcal{M}}_{2}=\left\{1\leq l\leq s_{n}:|\widehat{\beta}^{M}_{l}|\geq\right.$ $\left.\gamma_{1,n}\right\}$ $\cup\left\{1\leq l\leq s_{n}:\right.$ $\left.\left\|\widehat{{\bm{C}}}_{l}^{M}\right\|_{op}\geq\gamma_{2,n}\right\}$ . Let $\gamma_{1,n}=\alpha D_{1}n^{-\kappa}$ and $\gamma_{2,n}=\alpha D_{1}(pq)^{1/2}n^{-\kappa}$ with $0<\alpha<1$ , we have

			$\displaystyle P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0}_{1}\cup\mathcal{M}^{0}_{2})$
		$\displaystyle\geq$	$\displaystyle P\left[\cap_{1\leq l\leq s_{n}}\left\{\|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}\|\leq\frac{\gamma_{1,n}}{2}\right\}\cap_{1\leq l\leq s_{n}}\left\{\|\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\|\|_{op}\leq\frac{\gamma_{2,n}}{2}\right\}\right]$
		$\displaystyle=$	$\displaystyle 1-P\left[\cup_{1\leq l\leq s_{n}}\{\|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}\|\geq\frac{\gamma_{1,n}}{2}\}\cup_{1\leq l\leq s_{n}}\{\|\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\|\|_{op}\geq\frac{\gamma_{2,n}}{2}\}\right]$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{1\leq l\leq s_{n}}\left\{P\left(\|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}\|\geq\frac{\gamma_{1,n}}{2}\right)+P\left(\|\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\|\|_{op}\geq\frac{\gamma_{2,n}}{2}\right)\right\}$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{1\leq l\leq s_{n}}P\left(\|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}\|\geq\frac{\gamma_{1,n}}{2}\right)-\sum_{1\leq l\leq s_{n}}P\left(\|\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\|\|_{F}\geq\frac{\gamma_{2,n}}{2}\right)$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{1\leq l\leq s_{n}}P\left(\|\widehat{{\beta}_{l}^{M}}-\dot{\beta}_{l}^{M}\|\geq\alpha D_{1}n^{-\kappa}/2\right)$
			$\displaystyle-\sum_{1\leq l\leq s_{n}}\sum_{j,k}P\left(\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\|\geq\alpha D_{1}n^{-\kappa}/2\right)$
		$\displaystyle\geq$	$\displaystyle 1-2s_{n}\left\{(s_{1}+s_{2})\exp\left[-\frac{\alpha^{2}D_{1}^{2}(4b)^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(4b)^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}n^{-\kappa}\right\}}\right]\right.$
			$\displaystyle\left.+2\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}n^{-\kappa}\right\}}\right]\right\}$
			$\displaystyle-2s_{n}pq\left\{s_{2}\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}b^{-2}(s_{1}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}2^{-1}\alpha D_{1}n^{-\kappa}\}}\right]\right.$
			$\displaystyle\left.+\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{2}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}2^{-1}\alpha D_{1}n^{-\kappa}\}}\right]\right\}$
		$\displaystyle\geq$	$\displaystyle 1-2s_{n}\left\{(s_{1}+s_{2})\exp\left[-\frac{\alpha^{2}D_{1}^{2}(4b)^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(4b)^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}\right\}}\right]\right.$
			$\displaystyle\left.+2\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}\right\}}\right]\right\}$
			$\displaystyle-2s_{n}pq\left\{s_{2}\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}b^{-2}(s_{1}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}2^{-1}\alpha D_{1}\}}\right]\right.$
			$\displaystyle\left.+\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{2}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}2^{-1}\alpha D_{1}\}}\right]\right\}$
		$\displaystyle=$	$\displaystyle 1-2\exp(D_{4}n^{\xi})\left\{(s_{1}+s_{2})\exp\left[-\frac{\alpha^{2}D_{1}^{2}(4b)^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(4b)^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}\right\}}\right]\right.$
			$\displaystyle\left.+2\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{1}+s_{2}+2)^{-2}n^{1-2\kappa}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}\alpha D_{1}\right\}}\right]\right\}$
			$\displaystyle-2pq\exp(D_{4}n^{\xi})\left\{s_{2}\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}b^{-2}(s_{1}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}2^{-1}\alpha D_{1}\}}\right]\right.$
			$\displaystyle\left.+\exp\left[-\frac{\alpha^{2}D_{1}^{2}2^{-2}(s_{2}+1)^{-2}n^{1-2\kappa}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}2^{-1}\alpha D_{1}\}}\right]\right\}$

By Assumptions (A3) and (A5), we have

P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0}_{1}\cup\mathcal{M}^{0}_{2})\geq 1-d_{2}\exp(-d_{3}n^{1-2\kappa}),

for some constants $d_{2}$ and $d_{3}>0$ . Therefore, we have $P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0})\to 1$ as $n\to\infty$ .

In step 2, we will show that $|\mathcal{M}^{0}|=O(n^{2\kappa+\tau})$ . As $\left|\mathcal{M}^{0}\right|=\left|\mathcal{M}^{0}_{1}\cup\mathcal{M}^{0}_{2}\right|\leq\left|\mathcal{M}^{0}_{1}\right|+\left|\mathcal{M}^{0}_{2}\right|$ , we only need to show that both $\left|\mathcal{M}^{0}_{1}\right|$ and $\left|\mathcal{M}^{0}_{2}\right|$ are $O(n^{2\kappa+\tau})$ .

Define $\mathcal{M}^{1}_{1}=\left\{1\leq l\leq s_{n}:\left|\dot{\beta}_{l}^{M}\right|^{2}\geq\gamma_{1,n}^{2}/4\right\}$ and $\mathcal{M}^{0}_{1}\subset\mathcal{M}^{1}_{1}$ . By the definition of $\mathcal{M}^{1}_{1}$ , we have

\displaystyle\left|\mathcal{M}^{1}_{1}\right|\gamma_{1,n}^{2}/4

\displaystyle\leq

\displaystyle\sum_{l=1}^{s_{n}}\left|\dot{\beta}_{l}^{M}\right|^{2}=\sum_{l=1}^{s_{n}}\left(E\left[x_{il}Y_{i}\right]\right)^{2}=\left\|E\left[\bm{x}_{i}*Y_{i}\right]\right\|^{2}.

Define $\dot{\bm{\beta}}^{*}=(\dot{\beta}_{l}^{*},\ldots,\dot{\beta}_{s_{n}}^{*})^{\mathrm{\scriptscriptstyle T}}$ and $\dot{\bm{c}}=(\langle\dot{\bm{C}}_{l},\dot{\bm{B}}\rangle,\ldots,\langle\dot{\bm{C}}_{s_{n}},\dot{\bm{B}}\rangle)^{\mathrm{\scriptscriptstyle T}}$ , we can write

\displaystyle Y_{i}

\displaystyle=

\displaystyle\bm{x}_{i}^{\mathrm{\scriptscriptstyle T}}\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)+\langle\bm{E}_{i},\dot{\bm{B}}\rangle+\epsilon_{i}.

Multiplying $\bm{x}_{i}$ on both sides and taking expectations yield $E\left[\bm{x}_{i}*Y_{i}\right]=\Sigma_{x}\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)$ . Therefore, we have

\displaystyle\left|\mathcal{M}^{2}_{1}\right|\gamma_{2,n}^{2}/4

\displaystyle\leq

\displaystyle\left\|\Sigma_{x}\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)\right\|^{2}\leq\lambda_{max}(\Sigma_{x})\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)^{\mathrm{\scriptscriptstyle T}}\left(\dot{\bm{\beta}}^{*}+\dot{\bm{c}}\right)\leq 4b^{2}\lambda_{max}(\Sigma_{x}).

By Assumption (A4), we have $\left|\mathcal{M}^{1}_{1}\right|\leq 4b^{2}\lambda_{max}(\Sigma_{x})\gamma_{1,n}^{-2}=O(n^{2\kappa+\tau})$ . This implies that $\left|\mathcal{M}^{0}_{1}\right|\leq\left|\mathcal{M}^{1}_{1}\right|\leq O(n^{2\kappa+\tau})$ .

Define $\mathcal{M}^{1}_{2}=\left\{1\leq l\leq s_{n}:\left\|\dot{\bm{C}}_{l}^{M}\right\|_{F}^{2}\geq\gamma_{2,n}^{2}/4\right\}$ . As $\left\|\dot{\bm{C}}_{l}^{M}\right\|_{op}\leq\left\|\dot{\bm{C}}_{l}^{M}\right\|_{F}$ , we have $\mathcal{M}^{0}_{2}\subset\mathcal{M}^{1}_{1}$ . By the definition of $\mathcal{M}^{1}_{2}$ , we have

	$\displaystyle\left\|\mathcal{M}^{1}_{2}\right\|\gamma_{2,n}^{2}/4$	$\displaystyle\leq$	$\displaystyle\sum_{l=1}^{s_{n}}\left\\|\dot{\bm{C}}_{l}^{M}\right\\|^{2}_{F}$
		$\displaystyle=$	$\displaystyle\sum_{j,k}\sum_{l=1}^{s_{n}}(\dot{C}_{l,jk}^{M})^{2}=\sum_{j,k}\sum_{l=1}^{s_{n}}\left(E\left[x_{il}Z_{i,jk}\right]\right)^{2}=\sum_{j,k}\left\\|E\left[\bm{x}_{i}*Z_{i,jk}\right]\right\\|^{2}.$

Define $\dot{\bm{C}}_{jk}=(\dot{C}_{1,jk},\ldots,\dot{C}_{s_{n},jk})^{\mathrm{\scriptscriptstyle T}}$ , we can write $Z_{i,jk}=\bm{x}_{i}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{C}}_{jk}+E_{i,jk}$ . Multiplying $\bm{x}_{i}$ on both sides and taking expectations yield $E\left[\bm{x}_{i}*Z_{i,jk}\right]=\Sigma_{x}\dot{\bm{C}}_{jk}$ .

\displaystyle\left|\mathcal{M}^{1}_{2}\right|\gamma_{1,n}^{2}/4

\displaystyle\leq

\displaystyle\sum_{j,k}\left\|\Sigma_{x}\dot{\bm{C}}_{jk}\right\|^{2}\leq\lambda_{max}(\Sigma_{x})\sum_{j,k}\dot{\bm{C}}_{jk}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{C}}_{jk}\leq pqb^{2}\lambda_{max}(\Sigma_{x}).

By Assumption (A4), we have $\left|\mathcal{M}^{1}_{2}\right|\leq 4pqb^{2}\lambda_{max}(\Sigma_{x})\gamma_{2,n}^{-2}=O(n^{2\kappa+\tau})$ .

Combining results from two steps above leads to $P\{|\widehat{\mathcal{M}}|=O(n^{2\kappa+\tau})\}\geq P(\widehat{\mathcal{M}}\subset\mathcal{M}^{0})\to 1$ . ∎

Proof of Theorem 3:.

We can write

			$\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,}\cup\widehat{\mathcal{M}}_{2}^{block}\right)\right\}$
		$\displaystyle=$	$\displaystyle P\left\{\cap_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,}\cup\widehat{\mathcal{M}}_{2}^{block}\right)\right\}$
		$\displaystyle=$	$\displaystyle 1-P\left\{\cup_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,}\cup\widehat{\mathcal{M}}_{2}^{block}\right)^{c}\right\}$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left\{\widehat{\mathcal{M}}_{1}^{c}\cap\widehat{\mathcal{M}}_{2}^{c}\cap(\widehat{\mathcal{M}}_{1}^{block,})^{c}\cap(\widehat{\mathcal{M}}_{2}^{block})^{c}\right\}$
		$\displaystyle=$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n},\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n},\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n},\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n},\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n},\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n}\right)$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)$
			$\displaystyle-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n}\right)$

\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}=n^{-1}\sum_{i=1}^{n}\left[x_{il}Z_{i,jk}-E(x_{il}Z_{i,jk})\right].

It follows from Assumptions (A0), (A1), (A2) and Lemma 2 that for any $t>0$ , one has

			$\displaystyle P\left(\left\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right\|\geq t\right)=P\left(\left\|\sum_{i=1}^{n}[x_{il}Z_{i,jk}-E(x_{il}Z_{i,jk})\right\|\geq nt\right)$
		$\displaystyle=$	$\displaystyle P\left(\left\|\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left[x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\right]\dot{C}_{l^{\prime},jk}+\sum_{i=1}^{n}x_{il}E_{i,jk}\right\|\geq nt\right)$
		$\displaystyle\leq$	$\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left(\left\|\sum_{i=1}^{n}\left[x_{il}x_{il^{\prime}}-E(x_{il}x_{il^{\prime}})\right]\right\|\geq\frac{nt}{b(s_{2}+1)}\right)+P\left(\left\|\sum_{i=1}^{n}x_{il}E_{i,jk}\right\|\geq\frac{nt}{s_{2}+1}\right)$
		$\displaystyle\leq$	$\displaystyle 2s_{2}\exp\left[-\frac{nt^{2}b^{-2}(s_{1}+1)^{-2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{1}+1)^{-1}t\}}\right]$
			$\displaystyle+2\exp\left[-\frac{nt^{2}(s_{2}+1)^{-2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}t\}}\right].$

Therefore, for every $l\in\mathcal{M}_{2}$ , we have

			$\displaystyle P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n}\right)\leq P\left(\\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\\|_{op}\geq D_{1}(pq)^{1/2}n^{-\kappa}-\gamma_{2,n}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\\|\widehat{{\bm{C}}}_{l}^{M}-\dot{\bm{C}}_{l}^{M}\\|_{F}\geq(pq)^{1/2}(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle=$	$\displaystyle P\left(\sum_{j,k}\left\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right\|^{2}\geq(pq)^{1/2}(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle\sum_{j,k}P\left(\left\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right\|^{2}\geq\left\{(1-\alpha)D_{1}n^{-\kappa}\right\}^{2}\right)$
		$\displaystyle\leq$	$\displaystyle\sum_{j,k}P\left(\left\|\widehat{C}_{l,jk}^{M}-\dot{C}_{l,jk}^{M}\right\|\geq(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle 2pq\left(s_{2}\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}n^{-\kappa}\}}\right]\right.$
			$\displaystyle\left.+\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}n^{-\kappa}\}}\right]\right)$
		$\displaystyle\leq$	$\displaystyle 2pq\left(s_{2}\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right]\right.$
			$\displaystyle\left.+\exp\left[-\frac{n^{1-2\kappa}\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right]\right)$

Let

	$\displaystyle d_{0}$	$\displaystyle=$	$\displaystyle\min\left[\frac{\left\{(1-\alpha)D_{1}b^{-1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}b^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}},\right.$
			$\displaystyle\left.\frac{\left\{(1-\alpha)D_{1}(s_{2}+1)^{-1}\right\}^{2}}{2\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{2}+1)^{-1}(1-\alpha)D_{1}\}}\right],$

We have for every $l\in\mathcal{M}_{2}$ ,

\displaystyle P\left(\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)\leq P\left(\|\widehat{{\bm{C}}}_{l}^{M}\|_{op}\leq\gamma_{2,n}\right)

\displaystyle\leq

\displaystyle 2pq(s_{2}+1)\exp(-d_{0}n^{1-2\kappa}),

(7)

\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}=n^{-1}\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}.

It follows from Assumptions (A0), (A1), (A2) and Lemma 2 that for any $t>0$ , we have

			$\displaystyle P\left(\left\|\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}\right\|\geq t\right)=P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}\right\|\geq nt\right]$
		$\displaystyle=$	$\displaystyle P\left[\left\|\sum_{l^{\prime}\in\mathcal{M}_{1}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\dot{\beta}_{l^{\prime}}^{*M}+\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\langle\dot{\bm{C}}_{l^{\prime}}^{M},\dot{\bm{B}}\rangle+\right.\right.$
		$\displaystyle+$	$\displaystyle\left.\left.\sum_{i=1}^{n}\left\{x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle-E(x_{il})E\left(\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right\}+\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq nt\right]$
		$\displaystyle\leq$	$\displaystyle P\left[\sum_{l^{\prime}\in\mathcal{M}_{1}}\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|b+\sum_{l^{\prime}\in\mathcal{M}_{2}}\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|b\right.$
			$\displaystyle+\left.\left\|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|+\left\|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq nt\right]$
		$\displaystyle\leq$	$\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{1}}P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]$
			$\displaystyle+\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]$
			$\displaystyle+P\left(\left\|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|\geq\frac{nt}{s_{1}+s_{2}+2}\right)+P\left(\left\|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq\frac{nt}{s_{1}+s_{2}+2}\right)$
		$\displaystyle\leq$	$\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{nt^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right]$
			$\displaystyle+4\exp\left[-\frac{nt^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right].$

For $l\in\mathcal{M}_{1}\cap\mathcal{M}^{c}_{2}$ , we have $\langle\dot{\bm{C}}^{M}_{l},\dot{\bm{B}}\rangle=0$ , under Assumption (A1) and previous deduction, we have

			$\displaystyle P\left(\|\widehat{\beta}^{M}_{l}\|\leq\gamma_{1,n}\right)=P\left(-\|\widehat{\beta}^{M}_{l}\|\geq-\gamma_{1,n}\right)\leq P\left(\|\dot{\beta}_{l}^{*M}\|-\|\widehat{\beta}^{M}_{l}\|\geq D_{1}n^{-\kappa}-\gamma_{1,n}\right)$
		$\displaystyle=$	$\displaystyle P\left(\|\dot{\beta}_{l}^{*M}\|-\|\langle\dot{\bm{C}}^{M}_{l},\dot{\bm{B}}\rangle\|-\|\widehat{\beta}^{M}_{l}\|\geq(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\|\dot{\beta}_{l}^{M}\|-\|\widehat{\beta}^{M}_{l}\|\geq(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle P\left(\|\dot{\beta}_{l}^{M}-\widehat{\beta}^{M}_{l}\|\geq(1-\alpha)D_{1}n^{-\kappa}\right)$
		$\displaystyle\leq$	$\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}n^{-\kappa}\right\}}\right]$
			$\displaystyle+4\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}pq^{1/2}n^{-\kappa}\right\}}\right]$
		$\displaystyle\leq$	$\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right]$
			$\displaystyle+4\exp\left[-\frac{n^{1-2\kappa}(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right]$

Let

	$\displaystyle d_{1}$	$\displaystyle=$	$\displaystyle\min\left[\frac{(1-\alpha)^{2}D_{1}^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}},\right.$
			$\displaystyle\left.\frac{(1-\alpha)^{2}D_{1}^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}(1-\alpha)D_{1}\right\}}\right],$

We have for each $l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}$ ,

\displaystyle P\left(\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n}\right)\leq P\left(|\widehat{{\beta}_{l}^{M}}|\leq\gamma_{1,n}\right)\leq 2(s_{1}+s_{2}+2)\exp\left(-d_{1}n^{1-2\kappa}\right).

(8)

In sum, by Assumption (A5), and (7) and (8), we have

			$\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,}\cup\widehat{\mathcal{M}}_{2}^{block}\right)\right\}$
		$\displaystyle\geq$	$\displaystyle 1-2\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n}\right)-2\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n}\right)$
		$\displaystyle\geq$	$\displaystyle 1-4pqs_{2}(s_{2}+1)\exp(-d_{0}n^{1-2\kappa})-4s_{1}(s_{1}+s_{2}+2)\exp\left(-d_{1}n^{1-2\kappa}\right)$
		$\displaystyle\geq$	$\displaystyle 1-d_{0}^{\prime}pq\exp\left(-d_{1}^{\prime}n^{1-2\kappa}\right)\to 1,\quad\textrm{ as }n\to\infty,$

for some positive constants $d_{0}^{\prime}$ and $d_{1}^{\prime}$ . Therefore, $P(\mathcal{M}_{1}\subset\widehat{\mathcal{M}})\to 1$ as $n\to\infty$ . ∎

Proof of Theorem 4:.

Denote $\bm{\theta}^{\widehat{\mathcal{M}}}=(\bm{\beta}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\bm{B})^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ , where $\bm{\beta}^{\widehat{\mathcal{M}}}=\{\beta_{l}\}_{l\in\widehat{\mathcal{M}}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|}$ and $\bm{B}\in\mathbb{R}^{p\times q}$ . In addition, for a given pair $\bm{\lambda}=(\lambda_{1},\lambda_{2})$ , we let $\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}=(\dot{\bm{\beta}}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\dot{\bm{B}})^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ be the true value for $\bm{\theta}^{\widehat{\mathcal{M}}}$ with $\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}$ and $\dot{\bm{B}}$ being true values for $\bm{\beta}^{\widehat{\mathcal{M}}}$ and $\bm{B}$ respectively, and $\widehat{\bm{\theta}}_{\bm{\lambda}}^{\widehat{\mathcal{M}}}=(\widehat{\bm{\beta}}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\widehat{\bm{B}})^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ be the proposed estimator for $\bm{\theta}^{\widehat{\mathcal{M}}}$ with $\widehat{\bm{\beta}}^{\widehat{\mathcal{M}}}$ and $\widehat{\bm{B}}$ being the estimators for $\bm{\beta}^{\widehat{\mathcal{M}}}$ and $\bm{B}$ respectively. Furthermore, we let $r=\mathrm{rank}(\dot{\bm{B}})$ , the true rank of matrix $\dot{\bm{B}}\in\mathbb{R}^{p\times q}$ . Let us consider the class of matrices $\Theta$ that have rank $r\leq\min\left\{p,q\right\}$ . For any given matrix $\Theta$ , we let $\mathrm{row}(\Theta)\subset\mathbb{R}^{p}$ and $\mathrm{col}(\Theta)\subset\mathbb{R}^{q}$ denote its row and column space, respectively. Let $U$ and $V$ be a given pair of $r$ -dimensional subspace $U\subset\mathbb{R}^{p}$ and $V\subset\mathbb{R}^{q}$ , respectively.

For a given $\bm{\theta}^{\widehat{\mathcal{M}}}$ and pair $(U,V)$ , we define the subspace $\Omega_{1}(\widehat{\mathcal{M}})$ , $\Omega_{2}(U,V)$ , $\overline{\Omega}_{1}(\widehat{\mathcal{M}})$ , $\overline{\Omega}_{2}(U,V)$ , $\overline{\Omega}^{\perp}_{1}(\widehat{\mathcal{M}})$ and $\overline{\Omega}^{\perp}_{2}(U,V)$ as followed:

$\displaystyle{\Omega}_{1}(\widehat{\mathcal{M}})$	$\displaystyle=$	$\displaystyle\overline{\Omega}_{1}(\widehat{\mathcal{M}}):=\left\{\bm{\beta}^{\widehat{\mathcal{M}}}=\{\beta_{l}\}_{l\in\widehat{\mathcal{M}}}\in\mathbb{R}^{\|\widehat{\mathcal{M}}\|}\|\beta_{l}=0\textrm{ for all }l\not\in\mathcal{M}_{1}\right\},$
$\displaystyle\overline{\Omega}_{1}^{\perp}(\widehat{\mathcal{M}})$	$\displaystyle:=$	$\displaystyle\left\{\bm{\beta}^{\widehat{\mathcal{M}}}=\{\beta_{l}\}_{l\in\widehat{\mathcal{M}}}\in\mathbb{R}^{\|\widehat{\mathcal{M}}\|}\|\beta_{l}=0\textrm{ for all }l\in\mathcal{M}_{1}\right\},$
$\displaystyle{\Omega}_{2}(U,V)$	$\displaystyle:=$	$\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}\|\mathrm{row}(\Theta)\subset V,\textrm{ and }\mathrm{col}(\Theta)\subset U\right\},$
$\displaystyle\overline{\Omega}_{2}(U,V)$	$\displaystyle:=$	$\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}\|\mathrm{row}(\Theta)\subset V,\textrm{ or }\mathrm{col}(\Theta)\subset U\right\},$
$\displaystyle\overline{\Omega}_{2}^{\perp}(U,V)$	$\displaystyle:=$	$\displaystyle\left\{\Theta\in\mathbb{R}^{p\times q}\|\mathrm{row}(\Theta)\subset V^{\perp},\textrm{ and }\mathrm{col}(\Theta)\subset U^{\perp}\right\},$

where $\overline{\Omega}_{1}^{\perp}(\widehat{\mathcal{M}})$ and $\overline{\Omega}_{2}^{\perp}(U,V)$ are the orthogonal complements for $\overline{\Omega}_{1}(\widehat{\mathcal{M}})$ and $\overline{\Omega}_{2}(U,V)$ respectively. For simplicity, we will use ${\Omega}_{1}$ , $\overline{\Omega}_{1}$ and $\overline{\Omega}_{1}^{\perp}$ to denote ${\Omega}_{1}(\widehat{\mathcal{M}})$ , $\overline{\Omega}_{1}(\widehat{\mathcal{M}})$ and $\overline{\Omega}_{1}^{\perp}(\widehat{\mathcal{M}})$ , respectively; and use ${\Omega}_{2}$ , $\overline{\Omega}_{2}$ and $\overline{\Omega}_{2}^{\perp}$ to denote ${\Omega}_{2}(U,V)$ , $\overline{\Omega}_{2}(U,V)$ and $\overline{\Omega}_{2}^{\perp}(U,V)$ . It is easy to see that both $P_{1}$ and $P_{2}$ satisfy the condition that they are decomposable with respect to the subspace pair $(\Omega_{1},\overline{\Omega}_{1}^{\perp})$ and $(\Omega_{2},\overline{\Omega}_{2}^{\perp})$ , respectively. Therefore the regularizer terms $P_{1}$ and $P_{2}$ satisfies condition (G1) of Negahban et al. (2012).

Here we define the function $F:\mathbb{R}^{|\widehat{\mathcal{M}}|+pq}\to\mathbb{R}$ by

	$\displaystyle F(\bm{\Delta})$	$\displaystyle:=$	$\displaystyle l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}+\bm{\Delta})-l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}})+\lambda_{1}\left\{P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}+\bm{\Delta}_{1})-P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\}$
			$\displaystyle+\lambda_{2}\left\{P_{2}(\dot{\bm{B}}+\bm{\Delta}_{2})-P_{2}(\dot{\bm{B}})\right\},$

where, $\bm{\Delta}=\{\bm{\Delta}_{1}^{\mathrm{\scriptscriptstyle T}},\mathrm{vec}(\bm{\Delta}_{2})^{\mathrm{\scriptscriptstyle T}}\}^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|+pq}$ with $\bm{\Delta}_{1}\in\mathbb{R}^{|\widehat{\mathcal{M}}|}$ and $\bm{\Delta}_{2}\in\mathbb{R}^{p\times q}$ . Next, We will derive a lower bound for $F(\bm{\Delta})$ .

Before we formally prove the result, we first introduce the concept of subspace compatibility constant. For a subspace $\Omega$ , the subspace compatibility constant with respect to the pair $(P,\|\cdot\|)$ is given by

\psi(\Omega):=\sup_{u\in\Omega\backslash\{0\}}\frac{P(u)}{\|u\|}.

Therefore, we have

	$\displaystyle\psi_{1}(\overline{\Omega}_{1})$	$\displaystyle=$	$\displaystyle\sup_{\bm{\beta}^{\widehat{\mathcal{M}}}\in\overline{\Omega}_{1}\backslash\{0\}}\frac{P_{1}(\bm{\beta}^{\widehat{\mathcal{M}}})}{\\|\bm{\beta}^{\widehat{\mathcal{M}}}\\|}=\frac{\sum_{l\in\widehat{\mathcal{M}}}\|\beta_{l}\|}{\sqrt{\sum_{l\in\widehat{\mathcal{M}}}{\beta_{l}^{2}}}}\leq\frac{\sqrt{\sum_{l\in\widehat{\mathcal{M}}}\|\beta_{l}\|^{2}\sum_{l\in\widehat{M}}1^{2}}}{\sqrt{\sum_{l\in\widehat{\mathcal{M}}}{\beta_{l}^{2}}}}\leq\sqrt{\|\widehat{\mathcal{M}}\|};$
	$\displaystyle\psi_{2}(\overline{\Omega}_{2})$	$\displaystyle=$	$\displaystyle\sup_{\bm{B}\in\overline{\Omega}_{2}\backslash\{0\}}\frac{P_{2}(\bm{B})}{\\|\bm{B}\\|}=\frac{\left\\|\bm{B}\right\\|_{*}}{\\|\bm{B}\\|_{F}}\leq\frac{\sqrt{r}\left\\|\bm{B}\right\\|_{F}}{\\|\bm{B}\\|_{F}}=\sqrt{r},$

where the last step of first inequality is obtained by applying Cauchy-Schwarz inequality. Furthermore, we have

$\displaystyle F(\bm{\Delta})$	$\displaystyle=$	$\displaystyle l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}+\bm{\Delta})-l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}})+\lambda_{1}\left\{P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}+\bm{\Delta}_{1})-P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\}$
		$\displaystyle+\lambda_{2}\left\{P_{2}(\dot{\bm{B}}+\bm{\Delta}_{2})-P_{2}(\dot{\bm{B}})\right\}$
	$\displaystyle\geq$	$\displaystyle\langle\nabla l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}\rangle+\iota\left\\|\bm{\Delta}\right\\|^{2}+\lambda_{1}\left\{P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}+\bm{\Delta}_{1})-P_{1}(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\}$
		$\displaystyle+\lambda_{2}\left\{P_{2}(\dot{\bm{B}}+\bm{\Delta}_{2})-P_{2}(\dot{\bm{B}})\right\}$
	$\displaystyle\geq$	$\displaystyle\langle\nabla l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}\rangle+\iota\left\\|\bm{\Delta}\right\\|^{2}+\lambda_{1}\left[P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\}\right]$
		$\displaystyle+\lambda_{2}\left[P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\}\right],$

where the first inequality follows from Assumption (A6) and the second inequality follows from Lemma 3 in Negahban et al. (2012). By the Cauchy-Schwarz inequality applied to $P_{k}$ and its dual $P_{k}^{*}$ , $k=1,2$ , where $P_{k}^{*}$ is defined as the dual norm of $P_{k}$ such that $P_{k}^{*}(\bm{\theta})=\sup_{P_{k}(\bm{\eta})\leq 1}\langle\bm{\theta},\bm{\eta}\rangle$ , we have $|\langle l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}\rangle|=|\langle\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}_{1}\rangle|+|\langle\nabla l(\dot{\bm{B}}),\bm{\Delta}_{2}\rangle|\leq P_{1}^{*}\left\{\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\}P_{1}\left(\bm{\Delta}_{1}\right)+P_{2}^{*}\left\{\nabla l(\dot{\bm{B}})\right\}P_{2}\left(\bm{\Delta}_{2}\right).$ If $\lambda_{k}\geq 2P_{k}^{*}(\cdot)$ holds, where $P_{1}^{*}(\cdot)=\left\|\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\|_{\infty}$ and $P_{2}^{*}(\cdot)=\left\|\nabla l(\dot{\bm{B}})\right\|_{op}$ (Negahban et al., 2012; Kong et al., 2020), one has $|\langle\nabla l(\cdot),\bm{\Delta}_{k}\rangle|\leq\frac{1}{2}\lambda_{k}P_{k}(\bm{\Delta}_{k})\leq\frac{1}{2}\lambda_{k}\left\{P_{k}(\bm{\Delta}_{k,\overline{\Omega}_{k}^{\perp}})+P_{k}(\bm{\Delta}_{k,\overline{\Omega}_{k}})\right\}.$ Therefore, we have

$\displaystyle F(\bm{\Delta})$	$\displaystyle\geq$	$\displaystyle\langle\nabla l(\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}\rangle+\iota\left\\|\bm{\Delta}\right\\|^{2}+\lambda_{1}\left[P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\}\right]$
		$\displaystyle+\lambda_{2}\left[P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\{\mathrm{vec}(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\}\right]$
	$\displaystyle=$	$\displaystyle\langle\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}}),\bm{\Delta}_{1}\rangle+\langle\nabla l(\dot{\bm{B}}),\bm{\Delta}_{2}\rangle+\iota\left\\|\bm{\Delta}\right\\|^{2}$
		$\displaystyle+\lambda_{1}\left[P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\left\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\right\}\right]$
		$\displaystyle+\lambda_{2}\left[P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\left\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\right\}\right]$
	$\displaystyle\geq$	$\displaystyle-\frac{\lambda_{1}}{2}\left\{P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})+P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})\right\}-\frac{\lambda_{2}}{2}\left\{P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})+P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})\right\}+\iota\left\\|\bm{\Delta}\right\\|^{2}$
		$\displaystyle+\lambda_{1}\left[P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\left\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\right\}\right]$
		$\displaystyle+\lambda_{2}\left[P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\left\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\right\}\right]$
	$\displaystyle=$	$\displaystyle\iota\left\\|\bm{\Delta}\right\\|^{2}+\lambda_{1}\left[\frac{1}{2}P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}^{\perp}})-\frac{3}{2}P_{1}(\bm{\Delta}_{1,\overline{\Omega}_{1}})-2P_{1}\left\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\right\}\right]$
		$\displaystyle+\lambda_{2}\left[\frac{1}{2}P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}^{\perp}})-\frac{3}{2}P_{2}(\bm{\Delta}_{2,\overline{\Omega}_{2}})-2P_{2}\left\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\right\}\right].$

By the subspace compatibility, we have $P_{k}(\bm{\Delta}_{k,\overline{\Omega}_{k}})\leq\psi_{k}(\overline{\Omega}_{k})\|\bm{\Delta}_{k,\overline{\Omega}_{k}}\|$ , for $k=1,2$ . Substituting them into the previous inequality, and noticing that $P_{1}\left\{(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})_{\overline{\Omega}_{1}^{\perp}}\right\}=P_{2}\left\{(\dot{\bm{B}})_{\overline{\Omega}_{2}^{\perp}}\right\}$ $=0$ , we obtain that

$\displaystyle F(\bm{\Delta})$	$\displaystyle\geq$	$\displaystyle\iota\left\\|\bm{\Delta}\right\\|^{2}-\sum_{k\in\{1,2\}}\frac{3\lambda_{k}}{2}\psi_{k}(\overline{\Omega}_{k})\\|\bm{\Delta}_{k,\overline{\Omega}_{k}}\\|$
	$\displaystyle\geq$	$\displaystyle\iota\left\\|\bm{\Delta}\right\\|^{2}-\sum_{k\in\{1,2\}}\frac{3\lambda_{k}}{2}\psi_{k}(\overline{\Omega}_{k})\\|\bm{\Delta}_{k}\\|$
	$\displaystyle\geq$	$\displaystyle\iota\left\\|\bm{\Delta}\right\\|^{2}-3\max_{k\in\{1,2\}}\{\lambda_{k}\psi_{k}(\overline{\Omega}_{k})\}\left\\|\bm{\Delta}\right\\|.$

The right hand side is a quadratic form of $\bm{\Delta}$ . Therefore, as long as $\left\|\bm{\Delta}\right\|^{2}>\frac{9}{4\iota^{2}}\max^{2}_{k\in\{1,2\}}\{$ $\lambda_{k}\psi_{k}(\overline{\Omega}_{k})\}$ , one has $F(\bm{\Delta})>0$ . Therefore, by Lemma 4 in Negahban et al. (2012), we can establish that

\displaystyle\left\|\widehat{\bm{\theta}}^{\widehat{\mathcal{M}}}_{\bm{\lambda}}-\dot{\bm{\theta}}^{\widehat{\mathcal{M}}}\right\|_{2}^{2}\leq C\max\left\{\lambda_{1}^{2}|\widehat{\mathcal{M}}|,\lambda_{2}^{2}r\right\}\iota^{-2},

for some constant $C>0$ . It is easy to see that there exists some constant $C_{0}>0$ such that

\displaystyle\left\|\widehat{\bm{\theta}}_{\bm{\lambda}}-\dot{\bm{\theta}}\right\|_{2}^{2}\leq C_{0}\max\left\{C_{1}n^{2\kappa+\tau}\lambda_{1}^{2},\lambda_{2}^{2}r\right\}\iota^{-2}.

Now we calculate $P_{1}^{*}(\cdot)$ and $P_{2}^{*}(\cdot)$ . According to Kong et al. (2020); Negahban et al. (2012), we have that $P_{1}^{*}(\cdot)=\left\|\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\|_{\infty}$ and $P_{2}^{*}(\cdot)=\left\|\nabla l(\dot{\bm{B}})\right\|_{op}$ , where $\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})=-n^{-1}\sum_{i=1}^{n}\epsilon_{i}X_{i}^{\widehat{\mathcal{M}}}$ and $\nabla l(\dot{\bm{B}})=-n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{Z}_{i}$ .

We first calculate $\left\|\nabla l(\dot{\bm{\beta}}^{\widehat{\mathcal{M}}})\right\|_{\infty}$ . Denoting $\bm{\epsilon}=(\epsilon_{1},\ldots,\epsilon_{n})^{\mathrm{\scriptscriptstyle T}}$ , $\bm{X}^{\widehat{\mathcal{M}}}=(X_{1}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}},\ldots,X_{n}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ , where $X_{i}^{\widehat{\mathcal{M}}}=(x_{ij})^{\mathrm{\scriptscriptstyle T}}_{j\in\widehat{\mathcal{M}}}\in\mathbb{R}^{|\widehat{\mathcal{M}}|}$ for $i=1,\ldots,n$ . we let $\bm{x}_{l}^{\widehat{\mathcal{M}}}$ , where $l=1,\ldots,|\widehat{\mathcal{M}}|$ represent the $l$ -th column of $\bm{X}^{\widehat{\mathcal{M}}}$ . Since $\bm{X}$ column normalized, i.e. $\|\bm{x}_{l}\|_{2}/\sqrt{n}=1$ , for all $l\in 1,\ldots,s$ , by Assumption (A2), there exists a constant $\sigma_{0}>0$ such that $P(\left|\langle\bm{x}_{l}^{\widehat{\mathcal{M}}},\bm{\epsilon}\rangle/n\right|\geq t)\leq 2\exp\left(-\frac{nt^{2}}{2\sigma_{0}^{2}}\right)$ . Applying union bound, we have $P\left(\left\|-n^{-1}\sum_{i=1}^{n}\epsilon_{i}X_{i}^{\widehat{\mathcal{M}}}\right\|_{\infty}\geq t\right)=P\left(\left\|\bm{X}^{\widehat{\mathcal{M}},\mathrm{\scriptscriptstyle T}}\bm{\epsilon}/n\right\|_{\infty}\geq t\right)=P\left(\mathrm{sup}_{l\in\widehat{\mathcal{M}}}\left|\langle\bm{x}_{l}^{\widehat{\mathcal{M}}},\bm{\epsilon}\rangle/n\right|\geq t\right)\leq 2\exp\left(-\frac{nt^{2}}{2\sigma_{0}^{2}}+\log|\widehat{\mathcal{M}}|\right)\leq 2\exp\left(\right.$ $\left.-\frac{nt^{2}}{2\sigma_{0}^{2}}+C_{1}(2\kappa+\tau)\log n\right)$ . By choosing $t^{2}=2n^{-1}\sigma_{0}^{2}$ $\{\log(\log n)+C_{1}(2\kappa+\tau)\log n\}$ , we can see that when $\lambda_{1}\geq 2\sigma_{0}[2n^{-1}\{\log(\log n)+C_{1}(2\kappa+\tau)\log n\}]^{1/2}$ , then there exist a positive constant $c_{1}>0$ such that the choice of $\lambda_{1}$ holds with probability at least $1-c_{1}(\log n)^{-1}$ .

Secondly, we calculate $\left\|\nabla l(\dot{\bm{B}})\right\|_{op}$ :

			$\displaystyle\left\\|\nabla l(\dot{\bm{B}})\right\\|_{op}=\left\\|-n^{-1}\sum_{i=1}^{n}\epsilon_{i}\bm{Z}_{i}\right\\|_{op}=\left\\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\sum_{l\in\mathcal{M}_{2}}X_{il}*\dot{\bm{C}}_{l}+\bm{E}_{i}\right)\right\\|_{op}$
		$\displaystyle=$	$\displaystyle\left\\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\sum_{l\in\mathcal{M}_{2}}X_{il}\dot{\bm{C}}_{l}\right)+n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\\|_{op}$
		$\displaystyle=$	$\displaystyle\left\\|n^{-1}\sum_{l\in\mathcal{M}_{2}}\langle\bm{x}_{l},\bm{\epsilon}\rangle\dot{\bm{C}}_{l}+n^{-1}\sum_{i=1}^{n}\epsilon_{i}\bm{E}_{i}\right\\|_{op}$
		$\displaystyle\leq$	$\displaystyle n^{-1}\sum_{l\in\mathcal{M}_{2}}\left\|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right\|\left\\|\dot{\bm{C}}_{l}\right\\|_{op}+\left\\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\\|_{op}.$

Under condition (A1), $n^{-1}\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|\left\|\dot{\bm{C}}_{l}\right\|_{op}\leq n^{-1}b\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|$ . Therefore, we have

			$\displaystyle P(n^{-1}b\sum_{l\in\mathcal{M}_{2}}\left\|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right\|\geq t)=P(\sum_{l\in\mathcal{M}_{2}}\left\|\langle\bm{x}_{l},\bm{\epsilon}\rangle/n\right\|\geq t/b)$
		$\displaystyle\leq$	$\displaystyle P\left[\cup_{l\in\mathcal{M}_{2}}\left\{\left\|\langle\bm{x}_{l},\bm{\epsilon}\rangle/n\right\|\geq\frac{t}{bs_{2}}\right\}\right]\leq\sum_{l\in\mathcal{M}_{2}}P\left(\left\|\langle\bm{x}_{l},\bm{\epsilon}\rangle/n\right\|\geq\frac{t}{bs_{2}}\right)$
		$\displaystyle\leq$	$\displaystyle\sum_{l\in\mathcal{M}_{2}}P\left(\sup_{l\in\mathcal{M}_{2}}\left\|\langle\bm{x}_{l},\bm{\epsilon}\rangle/n\right\|\geq\frac{t}{bs_{2}}\right)\leq 2\exp\left(-\frac{nt^{2}}{2b^{2}s_{2}^{2}\sigma_{0}^{2}}+2\log s_{2}\right).$

Therefore, by choosing $t=bs_{2}\sigma_{0}$ $[2n^{-1}\{3\log s_{2}+\log(\log n)\}]^{1/2}$ , for any choice of $t_{1}\geq t$ , we guarantee that $t_{1}\geq n^{-1}b\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|$ therefore $t_{1}\geq n^{-1}\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|\left\|\dot{\bm{C}}_{l}\right\|_{op}$ is valid with probability at least $1-c_{2}/(s_{2}\log n)$ for some positive constant $c_{2}$ .

On the other hand, let $\bm{W}_{i}$ be a $p\times q$ random matrix with each entry i.i.d. standard normal. By Assumption (A8) and Lemma 3, conditioning on $\epsilon_{i}$ we have

\displaystyle P(\left\|\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\|_{op}\geq t)=P(\sup_{\|\bm{A}\|_{*}\leq 1}\langle\bm{A},\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\rangle\geq t)\leq P(\sup_{\|\bm{A}\|_{*}\leq 1}\langle\bm{A},\sum_{i=1}^{n}\epsilon_{i}*\bm{W}_{i}\rangle\geq\frac{t}{C_{U}}),

since $\Sigma_{e}\preceq C_{u}^{2}I_{pq\times pq}$ .

As $\sup_{\|\bm{A}\|_{*}\leq 1}\langle\bm{A},\sum_{i=1}^{n}\epsilon_{i}*\bm{W}_{i}\rangle=\left\|\sum_{i=1}^{n}\epsilon_{i}*\bm{W}_{i}\right\|_{\infty}$ conditioning on $\bm{W}_{i}$ , each entry of the matrix $\sum_{i=1}^{n}\epsilon_{i}*\bm{W}_{i}$ is i.i.d. $N(0,\|\bm{\epsilon}\|_{op}^{2})$ . Since $\frac{\|\bm{\epsilon}\|_{op}^{2}}{\sigma_{\epsilon}^{2}}$ is a $\chi^{2}$ random variable with $n$ degrees of freedom, one has

P(\frac{\|\bm{\epsilon}\|_{op}^{2}}{n\sigma_{\epsilon}^{2}}\geq 4)\leq\exp(-n)

using the tail bound of $\chi^{2}$ presented by the corollary of Lemma 1 from Laurent and Massart (2000). Combing with the standard random matrix theory, we know that $\|n^{-1/2}\sum_{i=1}^{n}\epsilon_{i}*$ $\bm{W}_{i}\|_{op}\leq 2n^{-1/2}\sigma_{\epsilon}(p^{1/2}+q^{1/2})$ with probability at least $1-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n)$ where $c_{3}$ and $c_{4}$ are some positive constants. Combining with what we got from previous step, we have $\sum_{l\in\mathcal{M}_{2}}\left|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right|\left\|\dot{\bm{C}}_{l}\right\|_{op}+\left\|\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\|_{op}\leq bs_{2}\sigma_{0}[2n^{-1}\{3\log s_{2}+\log(\log n)\}]^{1/2}+2n^{-1/2}\sigma_{\epsilon}(p^{1/2}+q^{1/2})$ holds with probability at least $1-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n)$ . Therefore, the choice of $\lambda_{2}\geq 2bs_{2}\sigma_{0}[2n^{-1}\{3\log s_{2}+\log(\log n)\}]^{1/2}+4n^{-1/2}\sigma_{\epsilon}(p^{1/2}+q^{1/2})$ holds with probability at least $1-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n)$ .

As a result, the event that both $\lambda_{1}$ and $\lambda_{2}$ satisfy the above inequalities holds with probability at least $1-c_{1}/\log n-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}-\exp(-n)$ for some postive constants $c_{1}$ , $c_{2}$ , $c_{3}$ and $c_{4}$ .

In sum, there exists some positive constants $c_{1},c_{2},c_{3},c_{4}$ , with probability at least $1-c_{1}/\log n-c_{2}/(s_{2}\log n)-c_{3}\exp\{-c_{4}(p+q)\}$ , one has

\left\|\widehat{\bm{\theta}}_{\bm{\lambda}}-\dot{\bm{\theta}}\right\|_{2}^{2}\leq C_{0}\max\left\{C_{1}\lambda_{1}^{2}n^{2\kappa+\tau},\lambda_{2}^{2}r\right\}\iota^{-2},

for some constants $C_{0},C_{1}>0$ . This completes the proof.

∎

References

Aizenstein et al. (2008) Aizenstein, H. J., R. D. Nebes, J. A. Saxton, J. C. Price, C. A. Mathis, N. D. Tsopelas, S. K. Ziolko, J. A. James, B. E. Snitz, P. R. Houck, et al. (2008). Frequent amyloid deposition without significant cognitive impairment among the elderly. Archives of Neurology 65(11), 1509–1517.
Anderson (1955) Anderson, T. W. (1955). The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities. Proceedings of the American Mathematical Society 6(2), 170–176.
Apostolova et al. (2006) Apostolova, L. G., I. D. Dinov, R. A. Dutton, K. M. Hayashi, A. W. Toga, J. L. Cummings, and P. M. Thompson (2006). 3D comparison of hippocampal atrophy in amnestic mild cognitive impairment and Alzheimer’s disease. Brain 129(11), 2867–2873.
Apostolova et al. (2010) Apostolova, L. G., L. Mosconi, P. M. Thompson, A. E. Green, K. S. Hwang, A. Ramirez, R. Mistur, W. H. Tsui, and M. J. de Leon (2010). Subregional hippocampal atrophy predicts Alzheimer’s dementia in the cognitively normal. Neurobiology of Aging 31(7), 1077–1088.
Barrett et al. (2005) Barrett, J. C., B. Fry, J. Maller, and M. J. Daly (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21(2), 263–265.
Barut et al. (2016) Barut, E., J. Fan, and A. Verhasselt (2016). Conditional sure independence screening. Journal of the American Statistical Association 111(515), 1266–1277.
Beck and Teboulle (2009) Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202.
Bertram and Tanzi (2020) Bertram, L. and R. E. Tanzi (2020). Genomic mechanisms in Alzheimer’s disease. Brain Pathology 30(5), 966–977.
Bi et al. (2017) Bi, X., L. Yang, T. Li, B. Wang, H. Zhu, and H. Zhang (2017). Genome-wide mediation analysis of psychiatric and cognitive traits through imaging phenotypes. Human Brain Mapping 38(8), 4088–4097.
Blackstad et al. (1970) Blackstad, T., K. Brink, J. Hem, and B. June (1970). Distribution of hippocampal mossy fibers in the rat. An experimental study with silver impregnation methods. Journal of Comparative Neurology 138(4), 433–449.
Braak and Braak (1998) Braak, H. and E. Braak (1998). Evolution of neuronal changes in the course of Alzheimer’s disease. In Ageing and Dementia, pp. 127–140. Springer.
Brookhart et al. (2006) Brookhart, M. A., S. Schneeweiss, K. J. Rothman, R. J. Glynn, J. Avorn, and T. Stürmer (2006). Variable selection for propensity score models. American Journal of Epidemiology 163(12), 1149–1156.
Cai et al. (2010) Cai, J.-F., E. J. Candès, and Z. Shen (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4), 1956–1982.
Chung et al. (2014) Chung, S. J., M.-J. Kim, J. Kim, Y. J. Kim, S. You, J. Koh, S. Y. Kim, and J.-H. Lee (2014). Exome array study did not identify novel variants in Alzheimer’s disease. Neurobiology of Aging 35(8), 1958–e13.
Consortium et al. (2012) Consortium, . G. P. et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65.
Coon et al. (2007) Coon, K. D., A. J. Myers, D. W. Craig, J. A. Webster, J. V. Pearson, D. H. Lince, V. L. Zismann, T. G. Beach, D. Leung, L. Bryden, et al. (2007). A high-density whole-genome association study reveals that APOE is the major susceptibility gene for sporadic late-onset Alzheimer’s disease. The Journal of Clinical Psychiatry 68(4), 613–618.
De Leon et al. (1989) De Leon, M., A. George, L. Stylopoulos, G. Smith, and D. Miller (1989). Early marker for Alzheimer’s disease: the atrophic hippocampus. The Lancet 334(8664), 672–673.
Dehman et al. (2015) Dehman, A., C. Ambroise, and P. Neuvial (2015). Performance of a blockwise approach in variable selection using linkage disequilibrium information. BMC Bioinformatics 16(1), 1–14.
Du et al. (2018) Du, L., K. Liu, X. Yao, S. L. Risacher, J. Han, L. Guo, A. J. Saykin, and L. Shen (2018). Fast multi-task SCCA learning with feature selection for multi-modal brain imaging genetics. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 356–361. IEEE.
Dudek et al. (2016) Dudek, S. M., G. M. Alexander, and S. Farris (2016). Rediscovering area CA2: unique properties and functions. Nature Reviews Neuroscience 17(2), 89–102.
Fan and Lv (2008) Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
Ferrari et al. (2014) Ferrari, D. V., M. E Avila, M. A Medina, E. Pérez-Palma, B. I Bustos, M. A Alarcon, et al. (2014). Wnt/ $\beta$ -catenin signaling in Alzheimer’s disease. CNS & Neurological Disorders-Drug Targets (Formerly Current Drug Targets-CNS & Neurological Disorders) 13(5), 745–754.
Ferreira and Klein (2011) Ferreira, S. T. and W. L. Klein (2011). The A $\beta$ oligomer hypothesis for synapse failure and memory loss in Alzheimer’s disease. Neurobiology of Learning and Memory 96(4), 529–543.
Fox et al. (1996) Fox, N., E. Warrington, P. Freeborough, P. Hartikainen, A. Kennedy, J. Stevens, and M. N. Rossor (1996). Presymptomatic hippocampal atrophy in Alzheimer’s disease: a longitudinal MRI study. Brain 119(6), 2001–2007.
Frozza et al. (2018) Frozza, R. L., M. V. Lourenco, and F. G. De Felice (2018). Challenges for Alzheimer’s disease therapy: insights from novel mechanisms beyond memory defects. Frontiers in Neuroscience 12, 37.
Gabriel et al. (2002) Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Higgins, M. DeFelice, A. Lochner, M. Faggart, et al. (2002). The structure of haplotype blocks in the human genome. Science 296(5576), 2225–2229.
Gao et al. (2016) Gao, L., Z. Cui, L. Shen, and H.-F. Ji (2016). Shared genetic etiology between type 2 diabetes and Alzheimer’s disease identified by bioinformatics analysis. Journal of Alzheimer’s Disease 50(1), 13–17.
Gaugler et al. (2019) Gaugler, J., B. James, T. Johnson, A. Marin, and J. Weuve (2019). 2019 Alzheimer’s disease facts and figures. Alzheimer’s & Dementia 15(3), 321–387.
Guerreiro and Bras (2015) Guerreiro, R. and J. Bras (2015). The age factor in Alzheimer’s disease. Genome Medicine 7(1), 106.
Guo et al. (2019) Guo, Y., W. Xu, J.-Q. Li, Y.-N. Ou, X.-N. Shen, Y.-Y. Huang, Q. Dong, L. Tan, and J.-T. Yu (2019). Genome-wide association study of hippocampal atrophy rate in non-demented elders. Aging (Albany NY) 11(22), 10468–10484.
Hampel et al. (2018) Hampel, H., M.-M. Mesulam, A. C. Cuello, M. R. Farlow, E. Giacobini, G. T. Grossberg, A. S. Khachaturian, A. Vergallo, E. Cavedo, P. J. Snyder, et al. (2018). The cholinergic system in the pathophysiology and treatment of Alzheimer’s disease. Brain 141(7), 1917–1933.
Hao and Zhang (2014) Hao, N. and H. H. Zhang (2014). Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association 109(507), 1285–1301.
Hao et al. (2017) Hao, X., C. Li, L. Du, X. Yao, J. Yan, S. L. Risacher, A. J. Saykin, L. Shen, and D. Zhang (2017). Mining outcome-relevant brain imaging genetic associations via three-way sparse canonical correlation analysis in Alzheimer’s disease. Scientific Reports 7(1), 1–12.
Hastie et al. (2009) Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data mining, Inference, and Prediction (Second ed.). Springer Series in Statistics. Springer, New York.
Hesse et al. (2001) Hesse, C., L. Rosengren, N. Andreasen, P. Davidsson, H. Vanderstichele, E. Vanmechelen, and K. Blennow (2001). Transient increase in total tau but not phospho-tau in human cerebrospinal fluid after acute stroke. Neuroscience Letters 297(3), 187–190.
Jack et al. (2000) Jack, C., R. C. Petersen, Y. Xu, P. O’brien, G. Smith, R. Ivnik, B. F. Boeve, E. G. Tangalos, and E. Kokmen (2000). Rates of hippocampal atrophy correlate with change in clinical status in aging and AD. Neurology 55(4), 484–490.
Jack et al. (2003) Jack, C., M. Slomkowski, S. Gracon, T. Hoover, J. Felmlee, K. Stewart, Y. Xu, M. Shiung, P. O’Brien, R. Cha, et al. (2003). MRI as a biomarker of disease progression in a therapeutic trial of milameline for AD. Neurology 60(2), 253–260.
Jack Jr et al. (2013) Jack Jr, C. R., D. S. Knopman, W. J. Jagust, R. C. Petersen, M. W. Weiner, P. S. Aisen, L. M. Shaw, P. Vemuri, H. J. Wiste, S. D. Weigand, et al. (2013). Tracking pathophysiological processes in Alzheimer’s disease: an updated hypothetical model of dynamic biomarkers. The Lancet Neurology 12(2), 207–216.
Jack Jr et al. (2010) Jack Jr, C. R., D. S. Knopman, W. J. Jagust, L. M. Shaw, P. S. Aisen, M. W. Weiner, R. C. Petersen, and J. Q. Trojanowski (2010). Hypothetical model of dynamic biomarkers of the Alzheimer’s pathological cascade. The Lancet Neurology 9(1), 119–128.
Khan et al. (2015) Khan, W., C. Aguilar, S. J. Kiddle, O. Doyle, M. Thambisetty, S. Muehlboeck, M. Sattlecker, S. Newhouse, S. Lovestone, R. Dobson, et al. (2015). A subset of cerebrospinal fluid proteins from a multi-analyte panel associated with brain atrophy, disease classification and prediction in Alzheimer’s disease. PLoS One 10(8), e0134368.
Kim et al. (2009) Kim, J., J. M. Basak, and D. M. Holtzman (2009). The role of apolipoprotein E in Alzheimer’s disease. Neuron 63(3), 287–303.
Kong et al. (2020) Kong, D., B. An, J. Zhang, and H. Zhu (2020). L2RM: Low-rank linear regression models for high-dimensional matrix responses. Journal of the American Statistical Association 115(529), 403–424.
Laurent and Massart (2000) Laurent, B. and P. Massart (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 28(5), 1302–1338.
Li et al. (2007) Li, S., F. Shi, F. Pu, X. Li, T. Jiang, S. Xie, and Y. Wang (2007). Hippocampal shape analysis of Alzheimer disease based on machine learning methods. American Journal of Neuroradiology 28(7), 1339–1345.
Lin et al. (2015) Lin, W., R. Feng, and H. Li (2015). Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. Journal of the American Statistical Association 110(509), 270–288.
Liu et al. (2013) Liu, E., M. Li, W. Wang, and Y. Li (2013). MaCH-Admix: Genotype imputation for admixed populations. Genetic Epidemiology 37, 25–37.
Maass et al. (2018) Maass, A., S. N. Lockhart, T. M. Harrison, R. K. Bell, T. Mellinger, K. Swinnerton, S. L. Baker, G. D. Rabinovici, and W. J. Jagust (2018). Entorhinal tau pathology, episodic memory decline, and neurodegeneration in aging. Journal of Neuroscience 38(3), 530–543.
Majdi et al. (2020) Majdi, A., S. Sadigh-Eteghad, S. R. Aghsan, F. Farajdokht, S. M. Vatandoust, A. Namvaran, and J. Mahmoudi (2020). Amyloid- $\beta$ , tau, and the cholinergic system in Alzheimer’s disease: Seeking direction in a tangle of clues. Reviews in the Neurosciences 31(4), 391–413.
Mohs et al. (1997) Mohs, R. C., D. Knopman, R. C. Petersen, S. H. Ferris, C. Ernesto, M. Grundman, M. Sano, L. Bieliauskas, D. Geldmacher, C. Clark, et al. (1997). Development of cognitive instruments for use in clinical trials of antidementia drugs: additions to the Alzheimer’s disease assessment scale that broaden its scope. Alzheimer Disease and Associated Disorders 11(suppl 2), S13–21.
Morishima-Kawashima and Ihara (2002) Morishima-Kawashima, M. and Y. Ihara (2002). Alzheimer’s disease: $\beta$ -Amyloid protein and tau. Journal of Neuroscience Research 70(3), 392–401.
Negahban et al. (2012) Negahban, S. N., P. Ravikumar, M. J. Wainwright, B. Yu, et al. (2012). A unified framework for high-dimensional analysis of $M$ -estimators with decomposable regularizers. Statistical Science 27(4), 538–557.
Nesterov (1998) Nesterov, Y. (1998). Introductory lectures on convex programming volume i: Basic course. Lecture Notes 3(4), 5.
Nitsch et al. (1992) Nitsch, R. M., B. E. Slack, R. J. Wurtman, and J. H. Growdon (1992). Release of Alzheimer amyloid precursor derivatives stimulated by activation of muscarinic acetylcholine receptors. Science 258(5080), 304–307.
Noble et al. (2012) Noble, K. G., S. M. Grieve, M. S. Korgaonkar, L. E. Engelhardt, E. Y. Griffith, L. M. Williams, and A. M. Brickman (2012). Hippocampal volume varies with educational attainment across the life-span. Frontiers in Human Neuroscience 6, 307.
Paterson et al. (2014) Paterson, R., J. Bartlett, K. Blennow, N. Fox, L. Shaw, J. Trojanowski, H. Zetterberg, and J. Schott (2014). Cerebrospinal fluid markers including trefoil factor 3 are associated with neurodegeneration in amyloid-positive individuals. Translational Psychiatry 4(7), e419.
Pedraza et al. (2004) Pedraza, O., D. Bowers, and R. Gilmore (2004). Asymmetry of the hippocampus and amygdala in MRI volumetric measurements of normal adults. Journal of the International Neuropsychological Society 10(5), 664–678.
Purcell et al. (2007) Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. De Bakker, M. J. Daly, et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3), 559–575.
Remnestål et al. (2021) Remnestål, J., S. Bergström, J. Olofsson, E. Sjöstedt, M. Uhlén, K. Blennow, H. Zetterberg, A. Zettergren, S. Kern, I. Skoog, et al. (2021). Association of CSF proteins with tau and amyloid $\beta$ levels in asymptomatic 70-year-olds. Alzheimer’s Research & Therapy 13(1), 1–19.
Richardson et al. (2018) Richardson, T. S., J. M. Robins, and L. Wang (2018). Discussion of “Data-driven confounder selection via Markov and Bayesian networks” by Häggström. Biometrics 74(2), 403–406.
Rivasplata (2012) Rivasplata, O. (2012). Subgaussian random variables: an expository note. Internet publication, PDF.
Sargolzaei et al. (2015) Sargolzaei, S., A. Sargolzaei, M. Cabrerizo, G. Chen, M. Goryawala, S. Noei, Q. Zhou, R. Duara, W. Barker, and M. Adjouadi (2015). A practical guideline for intracranial volume estimation in patients with Alzheimer’s disease. BMC Bioinformatics 16(7), 1–10.
Schisterman et al. (2009) Schisterman, E. F., S. R. Cole, and R. W. Platt (2009). Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology (Cambridge, Mass.) 20(4), 488.
Selkoe and Hardy (2016) Selkoe, D. J. and J. Hardy (2016). The amyloid hypothesis of Alzheimer’s disease at 25 years. EMBO Molecular Medicine 8(6), 595–608.
Shi et al. (2013) Shi, J., P. M. Thompson, B. Gutman, and Y. Wang (2013, Sep). Surface fluid registration of conformal representation: application to detect disease burden and genetic influence on hippocampus. NeuroImage 78, 111–134.
Shortreed and Ertefaie (2017) Shortreed, S. M. and A. Ertefaie (2017). Outcome-adaptive lasso: variable selection for causal inference. Biometrics 73(4), 1111–1122.
Sims et al. (2020) Sims, R., M. Hill, and J. Williams (2020). The multiplex model of the genetics of Alzheimer’s disease. Nature Neuroscience 23(3), 311–322.
Takei et al. (2009) Takei, N., A. Miyashita, T. Tsukie, H. Arai, T. Asada, M. Imagawa, M. Shoji, S. Higuchi, K. Urakami, H. Kimura, et al. (2009). Genetic association study on in and around the APOE in late-onset Alzheimer’s disease in Japanese. Genomics 93(5), 441–448.
Tang et al. (2021) Tang, D., D. Kong, W. Pan, and L. Wang (2021). Ultra-high dimensional variable selection for doubly robust causal inference. Biometrics (just-accepted).
Thompson et al. (2004) Thompson, P. M., K. M. Hayashi, G. I. de Zubicaray, A. L. Janke, S. E. Rose, J. Semple, M. S. Hong, D. H. Herman, D. Gravano, D. M. Doddrell, et al. (2004). Mapping hippocampal and ventricular change in Alzheimer’s disease. NeuroImage 22(4), 1754–1766.
Van de Pol et al. (2006) Van de Pol, L., A. Hensel, F. Barkhof, H. Gertz, P. Scheltens, and W. Van Der Flier (2006). Hippocampal atrophy in Alzheimer’s disease: age matters. Neurology 66(2), 236–238.
van der Vaart and Wellner (2000) van der Vaart, A. W. and J. A. Wellner (2000). Weak convergence. Springer.
Vina and Lloret (2010) Vina, J. and A. Lloret (2010). Why women have more Alzheimer’s disease than men: gender and mitochondrial toxicity of amyloid- $\beta$ peptide. Journal of Alzheimer’s Disease 20(s2), S527–S533.
Wall and Pritchard (2003) Wall, J. D. and J. K. Pritchard (2003). Haplotype blocks and linkage disequilibrium in the human genome. Nature Reviews Genetics 4(8), 587–597.
Wang et al. (2011) Wang, Y., Y. Song, P. Rajagopalan, T. An, K. Liu, Y.-Y. Chou, B. Gutman, A. W. Toga, P. M. Thompson, ADNI, et al. (2011). Surface-based TBM boosts power to detect disease effects on the brain: an N = 804 ADNI study. NeuroImage 56(4), 1993–2010.
Watson (2019) Watson, S. (2019). Brain atrophy (cerebral atrophy). available at: www.healthline.com/health/brain-atrophy (Mar. 2019).
Weiner et al. (2013) Weiner, M. W., D. P. Veitch, P. S. Aisen, L. A. Beckett, N. J. Cairns, R. C. Green, D. Harvey, C. R. Jack, W. Jagust, E. Liu, et al. (2013). The Alzheimer’s Disease Neuroimaging Initiative: A review of papers published since its inception. Alzheimer’s & Dementia 9(5), e111–e194.
Zhang et al. (1990) Zhang, M., R. Katzman, D. Salmon, H. Jin, G. Cai, Z. Wang, G. Qu, I. Grant, E. Yu, P. Levy, et al. (1990). The prevalence of dementia and Alzheimer’s disease in Shanghai, China: impact of age, gender, and education. Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society 27(4), 428–437.
Zhao et al. (2019) Zhao, B., T. Luo, T. Li, Y. Li, J. Zhang, Y. Shan, X. Wang, L. Yang, F. Zhou, Z. Zhu, et al. (2019). Genome-wide association analysis of 19,629 individuals identifies variants influencing regional brain volumes and refines their genetic co-architecture with cognitive and mental health traits. Nature Genetics 51(11), 1637–1644.
Zhou et al. (2020) Zhou, F., H. Zhou, T. Li, and H. Zhu (2020). Analysis of secondary phenotypes in multi-group association studies. Biometrics 76(2), 606–618.
Zhou and Li (2014) Zhou, H. and L. Li (2014). Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2), 463–483.
Zhou et al. (2018) Zhou, X., Y. Chen, K. Y. Mok, Q. Zhao, K. Chen, Y. Chen, et al. (2018). Identification of genetic risk factors in the Chinese population implicates a role of immune system in Alzheimer’s disease pathogenesis. Proceedings of the National Academy of Sciences 115(8), 1697–1706.

			$\displaystyle P\left(\left\|\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}\right\|\geq t\right)=P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}\right\|\geq nt\right]$
		$\displaystyle=$	$\displaystyle P\left[\left\|\sum_{l^{\prime}\in\mathcal{M}_{1}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\dot{\beta}_{l^{\prime}}^{*M}+\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\langle\dot{\bm{C}}_{l^{\prime}}^{M},\dot{\bm{B}}\rangle+\right.\right.$
		$\displaystyle+$	$\displaystyle\left.\left.\sum_{i=1}^{n}\left\{x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle-E(x_{il})E\left(\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right\}+\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq nt\right]$
		$\displaystyle\leq$	$\displaystyle P\left[\sum_{l^{\prime}\in\mathcal{M}_{1}}\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|b+\sum_{l^{\prime}\in\mathcal{M}_{2}}\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|b\right.$
			$\displaystyle+\left.\left\|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|+\left\|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq nt\right]$
		$\displaystyle\leq$	$\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{1}}P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]$
			$\displaystyle+\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]$
			$\displaystyle+P\left(\left\|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|\geq\frac{nt}{s_{1}+s_{2}+2}\right)+P\left(\left\|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq\frac{nt}{s_{1}+s_{2}+2}\right)$
		$\displaystyle\leq$	$\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{nt^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right]$
			$\displaystyle+4\exp\left[-\frac{nt^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right].$

	$\displaystyle\left\|\mathcal{M}^{1}_{2}\right\|\gamma_{2,n}^{2}/4$	$\displaystyle\leq$	$\displaystyle\sum_{l=1}^{s_{n}}\left\\|\dot{\bm{C}}_{l}^{M}\right\\|^{2}_{F}$
		$\displaystyle=$	$\displaystyle\sum_{j,k}\sum_{l=1}^{s_{n}}(\dot{C}_{l,jk}^{M})^{2}=\sum_{j,k}\sum_{l=1}^{s_{n}}\left(E\left[x_{il}Z_{i,jk}\right]\right)^{2}=\sum_{j,k}\left\\|E\left[\bm{x}_{i}*Z_{i,jk}\right]\right\\|^{2}.$

			$\displaystyle P\left\{\mathcal{M}_{1}\subset\left(\widehat{\mathcal{M}}_{1}^{}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,}\cup\widehat{\mathcal{M}}_{2}^{block}\right)\right\}$
		$\displaystyle=$	$\displaystyle P\left\{\cap_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,}\cup\widehat{\mathcal{M}}_{2}^{block}\right)\right\}$
		$\displaystyle=$	$\displaystyle 1-P\left\{\cup_{l\in\mathcal{M}_{1}}\left(\widehat{\mathcal{M}}_{1}^{}\cup\widehat{\mathcal{M}}_{2}\cup\widehat{\mathcal{M}}_{1}^{block,}\cup\widehat{\mathcal{M}}_{2}^{block}\right)^{c}\right\}$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left\{\widehat{\mathcal{M}}_{1}^{c}\cap\widehat{\mathcal{M}}_{2}^{c}\cap(\widehat{\mathcal{M}}_{1}^{block,})^{c}\cap(\widehat{\mathcal{M}}_{2}^{block})^{c}\right\}$
		$\displaystyle=$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n},\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n},\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n},\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n},\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n},\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n}\right)$
		$\displaystyle\geq$	$\displaystyle 1-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\\|\widehat{{\bm{C}}}_{l}^{M}\\|_{op}\leq\gamma_{2,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}}P\left(\widehat{{C}_{l}^{block,M}}\leq\gamma_{4,n}\right)$
			$\displaystyle-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\|\widehat{{\beta}_{l}^{M}}\|\leq\gamma_{1,n}\right)-\sum_{l\in\mathcal{M}_{1}\cap\mathcal{M}_{2}^{c}}P\left(\widehat{{\beta}_{l}^{block,M}}\leq\gamma_{3,n}\right)$

			$\displaystyle P\left(\left\|\widehat{\beta}_{l}^{M}-\dot{\beta}_{l}^{M}\right\|\geq t\right)=P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}Y_{i}-E(x_{il}Y_{i})\right\}\right\|\geq nt\right]$
		$\displaystyle=$	$\displaystyle P\left[\left\|\sum_{l^{\prime}\in\mathcal{M}_{1}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\dot{\beta}_{l^{\prime}}^{*M}+\sum_{l^{\prime}\in\mathcal{M}_{2}}\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\langle\dot{\bm{C}}_{l^{\prime}}^{M},\dot{\bm{B}}\rangle+\right.\right.$
		$\displaystyle+$	$\displaystyle\left.\left.\sum_{i=1}^{n}\left\{x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle-E(x_{il})E\left(\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right)\right\}+\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq nt\right]$
		$\displaystyle\leq$	$\displaystyle P\left[\sum_{l^{\prime}\in\mathcal{M}_{1}}\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|b+\sum_{l^{\prime}\in\mathcal{M}_{2}}\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|b\right.$
			$\displaystyle+\left.\left\|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|+\left\|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq nt\right]$
		$\displaystyle\leq$	$\displaystyle\sum_{l^{\prime}\in\mathcal{M}_{1}}P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]$
			$\displaystyle+\sum_{l^{\prime}\in\mathcal{M}_{2}}P\left[\left\|\sum_{i=1}^{n}\left\{x_{il}x_{il^{\prime}}-E\left(x_{il}x_{il^{\prime}}\right)\right\}\right\|\geq\frac{nt}{b(s_{1}+s_{2}+2)}\right]$
			$\displaystyle+P\left(\left\|\sum_{i=1}^{n}x_{il}\langle\bm{E}_{i},\dot{\bm{B}}\rangle\right\|\geq\frac{nt}{s_{1}+s_{2}+2}\right)+P\left(\left\|\sum_{i=1}^{n}x_{il}\epsilon_{i}\right\|\geq\frac{nt}{s_{1}+s_{2}+2}\right)$
		$\displaystyle\leq$	$\displaystyle 2(s_{1}+s_{2})\exp\left[-\frac{nt^{2}(2b)^{-2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}e^{D_{2}\sigma_{x}}D_{3}+D_{2}^{-1}(2b)^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right]$
			$\displaystyle+4\exp\left[-\frac{nt^{2}(s_{1}+s_{2}+2)^{-2}}{2\left\{2D_{2}^{-2}D_{3}+D_{2}^{-1}(s_{1}+s_{2}+2)^{-1}t\right\}}\right].$

			$\displaystyle\left\\|\nabla l(\dot{\bm{B}})\right\\|_{op}=\left\\|-n^{-1}\sum_{i=1}^{n}\epsilon_{i}\bm{Z}_{i}\right\\|_{op}=\left\\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\sum_{l\in\mathcal{M}_{2}}X_{il}*\dot{\bm{C}}_{l}+\bm{E}_{i}\right)\right\\|_{op}$
		$\displaystyle=$	$\displaystyle\left\\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\sum_{l\in\mathcal{M}_{2}}X_{il}\dot{\bm{C}}_{l}\right)+n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\\|_{op}$
		$\displaystyle=$	$\displaystyle\left\\|n^{-1}\sum_{l\in\mathcal{M}_{2}}\langle\bm{x}_{l},\bm{\epsilon}\rangle\dot{\bm{C}}_{l}+n^{-1}\sum_{i=1}^{n}\epsilon_{i}\bm{E}_{i}\right\\|_{op}$
		$\displaystyle\leq$	$\displaystyle n^{-1}\sum_{l\in\mathcal{M}_{2}}\left\|\langle\bm{x}_{l},\bm{\epsilon}\rangle\right\|\left\\|\dot{\bm{C}}_{l}\right\\|_{op}+\left\\|n^{-1}\sum_{i=1}^{n}\epsilon_{i}*\bm{E}_{i}\right\\|_{op}.$

Mapping the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer’s Disease

1 Introduction

2 Data and problem description

3 Methodology

3.1 Basic set-up

3.2 Naive screening methods

3.3 Joint screening

3.4 Blockwise joint screening

3.5 Second-step estimation

4 ADNI data applications

5 Simulation studies

5.1 Simulation for screening

5.2 Simulation for estimation

5.3 Screening and estimation using blockwise joint screening

6 Discussion

Supplementary Material

Acknowledgement

7 Description and derivation of Algorithm 1

Proposition 1.

Proposition 2.

8 Data usage acknowledgement

9 Image and genetic data preprocessing

9.1 Image data preprocessing

9.2 Genetic data preprocessing

10 Screening results of ADNI data applications

11 Sensitivity analysis of ADNI data applications

12 Subgroup analysis ADNI data applications

13 Results for mediation analyses

14 Additional results for simulation studies

14.1 Sensitivity and specificity analyses of simulation

14.2 Screening under different sparsity levels

14.3 Screening and estimation under different covariances of exposure errors

14.4 Screening and estimation under different sizes of ℳ^1∗\widehat{\mathcal{M}}_{1}^{*} and ℳ^2\widehat{\mathcal{M}}_{2}

14.5 Screening and estimation using blockwise joint screening

15 Theoretical properties

15.1 Sure screening property

Theorem 1.

Theorem 2.

Corollary 1.

Theorem 3.

15.2 Theory for two-step estimator

Theorem 4.

16 Assumptions for main theorems

17 Auxiliary lemmas

Lemma 1.

Lemma 2.

Proof of Lemma 2:.

Lemma 3.

18 Proof of theorems

Proof of Theorem 1:.

Proof of Theorem 2.

Proof of Theorem 3:.

Proof of Theorem 4:.

References

14.4 Screening and estimation under different sizes of $\widehat{\mathcal{M}}_{1}^{*}$ and $\widehat{\mathcal{M}}_{2}$