This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High-Dimensional Inference for Generalized Linear Models
with Hidden Confounding

\nameJing Ouyang \emailjingoy@umich.edu
\nameKean Ming Tan \emailkeanming@umich.edu
\nameGongjun Xu \emailgongjun@umich.edu
\addrDepartment of Statistics
University of Michigan
Ann Arbor, MI 48109, USA
Abstract

Statistical inferences for high-dimensional regression models have been extensively studied for their wide applications ranging from genomics, neuroscience, to economics. However, in practice, there are often potential unmeasured confounders associated with both the response and covariates, which can lead to invalidity of standard debiasing methods. This paper focuses on a generalized linear regression framework with hidden confounding and proposes a debiasing approach to address this high-dimensional problem, by adjusting for the effects induced by the unmeasured confounders. We establish consistency and asymptotic normality for the proposed debiased estimator. The finite sample performance of the proposed method is demonstrated through extensive numerical studies and an application to a genetic data set.

Keywords: High-dimensional inference; Generalized linear model; Latent variable; Unmeasured confounder.

1 Introduction

Statistical inferences for high-dimensional regression models have received a growing interest due to the increasing number of complex data sets with high-dimensional covariates that are collected across many different scientific disciplines ranging from genomics to social science to econometrics (Peng et al., 2010; Belloni et al., 2012; Fan et al., 2014; Bühlmann et al., 2014). One of the most popularly used high-dimensional regression methods is the lasso linear regression, which assumes that the underlying regression coefficients are sparse (Tibshirani, 1996). However, the lasso penalty introduces non-negligible bias that renders high-dimensional statistical inference challenging (Wainwright, 2019). To address this challenge, Zhang and Zhang (2014) and van de Geer et al. (2014) proposed the debiasing method to construct confidence intervals for the lasso regression coefficients: the main idea is to first obtain the lasso estimator, and then correct for the bias of the lasso estimator using a low-dimensional projection method. We refer the reader to Javanmard and Montanari (2014), Belloni et al. (2014), Ning and Liu (2017), Chernozhukov et al. (2018), among others, for detailed discussions of the debiasing approach, and also Wainwright (2019) for an overview of high-dimensional statistical inference.

The aforementioned studies were established under the assumption that there are no unmeasured confounders that are associated with both the response and covariates. However, this assumption is often violated in observational studies. For instance, in genetic studies, the effect of certain segments of DNA on the gene expression may be confounded by population structure and microarray expression artifacts (Listgarten et al., 2010). Another example is in healthcare studies where the effect of nutrients intake on the risk of cancer may be confounded by physical wellness, social class, and behavioral factors (Fewell et al., 2007). Without adjusting for the unmeasured confounders, the resulting inferences from the standard debiasing methods could be biased and consequently, lead to spurious scientific discoveries.

Various methods have been proposed to perform valid statistical inferences for regression parameters in the presence of hidden confounders. One commonly used approach is instrumental variables regression, which typically requires domain knowledge to identify valid instrumental variables, making it challenging for high-dimensional applications (Kang et al., 2016; Burgess et al., 2017; Guo et al., 2018; Windmeijer et al., 2019). Recently, under a relaxed linear structural equation modeling framework, Guo et al. (2022) proposed a deconfounding approach for statistical inference for individual regression coefficients. Fan et al. (2023) developed a factor-adjusted debiasing method and conducted statistical estimation and inference for the coefficient vector. However, these two methods rely on the linearity assumption between the response and covariates, which may not hold when the response is not continuous but categorical (binary, count, etc.). Such categorical data commonly occurs in genetic, biomedical, and social science applications, and thus alternative methods are needed to perform valid statistical inferences in these cases.

To bridge the gap in the existing literature, we propose a novel framework to perform statistical inference in the context of high-dimensional generalized linear models with hidden confounding. The main idea of our proposed method involves estimating the unmeasured confounders using high-dimensional factor analysis techniques and then applying the debiasing method on the regression coefficient of interest, with the estimated unmeasured confounders treated as surrogate variables. This method does not rely on the linear model assumption or any specific model form and it is applicable to more general models beyond the generalized linear models, such as graphical models (Ren et al., 2015; Zhu et al., 2020) and additive hazards models (Lin and Ying, 1994; Lin and Lv, 2013).

Theoretically, we show that under some mild scaling conditions, the estimation errors of proposed estimators achieve comparable rates as that of the 1\ell_{1}-penalized generalized linear model without unmeasured confounders. We further show that the debiased estimator for the coefficient of interest is asymptotically Gaussian after adjusting for the unmeasured confounders, which results in a valid statistical inference for high-dimensional generalized linear models with hidden confounding. It is worth highlighting that when using a factor model to relate covariates and unmeasured confounders, we make more general assumptions on the random noise compared to existing works. Specifically, we allow the random noise to be non-identically distributed. This represents a significant improvement over the majority of previous works, which assume that the random noise follows an independent and identical distribution (Guo et al., 2022; Fan et al., 2023). Furthermore, unlike existing methods under the linear framework (Guo et al., 2022; Fan et al., 2023), generalized linear models pose new challenges as many transformation and decomposition techniques commonly used for the linear models are not directly applicable. Consequently, more general and refined intermediate results are needed as “building blocks” in the proofs (see Remark 8 in Appendix D.1 for details).

Our paper is organized and structured as follows. In Section 2, we introduce the model setup and provide a comprehensive discussion of related literature. Section 3 presents our two-step approach for estimating the parameter of interest while adjusting for the effect of unmeasured confounders. Section 4 establishes the theoretical properties of the parameter including the estimation consistency as well as the asymptotic normality for the estimator. In Section 5, we demonstrate the performance of our proposed method and the validity of theoretical results via extensive simulation studies. We also provide an application to genetic data containing gene expression quantifications and stimulation statuses in mouse bone marrow-derived dendritic cells, where we identify significant gene expressions under stimulations, which are consistent with the experimental findings in genetic studies (Nemetz et al., 1999; Ather and Poynter, 2018; Jang et al., 2018; Toyoshima et al., 2019). Lastly, in Section 6, we provide concluding remarks and outline potential future directions for this research.

2 Generalized Linear Models with Hidden Confounding

In this section, we first set up a generalized linear model with hidden confounding and introduce a scientific application of our model framework. Then we will discuss related high-dimensional models in the existing literature.

2.1 Problem Setup

Consider a high-dimensional regression problem with a unidimensional response yy and a pp-dimensional observed covariates 𝑿\bm{X}. In addition, assume that there is a KK-dimensional unmeasured confounders 𝑼\bm{U} that are related to both yy and 𝑿\bm{X}. Without loss of generality, we write 𝑿=(D,𝑸T)T\bm{X}=(D,\bm{Q}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}, where DD\in\mathbb{R} is the covariate of interest and 𝑸p1\bm{Q}\in\mathbb{R}^{p-1} is a (p1)(p-1)-dimensional vector of nuisance covariates. Furthermore, we denote θ\theta\in\mathbb{R} as the univariate parameter of interest, 𝒗p1\bm{v}\in\mathbb{R}^{p-1} as parameters for the nuisance covariates, and 𝜷K\bm{\beta}\in\mathbb{R}^{K} as parameters that quantify the effects induced by the unmeasured confounders. The goal is to perform statistical inference on the parameter θ\theta.

We assume that yy given DD, 𝑸\bm{Q}, and the unmeasured confounders 𝑼\bm{U} follows a generalized linear model with probability density (mass) function:

f(y)=exp[{y(θD+𝒗T𝑸+𝜷T𝑼)b(θD+𝒗T𝑸+𝜷T𝑼)}/a(ϕ)+c(y,ϕ)],f(y)=\exp\left[\{y(\theta{D}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U})-b(\theta{D}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U})\}/{a(\phi)}+c(y,\phi)\right], (1)

where ϕ\phi is the scale parameter, and a()a(\cdot), b()b(\cdot), and c()c(\cdot) are some known functions. As the distribution of yy belongs to the exponential family, we have E(y)=b(θD+𝒗T𝑸+𝜷T𝑼)E(y)=b^{\prime}(\theta{D}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U}) and var(y)=b′′(θD+𝒗T𝑸+𝜷T𝑼)a(ϕ)\operatorname{var}(y)=b^{\prime\prime}(\theta{D}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U})a(\phi). For simplicity, we take a(ϕ)=1a(\phi)=1. For notational convenience, let 𝒁=(D,𝑸T,𝑼T)T\bm{Z}=(D,\bm{Q}^{\mathrm{\scriptscriptstyle T}},\bm{U}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} be a vector that includes all observed covariates and the unmeasured confounders, and let 𝜼=(θ,𝒗T,𝜷T)T\bm{\eta}=(\theta,\bm{v}^{\mathrm{\scriptscriptstyle T}},\bm{\beta}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} be the corresponding parameters. We now provide three commonly used examples.

Example 1 (Logistic Regression)

Let y{0,1}y\in\{0,1\} be a binary variable. Given covariates DD, 𝐐\bm{Q}, and unmeasured confounders 𝐔\bm{U}, the response yy follows the logistic regression model with ϕ=1\phi=1, a(ϕ)=1a(\phi)=1, b(t)=log{1+exp(t)}b(t)=\log\{1+\exp(t)\}, and c(y,ϕ)=0c(y,\phi)=0.

Example 2 (Poisson Regression)

Let y{0,1,2,}y\in\{0,1,2,\ldots\} be a discrete variable. Given covariates DD, 𝐐\bm{Q}, and unmeasured confounders 𝐔\bm{U}, the response yy follows the Poisson regression model with ϕ=1\phi=1, a(ϕ)=1a(\phi)=1, b(t)=exp(t)b(t)=\exp(t), and c(y,ϕ)=log(y!)c(y,\phi)=-\log(y!).

Example 3 (Linear Regression)

Let yy\in\mathbb{R} be a real-valued response variable. Given covariates DD, 𝐐\bm{Q}, and unmeasured confounders 𝐔\bm{U}, the response yy follows the linear regression model y=θD+𝐯T𝐐+𝛃T𝐔+εy=\theta{D}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U}+\varepsilon with E(ε)=0E(\varepsilon)=0 and var(ε)=σ2\text{var}(\varepsilon)=\sigma^{2}. The model parameters are ϕ=σ2\phi=\sigma^{2}, a(σ2)=σ2a(\sigma^{2})=\sigma^{2}, b(t)=t2/2b(t)=t^{2}/2 and c(y,σ2)=y2/(2σ2)log(2πσ2)/2c(y,\sigma^{2})=-{y^{2}}/(2\sigma^{2})-\log(2\pi\sigma^{2})/2.

In addition, we assume that the relationship between covariates 𝑿\bm{X} and unmeasured confounders 𝑼\bm{U} is captured by the following model:

𝑿=𝑾T𝑼+𝑬,\bm{X}=\bm{W}^{{\mathrm{\scriptscriptstyle T}}}\bm{U}+\bm{E}, (2)

where 𝑾K×p\bm{W}\in\mathbb{R}^{K\times p} is the loading matrix that describes the linear effect of unmeasured confounders 𝑼\bm{U} on covariates 𝑿\bm{X}, and 𝑬\bm{E} is the random noise independent of 𝑼\bm{U}. While similar model as in (2) is considered in Guo et al. (2022) and Fan et al. (2023), here we assume model (2) is an approximate factor model which allows for weak correlation and non-identical distribution of the random noise. This is a general setting compared to many existing works assuming 𝑬\bm{E} to be identically and independently distributed (Guo et al., 2022; Fan et al., 2023). We will elaborate on this model setup and expound the assumptions for weakly correlated random noise in Sections 3 and 4.

The aforementioned structural equation modeling framework can be applied to many scientific applications. In the following, we provide one motivating example in genetic studies. Various authors have found that the effect of gene expression in response to the environmental conditions, e.g., viral or bacterial stimulation, might be confounded by unmeasured factors (Price et al., 2006; Lazar et al., 2013). One interesting scientific problem is to assess the effect of gene expression responding to the viral stimulation to cells, while adjusting for the confounding effects from the unmeasured variables. The viral stimulation status, a binary variable, is considered as the response variable yy. Therefore, a generalized linear model for modeling yy is preferred. In this example, yy is the viral stimulation status, 𝑿\bm{X} is a vector of high-dimensional gene expression, and 𝑼\bm{U} is a vector of possible unmeasured confounders. More details of the data application will be provided in Section 5.2.

2.2 Related Models in Existing Literature

In high-dimensional regression, the adjustment for the hidden confounding effects is a challenging and intriguing problem. There are various related models in the literature. For instance, to address the difficulty of model selection when pnp\gg n, Paul et al. (2008) assumed that yy and 𝑿\bm{X} are connected via a low-dimensional latent variable model: y=𝜷T𝑼+εy=\bm{\beta}^{T}\bm{U}+\varepsilon and 𝑿=𝑾T𝑼+𝑬\bm{X}=\bm{W}^{T}\bm{U}+\bm{E}, where the latent factors are associated with response and covariates are only used to infer the latent factors. However, the response and covariates are not directly associated. Different from the latent variable model in Paul et al. (2008), Bai and Ng (2006) considered an additional low-dimensional covariate 𝑮\bm{G} to the latent variable model, that is expressed as y=𝜷T𝑼+ϱT𝑮+εy=\bm{\beta}^{T}\bm{U}+\bm{\varrho}^{T}\bm{G}+\varepsilon and 𝑿=𝑾T𝑼+𝑬\bm{X}=\bm{W}^{T}\bm{U}+\bm{E}. Besides, there are other factor-adjusted models that can be extended to adjust hidden confounding. For example, Fan et al. (2020) studied the high-dimensional model selection problem when covariates are highly correlated. As most commonly used model selection methods may fail with highly-correlated covariates, they used a factor model to reduce the dependency among covariates and proposed a factor-adjusted regularized model section method. Fan et al. (2020) considered a generalized linear model between response and covariate, which together with the factor model forms a similar model framework as ours. However, the problem they studied is fundamentally different than our problem. They did not assume hidden confounding and the factor model is only used to identify a low rank part of highly-correlated covariates whereas in our problem, we focus on the regression problem with unmeasured confounders associated with response and covariates. Similar factor-adjusted methods have also been studied in other settings (e.g., Gagnon-Bartsch and Speed, 2011; Wang and Blei, 2019); however, as noted in Ćevid et al. (2020), related theoretical justifications are still underdeveloped in the literature.

Recently, many researchers studied the following linear hidden confounding model, which can be viewed as a special case of our framework,

y=θD+𝒗T𝑸+𝜷T𝑼+ε,𝑿=𝑾T𝑼+𝑬.y=\theta{D}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U}+\varepsilon,\quad\bm{X}=\bm{W}^{\mathrm{\scriptscriptstyle T}}\bm{U}+\bm{E}. (3)

Under this model, Kneip and Sarda (2011) used the principal component method to estimate the unmeasured confounders and then applied a selection procedure on a projected model. Ćevid et al. (2020) proposed a method that first performs the spectral transformation pre-processing step and then applies the lasso regression on the transformed response and covariates. However, these two works focused on estimation consistency and did not address inference issues. Several works have investigated statistical inference on the covariate coefficients. For example, Guo et al. (2022) proposed a doubly debiased lasso method to perform statistical inference for θ\theta. Different from the approach of Guo et al. (2022) that implicitly adjusts for the hidden confounding effects, Fan et al. (2023) proposed to first use the principal component method to estimate unmeasured confounders and then construct the bias-corrected estimator for (θ,𝒗T)\|({\theta},{\bm{v}}^{{\mathrm{\scriptscriptstyle T}}})\|_{\infty}, which involves the decomposition of the estimation error relying on the linear form of the response and uses the projection of the response onto the factor space. The motivation of Fan et al. (2023) originated from Fan et al. (2020). When covariates are highly correlated, the leading factors are likely to have extra impacts on the response. So they augmented the factor into the sparse linear regression model between response and covariates, which is written as (3). However, these techniques are designed for linear models and may not be directly applicable to generalized linear model settings.

3 Estimation Method

In this section, we propose a novel framework to perform statistical inference for a parameter of interest in the context of high-dimensional generalized linear models with hidden confounding. In the proposed framework, we first estimate the unmeasured confounders using a factor analysis approach. Subsequently, the estimated unmeasured confounders are treated as surrogate variables for fitting a high-dimensional generalized linear model, and a debiased estimator is constructed to perform statistical inference.

Throughout this section, we assume that the observed data {yi,𝑿i}i=1,,n\{y_{i},\bm{X}_{i}\}_{i=1,\dots,n} and the unmeasured confounders {𝑼i}i=1,,n\{\bm{U}_{i}\}_{i=1,\dots,n} are realizations of (1) and (2). Moreover, the random noise 𝑬i=(Ei1,,Eip)T\bm{E}_{i}=(E_{i1},\ldots,E_{ip})^{{\mathrm{\scriptscriptstyle T}}} has mean zero and variance 𝛀i=𝔼(𝑬i𝑬iT).\bm{\Omega}_{i}=\mathbb{E}(\bm{E}_{i}\bm{E}_{i}^{{\mathrm{\scriptscriptstyle T}}}). Let 𝚺e=diag(n1i=1n𝛀i)\bm{\Sigma}_{e}={\rm diag}(n^{-1}\sum_{i=1}^{n}\bm{\Omega}_{i}), where diag(𝑨){\rm diag}(\bm{A}) denotes a diagonal matrix by setting off-diagonal entries in 𝑨\bm{A} to zero. In the p×pp\times p diagonal matrix 𝚺e\bm{\Sigma}_{e}, we denote the jj-th diagonal element to be σj2=n1i=1nτi,jj\sigma_{j}^{2}=n^{-1}\sum_{i=1}^{n}\tau_{i,jj}, where τi,jj\tau_{i,jj} is the (j,j)(j,j) element of 𝛀i\bm{\Omega}_{i}. The model assumption on the random noise is general as it does not assume that the random noise 𝑬i\bm{E}_{i} is identical nor does it require the covariance matrix 𝛀i\bm{\Omega}_{i} to be diagonal. The detailed theoretical assumptions regarding the random noise will be presented in Section 4.

Estimation of the Unmeasured Confounders: In this work, we consider the dimension KK as pre-specified. In practice, there are various methods to estimate the dimension of unmeasured confounders such as scree plot (Cattell, 1966), cross-validation method (Owen and Wang, 2016), information criteria method (Bai and Ng, 2002), the eigenvalue ratio method (Lam and Yao, 2012; Ahn and Horenstein, 2013), among others. In the implementation of our proposed method, we recommend the parallel analysis approach because of its good finite sample performance, easy implementation, and popularity in scientific applications (Hayton et al., 2004; Costello and Osborne, 2005; Brown, 2015) – it has shown superior performances compared to many other methods in various empirical studies (Zwick and Velicer, 1986; Peres-Neto et al., 2005). Detailed discussions on the estimation of the dimension of unmeasured confounders are presented in Appendix C.

We employ the maximum likelihood estimation to estimate the unmeasured confounders under (2). Without loss of generality, let 𝑼¯=n1i=1n𝑼i=0\bar{\bm{U}}=n^{-1}\sum_{i=1}^{n}\bm{U}_{i}={0} and let 𝑺u=n1i=1n𝑼i𝑼iT\bm{S}_{u}=n^{-1}\sum_{i=1}^{n}\bm{U}_{i}\bm{U}_{i}^{\mathrm{\scriptscriptstyle T}} be the sample variance of 𝑼\bm{U}. Similarly, let 𝑺x=n1i=1n(𝑿i𝑿¯)(𝑿i𝑿¯)T\bm{S}_{x}=n^{-1}\sum_{i=1}^{n}(\bm{X}_{i}-\bar{\bm{X}})(\bm{X}_{i}-\bar{\bm{X}})^{{\mathrm{\scriptscriptstyle T}}} be the sample variance of 𝑿\bm{X}, where 𝑿¯=n1i=1n𝑿i\bar{\bm{X}}=n^{-1}\sum_{i=1}^{n}\bm{X}_{i}. Given unmeasured confounders 𝑼1,,𝑼n\bm{U}_{1},\dots,\bm{U}_{n}, define an approximation of population variance of 𝑿\bm{X} to be 𝚺x=𝑾T𝑺u𝑾+𝚺e\bm{\Sigma}_{x}=\bm{W}^{\mathrm{\scriptscriptstyle T}}\bm{S}_{u}\bm{W}+\bm{\Sigma}_{e}. Note that 𝚺x\bm{\Sigma}_{x} is not exactly the covariance matrix of 𝑿\bm{X} because we do not restrict 𝛀i\bm{\Omega}_{i} to be diagonal and define 𝚺e\bm{\Sigma}_{e} to be diagonal by setting the off-diagonal of n1i=1n𝛀in^{-1}\sum_{i=1}^{n}\bm{\Omega}_{i} to be zero. Based on the factor model in (2), the maximum likelihood estimators of 𝑾\bm{W}, 𝑺u\bm{S}_{u} and 𝚺e{\bm{\Sigma}}_{e} are obtained as follows:

(𝑾^,𝑺^u,𝚺^e)=argmax𝑾,𝑺u,𝚺e{(2p)1log|𝚺x|(2p)1tr(𝑺x𝚺x1)}.(\hat{\bm{W}},\hat{\bm{S}}_{u},\hat{\bm{\Sigma}}_{e})=\underset{{\bm{W}},{\bm{S}}_{u},{\bm{\Sigma}}_{e}}{\operatorname{argmax}}\left\{-(2p)^{-1}\log|\bm{\Sigma}_{x}|-(2p)^{-1}\operatorname{tr}(\bm{S}_{x}\bm{\Sigma}_{x}^{-1})\right\}.

Computationally, we employ the Expectation-Maximization (EM) algorithm to obtain the maximum likelihood estimators as suggested in Bai and Li (2012) and Bai and Li (2016), where the authors proved the EM solutions are the stationary points for the likelihood function. Specifically, in this iterated EM algorithm, we use the principal components estimator as the initial estimator. Because principal component estimators are shown to be consistent estimators under similar model assumptions as ours (Fan et al., 2013; Wang and Fan, 2017), using the principal component estimators in initialization instead of using random initialization helps to improve algorithm efficiency and find more refined estimation results. At the tt-th iteration, denote the estimators at this step to be 𝑾(t)\bm{W}^{(t)} and 𝚺e(t)\bm{\Sigma}_{e}^{(t)}. The EM algorithm updates the estimators to be

{𝑾(t+1)}T\displaystyle\{\bm{W}^{(t+1)}\}^{{\mathrm{\scriptscriptstyle T}}} =\displaystyle= [n1i=1n𝔼(𝑿i𝑼iT𝑿,𝑾(t),𝚺e(t))][n1i=1n𝔼(𝑼i𝑼iT𝑿,𝑾(t),𝚺e(t))]1\displaystyle\left[n^{-1}\sum_{i=1}^{n}\mathbb{E}(\bm{X}_{i}\bm{U}_{i}^{{\mathrm{\scriptscriptstyle T}}}\mid\bm{X},\bm{W}^{(t)},\bm{\Sigma}_{e}^{(t)})\right]\left[n^{-1}\sum_{i=1}^{n}\mathbb{E}(\bm{U}_{i}\bm{U}_{i}^{{\mathrm{\scriptscriptstyle T}}}\mid\bm{X},\bm{W}^{(t)},\bm{\Sigma}_{e}^{(t)})\right]^{-1}
𝚺e(t+1)\displaystyle\bm{\Sigma}_{e}^{(t+1)} =\displaystyle= diag(𝑺x𝑾(t+1){𝑾(t)}T{𝚺x(t)}1𝑺x)\displaystyle{\rm diag}(\bm{S}_{x}-\bm{W}^{(t+1)}\{\bm{W}^{(t)}\}^{{\mathrm{\scriptscriptstyle T}}}\{\bm{\Sigma}_{x}^{(t)}\}^{-1}\bm{S}_{x})

where 𝚺xt={𝑾(t)}T𝑾(t)+𝚺e(t)\bm{\Sigma}_{x}^{t}=\{\bm{W}^{(t)}\}^{{\mathrm{\scriptscriptstyle T}}}\bm{W}^{(t)}+\bm{\Sigma}_{e}^{(t)},

n1i=1n𝔼(𝑿i𝑼iT𝑿,𝑾(t),𝚺e(t))\displaystyle n^{-1}\sum_{i=1}^{n}\mathbb{E}(\bm{X}_{i}\bm{U}_{i}^{{\mathrm{\scriptscriptstyle T}}}\mid\bm{X},\bm{W}^{(t)},\bm{\Sigma}_{e}^{(t)}) =\displaystyle= 𝑺x{𝚺e(t)}1{𝑾(t)}T,\displaystyle\bm{S}_{x}\{\bm{\Sigma}_{e}^{(t)}\}^{-1}\{\bm{W}^{(t)}\}^{{\mathrm{\scriptscriptstyle T}}},
n1i=1n𝔼(𝑼i𝑼iT𝑿,𝑾(t),𝚺e(t))\displaystyle n^{-1}\sum_{i=1}^{n}\mathbb{E}(\bm{U}_{i}\bm{U}_{i}^{{\mathrm{\scriptscriptstyle T}}}\mid\bm{X},\bm{W}^{(t)},\bm{\Sigma}_{e}^{(t)}) =\displaystyle= 𝑾(t){𝚺e(t)}1𝑺x{𝚺e(t)}1{𝑾(t)}T\displaystyle\bm{W}^{(t)}\{\bm{\Sigma}_{e}^{(t)}\}^{-1}\bm{S}_{x}\{\bm{\Sigma}_{e}^{(t)}\}^{-1}\{\bm{W}^{(t)}\}^{{\mathrm{\scriptscriptstyle T}}}
+𝕀K𝑾(t){𝚺e(t)}1{𝑾(t)}T.\displaystyle+\mathbb{I}_{K}-\bm{W}^{(t)}\{\bm{\Sigma}_{e}^{(t)}\}^{-1}\{\bm{W}^{(t)}\}^{{\mathrm{\scriptscriptstyle T}}}.

The iterative steps stop when 𝑾(t+1)𝑾(t)F\|\bm{W}^{(t+1)}-\bm{W}^{(t)}\|_{F} and 𝚺e(t+1)𝚺e(t)F\|\bm{\Sigma}_{e}^{(t+1)}-\bm{\Sigma}_{e}^{(t)}\|_{F} are less than certain tolerance value. Let 𝑾\bm{W}^{{\dagger}} and 𝚺e\bm{\Sigma}_{e}^{{\dagger}} to be the estimators at the last step, and 𝒱\mathcal{V} to be the matrix containing eigenvectors of p1𝑾(𝚺e)1(𝑾)Tp^{-1}\bm{W}^{{\dagger}}(\bm{\Sigma}_{e}^{{\dagger}})^{-1}(\bm{W}^{{\dagger}})^{{\mathrm{\scriptscriptstyle T}}} corresponding to the descending eigenvalues. The maximum likelihood estimators are 𝑾^=𝒱T𝑾\hat{\bm{W}}=\mathcal{V}^{{\mathrm{\scriptscriptstyle T}}}\bm{W}^{{\dagger}} and 𝚺^e=𝚺e\hat{\bm{\Sigma}}_{e}=\bm{\Sigma}_{e}^{{\dagger}}. The estimator 𝑺^u\hat{\bm{S}}_{u} is obtained after estimating 𝑾^\hat{\bm{W}} and 𝚺^e\hat{\bm{\Sigma}}_{e}. As 𝑺^u\hat{\bm{S}}_{u} is not our focus in this method, we omit its derivation details in this paper.

With 𝑾^\hat{\bm{W}} and 𝚺^e\hat{\bm{\Sigma}}_{e}, we then estimate 𝑼i\bm{U}_{i} using the generalized least squares estimator:

𝑼^i=(𝑾^𝚺^e1𝑾^T)1𝑾^𝚺^e1(𝑿i𝑿¯).\hat{\bm{U}}_{i}=(\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\hat{\bm{W}}^{\mathrm{\scriptscriptstyle T}})^{-1}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{X}_{i}-\bar{\bm{X}}). (4)

Next, we treat the estimators 𝑼^1,,𝑼^n\hat{\bm{U}}_{1},\ldots,\hat{\bm{U}}_{n} as surrogate variables for the underlying unmeasured confounders to fit a high-dimensional generalized linear model. We then construct a debiased estimator for the parameter of interest by generalizing the decorrelated score method proposed by Ning and Liu (2017). Other debiasing approaches such as that of van de Geer et al. (2014) could also be similarly developed.

Initial 1\ell_{1}-Penalized Estimator: Recall that 𝜼=(θ,𝒗T,𝜷T)T\bm{\eta}=(\theta,\bm{v}^{{\mathrm{\scriptscriptstyle T}}},\bm{\beta}^{{\mathrm{\scriptscriptstyle T}}})^{{\mathrm{\scriptscriptstyle T}}}. For vector 𝒓=(r1,,rl)T\bm{r}=(r_{1},\dots,r_{l})^{{\mathrm{\scriptscriptstyle T}}}, define 𝒓1=j=1l|rj|\|\bm{r}\|_{1}=\sum_{j=1}^{l}|r_{j}|. To fit a high-dimensional generalized linear model, we solve the 1\ell_{1}-penalized optimization problem

𝜼^=argmin𝜼(p+K)[1ni=1n{yi(θDi+𝒗T𝑸i+𝜷T𝑼^i)b(θDi+𝒗T𝑸i+𝜷T𝑼^i)}+λ(|θ|+𝒗1)],\hat{\bm{\eta}}=\underset{\bm{\eta}\in\mathbb{R}^{(p+K)}}{\operatorname{argmin}}\left[-\frac{1}{n}\sum_{i=1}^{n}\{y_{i}(\theta D_{i}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{U}}_{i})-b(\theta D_{i}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{U}}_{i})\}+\lambda(|\theta|+\|\bm{v}\|_{1})\right],

where 𝜼^=(θ^,𝒗^T,𝜷^T)T\hat{\bm{\eta}}=(\hat{\theta},\hat{\bm{v}}^{{\mathrm{\scriptscriptstyle T}}},\hat{\bm{\beta}}^{{\mathrm{\scriptscriptstyle T}}})^{\mathrm{\scriptscriptstyle T}} and b(t)b(t) is a known function, given a specific generalized linear model. Throughout the manuscript, for notational convenience, let 𝜻=(𝒗T,𝜷T)T{\bm{\zeta}}=(\bm{v}^{\mathrm{\scriptscriptstyle T}},\bm{\beta}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} be regression coefficients for the nuisance covariates and unmeasured confounders. Accordingly, we have 𝜼=(θ,𝜻T)T\bm{\eta}=(\theta,\bm{\zeta}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}, with the goal of performing statistical inference on θ\theta. In the following, we construct a debiased estimator for θ\theta, generalizing the approach in Ning and Liu (2017) to situations involving unmeasured confounders.


A Debiased Estimator: Before we unfold the details of constructing a debiased estimator, we start with introducing some notations. Let 𝑰=E{b′′(θDi+𝒗T𝑸i+𝜷T𝑼i)𝒁i𝒁iT}\bm{I}=E\{b^{\prime\prime}(\theta D_{i}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U}_{i})\bm{Z}_{i}\bm{Z}_{i}^{\mathrm{\scriptscriptstyle T}}\} be the Fisher information matrix, and let 𝒘T=𝑰θ𝜻𝑰𝜻𝜻1\bm{w}^{\mathrm{\scriptscriptstyle T}}=\bm{I}_{\theta\bm{\zeta}}\bm{I}_{\bm{\zeta}\bm{\zeta}}^{-1}, where 𝑰θ𝜻\bm{I}_{\theta\bm{\zeta}} and 𝑰𝜻𝜻\bm{I}_{\bm{\zeta}\bm{\zeta}} are corresponding block matrices of 𝑰\bm{I}. In addition, let Iθ𝜻=E{b′′(θDi+𝒗T𝑸i+𝜷T𝑼i)Di(Di𝒘T𝑴i)}{I}_{\theta\mid\bm{\zeta}}=E\{b^{\prime\prime}(\theta D_{i}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U}_{i})D_{i}(D_{i}-\bm{w}^{\mathrm{\scriptscriptstyle T}}\bm{M}_{i})\} be the partial Fisher information matrix, where 𝑴i=(𝑸iT,𝑼iT)T\bm{M}_{i}=(\bm{Q}_{i}^{\mathrm{\scriptscriptstyle T}},\bm{U}_{i}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} is a vector of nuisance covariates and unmeasured confounders. Finally, let l(θ,𝜻)=n1i=1n{yi(θDi+𝜻T𝑴˙i)b(θDi+𝜻T𝑴˙i)}l(\theta,\bm{\zeta})=-n^{-1}\sum_{i=1}^{n}\{y_{i}(\theta D_{i}+\bm{\zeta}^{T}\dot{\bm{M}}_{i})-b(\theta D_{i}+\bm{\zeta}^{T}\dot{\bm{M}}_{i})\} with 𝑴˙i=(𝑸iT,𝑼^iT)T\dot{\bm{M}}_{i}=(\bm{Q}_{i}^{\mathrm{\scriptscriptstyle T}},\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} be the loss function.

We define S(θ,𝜻)=θl(θ,𝜻)𝒘T𝜻l(θ,𝜻)S(\theta,\bm{\zeta})=\nabla_{\theta}l(\theta,\bm{\zeta})-\bm{w}^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}}l(\theta,\bm{\zeta}) as the generalized decorrelated score function, where θl(θ,𝜻)\nabla_{\theta}l(\theta,\bm{\zeta}) and 𝜻l(θ,𝜻)\nabla_{\bm{\zeta}}l(\theta,\bm{\zeta}) are the partial derivatives of the loss function with respect to θ\theta and 𝜻\bm{\zeta}, respectively. Different from the existing definition of the decorrelated score function in Ning and Liu (2017), the generalized decorrelated score function takes into account the effects induced by the unmeasured confounders. Specifically, in the presence of unmeasured confounders, the generalized decorrelated score function is uncorrelated with the score function corresponding to the nuisance covariates as well as the unmeasured confounders, i.e., E{S(θ,𝜻)T𝜻l(θ,𝜻)}=𝑰θ𝜻𝑰θ𝜻𝑰𝜻𝜻1𝑰𝜻𝜻=𝟎E\{S(\theta,\bm{\zeta})^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}}l(\theta,\bm{\zeta})\}=\bm{I}_{\theta\bm{\zeta}}-\bm{I}_{\theta\bm{\zeta}}\bm{I}_{\bm{\zeta}\bm{\zeta}}^{-1}\bm{I}_{\bm{\zeta}\bm{\zeta}}=\bm{0}. The debiased estimator of θ\theta is constructed by solving for θ~\tilde{\theta} from the first-order approximation of the generalized decorrelated score function S^(θ^,𝜻^)+I^θ𝜻(θ~θ^)=0\hat{S}(\hat{\theta},\hat{\bm{\zeta}})+\hat{I}_{\theta\mid\bm{\zeta}}(\tilde{\theta}-\hat{\theta})=0. From the first-order approximation equation, we see that to establish the debiased estimator θ~\tilde{\theta}, we need to construct two estimators S^(θ^,𝜻^)\hat{S}(\hat{\theta},\hat{\bm{\zeta}}) and I^θ𝜻\hat{I}_{\theta\mid\bm{\zeta}}, and the key is to estimate 𝒘\bm{w}.

We estimate 𝒘\bm{w} by solving the following convex optimization problem:

𝒘^=argmin𝒘(p+K1)12ni=1n{𝒘T𝜻𝜻li(θ^,𝜻^)𝒘2𝒘T𝜻θli(θ^,𝜻^)}+λ𝒘1,\hat{\bm{w}}=\underset{\bm{w}\in\mathbb{R}^{(p+K-1)}}{\operatorname{argmin}}~\frac{1}{2n}\sum_{i=1}^{n}\{\bm{w}^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}\bm{\zeta}}l_{i}(\hat{\theta},\hat{\bm{\zeta}})\bm{w}-2\bm{w}^{{\mathrm{\scriptscriptstyle T}}}\nabla_{\bm{\zeta}\theta}l_{i}(\hat{\theta},\hat{\bm{\zeta}})\}+\lambda^{\prime}\|\bm{w}\|_{1},

where li(θ,𝜻)=yi(θDi+𝜻T𝑴˙i)+b(θDi+𝜻T𝑴˙i)l_{i}(\theta,\bm{\zeta})=-y_{i}(\theta D_{i}+\bm{\zeta}^{T}\dot{\bm{M}}_{i})+b(\theta D_{i}+\bm{\zeta}^{T}\dot{\bm{M}}_{i}) is the iith component of the loss function and λ>0\lambda^{\prime}>0 is a sparsity tuning parameter for 𝒘\bm{w}. Equivalently, the estimator 𝒘^\hat{\bm{w}} is obtained by

𝒘^=argmin𝒘(p+K1)12ni=1nb′′(θ^Di+𝜻^T𝑴˙i)(Di𝒘T𝑴˙i)2+λ𝒘1.\hat{\bm{w}}=\underset{\bm{w}\in\mathbb{R}^{(p+K-1)}}{\operatorname{argmin}}~\frac{1}{2n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\theta}D_{i}+\hat{\bm{\zeta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i})(D_{i}-\bm{w}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i})^{2}+\lambda^{\prime}\|\bm{w}\|_{1}. (5)

The estimator 𝒘^\hat{\bm{w}} is constructed with the intuition of finding a sparse vector such that the generalized decorrelated score function is approximately zero. This coincides with the intuition to solve for θ~\tilde{\theta} from the first-order approximation of the generalized decorrelated score function. Under the null hypothesis H0:θ=θ0H_{0}:\theta^{*}=\theta^{0}, we estimate the generalized decorrelated score function and the partial Fisher information matrix by

S^(θ0,𝜻^)\displaystyle\hat{S}(\theta^{0},\hat{\bm{\zeta}}) =\displaystyle= 1ni=1n{yib(θ0Di+𝒗^T𝑸i+𝜷^T𝑼^i)}(Di𝒘^T𝑴˙i);\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\{y_{i}-b^{\prime}(\theta^{0}D_{i}+\hat{\bm{v}}^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+\hat{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{U}}_{i})\}(D_{i}-\hat{\bm{w}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i});
I^θ𝜻\displaystyle\hat{{I}}_{\theta\mid\bm{\zeta}} =\displaystyle= 1ni=1nb′′(θ^Di+𝒗^T𝑸i+𝜷^T𝑼^i)Di(Di𝒘^T𝑴i˙).\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\theta}D_{i}+\hat{\bm{v}}^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+\hat{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{U}}_{i})D_{i}(D_{i}-\hat{\bm{w}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}_{i}}).

The debiased estimator can then be constructed as θ~=θ^(I^θ𝜻)1S^(θ0,𝜻^)\tilde{\theta}=\hat{\theta}-(\hat{I}_{\theta\mid\bm{\zeta}})^{-1}\hat{S}(\theta^{0},\hat{\bm{\zeta}}). We will show in Section 4 that the debiased estimator θ~\tilde{\theta} is asymptotically normal. Subsequently, the (1α)×100%(1-\alpha)\times 100\% confidence interval for θ\theta can be constructed as

{θ~(nI^θ𝜻)1/2Φ1(1α/2),θ~+(nI^θ𝜻)1/2Φ1(1α/2)},\{\tilde{\theta}-(n\hat{I}_{\theta\mid\bm{\zeta}})^{-1/2}{\Phi}^{-1}(1-{\alpha}/2),\ \tilde{\theta}+(n\hat{I}_{\theta\mid\bm{\zeta}})^{-1/2}{\Phi}^{-1}(1-{\alpha}/2)\}, (6)

where Φ(t)\Phi(t) is the cumulative distribution function for the standard normal random variable.

4 Theoretical Results

Recall that our proposed method yields estimators 𝜼^\hat{\bm{\eta}}, 𝒘^\hat{\bm{w}}, and the debiased estimator θ~\tilde{\theta}. In this section, we first establish upper bounds for the estimation errors of 𝜼^\hat{\bm{\eta}} and 𝒘^\hat{\bm{w}} under the 1\ell_{1} norm. Subsequently, we show that the debiased estimator θ~\tilde{\theta} is asymptotically normal.

For a vector 𝒓=(r1,,rl)T\bm{r}=(r_{1},\dots,r_{l})^{{\mathrm{\scriptscriptstyle T}}}, let 𝒓q=(j=1l|rj|q)1/q\|\bm{r}\|_{q}=(\sum_{j=1}^{l}|r_{j}|^{q})^{1/q} for q1q\geq 1 and let 𝒓=maxj=1,,l|rj|\|\bm{r}\|_{\infty}=\max_{j=1,\ldots,l}|r_{j}|. For any matrix 𝑨\bm{A}, let λmax(𝑨)\lambda_{\max}(\bm{A}) and λmin(𝑨)\lambda_{\min}(\bm{A}) represent the largest and smallest eigenvalues of 𝑨\bm{A}, respectively. Moreover, for sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we write anbna_{n}\lesssim b_{n} if there exists a constant C>0C>0 such that anCbna_{n}\leq Cb_{n} for all nn, and anbna_{n}\asymp b_{n} if anbna_{n}\lesssim b_{n} and bnanb_{n}\lesssim a_{n}. For a sub-exponential random variable Y1Y_{1}, we write Y1φ1=inf[s>0:E{exp(Y1/s)}2]\|Y_{1}\|_{\varphi_{1}}=\inf[s>0:E\{\exp(Y_{1}/s)\}\leq 2] as the sub-exponential norm. For a sub-Gaussian random variable Y2Y_{2}, we write Y2φ2=inf[s>0:E{exp(Y22/s2)}2]\|Y_{2}\|_{\varphi_{2}}=\inf[s>0:E\{\exp{(Y_{2}^{2}/s^{2})}\}\leq 2] as the sub-Gaussian norm. Throughout the manuscript, we will use an asterisk on the upper subscript to indicate the population parameters. In addition, we define sη=card{(j:ηj0)}s_{\eta}=\text{card}\{(j:\eta^{*}_{j}\neq 0)\} and sw=card{(j:𝒘j0)}s_{w}=\text{card}\{(j:\bm{w}^{*}_{j}\neq 0)\} as the cardinalities of 𝜼\bm{\eta}^{*} and 𝒘\bm{w}^{*}, respectively. All of our theoretical analysis are performed under the regime in which nn, pp, sηs_{\eta}, and sws_{w} are allowed to increase, and the number of unmeasured confounders KK is fixed.

We start with some conditions on the factor model in (2). Similar conditions were also considered in Bai and Li (2016) in the context of high-dimensional approximate factor model.

Assumption 1

For some large constant C>0C>0,

(a) 𝔼(Eij)=0\mathbb{E}(E_{ij})=0, 𝔼(Eij8)C\mathbb{E}({E}_{ij}^{8})\leq C.

(b) 𝔼(EihEij)=τi,hj\mathbb{E}(E_{ih}E_{ij})=\tau_{i,hj} with |τi,hj|τhj|\tau_{i,hj}|\leq\tau_{hj} for some τhj>0\tau_{hj}>0 and all i=1,,ni=1,\dots,n, and h=1pτhjC\sum_{h=1}^{p}\tau_{hj}\leq C for all j=1,,pj=1,\dots,p.

(c) 𝔼(EijEsj)=ρis,j\mathbb{E}(E_{ij}E_{sj})=\rho_{is,j} with |ρis,j|ρis|\rho_{is,j}|\leq\rho_{is} for some ρis>0\rho_{is}>0 and all j=1,,pj=1,\dots,p, and n1i=1ns=1nρisn^{-1}\sum_{i=1}^{n}\sum_{s=1}^{n}\rho_{is} C\leq C.

(d) For all j,q=1,,pj,q=1,\dots,p,

𝔼{|1ni=1n[EijEiq𝔼(EijEiq)]|4}C.\mathbb{E}\left\{\left|\frac{1}{\sqrt{n}}\sum_{i=1}^{n}[E_{ij}E_{iq}-\mathbb{E}(E_{ij}E_{iq})]\right|^{4}\right\}\leq C.

(e) 𝐖j2C\|\bm{W}_{j}^{*}\|_{2}\leq C and σj2\sigma_{j}^{2} are estimated within the set [C2,C2][C^{-2},C^{2}] for all jj. For positive definite matrices 𝚪\bm{\Gamma}^{*} and Υ\Upsilon^{*}, limpp1𝐖(𝚺e)1(𝐖)T=𝚪\lim_{p\rightarrow\infty}p^{-1}\bm{W}^{*}(\bm{\Sigma}_{e}^{*})^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}}=\bm{\Gamma}^{*} and limpp1j=1p(σj)4\lim_{p\rightarrow\infty}p^{-1}\sum_{j=1}^{p}({\sigma}_{j}^{*})^{-4} {(𝐖j)T(𝐖j)T}(𝐖j𝐖j)=Υ\{(\bm{W}_{j}^{*})^{\mathrm{\scriptscriptstyle T}}\otimes(\bm{W}_{j}^{*})^{\mathrm{\scriptscriptstyle T}}\}(\bm{W}_{j}^{*}\otimes\bm{W}_{j}^{*})=\Upsilon^{*}, where \otimes is the Kronecker product.

Assumption 1 is more general than assumptions in classical factor analysis (Anderson and Amemiya, 1988; Bai, 2003; Fan et al., 2013; Bai and Li, 2016). Instead of constraining all the 𝑬i\bm{E}_{i} to have a diagonal covariance matrix, we now only require the higher-order moment of 𝑬i\bm{E}_{i} to be bounded, the diagonal entries σj\sigma_{j}^{*}’s to be bounded, as well as the magnitudes of the correlations among entries of 𝑬i\bm{E}_{i} to be controlled in Assumption 1. The conditions are mild as we only control the magnitude of correlations rather than assuming zero correlations as in classical factor analysis. Assumption 1(e) is a regularity condition for restricting the parameters. Overall, Assumption 1 follows standard conditions for the approximate factor model in Bai and Li (2016) and is required for the estimation consistency of the unmeasured confounders.

Comparing Assumption 1 to the dense confounding assumption (A2) imposed on the linear framework in Guo et al. (2022), our assumption is mild as it holds for a broad regime of nn and pp. Specifically, the dense confounding assumption (A2) in Guo et al. (2022) is related to our assumption 1(e) that limpp1𝑾(𝚺e)1(𝑾)T=𝚪\lim_{p\rightarrow\infty}p^{-1}\bm{W}^{*}(\bm{\Sigma}_{e}^{*})^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}}=\bm{\Gamma}^{*} where 𝚪\bm{\Gamma}^{*} is positive definite. Our assumption follows from classical factor analysis literature and as a result, we have λq(𝑾)p\lambda_{q}(\bm{W}^{*})\asymp\sqrt{p}, where λq(𝑾)\lambda_{q}(\bm{W}^{*}) is the qq-th singular value of the factor loading matrix 𝑾\bm{W}^{*}. With 𝜸T=(θ,𝒗T)\bm{\gamma}^{{\mathrm{\scriptscriptstyle T}}}=(\theta,\bm{v}^{{\mathrm{\scriptscriptstyle T}}}) being the coefficient for all covariates and γj\gamma_{j} being the individual coefficient of interest, the dense confounding assumption mainly requires that

λq(𝑾j)max{MKpn1(logp)3/4,MKp1/4(logp)3/8,Knlogp},\lambda_{q}(\bm{W}_{-j}^{*})\gg\max\left\{M\sqrt{Kpn^{-1}}(\log p)^{3/4},\sqrt{MK}p^{1/4}(\log p)^{3/8},\sqrt{Kn\log p}\right\},

where 𝑾j\bm{W}_{-j}^{*} denotes the factor loading matrix 𝑾\bm{W} with jj-th column removed. Guo et al. (2022) focuses on the setting pnp\gg n and they point out that in the high-dimensional regime and under certain settings, the dense confounding assumption holds with high probability. Specifically, when the entries of 𝑾\bm{W} are i.i.d. Sub-Gaussian with zero mean and variance σw2\sigma_{w}^{2}, it holds that λq(𝑾j)pσw\lambda_{q}(\bm{W}_{-j})\asymp\sqrt{p}\sigma_{w} and the dense confounding assumption requires pKnlogpp\gg Kn\log p and min{n,p}K3(logp)3/2M2\min\{n,p\}\gg K^{3}(\log p)^{3/2}M^{2} to make σw\sigma_{w} diminish to zero. However, this condition is restricted to the high-dimensional regime and may not hold when pp is of relatively lower order.

Moreover, without loss of generality, we assume a working identifiability condition: 𝑺u=𝑰K\bm{S}_{u}=\bm{I}_{K} and p1𝑾(𝚺e)1(𝑾)Tp^{-1}\bm{W}^{*}(\bm{\Sigma}_{e}^{*})^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}} is a diagonal matrix with distinct entries. Note that the aforementioned working identifiability condition is for presentation purpose only and not an assumption on the model structure. As presented in Appendix B, when such a working identifiability condition is not satisfied, our theoretical results in Theorems 2 and 3 are still valid. We also illustrate this via simulation in Section 5.1. Next, we impose some assumptions on the generalized linear model with unmeasured confounders in (1).

Assumption 2

(a) The Fisher information 𝐈=E[b′′{(𝛈)T𝐙i}𝐙i𝐙iT]\bm{I}^{*}=E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}{\bm{Z}}_{i}{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}] satisfies λmin(𝐈)κ\lambda_{\min}(\bm{I}^{*})\geq\kappa, where κ>0\kappa>0 is some constant.

(b) For some constant M>0M>0, 𝐗iM\|\bm{X}_{i}\|_{\infty}\leq M, 𝐔iM\|\bm{U}_{i}\|_{\infty}\leq M, 𝛈M\|\bm{\eta}^{*}\|_{\infty}\leq M and |(𝐰q)T𝐐i|M|(\bm{w}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}|\leq M, where 𝐰q=(w2,,wp)T\bm{w}_{q}^{*}=(w_{2}^{*},\ldots,w_{p}^{*})^{\mathrm{\scriptscriptstyle T}}.

(c) The term |yib{(𝛈)T𝐙i}||y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}| is sub-exponential with yib{(𝛈)T𝐙i}φ1M\|y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}\|_{\varphi_{1}}\leq M.

(d) Assume that a1(𝛈)T𝐙ia2a_{1}\leq(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\leq a_{2} and 0|b(t)|B0\leq|b^{\prime}(t)|\leq B with |b(t1)b(t)|B|(t1t)b(t)||b^{\prime}(t_{1})-b^{\prime}(t)|\leq B|(t_{1}-t)b^{\prime}(t)| and 0b′′(t)B0\leq b^{\prime\prime}(t)\leq B with |b′′(t1)b′′(t)|B|t1t|b′′(t)|b^{\prime\prime}(t_{1})-b^{\prime\prime}(t)|\leq B|t_{1}-t|b^{\prime\prime}(t) for constants a1a_{1}, a2a_{2} and BB, where t[a1ϵ,a2+ϵ]t\in[a_{1}-\epsilon,a_{2}+\epsilon] for ϵ>0\epsilon>0 and sequence t1t_{1} satisfies |t1t|=o(1)|t_{1}-t|=o(1).

In the absence of unmeasured confounders, similar conditions in Assumption 2 can be implied from the conditions in Theorem 3.3 in van de Geer et al. (2014) and Assumption E.1 in Ning and Liu (2017). When 𝑼i\bm{U}_{i} and 𝑿i\bm{X}_{i} are binary or categorical, Assumption 2(b) holds with a constant M>0M>0. When 𝑼i\bm{U}_{i} and 𝑿i\bm{X}_{i} are sub-exponential random vectors, Assumption 2(b) holds with M=cn1/2(logp)1/2M=cn^{-1/2}(\log p)^{1/2} for some constant c>0c>0, with probability at least 1p11-p^{-1}. Assumption 2(d) imposes mild regularity conditions on the function b(t)b(t), and is commonly used in analyzing high-dimensional generalized linear models without unmeasured confounders. We require the function b(t)b(t) to be at least twice differentiable and b(t)b^{\prime}(t) and b′′(t)b^{\prime\prime}(t) to be bounded. Specifically, |b′′(t1)b′′(t)|B|t1t|b′′(t)|b^{\prime\prime}(t_{1})-b^{\prime\prime}(t)|\leq B|t_{1}-t|b^{\prime\prime}(t) can be implied by |b′′′(t)|Bb′′(t)|b^{\prime\prime\prime}(t)|\leq Bb^{\prime\prime}(t) when b′′′(t)b^{\prime\prime\prime}(t) exists, which is a weaker condition than the self-concordance property (Bach, 2010). This boundary assumption is important for the concentration of the Hessian matrix of the loss function. Assumption 2(d) holds for commonly used generalized linear models. For example, for logistic regression where b(t)=log{1+exp(t)}b(t)=\log\{1+\exp(t)\}, Assumption 2(d) holds with B=1B=1, and a1,a2a_{1},a_{2} extended to infinity and it can be similarly verified to hold at B=max{1,exp(a2+ϵ)}B=\max\{1,\exp(a_{2}+\epsilon)\} for Poisson regression and B=max{1/(a2+ϵ)2,2/|a2+ϵ|}B=\max\{1/(a_{2}+\epsilon)^{2},2/|a_{2}+\epsilon|\} for exponential regression with a2<0a_{2}<0. For linear model, Assumption 2 can be relaxed as stated in Remark 4.

The theoretical analysis on the unmeasured confounders estimator is important in establishing the theoretical guarantee for our debiased method. As the decomposition techniques commonly used in linear models may not be applicable in generalized linear model settings, it is necessary to establish more general and refined intermediate results as the foundation of our theoretical analysis (see Remark 8 in Appendix D.1 for more details). We first present a uniform convergence result for the estimators of the unmeasured confounders.

Proposition 1

Under Assumptions 12, if n,pn,p\rightarrow\infty, we have

maxi{1,,n}𝑼^i𝑼i=Op(1n+lognp).\max_{i\in\{1,\dots,n\}}\|\hat{\bm{U}}_{i}-\bm{U}_{i}^{*}\|_{\infty}=O_{p}\left(\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right).

From Proposition 1, the estimator of unmeasured confounders 𝑼^i\hat{\bm{U}}_{i} uniformly converges to 𝑼i\bm{U}_{i} at plognp\gg\log n, which holds naturally under the high-dimensional regime pnp\gg n. Moreover, the convergence rate Op(n1/2+p1/2O_{p}(n^{-1/2}+p^{-1/2} (logn)1/2)(\log n)^{1/2}) is of a similar order to the convergence rates of principal component estimators (Fan et al., 2013; Wang and Fan, 2017). The estimation results of 𝜼^\hat{\bm{\eta}} and 𝒘^\hat{\bm{w}} are dependent on the accuracy of the estimator of unmeasured confounders and we next establish upper bounds on the estimation errors for 𝜼^\hat{\bm{\eta}} and 𝒘^\hat{\bm{w}}.

Theorem 2 (Estimation consistency)

Suppose that Assumptions 12 hold. With λλn1/2(logp)1/2+p1/2(logn)1/2\lambda\asymp\lambda^{\prime}\asymp n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}, if the scaling condition n,pn,p\rightarrow\infty and (swsη)(s_{w}\vee s_{\eta}) {n1/2(logp)1/2+p1/2(logn)1/2}\{n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\} =op(1)=o_{p}(1) hold, then we have

𝜼^𝜼1=Op{sη(logpn+lognp)};\displaystyle\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{1}=O_{p}\left\{s_{\eta}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\right\};
𝒘^𝒘1=Op{(swsη)(logpn+lognp)}.\displaystyle\|\hat{\bm{w}}-\bm{w}^{*}\|_{1}=O_{p}\left\{(s_{w}\vee s_{\eta})\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\right\}.

The effect of unmeasured confounders enters the rate of the estimation error in our results through the term p1/2(logn)1/2p^{-1/2}(\log n)^{1/2}. Under the high-dimensional setting with pnp\gg n, the unmeasured confounders effect is dominated by n1/2(logp)1/2n^{-1/2}(\log p)^{1/2} and as a result, the convergence rates in Theorem 2 are of the same order as in the oracle case when the unmeasured confounders are assumed to be known (van de Geer et al., 2014; Ning and Liu, 2017). With the estimation consistency established, we show that the debiased estimator from the proposed estimation method is asymptotically normal, and thus valid confidence intervals can be constructed.

Theorem 3 (Asymptotic normality)

Suppose that Assumptions 12 hold. With λλn1/2(logp)1/2+p1/2(logn)1/2\lambda\asymp\lambda^{\prime}\asymp n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}, if the scaling condition n,pn,p\rightarrow\infty and (swsη)(s_{w}\vee s_{\eta}) (n1/2logp+p1n1/2logn)=op(1)(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1) hold, then

n1/2(Iθ𝜻)1/2(θ~θ)dN(0,1).{{n}^{1/2}({I}_{\theta\mid\bm{\zeta}}^{*}})^{1/2}(\tilde{\theta}-\theta^{*})\rightarrow_{d}N(0,1). (7)

The result in Theorem 3 is similar to existing inference results for high-dimensional generalized linear models without unmeasured confounders (van de Geer et al., 2014; Ning and Liu, 2017; Ma et al., 2021; Shi et al., 2021; Cai et al., 2023). Theorem 3 is established under the condition that the estimation errors of 𝜼^\hat{\bm{\eta}} and 𝒘^\hat{\bm{w}} in Theorem 2 are op(1)o_{p}(1). As a consequence of the asymptotic normality result in Theorem 3 and that I^θ𝜻\hat{{I}}_{\theta\mid\bm{\zeta}} is consistent estimator for Iθ𝜻{I}_{\theta\mid\bm{\zeta}}^{*}, the construction of the confidence interval in (6) is valid.

Remark 4

For the linear model, estimation consistency and asymptotic normality results in Theorems 2 and 3 hold with less stringent conditions than that in Assumption 2. Suppose Assumption 1 and Assumption 2(a) hold and assume that the random noise εi\varepsilon_{i} is independent sub-Gaussian random variable and 𝐙i\bm{Z}_{i} is sub-Gaussian vector such that εiφ2M\|\varepsilon_{i}\|_{\varphi_{2}}\leq M and Zijφ2M\|Z_{ij}\|_{\varphi_{2}}\leq M for some constant M>0M>0. By choosing λλn1/2(logp)1/2+p1/2(logn)1/2\lambda\asymp\lambda^{\prime}\asymp n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}, if n,pn,p\rightarrow\infty and (swsη)(n1/2logp+p1n1/2logn)=op(1)(s_{w}\vee s_{\eta})(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1), the estimation error bounds in Theorem 2 hold and the debiased estimator θ~\tilde{\theta} is asymptotically normal with limiting distribution (7). Similar assumptions are commonly imposed in many existing inference methods for high-dimensional linear models without unmeasured confounders (Zhang and Zhang, 2014; Javanmard and Montanari, 2014) as well as in the existence of confounders (Ćevid et al., 2020; Guo et al., 2022).

Remark 5

The sparsity assumption on 𝐰\bm{w} is a standard assumption in high-dimensional regression models without unmeasured confounders. It may be more suitable and even weaker in our proposed framework with unmeasured confounders. In high-dimensional regression models without unmeasured confounders, the sparsity assumption on 𝐰\bm{w} is implied by the sparse inverse population Hessian condition (van de Geer et al., 2014; Belloni et al., 2016; Ning and Liu, 2017; Jankova and van de Geer, 2018). Specifically, the sparse inverse population Hessian condition implies the sparsity of the coefficient parameter 𝐰\bm{w} in a weighted node-wise lasso regression Di𝐐iD_{i}\sim\bm{Q}_{i}, which is similar to (5) but regresses the covariate of interest DiD_{i} on the rest of the covariates 𝐐i\bm{Q}_{i}.

In our proposed setting with unmeasured confounders, as shown from (5), we assume that in the weighted node-wise lasso regression Di𝐐i+𝐔iD_{i}\sim\bm{Q}_{i}+\bm{U}_{i}, the coefficient of 𝐐i\bm{Q}_{i}, denoted as 𝐰q\bm{w}_{q}, to be sparse. That is, 𝐰q\bm{w}_{q} is sparse, conditional on the unmeasured confounders 𝐔i\bm{U}_{i}. The sparsity assumption is mild and can be satisfied in many settings. For example, under the assumptions in classical factor analysis in which the covariance matrix of the random error 𝐄i\bm{E}_{i} is diagonal, DiD_{i} is uncorrelated with the other covariates 𝐐i\bm{Q}_{i}, conditioned on the unmeasured confounders 𝐔i\bm{U}_{i}. In this case, we have 𝐰q=𝟎\bm{w}_{q}=\bm{0}, conditioned on 𝐔i\bm{U}_{i} and thus the sparsity assumption holds naturally. We also want to point out that the imposed sparsity assumption may be weaker than that of existing work on high-dimensional inference without unmeasured confounders, where the coefficients of regression Di𝐐iD_{i}\sim\bm{Q}_{i} are assumed to be sparse. For instance, we allow DiD_{i} to be densely correlated with 𝐐i\bm{Q}_{i} marginally. In other words, we only require 𝐰q\bm{w}_{q}, the coefficient for 𝐐i\bm{Q}_{i} in Di𝐐i+𝐔iD_{i}\sim\bm{Q}_{i}+\bm{U}_{i}, to be sparse, conditional on the confounders 𝐔i\bm{U}_{i} while marginally the coefficients of Di𝐐iD_{i}\sim\bm{Q}_{i} could be dense.

Remark 6

For the simplicity of theoretical analysis, we assume the dimension of unmeasured confounders KK to be fixed and known, which is also assumed in Wang and Fan (2017), Fan et al. (2023), Wang (2022) and many other works. Nevertheless, our theoretical results hold as long as KK is consistently estimated. As introduced in Section 3, we use parallel analysis to estimate the dimension of unmeasured confounders in practice. Theoretically, it has been shown that parallel analysis consistently selects the dimension of unmeasured confounders in factor analysis (Dobriban, 2020). Specifically, when the dimension pp is relatively large compared to the sample size nn, each factor loads on not too few variables, and the signal size of unmeasured confounders is not too large, parallel analysis selects the number of factors with probability tending to one (Dobriban, 2020). These conditions can be satisfied under our framework, implying that the dimension of unmeasured confounders is consistently determined. Moreover, empirically we find that the proposed method still provides satisfying inference results under some overestimation of KK. Intuitively, as long as the corresponding linear combinations of the true underlying factors 𝐔\bm{U} in the considered models can be well approximated by those of the estimated 𝐔^\hat{\bm{U}}, the developed inference results for θ\theta^{*} would still hold. To further illustrate this, we perform simulation studies in Appendix C to show that the overestimation of KK may not affect the asymptotic properties of the debiased estimator.

5 Numerical Studies

To illustrate the performance of our proposed method, we conduct numerical experiments including simulation studies and an application of our method to a genetic data set.

5.1 Simulations

We consider two different models in (1), a linear regression model and a logistic regression model. We compare our method with two alternative approaches for performing high-dimensional inference: the oracle method where we perform the debiasing approach assuming that the true values of unmeasured confounders are known; and the naive method in which the unmeasured confounders are neglected in the estimation and we perform the debiasing approach with the observed covariates only. In addition, for the linear regression setting, we compare the proposed method with the doubly debiased lasso method proposed by Guo et al. (2022). The sparsity tuning parameters for all of the aforementioned methods are selected using 10-fold cross-validation. Our proposed method involves estimating the dimension of the unmeasured confounders, which we estimate using the parallel analysis (Horn, 1965; Dinno, 2009). To evaluate the performance across the different methods, we construct the 95% confidence intervals for the parameter of interest, and compute the average confidence length and the coverage probability of the true parameter over 300 independent replications.

First, we generate each entry of the unmeasured confounders 𝑼1,,𝑼n3\bm{U}_{1},\ldots,\bm{U}_{n}\in\mathbb{R}^{3} and each entry of the random error 𝑬1,,𝑬np\bm{E}_{1},\ldots,\bm{E}_{n}\in\mathbb{R}^{p} from a standard normal distribution. We set

(𝑾)T=(0.5p/3𝟎𝟎𝟎𝟏p/3𝟎𝟎𝟎1.5p/3)p×3,(\bm{W}^{*})^{T}=\left(\begin{array}[]{ccc}\bm{0.5}_{p/3}&\bm{0}&\bm{0}\\ \bm{0}&\bm{1}_{p/3}&\bm{0}\\ \bm{0}&\bm{0}&\bm{1.5}_{p/3}\\ \end{array}\right)_{p\times 3},

where 0.5p/3\bm{0.5}_{p/3}, 𝟏p/3\bm{1}_{p/3}, and 1.5p/3\bm{1.5}_{p/3} are vectors of dimension p/3p/3 with all entries equal 0.5, 1, and 1.5, respectively (in our simulation, p/3p/3 is set as integer values). The 𝟎\bm{0} in the matrix denotes a vector of dimension p/3p/3 with all entries equal 0. We then generate the covariates 𝑿i=(𝑾)T𝑼i+𝑬i\bm{X}_{i}=(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{U}_{i}+\bm{E}_{i} with Di=𝑿i,1D_{i}=\bm{X}_{i,1} and 𝑸i=𝑿i,1\bm{Q}_{i}=\bm{X}_{i,-1}. It can be verified that the above setting satisfies the working identifiability condition described in Section 4. Later in this section, we also perform additional simulation studies as to illustrate the validity of the theory when the working identifiability condition does not hold.

We consider both the linear regression model and the logistic regression model: for linear regression, we generate the response yiy_{i} according to yi=θDi+(𝒗)T𝑸i+(𝜷)T𝑼i+εiy_{i}=\theta^{*}D_{i}+(\bm{v}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+(\bm{\beta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{U}_{i}+\varepsilon_{i}, where the random noise εi\varepsilon_{i} is generated from a standard normal distribution; for logistic regression, the response yiy_{i} is generated from a Bernoulli distribution with probability pi(θ,𝒗,𝜷)=1/(1+exp[{θDi+(𝒗)T𝑸i+(𝜷)T𝑼i}])p_{i}(\theta^{*},\bm{v}^{*},\bm{\beta}^{*})={1}/(1+\exp[-\{\theta^{*}D_{i}+(\bm{v}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+(\bm{\beta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{U}_{i}\}]). The regression coefficients for the unmeasured confounders, parameter of interest, and nuisance parameters are set to be 𝜷=(1,1,1)T\bm{\beta}^{*}=(1,1,1)^{\mathrm{\scriptscriptstyle T}}, θ=0\theta^{*}=0, and 𝒗=(1,𝟎p2)T\bm{v}^{*}=(1,\bm{0}_{p-2})^{\mathrm{\scriptscriptstyle T}}, respectively.

Refer to caption
Figure 1: Coverage and length of the confidence interval under linear regression, averaged over 300 replications, with varying pp and fixed n=500n=500 (top), and with varying nn and fixed p=600p=600 (bottom). Black dashlines indicate the 0.95 level. Purple solid lines (Refer to caption) are our proposed method. Blue two-dashed lines (Refer to caption) represent the oracle case. Green dotted lines (Refer to caption) indicate the naive method. Orange dashed lines (Refer to caption) represent the doubly debiased lasso method.

Results for the cases when n=500n=500 and pp varies from 100 to 1500 and when p=600p=600 and nn varies from 100 to 1500 under both linear regression model and logistic regression model are presented in Fig. 1 and Fig. 2, respectively. From Fig. 1, we see that the naive method suffers from undercoverage since the effects induced by the unmeasured confounders are not taken into account. Our proposed method has coverage at approximately 0.95 and is similar to that of the oracle method. The proposed method also has similar length of the confidence intervals as that of the oracle method. As a comparison, the doubly debiased lasso method suffers from undercoverage when pp is relatively small (p<100p<100 and n=500n=500): our finding is consistent with that of Guo et al. (2022), where they indicate that when pp is relatively small to nn, bias from the hidden confounding effects can be relatively large, which leads to undercoverage of their method. As pp increases, we see that coverage for the doubly debiased lasso method approaches to 0.95, but its confidence interval lengths are larger than that of our proposed method, indicating our method is preferred in this case.

Under logistic regression model, our proposed method performs similarly to the oracle method in terms of both coverage and length of the confidence intervals. Results for doubly debiased lasso are not reported as it is not directly applicable to generalized linear model.

Refer to caption
Figure 2: Coverage and length of the confidence interval under logistic regression, averaged over 300 replications, with varying pp and fixed n=500n=500 (top), and with varying nn and fixed p=600p=600 (bottom). Black dashlines indicate the 0.95 level. Purple solid lines (Refer to caption) are our proposed method. Blue two-dashed lines (Refer to caption) represent the oracle case. Green dotted lines (Refer to caption) indicate the naive method.

In the remaining subsection, we show that when the working identifiability condition fails, our proposed method can still perform well. More discussions about the working identifiability condition are in Appendix B. Specifically, we set Wjk Unif[0,1]W_{jk}^{*}{\sim}\text{ Unif}[0,1] for j=1,,pj=1,\dots,p and k=1,2,k=1,2, and 33. This loading matrix implies that each covariate vector 𝑿i\bm{X}_{i} is related to all the confounders. The identifiability condition fails in such case as p1𝑾(𝚺e)1(𝑾)Tp^{-1}\bm{W}^{*}(\bm{\Sigma}_{e}^{*})^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}} is not a diagonal matrix. We continue with the parameter setting at the start of this section and the data generation process is unchanged except for replacing the original loading matrix with the new loading matrix where elements are uniformly distributed.

At varying dimensions nn and pp, we construct the confidence intervals using each method over 300 simulations, and then compute the coverage probabilities of the 95% confidence intervals on the true parameter and the average confidence interval lengths. Under the new loading matrix, the results for linear regression model are shown in Fig. 3 and the results for logistic regression model are shown in Fig. 4. We observe the coverage rates of our proposed method can still achieve the desirable 0.95 level and are close to the oracle case. The naive method performs worse than the proposed method, with coverage rates much less than the 0.95 level for most cases. For the linear regression results, although a little under coverage in some cases, the doubly debiased lasso method exhibits good performance in this setting with coverage rates approximating to the oracle case, while its average confidence interval length is larger than that of the proposed method.

Refer to caption
Figure 3: Coverage and length of the confidence interval under linear regression with new loading matrix, averaged over 300 replications, with varying pp and fixed n=500n=500 (top), and with varying nn and fixed p=600p=600 (bottom). Black dashlines indicate the 0.95 level. Purple solid lines (Refer to caption) are our proposed method. Blue two-dashed lines (Refer to caption) represent the oracle case. Green dotted lines (Refer to caption) indicate the naive method. Orange dashed lines(Refer to caption) represent the doubly debiased lasso method.
Refer to caption
Figure 4: Coverage and length of confidence interval under logistic regression new loading matrix, averaged over 300 replications, with varying pp and fixed n=500n=500 (top), and with varying nn and fixed p=600p=600 (bottom). Black dashlines indicate the 0.95 level. Purple solid lines (Refer to caption) are our proposed method. Blue two-dashed lines (Refer to caption) represent the oracle case. Green dotted lines (Refer to caption) indicate the naive method.

5.2 Data Application

In this section, we apply the proposed method to a genetic data containing gene expression quantifications and stimulation statuses in mouse bone marrow derived dendritic cells. The data were also previously analyzed in Shalek et al. (2014) and Cai et al. (2023). Specifically, Cai et al. (2023) aimed to find significant gene expressions in response to three stimulations, which are (a) PAM, a synthetic mimic of bacterial lipopeptides; (b) PIC, a viralike ribonucleic acid; and (c) LPS, a component of bacteria. However, there may be potential unmeasured confounders that may lead to spurious discoveries. Our goal is to investigate the relationship between gene expressions levels and the different stimulations while accounting for possible unmeasured confounders.

We start with pre-processing the data: we consider expression profiles after six hours of stimulations and perform genes filtering, expression level transformation, and normalization. The details can be found in Cai et al. (2023). After the pre-processing steps, we have three groups of cells including 64 PAM stimulated cells, 96 PIC stimulated cells, and 96 LPS stimulated cells, respectively. Moreover, each of the three groups contains 96 control cells without any stimulation. The gene expression quantifications are computed on 768 genes for PAM stimulation group, 697 genes on PIC stimulation group, and 798 genes for LPS stimulation group. For each group of cells, the high-dimensional covariates are the gene expression levels, the stimulation status is a binary response variable recording whether a cell is stimulated.

We fit the data using our proposed method and the naive method in which we perform the debiasing approach without adjusting for unmeasured confounders. Applying each method, we construct 95% confidence intervals for the regression coefficient for every gene. In addition, we compute the pp-value and effect size for each gene using each of two methods, respectively. Our proposed method found that there are 7 possible confounders for the PAM and PIC stimulation groups, and 9 possible confounders for the LPS stimulation group. The 95% confidence intervals constructed with our proposed method for each of the three groups are shown in Fig. 5.

Refer to caption
Figure 5: Confidence interval estimation using proposed method under logistic regression for the stimulation groups of PAM (top), PIC (middle) and LPS (bottom). Purple intervals indicate confidence intervals that do not cover zero. Red intervals represent confidence intervals that are among purple intervals and significant after Bonferroni Correction.

We see from Fig. 5 that the confidence intervals for some genes do not contain zero, suggesting that they are possibly associated with their respective stimulations. For valid statistical inference, we perform a Bonferroni correction to adjust for multiple hypothesis testing. As a comparison, we perform similar procedures using the confidence intervals constructed with the naive method to identify the significant genes. Genes that are significantly associated with the stimulation after Bonferroni correction from the proposed method and naive method respectively are reported in Table 1. The small pp-values and large effect sizes indicate that the corresponding genes are strongly associated with their respective stimulations. Some gene codes in Table 1 coincide with the findings in Cai et al. (2023), and their functional consequences to the stimulation are also supported in existing literature. Specifically, Cai et al. (2023) suggested that there exist significant association for “IL6” with PAM stimulation, “RSAD2” with PIC stimulation, and “CXCL10” and “IL12B” with LPS stimulation. Applying our proposed method in which the effects of unmeasured confounders are considered, we identify that “IL1B”, “IL6”, and “SAA3” are significant genes for PAM stimulated cells, “RSAD2” is significant for PIC stimulated cells, and “IL12B” is significant for LPS stimulated cells. On the other hand, when we apply the naive method which does not adjust for unmeasured confounders, we identify significant genes “IL1B”, “IL6” for PAM stimulated cells, “RSAD2” for PIC stimulated cells and “IL12B” for LPS stimulated cells. Comparing the proposed method with the naive method, our method that adjusts for unmeasured confounders identifies an additional gene “SAA3”. In the genetics literature, experimental studies support the finding that “SAA3” plays important roles in immune reactions (Ather and Poynter, 2018). For instance, mice lacking the gene “SAA3” develop metabolic dysfunction along with defects in innate immune development (Ather and Poynter, 2018). This comparison suggests that our proposed method can identify significant genes/variables that are not captured by existing debiasing procedures without for adjusting unmeasured confounders.

Method Stimulation Gene code pp-value Effect size
Proposed method PAM IL1B 1\cdot743 ×105\times 10^{-5} 4\cdot295
IL6 8\cdot940×1011\times 10^{-11} 6\cdot484
SAA3 8\cdot800×106\times 10^{-6} 4\cdot445
PIC RSAD2 4\cdot441×1016\times 10^{-16} 8\cdot144
LPS IL12B 5\cdot180×109\times 10^{-9} 5\cdot841
Naive method PAM IL1B 6\cdot908×107\times 10^{-7} 4\cdot963
IL6 2\cdot745×1012\times 10^{-12} 6\cdot990
PIC RSAD2 <1×1016<1\times 10^{-16} 8\cdot604
LPS IL12B 4\cdot187×108\times 10^{-8} 5\cdot482
Table 1: Significant gene expression associated with stimulations after Bonferroni correction estimated using the proposed method and naive method, respectively.

Moreover, many genetic studies support that in addition to the “SAA3”, genes discovered by our proposed method all play important roles in the immune response. The gene “IL1B” encodes protein in the family of interleukin 1 cytokine, which is an important mediator of the inflammatory response and its induction can contribute to inflammatory pain hypersensitivity (Nemetz et al., 1999). The genes “IL6” and “IL12B” are also known to encode cytokines that play key roles in hosting defense through stimulation and immune reactions (Müller-Berghaus et al., 2004; Toyoshima et al., 2019). The expression of the gene “RSAD2” is important in antiviral innate immune responses and “RSAD2” is also a powerful stimulator of adaptive immune response mediated via mature dendritic cells (Jang et al., 2018).

6 Discussion

This manuscript studies statistical inference problem for the high-dimensional generalized linear model under the presence of hidden confounding bias. We propose a debiasing approach to construct a consistent estimator of the individual coefficient of interest and its corresponding confidence intervals, which generalizes the existing debiasing approach to account for the effects induced by the unmeasured confounders. Theoretical properties were also established for the proposed procedure.

Our goal of this paper is to conduct statistical inference for individual coefficients θ\theta. The purpose of adjusting for the unmeasured confounders in the proposed method is to use the latent information for the better inference of covariates coefficient θ\theta, while the inference for the coefficient of unmeasured confounders 𝜷\bm{\beta} is not the focus of the current work. This problem setting has wide scientific applications. For instance, as introduced in our data application, biologists are interested in the significance of the genes under the simulations with the unmeasured confounders adjusted. Moreover, in many practices with finite samples, consistent estimation of the number of factors may be problematic; our method only requires that the estimated factors can well approximate the confounders and the interpretation of the factors is of less interest in this setting. Here to investigate how the accuracy of the estimation of the dimension of unmeasured confounders KK affects the asymptotic distribution of debiased estimator, we conducted a simulation study in Appendix C and the results suggest that in practice, the overestimation of KK may not affect the asymptotic normality results, which may be because an abundant amount of unmeasured confounders can still fully capture the covariate information and would not incur bias in the point and interval estimation.

Nevertheless, the inference of unmeasured confounders is also of great interest in many econometrics and psychometrics applications. For instance, Fan et al. (2023) recently proposed an ANOVA-type testing procedure for testing the existence of unmeasured confounders or not. It would be interesting to develop similar inference testing procedures under the generalized linear models in applications when the inference of the unmeasured confounders of scientific interest.

Another interesting related problem is regarding the inference on group-wise covariate coefficients. We next briefly describe how to generalize our method to obtain the group-wise asymptotic distribution of maxjG|γj|\max_{j\in G}|\gamma_{j}| via a bootstrap-assisted procedure for 𝜸=(θ,𝒗T)p\bm{\gamma}=(\theta,\bm{v}^{{\mathrm{\scriptscriptstyle T}}})\in\mathbb{R}^{p} and any subset G{1,,p}G\subseteq\{1,\dots,p\}. The procedures follow simultaneous inference results from Zhang and Cheng (2017) and are shown as follows.

  • Step 1: For each γj\gamma_{j}, we construct a debiased estimator γ~j\tilde{\gamma}_{j} using our proposed method for jGj\in G. We generate a sequence of i.i.d. standard normal random variables and denote them as {ϖi}i=1,,n\{\varpi_{i}\}_{i=1,\dots,n}.

  • Step 2: Then, under the null hypothesis that H0,G:γj=γj0H_{0,G}:\gamma_{j}^{*}=\gamma_{j}^{0} for jGj\in G, let WG=maxjG{W}_{G}=\max_{j\in G} |i=1nI^γj𝜼j1S^(γj0,𝜼^j)ϖi/n||\sum_{i=1}^{n}\hat{I}_{\gamma_{j}\mid\bm{\eta}_{-j}}^{-1}\hat{S}(\gamma_{j}^{0},\hat{\bm{\eta}}_{-j})\varpi_{i}/\sqrt{n}|, where S^(γj0,𝜼^j)\hat{S}(\gamma_{j}^{0},\hat{\bm{\eta}}_{-j}) can be similarly calculated as S^(θ0,𝜻^)\hat{S}(\theta^{0},\hat{\bm{\zeta}}) and I^γj𝜼j\hat{I}_{\gamma_{j}\mid\bm{\eta}_{-j}} can be similarly calculated as I^θ𝜻\hat{I}_{\theta\mid\bm{\zeta}} in Section 3 by treating γj\gamma_{j} as the coefficient of interest and 𝜼j\bm{\eta}_{-j} as the other coefficient for nuisance covariates and unmeasured confounders.

  • Step 3: Next, we calculate the critical value CG(α)C_{G}(\alpha) at (1α)(1-\alpha) significance level by CG(α)=inf[t:P{WGt(yi,𝑿i)i=1,,n}1α]C_{G}(\alpha)=\inf[t\in\mathbb{R}:P\{W_{G}\leq t\mid(y_{i},\bm{X}_{i})_{i=1,\dots,n}\}\geqslant 1-\alpha].

With the debiased estimators and the critical value, we expect a similar result as Theorem 4.1 in Zhang and Cheng (2017) that the asymptotic distribution of maxjGn|γ~jγj|\max_{j\in G}\sqrt{n}|\tilde{\gamma}_{j}-\gamma_{j}^{*}| satisfy supα(0,1)|P{maxjGn|γ~jγj|>CG(α)}α|=o(1)\sup_{\alpha\in(0,1)}|P\{\max_{j\in G}\sqrt{n}|\tilde{\gamma}_{j}-\gamma_{j}^{*}|>C_{G}(\alpha)\}-\alpha|=o(1). While the inference on group-wise maximum coefficients is an interesting problem, it is not the focus of this work and we leave the related theoretical proof for future research.

Besides the aforementioned extensions, there are several related problems worth investigating in the future. For instance, as discussed in Section 2.2, there are many models in existing literature related to our problem, in particular, the factor-adjusted method can be extended to the situation involving hidden confounding (Fan et al., 2020). In addition, in our theoretical analysis, we assume that the dimension of unmeasured confounders, KK, is fixed, which has also been imposed in the existing literature (e.g., Wang and Fan, 2017; Fan et al., 2023). One possible extension is to allow KK to grow as nn and pp increase, and it involves generalizations of the theoretical results on the maximum likelihood estimation for the unmeasured confounders. It would also be interesting to generalize the factor model to a nonlinear structure and investigate the theoretical properties of the debiased estimator under generalized factor model (Chen et al., 2020; Liu et al., 2023). Besides generalized linear models, the high-dimensional debiasing technique is also popularly studied in a variety of models such as Gaussian graphical models (Ren et al., 2015; Zhu et al., 2020) and additive hazards models (Lin and Ying, 1994; Lin and Lv, 2013). For instance, consider additive hazards models, which are popularly used in survival analysis and assume the conditional hazard function at time tt as λ(tD,𝑸,𝑼)=λ0(t)+θD+𝒗T𝑸+𝜷T𝑼\lambda(t\mid D,\ \bm{Q},\ \bm{U})=\lambda_{0}(t)+\theta D+\bm{v}^{{\mathrm{\scriptscriptstyle T}}}\bm{Q}+\bm{\beta}^{{\mathrm{\scriptscriptstyle T}}}\bm{U} with DD and 𝑸\bm{Q} covariates and 𝑼\bm{U} unmeasured confounders. The relationship between covariates {D,𝑸}\{D,\bm{Q}\} and unmeasured confounders 𝑼\bm{U} are also modeled by a linear factor model. Our method can be used to perform inference on θ\theta under the aforementioned model setup based on the quadratic loss function. Generalizing our proposed approach to these models to adjust for possible unmeasured confounders is also an interesting direction to investigate.

Appendix A. Preliminaries

We start with introducing some notations and definitions. For a vector 𝒓=(r1,,rl)T\bm{r}=(r_{1},\dots,r_{l})^{{\mathrm{\scriptscriptstyle T}}}, we let 𝒮r={j:rj0}\mathcal{S}_{r}=\{j:{r}_{j}\neq 0\}, 𝒓=maxj=1,,l|rj|\|\bm{r}\|_{\infty}=\max_{j=1,\ldots,l}|r_{j}|, 𝒓q=(j=1l|rj|q)1/q\|\bm{r}\|_{q}=(\sum_{j=1}^{l}|r_{j}|^{q})^{1/q} for q1q\geq 1 and sr=card(𝒮r)s_{r}=\text{card}(\mathcal{S}_{r}). For a matrix 𝑨=(aij)n×l\bm{A}=(a_{ij})_{n\times l}, let 𝑨,1=maxj=1,,li=1n|aij|\|\bm{A}\|_{\infty,1}=\max_{j=1,\ldots,l}\sum_{i=1}^{n}|a_{ij}| to be the maximum absolute column sum, 𝑨1,=maxi=1,,nj=1l|aij|\|\bm{A}\|_{1,\infty}=\max_{i=1,\ldots,n}\sum_{j=1}^{l}|a_{ij}| to be the maximum of the absolute row sum, 𝑨max=maxi,j|aij|\|\bm{A}\|_{\max}=\max_{i,j}|a_{ij}| to be the maximum of the matrix entry, λmin(𝑨)\lambda_{\min}(\bm{A}) and λmax(𝑨)\lambda_{\max}(\bm{A}) to be the smallest and largest eigenvalues of 𝑨\bm{A} and 𝑨F=(i=1nj=1l|aij|2)1/2\|\bm{A}\|_{F}=(\sum_{i=1}^{n}\sum_{j=1}^{l}|a_{ij}|^{2})^{1/2} to be the Frobenius norm of 𝑨\bm{A}. For sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we write anbna_{n}\lesssim b_{n} if there exists a constant C>0C>0 such that anCbna_{n}\leq Cb_{n} for all nn, and anbna_{n}\asymp b_{n} if anbna_{n}\lesssim b_{n} and bnanb_{n}\lesssim a_{n}. For any sub-exponential random variable Y1Y_{1}, we define the sub-exponential norm as Y1φ1=inf[s>0:E{exp(Y1/s)}2]\|Y_{1}\|_{\varphi_{1}}=\inf[s>0:E\{\exp(Y_{1}/s)\}\leq 2]. For any sub-Gaussian random variable Y2Y_{2}, we define the sub-Gaussian norm as Y2φ2=inf[s>0:E{exp(Y22/s2)}2]\|Y_{2}\|_{\varphi_{2}}=\inf[s>0:E\{\exp{(Y_{2}^{2}/s^{2})}\}\leq 2].

Next, we give a review of our model framework. We assume the response yy given the covariate of interest DD, the nuisance covariates 𝑸\bm{Q} and the unmeasured confounders 𝑼\bm{U} follows the generalized linear model with the probability density (mass) function to be

f(y)=exp[{y(θD+𝒗T𝑸+𝜷T𝑼)b(θD+𝒗T𝑸+𝜷T𝑼)}/a(ϕ)+c(y,ϕ)],f(y)=\exp\left[\{y(\theta{D}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U})-b(\theta{D}+\bm{v}^{\mathrm{\scriptscriptstyle T}}\bm{Q}+\bm{\beta}^{\mathrm{\scriptscriptstyle T}}\bm{U})\}/{a(\phi)}+c(y,\phi)\right], (A8)

and the relationship between 𝑿\bm{X} and 𝑼\bm{U} is

𝑿=𝑾T𝑼+𝑬.\bm{X}=\bm{W}^{{\mathrm{\scriptscriptstyle T}}}\bm{U}+\bm{E}. (A9)

We let 𝒁=(D,𝑸T,𝑼T)T\bm{Z}=(D,\bm{Q}^{\mathrm{\scriptscriptstyle T}},\bm{U}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} to be a vector that includes all of the covariates and the unmeasured confounders, and let 𝜼T=(θ,𝒗T,𝜷T)\bm{\eta}^{\mathrm{\scriptscriptstyle T}}=(\theta,\bm{v}^{\mathrm{\scriptscriptstyle T}},\bm{\beta}^{\mathrm{\scriptscriptstyle T}}) to be the corresponding parameters. For notational convenience, we also let 𝑴=(𝑸T,𝑼T)T\bm{M}=(\bm{Q}^{\mathrm{\scriptscriptstyle T}},\bm{U}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} and its coefficient 𝜻=(𝒗T,𝜷T)T\bm{\zeta}=(\bm{v}^{{\mathrm{\scriptscriptstyle T}}},\bm{\beta}^{{\mathrm{\scriptscriptstyle T}}})^{\mathrm{\scriptscriptstyle T}}, so 𝜼=(θ,𝜻T)T\bm{\eta}=(\theta,\bm{\zeta}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}. We denote 𝜸=(θ,𝒗T)T\bm{\gamma}=(\theta,\bm{v}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} as the coefficient for covariates 𝑿\bm{X}, so 𝜼=(𝜸T,𝜷T)T\bm{\eta}=(\bm{\gamma}^{\mathrm{\scriptscriptstyle T}},\bm{\beta}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}.

Throughout the appendix, we use an asterisk on the upper subscript to indicate the population parameters. We assume that the observed data {yi,𝑿i}i=1,,n\{y_{i},\bm{X}_{i}\}_{i=1,\dots,n} and the unmeasured confounders {𝑼i}i=1,,n\{\bm{U}_{i}\}_{i=1,\dots,n} are realizations of (A8) and (A9). The noise for the factor model are denoted as {𝑬i}i=1,,n\{\bm{E}_{i}\}_{i=1,\ldots,n} with 𝑬i=(Eij:j=1,,p)\bm{E}_{i}=(E_{ij}:j=1,\ldots,p).

In the proposed inferential procedure, we first obtain the maximum likelihood estimator for unmeasured confounders 𝑼^i\hat{\bm{U}}_{i}. With the unmeasured confounder estimators, we let 𝒁˙i=(Di,𝑸iT,𝑼^iT)T\dot{\bm{Z}}_{i}=(D_{i},\bm{Q}_{i}^{\mathrm{\scriptscriptstyle T}},\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}, 𝑴˙i=(𝑿iT,𝑼^iT)T\dot{\bm{M}}_{i}=(\bm{X}_{i}^{\mathrm{\scriptscriptstyle T}},\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} and get an initial lasso estimator

𝜼^=argminη(p+K)l(𝜼)+λ𝜸1,\hat{\bm{\eta}}=\underset{\eta\in\mathbb{R}^{(p+K)}}{\operatorname{argmin}}~l(\bm{\eta})+\lambda\|\bm{\gamma}\|_{1}, (A10)

where the loss function is the negative log-likelihood function defined as

l(𝜼)=1ni=1n{yi𝜼T𝒁˙ib(𝜼T𝒁˙i)}.\displaystyle l(\bm{\eta})=-\frac{1}{n}\sum_{i=1}^{n}\{y_{i}\bm{\eta}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}-b(\bm{\eta}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\}.

The gradient l(𝜼)\nabla l(\bm{\eta}) and Hessian 2l(𝜼)\nabla^{2}l(\bm{\eta}) of the loss function are commonly used in our subsequent proofs, which are expressed as

l(𝜼)=1ni=1n{yib(𝜼T𝒁˙i)}𝒁˙i;2l(𝜼)=1ni=1nb′′(𝜼T𝒁˙i)𝒁˙i𝒁˙iT.\displaystyle\nabla l(\bm{\eta})=-\frac{1}{n}\sum_{i=1}^{n}\{y_{i}-b^{\prime}(\bm{\eta}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\}\dot{\bm{Z}}_{i};\quad\nabla^{2}l(\bm{\eta})=\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\bm{\eta}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}.

In constructing the debiased estimator, we define 𝒘=𝑰θ𝜻(𝑰𝜻𝜻)1\bm{w}^{*}=\bm{I}_{\theta\bm{\zeta}}^{*}(\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*})^{-1} and 𝝉=(1,𝒘)T\bm{\tau}^{*}=(1,-\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}, where 𝑰θ𝜻=E[b′′{(𝜼)T𝒁i}Di𝑴iT]\bm{I}_{\theta\bm{\zeta}}^{*}=E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}D_{i}\bm{M}_{i}^{\mathrm{\scriptscriptstyle T}}] and 𝑰𝜻𝜻=E[b′′{(𝜼)T𝒁i}𝑴i𝑴iT]\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}=E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\bm{M}_{i}\bm{M}_{i}^{\mathrm{\scriptscriptstyle T}}]. We denote two sub-vectors of 𝒘\bm{w}^{*} as 𝒘q=(w2,,wq)T\bm{w}_{q}^{*}=(w_{2}^{*},\ldots,w_{q}^{*})^{\mathrm{\scriptscriptstyle T}} and 𝒘u=(wp+1,,wp+K)T\bm{w}_{u}^{*}=(w_{p+1}^{*},\ldots,w_{p+K}^{*})^{\mathrm{\scriptscriptstyle T}}. We define the estimator for the sparse vector 𝒘\bm{w} as

𝒘^=argmin𝒘(p+K1)12ni=1n{𝒘T𝜻𝜻li(θ^,𝜻^)𝒘2𝒘T𝜻θli(θ^,𝜻^)}+λ𝒘1,\hat{\bm{w}}=\underset{\bm{w}\in\mathbb{R}^{(p+K-1)}}{\operatorname{argmin}}~\frac{1}{2n}\sum_{i=1}^{n}\{\bm{w}^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}\bm{\zeta}}l_{i}(\hat{\theta},\hat{\bm{\zeta}})\bm{w}-2\bm{w}^{{\mathrm{\scriptscriptstyle T}}}\nabla_{\bm{\zeta}\theta}l_{i}(\hat{\theta},\hat{\bm{\zeta}})\}+\lambda^{\prime}\|\bm{w}\|_{1}, (A11)

where li(θ,𝜻)=yi(θDi+𝜻T𝑴˙i)+b(θDi+𝜻T𝑴˙i)l_{i}(\theta,\bm{\zeta})=-y_{i}(\theta D_{i}+\bm{\zeta}^{T}\dot{\bm{M}}_{i})+b(\theta D_{i}+\bm{\zeta}^{T}\dot{\bm{M}}_{i}) is equivalent to li(𝜼)l_{i}(\bm{\eta}), the iith component of the loss function. Equivalently, the estimator 𝒘^\hat{\bm{w}} is obtained by

𝒘^=argmin𝒘(p+K1)12ni=1nb′′(θ^Di+𝜻^T𝑴˙i)(Di𝒘T𝑴˙i)2+λ𝒘1.\hat{\bm{w}}=\underset{\bm{w}\in\mathbb{R}^{(p+K-1)}}{\operatorname{argmin}}~\frac{1}{2n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\theta}D_{i}+\hat{\bm{\zeta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i})(D_{i}-\bm{w}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i})^{2}+\lambda^{\prime}\|\bm{w}\|_{1}.

The estimator 𝒘^\hat{\bm{w}} is used in estimating the generalized decorrelated score function and the partial Fisher information matrix. Under null hypothesis H0:θ=θ0H_{0}:\theta^{*}=\theta^{0}, the two estimators are

S^(θ0,𝜻^)\displaystyle\hat{S}(\theta^{0},\hat{\bm{\zeta}}) =\displaystyle= 1ni=1n{yib(θ0Di+𝒗^T𝑸i+𝜷^T𝑼^i)}(Di𝒘^T𝑴˙i);\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\{y_{i}-b^{\prime}(\theta^{0}D_{i}+\hat{\bm{v}}^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+\hat{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{U}}_{i})\}(D_{i}-\hat{\bm{w}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i});
I^θ𝜻\displaystyle\hat{{I}}_{\theta\mid\bm{\zeta}} =\displaystyle= 1ni=1nb′′(θ^Di+𝒗^T𝑸i+𝜷^T𝑼^i)Di(Di𝒘^T𝑴i˙).\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\theta}D_{i}+\hat{\bm{v}}^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}+\hat{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{U}}_{i})D_{i}(D_{i}-\hat{\bm{w}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}_{i}}).

Our theoretical results are established under the asymptotic regime with n,pn,p\rightarrow\infty. Regarding the factor model, as stated in Assumption 1 in Section 4 of the main text, we do not assume the random errors 𝑬i\bm{E}_{i} to be identically distributed nor does the model assumption require the covariance of 𝑬i\bm{E}_{i} to be diagonal. Specifically, we assume for some large constant C>0C>0: (a) 𝔼(Eij)=0\mathbb{E}(E_{ij})=0, 𝔼(Eij8)C\mathbb{E}({E}_{ij}^{8})\leq C; (b) 𝔼(EihEij)=τi,hj\mathbb{E}(E_{ih}E_{ij})=\tau_{i,hj} with |τi,hj|τhj|\tau_{i,hj}|\leq\tau_{hj} for some τhj>0\tau_{hj}>0 and all i=1,,ni=1,\dots,n, and h=1pτhjC\sum_{h=1}^{p}\tau_{hj}\leq C for all j=1,,pj=1,\dots,p. (c) 𝔼(EijEsj)=ρis,j\mathbb{E}(E_{ij}E_{sj})=\rho_{is,j} with |ρis,j|ρis|\rho_{is,j}|\leq\rho_{is} for some ρis>0\rho_{is}>0 and all j=1,,pj=1,\dots,p, and n1i=1ns=1nρisn^{-1}\sum_{i=1}^{n}\sum_{s=1}^{n}\rho_{is} C\leq C. (d) For all j,q=1,,pj,q=1,\dots,p,

𝔼{|1ni=1n[EijEiq𝔼(EijEiq)]|4}C.\mathbb{E}\left\{\left|\frac{1}{\sqrt{n}}\sum_{i=1}^{n}[E_{ij}E_{iq}-\mathbb{E}(E_{ij}E_{iq})]\right|^{4}\right\}\leq C.

Regarding the loading matrix, we assume 𝑾j2C\|\bm{W}_{j}^{*}\|_{2}\leq C. There exist positive definite matrices 𝚪\bm{\Gamma}^{*} and Υ\Upsilon^{*} such that limpp1𝑾(𝚺e)1(𝑾)T=𝚪\lim_{p\rightarrow\infty}p^{-1}\bm{W}^{*}(\bm{\Sigma}_{e}^{*})^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}}=\bm{\Gamma}^{*} and limpp1j=1p(σj)4\lim_{p\rightarrow\infty}p^{-1}\sum_{j=1}^{p}({\sigma}_{j}^{*})^{-4} {(𝑾j)T(𝑾j)T}(𝑾j𝑾j)=Υ\{(\bm{W}_{j}^{*})^{\mathrm{\scriptscriptstyle T}}\otimes(\bm{W}_{j}^{*})^{\mathrm{\scriptscriptstyle T}}\}(\bm{W}_{j}^{*}\otimes\bm{W}_{j}^{*})=\Upsilon^{*}. In addition, we also assume a working identifiability condition that 𝑺u=𝑰K\bm{S}_{u}=\bm{I}_{K} and p1𝑾(𝚺e)1(𝑾)Tp^{-1}\bm{W}^{*}(\bm{\Sigma}_{e}^{*})^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}} is a diagonal matrix with distinct entries.

For the assumptions regarding the generalized linear model relating y{y} and (𝑿,𝑼)(\bm{X},\bm{U}), as mentioned in Assumption 2 in Section 4 of the main text, we let λmin(𝑰)κ\lambda_{\min}(\bm{I}^{*})\geq\kappa, where 𝑰=E[b′′{(𝜼)T𝒁i}𝒁i𝒁iT]\bm{I}^{*}=E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}{\bm{Z}}_{i}{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}] and κ\kappa is some constant. The unmeasured confounders, the covariates and coefficients parameters are assumed to be bounded, that is, 𝑼iM\|\bm{U}_{i}\|_{\infty}\leq M, 𝑿iM\|\bm{X}_{i}\|_{\infty}\leq M, 𝜼M\|\bm{\eta}^{*}\|_{\infty}\leq M, and |(𝒘q)T𝑸i|M|(\bm{w}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Q}_{i}|\leq M for some constant M>0M>0. We let |yib{(𝜼)T𝒁i}||y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}| to be sub-exponential with yib{(𝜼)T𝒁i}φ1M\|y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}\|_{\varphi_{1}}\leq M. In addition, we assume a1(𝜼)T𝒁ia2a_{1}\leq(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\leq a_{2}, 0|b(t)|B0\leq|b^{\prime}(t)|\leq B with |b(t1)b(t)|B|(t1t)b(t)||b^{\prime}(t_{1})-b^{\prime}(t)|\leq B|(t_{1}-t)b^{\prime}(t)| and 0b′′(t)B0\leq b^{\prime\prime}(t)\leq B with |b′′(t1)b′′(t)|B|t1t|b′′(t)|b^{\prime\prime}(t_{1})-b^{\prime\prime}(t)|\leq B|t_{1}-t|b^{\prime\prime}(t) for constants a1a_{1}, a2a_{2} and BB, where t[a1ϵ,a2+ϵ]t\in[a_{1}-\epsilon,a_{2}+\epsilon] for ϵ>0\epsilon>0 and sequence t1t_{1} satisfying |t1t|=o(1)|t_{1}-t|=o(1).

With the notations and assumptions revisited, we next present the discussion for the working identifiability condition and then the proofs of Theorems 2 and 3.

Appendix B. Discussion on Working Identifiability Condition

In this section, we make a detailed clarification on the working identifiability condition in the main text. Specifically, we show that when the working identifiability condition does not hold, we can transform the model into an identifiable model and the transformed identifiable model has identical parameter of interest θ\theta compared to that of the pre-transformed model.

When the identifiability condition is not satisfied, the estimators are not fully identifiable in the sense that, for any invertible matrix 𝑶\bm{O}, we have 𝑾~=𝑶T𝑾^\tilde{\bm{W}}=\bm{O}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{W}} and 𝑺~u=(𝑶1)T𝑺^u𝑶1\tilde{\bm{S}}_{u}=(\bm{O}^{-1})^{\mathrm{\scriptscriptstyle T}}\hat{\bm{S}}_{u}\bm{O}^{-1} to be valid maximum likelihood estimators. At 𝑾~=𝑶T𝑾^\tilde{\bm{W}}=\bm{O}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{W}}, we can substitute the expression for 𝑾~\tilde{\bm{W}} into (4) to get the relationship between 𝑼~i\tilde{\bm{U}}_{i} and 𝑼^i\hat{\bm{U}}_{i} as follows,

𝑼~i\displaystyle\tilde{\bm{U}}_{i} =\displaystyle= (𝑾~𝚺^e1𝑾~T)1𝑾~𝚺^e1(𝑿i𝑿¯)\displaystyle(\tilde{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\tilde{\bm{W}}^{\mathrm{\scriptscriptstyle T}})^{-1}\tilde{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{X}_{i}-\bar{\bm{X}})
=\displaystyle= (𝑶T𝑾^𝚺^e1𝑾^T𝑶)1𝑶T𝑾^𝚺^e1(𝑿i𝑿¯)\displaystyle(\bm{O}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\hat{\bm{W}}^{\mathrm{\scriptscriptstyle T}}\bm{O})^{-1}\bm{O}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{X}_{i}-\bar{\bm{X}})
=\displaystyle= 𝑶1(𝑾^𝚺^e1𝑾^T)1(𝑶1)T𝑶T𝑾^𝚺^e1(𝑿i𝑿¯)\displaystyle\bm{O}^{-1}(\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\hat{\bm{W}}^{\mathrm{\scriptscriptstyle T}})^{-1}(\bm{O}^{-1})^{\mathrm{\scriptscriptstyle T}}\bm{O}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{X}_{i}-\bar{\bm{X}})
=\displaystyle= 𝑶1𝑼^i.\displaystyle\bm{O}^{-1}\hat{\bm{U}}_{i}.

Suppose 𝑾^\hat{\bm{W}} and 𝑼^i\hat{\bm{U}}_{i} are the asymptotically unbiased estimators for 𝑾\bm{W}^{*} and 𝑼i\bm{U}_{i}^{*} respectively. Here we want to clarify that we use 𝑺u\bm{S}_{u}^{*} and 𝑼i\bm{U}_{i}^{*} to differentiate the unmeasured confounders before the transformation from that of transformed model but this does not imply that 𝑺u\bm{S}_{u}^{*} and 𝑼i\bm{U}_{i}^{*} are population parameters. we have 𝑾~\tilde{\bm{W}} and 𝑼~i\tilde{\bm{U}}_{i} to be the asymptotically unbiased estimators for 𝑶T𝑾\bm{O}^{\mathrm{\scriptscriptstyle T}}\bm{W}^{*} and 𝑶1𝑼i\bm{O}^{-1}\bm{U}_{i}^{*}.

When the working identifiability condition does not hold, that is, we have 𝑺u𝐈K\bm{S}_{u}^{*}\neq\mathbf{I}_{K}, and/or p1𝑾𝚺e1(𝑾)Tp^{-1}\bm{W}^{*}\bm{\Sigma}_{e}^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}} is not diagonal matrix with distinct entries, we can find an invertible matrix 𝑶\bm{O} to transform the true parameters to satisfy the assumption. Then we build a correspondence between the true model and the model corresponding to the transformed parameters, showing that the two models have parameter of interest θ\theta to be identical.

The transformed parameters are 𝑾r=𝑶21𝑶1T𝑾\bm{W}^{r}=\bm{O}_{2}^{-1}\bm{O}_{1}^{\mathrm{\scriptscriptstyle T}}\bm{W}^{*} and 𝑼ir=𝑶2T(𝑶11)𝑼i\bm{U}_{i}^{r}=\bm{O}_{2}^{\mathrm{\scriptscriptstyle T}}(\bm{O}_{1}^{-1})\bm{U}_{i}^{*}, which are constructed in two steps. At first step, for any 𝑺u𝑰K\bm{S}_{u}^{*}\neq\bm{I}_{K}, we can find a matrix 𝑶1\bm{O}_{1} such that

n1i=1n𝑶11(𝑼i𝑼¯)(𝑼i𝑼¯)T(𝑶11)T=𝑶11𝑺u(𝑶11)T=𝑰K.n^{-1}\sum_{i=1}^{n}\bm{O}_{1}^{-1}(\bm{U}_{i}^{*}-\bar{\bm{U}}^{*})(\bm{U}_{i}^{*}-\bar{\bm{U}}^{*})^{\mathrm{\scriptscriptstyle T}}(\bm{O}_{1}^{-1})^{\mathrm{\scriptscriptstyle T}}=\bm{O}_{1}^{-1}\bm{S}_{u}^{*}(\bm{O}_{1}^{-1})^{\mathrm{\scriptscriptstyle T}}=\bm{I}_{K}.

At next step, given the matrix p1𝑾𝚺e1(𝑾)Tp^{-1}\bm{W}^{*}\bm{\Sigma}_{e}^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}} is symmetrical, there exists an orthogonal matrix 𝑶2\bm{O}_{2} whose columns correspond to the eigenvectors of p1𝑶1T𝑾𝚺e1(𝑶1T𝑾)Tp^{-1}\bm{O}_{1}^{\mathrm{\scriptscriptstyle T}}\bm{W}^{*}\bm{\Sigma}_{e}^{-1}(\bm{O}_{1}^{\mathrm{\scriptscriptstyle T}}\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}} as we could decompose this symmetric matrix into p1𝑶1T𝑾𝚺e1(𝑶1T𝑾)Tp^{-1}\bm{O}_{1}^{\mathrm{\scriptscriptstyle T}}\bm{W}^{*}\bm{\Sigma}_{e}^{-1}(\bm{O}_{1}^{\mathrm{\scriptscriptstyle T}}\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}} =𝑶2𝚲𝑶2T=\bm{O}_{2}{\bm{\Lambda}}\bm{O}_{2}^{\mathrm{\scriptscriptstyle T}} where 𝚲{\bm{\Lambda}} has distinct eigenvalues in the diagonal.

We verify the transformed parameters satisfy the identifiability condition as follows. At 𝑾r=𝑶21𝑶1T𝑾\bm{W}^{r}=\bm{O}_{2}^{-1}\bm{O}_{1}^{\mathrm{\scriptscriptstyle T}}\bm{W}^{*} and 𝑼ir=𝑶2T(𝑶11)𝑼i\bm{U}_{i}^{r}=\bm{O}_{2}^{\mathrm{\scriptscriptstyle T}}(\bm{O}_{1}^{-1})\bm{U}_{i}^{*}, we have

𝑺ur=n1i=1n𝑶2T𝑶11(𝑼i𝑼¯)(𝑼i𝑼¯)T(𝑶11)T𝑶2=𝑶2T𝑰K𝑶2=𝑰K,\bm{S}_{u}^{r}=n^{-1}\sum_{i=1}^{n}\bm{O}_{2}^{\mathrm{\scriptscriptstyle T}}\bm{O}_{1}^{-1}(\bm{U}_{i}^{*}-\bar{\bm{U}}^{*})(\bm{U}_{i}^{*}-\bar{\bm{U}}^{*})^{\mathrm{\scriptscriptstyle T}}(\bm{O}_{1}^{-1})^{\mathrm{\scriptscriptstyle T}}\bm{O}_{2}=\bm{O}_{2}^{\mathrm{\scriptscriptstyle T}}\bm{I}_{K}\bm{O}_{2}=\bm{I}_{K},

and

p1𝑾r𝚺e1(𝑾r)T=p1𝑶21𝑶1T𝑾𝚺e1(𝑾)T𝑶1(𝑶21)T=𝑶21𝑶2𝚲𝑶2T(𝑶21)T=𝚲,p^{-1}\bm{W}^{r}\bm{\Sigma}_{e}^{-1}(\bm{W}^{r})^{\mathrm{\scriptscriptstyle T}}=p^{-1}\bm{O}_{2}^{-1}\bm{O}_{1}^{\mathrm{\scriptscriptstyle T}}\bm{W}^{*}\bm{\Sigma}_{e}^{-1}(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{O}_{1}(\bm{O}_{2}^{-1})^{\mathrm{\scriptscriptstyle T}}=\bm{O}_{2}^{-1}\bm{O}_{2}{\bm{\Lambda}}\bm{O}_{2}^{\mathrm{\scriptscriptstyle T}}(\bm{O}_{2}^{-1})^{\mathrm{\scriptscriptstyle T}}={\bm{\Lambda}},

is a diagonal matrix with distinct entries.

We let 𝑶=𝑶2T(𝑶11)\bm{O}=\bm{O}_{2}^{\mathrm{\scriptscriptstyle T}}(\bm{O}_{1}^{-1}), so accordingly the parameters for the transformed factor model are 𝑼r=𝑶𝑼\bm{U}^{r}=\bm{O}\bm{U}^{*} and (𝑾r)T=(𝑾)T𝑶1(\bm{W}^{r})^{\mathrm{\scriptscriptstyle T}}=(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{O}^{-1}. The relationship between the true model and the transformed model is as follows. The factor model structure corresponding to the true parameter is the same as the model corresponding to the transformed parameters as

𝑿=(𝑾r)T𝑼r+𝑬=(𝑾)T𝑶1𝑶𝑼+𝑬=(𝑾)T𝑼+𝑬.\bm{X}=(\bm{W}^{r})^{\mathrm{\scriptscriptstyle T}}\bm{U}^{r}+\bm{E}=(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{O}^{-1}\bm{O}\bm{U}^{*}+\bm{E}=(\bm{W}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{U}^{*}+\bm{E}.

The generalized linear framework according to the rotated true parameter is

f(y)=exp([y{θrD+(𝒗r)T𝑸+(𝜷r)T𝑼}b{θrD+(𝒗r)T𝑸+(𝜷r)T𝑼]/a(ϕ)+c(y,ϕ)),f(y)=\exp\left([y\{\theta^{r}{D}+(\bm{v}^{r})^{\mathrm{\scriptscriptstyle T}}\bm{Q}+(\bm{\beta}^{r})^{\mathrm{\scriptscriptstyle T}}\bm{U}\}-b\{\theta^{r}{D}+(\bm{v}^{r})^{\mathrm{\scriptscriptstyle T}}\bm{Q}+(\bm{\beta}^{r})^{\mathrm{\scriptscriptstyle T}}\bm{U}]/{a(\phi)}+c(y,\phi)\right),

whereas the framework according to true parameters is

f(y)=exp([y{θD+(𝒗)T𝑸+(𝜷)T𝑼}b{θD+(𝒗)T𝑸+(𝜷)T𝑼}]/a(ϕ)+c(y,ϕ)).f(y)=\exp\left([y\{\theta^{*}{D}+(\bm{v}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Q}+(\bm{\beta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{U}\}-b\{\theta^{*}{D}+(\bm{v}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Q}+(\bm{\beta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{U}\}]/{a(\phi)}+c(y,\phi)\right).

At 𝜷r=𝑶1𝜷\bm{\beta}^{r}=\bm{O}^{-1}\bm{\beta}^{*}, θr=θ\theta^{r}=\theta^{*} and 𝒗r=𝒗\bm{v}^{r}=\bm{v}^{*}, the two frameworks are identical. That is, when the confounders are not identifiable, only the coefficient 𝜷\bm{\beta} will be affected accordingly; the parameter of interest will not change and thus the theoretical results on θ~\tilde{\theta} are not affected.

Appendix C. Estimation of the Dimension of Unmeasured Confounders

In the numerical studies of this paper, we use parallel analysis (Horn, 1965) to estimate the dimension of unmeasured confounders. Because in factor analysis, parallel analysis is a popular approach to selecting the number of factors as it is accurate and easy to use (Hayton et al., 2004; Costello and Osborne, 2005; Brown, 2015). In the existing literature, several authors have conducted extensive simulation studies to assess the performance of parallel analysis relative to other existing approaches (Zwick and Velicer, 1986; Peres-Neto et al., 2005). They have shown that parallel analysis has better numerical performance in terms of selecting KK than many existing approaches (Zwick and Velicer, 1986; Peres-Neto et al., 2005). Furthermore, parallel analysis is also a commonly used statistical tool for dimension reduction (Lin et al., 2016), multiple testing dependence (Leek and Storey, 2008), and finds wide applications in other scientific disciplines including virology (Quadeer et al., 2014) and genetic studies (Leek and Storey, 2007).

The implementation of parallel analysis is as follows. With our given matrix 𝑿n×p=(D,𝑸T)T\bm{X}_{n\times p}=(D,\bm{Q}^{{\mathrm{\scriptscriptstyle T}}})^{{\mathrm{\scriptscriptstyle T}}}, we denote pp columns in the design matrix as 𝑿1,,𝑿p\bm{X}_{1},\dots,\bm{X}_{p}, respectively. Then we repeatedly generate matrices 𝑿π\bm{X}_{\pi}’s where each matrix is generated by randomly permuting every column 𝑿j\bm{X}_{j} for j=1,,pj=1,\dots,p. Next, we select the first factor when the top singular value of 𝑿\bm{X} is larger than a certain percentile of the top singular value of the permuted matrices 𝑿π\bm{X}_{\pi}’s. If the first factor is selected, we repeat this procedure to determine whether the second factor can be selected. The process is repeated until no more factor is selected. The main intuition of this approach is that the factor model is considered a summation of the signal (factors) and noise (random error). The permutation destroys the original signal structure and turns it into a matrix of random noise. Thus, identifying factors based on large singular values of 𝑿\bm{X} can be interpreted as selecting factors that are above the noise level.

Besides parallel analysis, there are various methods to estimate the dimension of unmeasured confounders such as scree plot (Cattell, 1966), which empirically chooses the elbow point in the plot of descending eigenvalues of factors; method based on cross validation (Owen and Wang, 2016), which uses random held-out matrices of data to choose the number of factors; method based on information criteria including AIC and BIC to select the number of factors in high-dimensional factor model (Bai and Ng, 2002); the eigenvalue ratio method (Lam and Yao, 2012; Ahn and Horenstein, 2013), which chooses K^\hat{K} by K^=argmaxK𝒦λk(𝑿𝑿T)/λk+1(𝑿𝑿T)\hat{K}=\arg\max_{K\leq\mathcal{K}}\lambda_{k}(\bm{X}\bm{X}^{{\mathrm{\scriptscriptstyle T}}})/\lambda_{k+1}(\bm{X}\bm{X}^{{\mathrm{\scriptscriptstyle T}}}) where λk(𝑿𝑿T)\lambda_{k}(\bm{X}\bm{X}^{{\mathrm{\scriptscriptstyle T}}}) denotes the kk-th eigenvalue of 𝑿𝑿T\bm{X}\bm{X}^{{\mathrm{\scriptscriptstyle T}}} and 𝒦\mathcal{K} is a prespecified threshold which is often set to be 𝒦=p/2\mathcal{K}=p/2 in practice. Among these methods, the information-criteria-based method and eigenvalue ratio method have the theoretical guarantees to be consistent under similar assumptions as Assumption 1 in our paper. These assumptions follow the common conditions in theoretical analysis for the approximate factor model that allows weak correlation among the random error. Besides the method to determine the number of factors for linear factor models, Chen and Li (2021) proposed a method based on joint-likelihood-based information criterion to determine the number of factors for generalized linear factor models. All these selection methods are well-established tools and can possibly be used to select KK. Nonetheless, our theoretical results hold as long as the dimension of unmeasured confounders is consistently estimated. In our manuscript, we use the parallel analysis for good empirical performance, as illustrated in Peres-Neto et al. (2005).

To further investigate how the accuracy of the estimation of KK affects the asymptotic distribution of the debiased estimation, we conduct simulation studies where we keep all the estimation settings to be the same as in Section 5.1 in the paper except manually replacing the K^\hat{K} estimated by parallel analysis with specified values: 2, 4, 5, and 10. Recall that we set the true K=3K=3, so this simulation provides us with some insights on how the overestimation (K^=4,5(\hat{K}=4,5 and 10)10) and underestimation (K^=2)(\hat{K}=2) of unmeasured confounder dimension affect the inference results.

The results under the logistic regression model at (a) n=500n=500 and pp vary from 30 to 1500; (b) p=600p=600 and nn varies from 100 to 1500 are presented in Fig. 6. The results under the linear model at the same regime of nn and pp are presented in Fig. 7. From the results, we find that the overestimation of KK appears not to affect the asymptotic normality results but the underestimation of KK can influence the asymptotic distribution of debiased estimation and further affects confidence interval estimation. Intuitively, as long as the corresponding linear combinations of the true underlying factors 𝑼\bm{U} in the considered models can be well approximated by those of the estimated 𝑼^\hat{\bm{U}}, the developed inference results for θ\theta^{*} would still hold.

Refer to caption
Figure 6: Coverage and length of the confidence interval under logistic regression with prespecified K^=2,4,5\hat{K}=2,4,5 and 10, averaged over 300 replications, with varying pp and fixed n=500n=500 (top), and with varying nn and fixed p=600p=600 (bottom). Black dashlines indicate the 0.95 level. Orange dashed lines (Refer to caption) represent the results at K^=2\hat{K}=2. Green dotted lines (Refer to caption) represent the results at K^=4\hat{K}=4. Blue two-dashed lines (Refer to caption) represent the results at K^=5\hat{K}=5. Purple solid lines (Refer to caption) represent the results at K^=10\hat{K}=10.
Refer to caption
Figure 7: Coverage and length of the confidence interval under linear model with prespecified K^=2,4,5\hat{K}=2,4,5 and 10, averaged over 300 replications, with varying pp and fixed n=500n=500 (top), and with varying nn and fixed p=600p=600 (bottom). Black dashlines indicate the 0.95 level. Orange dashed lines (Refer to caption) represent the results at K^=2\hat{K}=2. Green dotted lines (Refer to caption) represent the results at K^=4\hat{K}=4. Blue two-dashed lines (Refer to caption) represent the results at K^=5\hat{K}=5. Purple solid lines (Refer to caption) represent the results at K^=10\hat{K}=10.

Appendix D. Proof of Theorem 2

D.1 Estimation Consistency of 𝜼^\hat{\bm{\eta}}

Theorem 2 shows that the estimators 𝜼^\hat{\bm{\eta}} and 𝒘^\hat{\bm{w}} can be consistently estimated. Before proving these results, we first introduce some lemmas that will be used in the proofs.

Lemma 7 (Concentration of the Gradient and Hessian)

Under Assumptions 12 and the scaling condition n,pn,p\rightarrow\infty, we have

(i)(i) l(𝛈)=n1i=1n[yib{(𝛈)T𝐙˙i}]𝐙˙i=Op{n1/2(logp)1/2+p1/2(logn)1/2}\|\nabla l(\bm{\eta}^{*})\|_{\infty}=\|n^{-1}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]\dot{\bm{Z}}_{i}\|_{\infty}=O_{p}\{n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\};

(ii)(ii) (𝛕)T2l(𝛈)E𝛈{(𝛕)T2l(𝛈)}=n1i=1n(𝛕)Tb′′{(𝛈)T𝐙˙i}𝐙˙i𝐙˙iTE𝛈[(𝛕)Tb′′{(𝛈)T𝐙˙i}𝐙˙i𝐙˙iT]=Op{n1/2(logp)1/2+p1/2(logn)1/2}\|(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla^{2}l(\bm{\eta}^{*})-E_{\bm{\eta}^{*}}\{(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla^{2}l(\bm{\eta}^{*})\}\|_{\infty}\\ =\|n^{-1}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\bm{\eta}^{*}}[(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}]\|_{\infty}\\ =O_{p}\{n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\}.

Lemma 7 shows that there exist sub-exponential type of bounds on the gradient and linear combination of the Hessian of the loss functions. This is motivated by Assumption 3.2 in Ning and Liu (2017). They construct the loss function based on the observed covariates whereas our results are established upon the fact that the gradient and the hessian of the loss function involve not only observed covariates but also the estimated unmeasured confounders.

Remark 8

As the decomposition techniques commonly used in linear models may not be applicable in generalized linear model settings, it is necessary to establish stronger and more general intermediate results as the foundation of our theoretical analysis. For instance, different from linear models, where average estimation consistency on unmeasured confounders suffices, stronger uniform estimation consistency is necessary for the generalized linear framework. Specifically, in Fan et al. (2023), the linear model form and projection-based techniques enable the reduction of the gradient max-norm into 𝐄^T(𝐲~𝐄^𝛄)\|\hat{\bm{E}}^{{\mathrm{\scriptscriptstyle T}}}(\tilde{\bm{y}}-\hat{\bm{E}}\bm{\gamma}^{*})\|_{\infty}, where 𝛄=(θ,(𝐯)T)T\bm{\gamma}^{*}=(\theta^{*},(\bm{v}^{*})^{{\mathrm{\scriptscriptstyle T}}})^{{\mathrm{\scriptscriptstyle T}}} and 𝐲~=(𝐈nn1𝐔^𝐔^T)𝐲\tilde{\bm{y}}=(\bm{I}_{n}-n^{-1}\hat{\bm{U}}\hat{\bm{U}}^{{\mathrm{\scriptscriptstyle T}}})\bm{y} is the residual of the response 𝐲\bm{y} after projecting in onto space of 𝐔^\hat{\bm{U}}. To show the concentration of the gradient, a key step is to upper bound (𝐔)T𝐄^ϕ\|(\bm{U}^{*})^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{E}}\bm{\phi}\|_{\infty} with ϕ2=1\|\bm{\phi}\|_{2}=1 by (𝐄^𝐄)T(𝐔𝐇T𝐔^)𝐇Tϕ\|(\hat{\bm{E}}-\bm{E})^{{\mathrm{\scriptscriptstyle T}}}(\bm{U}\bm{H}^{{\mathrm{\scriptscriptstyle T}}}-\hat{\bm{U}})\bm{H}^{-{\mathrm{\scriptscriptstyle T}}}\bm{\phi}\|_{\infty} and 𝐄T(𝐔𝐇T𝐔^)𝐇Tϕ\|\bm{E}^{{\mathrm{\scriptscriptstyle T}}}(\bm{U}\bm{H}^{{\mathrm{\scriptscriptstyle T}}}-\hat{\bm{U}})\bm{H}^{-{\mathrm{\scriptscriptstyle T}}}\bm{\phi}\|_{\infty}, where they apply Cauchy-Schwartz inequality and use the Frobenius norm of estimated unmeasured confounders (𝐔𝐇T𝐔^)F\|(\bm{U}\bm{H}^{{\mathrm{\scriptscriptstyle T}}}-\hat{\bm{U}})\|_{F}. Here 𝐇K×K\bm{H}_{K\times K} is some transformation matrix and with a slight abuse of notation, we use 𝐄n×p\bm{E}\in\mathbb{R}^{n\times p}, 𝐔n×K\bm{U}\in\mathbb{R}^{n\times K} and 𝐲n\bm{y}\in\mathbb{R}^{n} to denote the matrix (vector) form of random error, unmeasured confounders, and responses. However in the generalized linear model, to derive the bound for gradient max-norm to be n1i=1n[yib{(𝛈)T𝐙˙i}]𝐙˙i\|n^{-1}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]\dot{\bm{Z}}_{i}\|_{\infty}, we instead apply Bernstein inequality which requires the uniform estimation bound of unmeasured confounders, that is, maxi𝐔^i𝐔i\max_{i}\|\hat{\bm{U}}_{i}-\bm{U}_{i}\|_{\infty}. We leave the detailed proof of Lemma 7 in Appendix G.1.

We next obtain the upper bound for the estimation error of 𝜼^\hat{\bm{\eta}}. To this end, we define the first-order approximation to the loss difference as D(𝜼^,𝜼)=(𝜼^𝜼)T{l(𝜼^)l(𝜼)}D(\hat{\bm{\eta}},\bm{\eta}^{*})=(\hat{\bm{\eta}}-\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l({\hat{\bm{\eta}}})-\nabla l(\bm{\eta}^{*})\}. We will obtain upper and lower bounds of D(𝜼^,𝜼)D(\hat{\bm{\eta}},\bm{\eta}^{*}). As we will show, the 2\ell_{2} norm 𝜼^𝜼2\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{2} is involved in both upper and lower bounds, combining which will give an inequality of 𝜼^𝜼2\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{2} and thus result in a bounded estimation error of 𝜼^\hat{\bm{\eta}}.

Upper Bound for D(𝛈^,𝛈)D(\hat{\bm{\eta}},\bm{\eta}^{*}): recall that 𝜼^T=(𝜸^T,𝜷^T)\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}=(\hat{\bm{\gamma}}^{\mathrm{\scriptscriptstyle T}},\hat{\bm{\beta}}^{\mathrm{\scriptscriptstyle T}}) is a solution obtained from solving the convex optimization problem in (A10). And thus, we have the following optimality conditions l(𝜸^)=λϱ-\nabla l(\hat{\bm{\gamma}})=\lambda\bm{\varrho} and l(𝜷^)=0\nabla l(\hat{\bm{\beta}})=0, where 𝜸\bm{\gamma} is a p×1p\times 1 vector with entries

ϱj={sign(γ^j),γ^j0[1,1],γ^j=0,(j=1,p).\displaystyle\bm{\varrho}_{j}=\left\{\begin{array}[]{ll}\operatorname{sign}(\hat{{\gamma}}_{j}),&\ \hat{{\gamma}}_{j}\neq 0\\ {[-1,1]},&\ \hat{{\gamma}}_{j}=0\end{array},\quad(j=1,\ldots p).\right. (A14)

This implies l(𝜼^)=n1i=1n𝒁˙i{yib(𝜼^T𝒁˙i)}λ\|\nabla l({\hat{\bm{\eta}}})\|_{\infty}=\|n^{-1}\sum_{i=1}^{n}\dot{\bm{Z}}_{i}\{y_{i}-b^{\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\}\|_{\infty}\leq\lambda.

Recall that we denote the support set for 𝜼\bm{\eta}^{*} as 𝒮η={j:ηj0}{\mathcal{S}}_{\eta}=\{j:{\eta}_{j}^{*}\neq 0\}. We denote the difference between the estimator 𝜼^\hat{\bm{\eta}} and the true parameter 𝜼\bm{\eta}^{*} as Δ^=𝜼^𝜼\hat{\Delta}=\hat{\bm{\eta}}-\bm{\eta}^{*}, and its two sub-vectors are Δ^𝒮=(η^jηj:j𝒮η)\hat{\Delta}_{{\mathcal{S}}}=(\hat{\eta}_{j}-\eta_{j}^{*}:j\in{\mathcal{S}}_{\eta}) and Δ^𝒮¯=(η^jηj:j𝒮η)\hat{\Delta}_{\bar{{\mathcal{S}}}}=(\hat{\eta}_{j}-\eta_{j}^{*}:j\notin{\mathcal{S}}_{\eta}). Similarly, we denote the sub-vectors of 𝒁˙i\dot{\bm{Z}}_{i} corresponding to non-zero entries as 𝒁˙i,𝒮={𝒁˙ij:j𝒮η}\dot{\bm{Z}}_{i,{\mathcal{S}}}=\{\dot{\bm{Z}}_{ij}:j\in\mathcal{S}_{\eta}\} and that corresponding to zero entries as 𝒁˙i,𝒮¯={𝒁˙ij:j𝒮η}\dot{\bm{Z}}_{i,\bar{{\mathcal{S}}}}=\{\dot{\bm{Z}}_{ij}:j\notin\mathcal{S}_{\eta}\}. With the notations introduced, the quadratic difference D(𝜼^,𝜼)D(\hat{\bm{\eta}},\bm{\eta}^{*}) can then be written as

D(𝜼^,𝜼)\displaystyle D(\hat{\bm{\eta}},\bm{\eta}^{*}) =\displaystyle= (𝜼^𝜼)T{l(𝜼^)l(𝜼)}\displaystyle(\hat{\bm{\eta}}-\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l({\hat{\bm{\eta}}})-\nabla l(\bm{\eta}^{*})\}
=\displaystyle= 1ni=1nΔ^𝒮T𝒁˙i,𝒮{yib(𝜼^T𝒁˙i)}1ni=1nΔ^𝒮¯T𝒁˙i,𝒮¯{yib(𝜼^T𝒁˙i)}Δ^Tl(𝜼)\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\hat{{\Delta}}_{{\mathcal{S}}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i,{\mathcal{S}}}\{y_{i}-b^{\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\}-\frac{1}{n}\sum_{i=1}^{n}\widehat{{\Delta}}_{\bar{{\mathcal{S}}}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i,\bar{{\mathcal{S}}}}\{y_{i}-b^{\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\}-\widehat{{\Delta}}^{{\mathrm{\scriptscriptstyle T}}}\nabla l({\bm{\eta}}^{*})
\displaystyle\leq λΔ^S1λΔ^S¯1+Δ^1l(𝜼)\displaystyle\lambda\|\hat{\Delta}_{S}\|_{1}-\lambda\|\hat{\Delta}_{\bar{S}}\|_{1}+\|\hat{\Delta}\|_{1}\|\nabla l(\bm{\eta}^{*})\|_{\infty}

where the last inequality is by Hölder’s inequality.

From Lemma 7, we have l(𝜼)n1/2(logp)1/2+p1/2(logn)1/2\|\nabla l(\bm{\eta}^{*})\|_{\infty}\lesssim{n^{-1/2}(\log p)^{1/2}}+p^{-1/2}(\log n)^{1/2}. Since Δ^1Δ^𝒮1+Δ^𝒮¯1\|\hat{\Delta}\|_{1}\leq\|\hat{\Delta}_{{\mathcal{S}}}\|_{1}+\|\hat{\Delta}_{\bar{{\mathcal{S}}}}\|_{1} and by taking λ=2c{n1/2(logp)1/2+p1/2(logn)1/2}\lambda=2c\{n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\} where c>0c>0 is some constant, we further have

D(𝜼^,𝜼)\displaystyle D(\hat{\bm{\eta}},\bm{\eta}^{*}) \displaystyle\leq c(logpn+lognp)(3Δ^S1Δ^S¯1)\displaystyle c\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)(3\|\hat{\Delta}_{S}\|_{1}-\|\hat{\Delta}_{\bar{S}}\|_{1}) (A15)
\displaystyle\leq 3csη1/2(logpn+lognp)Δ^2,\displaystyle 3cs_{\eta}^{1/2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\|\hat{\Delta}\|_{2},

where the last inequality is because Δ^𝒮1sη1/2Δ^𝒮2sη1/2Δ^2\|\hat{\Delta}_{{\mathcal{S}}}\|_{1}\leq s_{\eta}^{1/2}\|\hat{\Delta}_{{\mathcal{S}}}\|_{2}\leq s_{\eta}^{1/2}\|\hat{\Delta}\|_{2}.

Lower Bound for D(𝛈^,𝛈)D(\hat{\bm{\eta}},\bm{\eta}^{*}): after establishing the upper bound for the quadratic difference, we next obtain a lower bound for it. Because l(𝜼)l(\bm{\eta}^{*}) is convex function, D(𝜼^,𝜼)0D(\hat{\bm{\eta}},\bm{\eta}^{*})\geq 0. We have Δ^S¯13Δ^S1\|\hat{\Delta}_{\bar{S}}\|_{1}\leq 3\|\hat{\Delta}_{S}\|_{1} from (A15), and based on this result and the restricted strong convexity condition for generalized linear model in Proposition 1 of Loh and Wainwright (2015),

D(𝜼^,𝜼)κ2Δ^22,D(\hat{\bm{\eta}},\bm{\eta}^{*})\geq\kappa_{2}\|\hat{\Delta}\|_{2}^{2}, (A16)

for some constant κ2>0\kappa_{2}>0.

Then combining the upper bound (A15) and the lower bound (A16) for D(𝜼^,𝜼)D(\hat{\bm{\eta}},\bm{\eta}^{*}), we have

Δ^23csη1/2κ2(logpn+lognp),\|\hat{\Delta}\|_{2}\leq\frac{3cs_{\eta}^{1/2}}{\kappa_{2}}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right), (A17)

and so

Δ^1\displaystyle\|\hat{\Delta}\|_{1} \displaystyle\leq Δ^𝒮1+Δ^𝒮¯14Δ^S14sη1/2Δ^S24sη1/2Δ^2\displaystyle\|\hat{\Delta}_{{\mathcal{S}}}\|_{1}+\|\hat{\Delta}_{\bar{{\mathcal{S}}}}\|_{1}\leq 4\|\hat{\Delta}_{S}\|_{1}\leq 4s_{\eta}^{1/2}\|\hat{\Delta}_{S}\|_{2}\leq 4s_{\eta}^{1/2}\|\hat{\Delta}\|_{2}
\displaystyle\leq 12csηκ2(logpn+lognp),\displaystyle\frac{12cs_{\eta}}{\kappa_{2}}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right),

which completes the first part proof.

D.2 Estimation Consistency of 𝒘^\hat{\bm{w}}

Before we present the proof for the estimation consistency of the estimator 𝒘^\hat{\bm{w}}, we introduce additional results established based on the estimation consistency of 𝜼^\hat{\bm{\eta}} and used in proving the estimation consistency of 𝒘^\hat{\bm{w}}.

Lemma 9

Under Assumptions 12, with λλn1/2(logp)1/2+p1/2(logn)1/2\lambda\asymp\lambda^{\prime}\asymp{n^{-1/2}(\log p)^{1/2}}+p^{-1/2}(\log n)^{1/2} and (sηsw)(n1/2(logp)1/2+p1/2(logn)1/2)=op(1)(s_{\eta}\vee s_{w})({n^{-1/2}(\log p)^{1/2}}+p^{-1/2}(\log n)^{1/2})=o_{p}(1), then we have

1ni=1n(𝜼^𝜼)T𝒁˙ib′′{(𝜼)T𝒁˙i}𝒁˙iT(𝜼^𝜼)=Op{sη(logpn+lognp)},\displaystyle\frac{1}{n}\sum_{i=1}^{n}(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\dot{\bm{Z}}}_{i}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{\eta}}-\bm{\eta}^{*})=O_{p}\left\{s_{\eta}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\}, (A18)
1ni=1n(𝒘^𝒘)T𝑴˙ib′′{(𝜼)T𝒁˙i}𝑴˙iT(𝒘^𝒘)=Op{(sηsw)(logpn+lognp)},\displaystyle\frac{1}{n}\sum_{i=1}^{n}(\hat{\bm{w}}-\bm{w}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{w}}-\bm{w}^{*})=O_{p}\left\{(s_{\eta}\vee s_{w})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\},
(A19)
1ni=1n(𝒘^𝒘)T𝑴˙ib′′{𝜼^T𝒁˙i}𝑴˙iT(𝒘^𝒘)=Op{(sηsw)(logpn+lognp)}.\displaystyle\frac{1}{n}\sum_{i=1}^{n}(\hat{\bm{w}}-\bm{w}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i}b^{\prime\prime}\{\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{w}}-\bm{w}^{*})=O_{p}\left\{(s_{\eta}\vee s_{w})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\}.
(A20)
Remark 10

For Lemma 9, we prove the results (A19) and (A20) along the way in proving our main result of the estimation consistency of 𝐰^\hat{\bm{w}} as the two inequalities are direct consequences of certain intermediate steps. We leave the proof for (A18) in Appendix G.2.

To prove the consistency of 𝒘^\hat{\bm{w}}, we denote 𝜹^=𝒘^𝒘\hat{\bm{\delta}}=\hat{\bm{w}}-\bm{w}^{*} and establish the estimation error bound for 𝜹^1\|\hat{\bm{\delta}}\|_{1}. Recall that 𝒘^\hat{\bm{w}} is a solution to (A11), that is

𝒘^=argmin𝒘12ni=1n{𝒘Tb′′(𝜼^T𝒁˙i)𝑴˙i𝑴˙iT𝒘2𝒘Tb′′(𝜼^T𝒁˙i)Di𝑴˙i}+λ𝒘1.\hat{\bm{w}}=\underset{\bm{w}}{\operatorname{argmin}}~\frac{1}{2n}\sum_{i=1}^{n}\{\bm{w}^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})\dot{\bm{M}}_{i}\dot{\bm{M}}_{i}^{T}\bm{w}-2\bm{w}^{{\mathrm{\scriptscriptstyle T}}}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})D_{i}\dot{\bm{M}}_{i}\}+\lambda^{\prime}\|\bm{w}\|_{1}.

By definition, we have

12ni=1n{𝒘^Tb′′(𝜼^T𝒁˙i)𝑴˙i𝑴˙iT𝒘^2𝒘^Tb′′(𝜼^T𝒁˙i)Di𝑴˙i}+λ𝒘^1\displaystyle\frac{1}{2n}\sum_{i=1}^{n}\{\hat{\bm{w}}^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})\dot{\bm{M}}_{i}\dot{\bm{M}}_{i}^{T}\hat{\bm{w}}-2\hat{\bm{w}}^{{\mathrm{\scriptscriptstyle T}}}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})D_{i}\dot{\bm{M}}_{i}\}+\lambda^{\prime}\|\hat{\bm{w}}\|_{1}
12ni=1n{(𝒘)Tb′′(𝜼^T𝒁˙i)𝑴˙i𝑴˙iT𝒘2(𝒘)Tb′′(𝜼^T𝒁˙i)Di𝑴˙i}+λ𝒘1.\displaystyle\leq\frac{1}{2n}\sum_{i=1}^{n}\{(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})\dot{\bm{M}}_{i}\dot{\bm{M}}_{i}^{T}\bm{w}^{*}-2(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})D_{i}\dot{\bm{M}}_{i}\}+\lambda^{\prime}\|\bm{w}^{*}\|_{1}.

Write (𝜹^T𝑴˙i)2=(𝒘^T𝑴˙i)2{(𝒘)T𝑴˙i}22𝒘^T𝑴˙i𝑴˙iT𝒘+2{(𝒘)T𝑴˙i}2(\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i})^{2}=(\hat{\bm{w}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i})^{2}-\{(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}^{2}-2\hat{\bm{w}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\dot{\bm{M}}_{i}^{\mathrm{\scriptscriptstyle T}}\bm{w}^{*}+2\{(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}^{2}, then the above inequality can be rearranged as

12ni=1nb′′(𝜼^T𝒁˙i)(𝜹^T𝑴˙i)21ni=1nb′′(𝜼^T𝒁˙i){Di(𝒘)T𝑴˙i}𝑴˙iT𝜹^+λ𝒘1λ𝒘^1.\displaystyle\frac{1}{2n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})(\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i})^{2}\leq\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})\{D_{i}-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}\dot{\bm{M}}_{i}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{\delta}}+\lambda^{\prime}\|\bm{w}^{*}\|_{1}-\lambda^{\prime}\|\hat{\bm{w}}\|_{1}.
(A21)

The proof techniques are mostly motivated by Ning and Liu (2017). To present our proof, we define two quadratic difference terms as Q(𝒘^,𝒘)Q(\hat{\bm{w}},\bm{w}^{*}) =(𝒘^𝒘)T2l(𝜼^)(𝒘^𝒘)=n1i=1nb′′(𝜼^T𝒁˙i)(𝑴˙iT𝜹^)2=(\hat{\bm{w}}-\bm{w}^{*})^{{\mathrm{\scriptscriptstyle T}}}\nabla^{2}l(\hat{\bm{\eta}})(\hat{\bm{w}}-\bm{w}^{*})=n^{-1}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})(\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}})^{2} and Q(𝒘^,𝒘)=(𝒘^𝒘)TQ^{*}(\hat{\bm{w}},\bm{w}^{*})=(\hat{\bm{w}}-\bm{w}^{*})^{{\mathrm{\scriptscriptstyle T}}} 2l(𝜼)(𝒘^𝒘)=n1i=1nb′′{(𝜼)T𝒁˙i}(𝑴˙iT𝜹^)2\nabla^{2}l({\bm{\eta}}^{*})(\hat{\bm{w}}-\bm{w}^{*})=n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}(\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}})^{2}. The left hand side of the above inequality is Q(𝒘^,𝒘)Q(\hat{\bm{w}},\bm{w}^{*}), and next we investigate the upper bound for Q(𝒘^,𝒘)Q(\hat{\bm{w}},\bm{w}^{*}) in details.

For the right hand side of (A21), we first consider λ𝒘1λ𝒘^1\lambda^{\prime}\|\bm{w}^{*}\|_{1}-\lambda^{\prime}\|\hat{\bm{w}}\|_{1}. Recall that we denote the support set for 𝒘\bm{w}^{*} as 𝒮w={j:𝒘j0}{\mathcal{S}}_{w}=\{j:{\bm{w}}_{j}^{*}\neq 0\}. We also denote 𝒘𝒮=(𝒘j:j𝒮w)\bm{w}^{*}_{{\mathcal{S}}}=(\bm{w}^{*}_{j}:j\in{\mathcal{S}}_{w}), 𝒘𝒮¯=(𝒘j:j𝒮w)\bm{w}^{*}_{\bar{{\mathcal{S}}}}=(\bm{w}^{*}_{j}:j\notin{\mathcal{S}}_{w}), 𝜹^𝒮=(𝒘^j𝒘j:j𝒮w)\hat{\bm{\delta}}_{{\mathcal{S}}}=(\hat{\bm{w}}_{j}-\bm{w}_{j}^{*}:j\in{\mathcal{S}}_{w}) and 𝜹^𝒮¯=(𝒘^j𝒘j:j𝒮w)\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}=(\hat{\bm{w}}_{j}-\bm{w}_{j}^{*}:j\notin{\mathcal{S}}_{w}). So 𝜹^𝒮¯1=𝒘^𝒮¯1\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}=\|\hat{\bm{w}}_{\bar{{\mathcal{S}}}}\|_{1} and 𝒘𝒮¯1=0\|\bm{w}^{*}_{\bar{{\mathcal{S}}}}\|_{1}=0. Therefore we have

λ𝒘1λ𝒘^1\displaystyle\lambda^{\prime}\|\bm{w}^{*}\|_{1}-\lambda^{\prime}\|\hat{\bm{w}}\|_{1} =\displaystyle= λ𝒘𝒮1+λ𝒘𝒮¯1λ𝒘^𝒮1λ𝒘^𝒮¯1\displaystyle\lambda^{\prime}\|\bm{w}_{{\mathcal{S}}}^{*}\|_{1}+\lambda^{\prime}\|\bm{w}_{\bar{{\mathcal{S}}}}^{*}\|_{1}-\lambda^{\prime}\|\hat{\bm{w}}_{{\mathcal{S}}}\|_{1}-\lambda^{\prime}\|\hat{\bm{w}}_{\bar{{\mathcal{S}}}}\|_{1} (A22)
=\displaystyle= λ𝒘𝒮1λ𝒘^𝒮1λ𝒘^𝒮¯1\displaystyle\lambda^{\prime}\|\bm{w}_{{\mathcal{S}}}^{*}\|_{1}-\lambda^{\prime}\|\hat{\bm{w}}_{{\mathcal{S}}}\|_{1}-\lambda^{\prime}\|\hat{\bm{w}}_{\bar{{\mathcal{S}}}}\|_{1}
\displaystyle\leq λ𝜹^𝒮1λ𝜹^𝒮¯1.\displaystyle\lambda^{\prime}\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}-\lambda^{\prime}\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}.

And according to Lagranian duality theory, an equivalent problem for (A11) is

𝒘^=argmin𝒘𝒘1 s.t. 12ni=1n{𝒘T𝜻𝜻i(𝜼^)𝒘2𝒘T𝜻θi(𝜼^)}b2,\hat{\bm{w}}=\underset{\bm{w}}{\operatorname{argmin}}~\|\bm{w}\|_{1}~\text{ s.t. }~\frac{1}{2n}\sum_{i=1}^{n}\{\bm{w}^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}\bm{\zeta}}\ell_{i}(\hat{\bm{\eta}})\bm{w}-2\bm{w}^{{\mathrm{\scriptscriptstyle T}}}\nabla_{\bm{\zeta}\theta}\ell_{i}(\hat{\bm{\eta}})\}\leq b^{2},

for b>0b>0. This gives 𝒘^1𝒘1\|\hat{\bm{w}}\|_{1}\leq\|\bm{w}^{*}\|_{1}, which further results in 𝜹^𝒮¯1𝜹^𝒮1\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}\leq\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1} from (A22).

We next consider the first term in (A21). Denote I1=n1i=1nb′′(𝜼^T𝒁˙i){Di(𝒘)T𝑴˙i}I_{1}=n^{-1}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})\{D_{i}-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\} 𝑴˙iT𝜹^\dot{\bm{M}}_{i}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{\delta}}, which is a summation of two terms I11I_{11} and I12I_{12} denoted as

I1\displaystyle I_{1} =\displaystyle= 1ni=1nb′′{(𝜼)T𝒁˙i}{Di(𝒘)T𝑴˙i}𝑴˙iT𝜹^\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{D_{i}-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}}
+1ni=1n[b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁˙i}]{Di(𝒘)T𝑴˙i}𝑴˙iT𝜹^\displaystyle+\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}]\{D_{i}-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}}
=\displaystyle= I11+I12.\displaystyle I_{11}+I_{12}.

For I11I_{11}, using a similar argument as in the proof of Lemma 7, we have n1i=1n\|n^{-1}\sum_{i=1}^{n} b′′{(𝜼)T𝒁˙i}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\} {Di(𝒘)T𝑴˙i}𝑴˙iTn1/2(logp)1/2+p1/2(logn)1/2\{D_{i}-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\|_{\infty}\lesssim n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}, hence

|I11|\displaystyle|I_{11}| \displaystyle\leq 𝜹^1n1i=1nb′′{(𝜼)T𝒁˙i}{Di(𝒘)T𝑴˙i}𝑴˙iT\displaystyle\|\hat{\bm{\delta}}\|_{1}\|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{D_{i}-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\|_{\infty} (A23)
\displaystyle\lesssim (logpn+lognp)𝜹^1.\displaystyle\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\|\hat{\bm{\delta}}\|_{1}.

For I12I_{12}, by Assumption 2 that |b′′(t1)b′′(t)|B|t1t|b′′(t)|b^{\prime\prime}(t_{1})-b^{\prime\prime}(t)|\leq B|t_{1}-t|b^{\prime\prime}(t) with t1=η^T𝒁˙it_{1}=\hat{\eta}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i} and t=(η)T𝒁˙it=(\eta^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i} and applying Cauchy-Schwarz inequality, we have

|I12|\displaystyle|I_{12}| \displaystyle\leq |n1i=1nb′′{(𝜼)T𝒁˙i}|𝜼^T𝒁˙i(𝜼)T𝒁˙i|{Di(𝒘)T𝑴˙i}𝑴˙iT𝜹^|\displaystyle\left|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}|\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}-(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}|\{D_{i}-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}}\right| (A24)
\displaystyle\leq |n1i=1nb′′{(𝜼)T𝒁˙i}(𝑴˙iT𝜹^)2|1/2|n1i=1nb′′{(𝜼)T𝒁˙i}{(𝜼^𝜼)T𝒁˙i}2|1/2\displaystyle\left|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}(\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}})^{2}\right|^{1/2}\left|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{(\hat{\bm{\eta}}-\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}^{2}\right|^{1/2}
\displaystyle\lesssim |Q(𝒘^,𝒘)|1/2sη1/2(logpn+lognp),\displaystyle|Q^{*}(\hat{\bm{w}},\bm{w}^{*})|^{1/2}s_{\eta}^{1/2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right),

where the last inequality is from (A18) in Lemma 9.

Combining the upper bounds in (A22),  (A23) and (A24) into (A21), we have

Q(𝒘^,𝒘)\displaystyle Q(\hat{\bm{w}},\bm{w}^{*}) \displaystyle\lesssim (logpn+lognp)𝜹^1\displaystyle\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\|\hat{\bm{\delta}}\|_{1}
+|Q(𝒘^,𝒘)|1/2sη1/2(logpn+lognp)\displaystyle+|Q^{*}(\hat{\bm{w}},\bm{w}^{*})|^{1/2}s_{\eta}^{1/2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)
+λ𝜹^𝒮1λ𝜹^𝒮¯1.\displaystyle+\lambda^{\prime}\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}-\lambda^{\prime}\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}.

The above upper bound of Q(𝒘^,𝒘)Q(\hat{\bm{w}},\bm{w}^{*}) involves the expression of Q(𝒘^,𝒘)Q^{*}(\hat{\bm{w}},\bm{w}^{*}). We next investigate the relation between Q(𝒘^,𝒘)Q(\hat{\bm{w}},\bm{w}^{*}) and Q(𝒘^,𝒘)Q^{*}(\hat{\bm{w}},\bm{w}^{*}). We apply Assumption 2 again on the difference between Q(𝒘^,𝒘)Q(\hat{\bm{w}},\bm{w}^{*}) and Q(𝒘^,𝒘)Q^{*}(\hat{\bm{w}},\bm{w}^{*}) as we do for bounding the term I12I_{12}, and have

|Q(𝒘^,𝒘)Q(𝒘^,𝒘)|\displaystyle|Q(\hat{\bm{w}},\bm{w}^{*})-Q^{*}(\hat{\bm{w}},\bm{w}^{*})| =\displaystyle= |1ni=1n[b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁˙i}](𝑴˙iT𝜹^)2|\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}](\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}})^{2}\right|
\displaystyle\leq |1ni=1nb′′{(𝜼)T𝒁˙i}(𝑴˙iT𝜹^)2(𝜼^𝜼)T𝒁˙i|\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}(\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}})^{2}(\hat{\bm{\eta}}-\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\right|
\displaystyle\leq Q(𝒘^,𝒘)𝜼^𝜼1maxi=1,,n𝒁˙i\displaystyle Q^{*}(\hat{\bm{w}},\bm{w}^{*})\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{1}\max_{i=1,\ldots,n}\|\dot{\bm{Z}}_{i}\|_{\infty}
\displaystyle\lesssim Q(𝒘^,𝒘)sη(logpn+lognp)(M+1n+lognp),\displaystyle Q^{*}(\hat{\bm{w}},\bm{w}^{*}){s_{\eta}}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\left(M+\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right),

where the last inequality is by the estimation consistency results of 𝜼^\hat{\bm{\eta}} in Appendix D.1 and by Proposition 1 that 𝒁˙i=M+Op(n1/2+p1/2(logn)1/2)\|\dot{\bm{Z}}_{i}\|_{\infty}=M+O_{p}(n^{-1/2}+p^{-1/2}(\log n)^{1/2}) and sη(n1/2(logp)1/2+p1/2(logn)1/2)=op(1)s_{\eta}(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2})=o_{p}(1). From the above result, we further apply triangular inequality and have

Q(𝒘^,𝒘){1sη(logpn+lognp)(M+1n+lognp)}Q(𝒘^,𝒘).Q^{*}(\hat{\bm{w}},\bm{w}^{*})\left\{1-{s_{\eta}}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\left(M+\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right)\right\}\lesssim Q(\hat{\bm{w}},\bm{w}^{*}). (A26)

Combining the above result on Q(𝒘^,𝒘)Q^{*}(\hat{\bm{w}},\bm{w}^{*}) with (LABEL:Q_tau_taus) and taking λ=2{n1/2(logp)1/2+p1/2(logn)1/2}\lambda^{\prime}=2\{n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\}, we get

Q(𝒘^,𝒘)\displaystyle Q(\hat{\bm{w}},\bm{w}^{*}) \displaystyle\lesssim |Q(𝒘^,𝒘)|1/2sη1/2(logpn+lognp)\displaystyle|Q(\hat{\bm{w}},\bm{w}^{*})|^{1/2}s_{\eta}^{1/2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right) (A27)
+(logpn+lognp)(3𝜹^𝒮1𝜹^𝒮¯1).\displaystyle+\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)(3\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}-\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}).

The above result will be used to prove (A20), which will later be used to derive the error bound 𝜹^1\|\hat{\bm{\delta}}\|_{1}. Consider the following two cases.

Case 1: |Q(𝒘^,𝒘)|1/2sη1/2(n1/2(logp)1/2+p1/2(logn)1/2)|Q(\hat{\bm{w}},\bm{w}^{*})|^{1/2}\lesssim s_{\eta}^{1/2}(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}). We have (A20) naturally hold.

Case 2: |Q(𝒘^,𝒘)|1/2sη1/2(n1/2(logp)1/2+p1/2(logn)1/2)|Q(\hat{\bm{w}},\bm{w}^{*})|^{1/2}\gtrsim s_{\eta}^{1/2}(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}). Then we can have

Q(𝒘^,𝒘)|Q(𝒘^,𝒘)|1/2sη1/2(logpn+lognp)>0.Q(\hat{\bm{w}},\bm{w}^{*})-|Q(\hat{\bm{w}},\bm{w}^{*})|^{1/2}s_{\eta}^{1/2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)>0.

combining this result with (A27), we have 3𝜹^𝒮1𝜹^𝒮¯13\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}\geq\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}. We next use this cone condition to derive the lower bound for Q(𝒘^,𝒘)Q(\hat{\bm{w}},\bm{w}^{*}) for case 2.

Denote 𝑼~iT=(𝟎p1,𝑼^iT𝑼iT)p+K1\tilde{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}=(\bm{0}_{p-1},\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}-\bm{U}_{i}^{\mathrm{\scriptscriptstyle T}})_{p+K-1} as a vector including a vector of zeros and the differences between estimated unmeasured confounders and the true confounders. Hence we have 𝑴˙i=𝑴i+𝑼~i\dot{\bm{M}}_{i}=\bm{M}_{i}+\tilde{\bm{U}}_{i}. We establish the lower bounds for Q(𝒘^,𝒘)/𝜹^22Q(\hat{\bm{w}},\bm{w}^{*})/\|\hat{\bm{\delta}}\|_{2}^{2} first.

Q(𝒘^,𝒘)𝜹^22\displaystyle\frac{Q(\hat{\bm{w}},\bm{w}^{*})}{\|\hat{\bm{\delta}}\|_{2}^{2}} =\displaystyle= 𝜹^T{n1i=1nb′′(𝜼^T𝒁˙i)𝑴˙i𝑴˙iT}𝜹^𝜹^22\displaystyle\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}\{n^{-1}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})\dot{\bm{M}}_{i}\dot{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\}\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}} (A28)
=\displaystyle= 𝜹^T{n1i=1nb′′(𝜼^T𝒁˙i)𝑴i𝑴iT}𝜹^𝜹^22+2𝜹^T{n1i=1nb′′(𝜼^T𝒁˙i)𝑴i𝑼~iT}𝜹^𝜹^22\displaystyle\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}\{n^{-1}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}){\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\}\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}+\frac{2\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}\{n^{-1}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}){\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\}\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}
+𝜹^T{n1i=1nb′′(𝜼^T𝒁˙i)𝑼~i𝑼~iT}𝜹^𝜹^22\displaystyle+\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}\{n^{-1}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})\tilde{\bm{U}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\}\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}
=\displaystyle= Q1+Q2+Q3.\displaystyle Q_{1}+Q_{2}+Q_{3}.

We next derive the bounds for the three terms in (A28) separately. For Q1Q_{1}, we have

Q1=𝜹^T[n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑴iT]𝜹^𝜹^22+1ni=1n(𝑴iT𝜹^)2𝜹^22[b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁i}].Q_{1}=\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}[n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}+\frac{1}{n}\sum_{i=1}^{n}\frac{(\bm{M}_{i}^{\mathrm{\scriptscriptstyle T}}\hat{\bm{\delta}})^{2}}{\|\hat{\bm{\delta}}\|_{2}^{2}}[b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}]. (A29)

For the second term in (A29), consider Assumption 2 that |b′′(t1)b′′(t)|B|t1t|b′′(t)|b^{\prime\prime}(t_{1})-b^{\prime\prime}(t)|\leq B|t_{1}-t|b^{\prime\prime}(t) by letting t1=𝜼^T𝒁˙it_{1}=\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i} and t=(𝜼)T𝒁˙it=(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i} and applying Cauchy-Schwarz inequality, we have

maxi=1,,n|b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁i}|\displaystyle\max_{i=1,\ldots,n}|b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}| (A30)
\displaystyle\leq maxi=1,,nBb′′{(𝜼)T𝒁i}|𝜼^T𝒁˙i(𝜼)T𝒁i|\displaystyle\max_{i=1,\ldots,n}Bb^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}|\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}-(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}|
\displaystyle\leq maxi=1,,nBb′′{(𝜼)T𝒁i}|𝜼^T𝒁˙i(𝜼)T𝒁˙i+(𝜼)T𝒁˙i(𝜼)T𝒁i|\displaystyle\max_{i=1,\ldots,n}Bb^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}|\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}-(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}+(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}-(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}|
\displaystyle\leq maxi=1,,nBb′′{(𝜼)T𝒁i}|(𝜼^𝜼)T𝒁˙i+(𝜼)T(𝒁˙i𝒁i)|.\displaystyle\max_{i=1,\ldots,n}Bb^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}|(\hat{\bm{\eta}}-\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}+(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}(\dot{\bm{Z}}_{i}-\bm{Z}_{i})|.
\displaystyle\leq maxi=1,,nBb′′{(𝜼)T𝒁i}(𝜼^𝜼1𝒁˙i+𝜼1𝒁˙i𝒁i).\displaystyle\max_{i=1,\ldots,n}Bb^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}(\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{1}\|\dot{\bm{Z}}_{i}\|_{\infty}+\|\bm{\eta}^{*}\|_{1}\|\dot{\bm{Z}}_{i}-\bm{Z}_{i}\|_{\infty}).

From Assumption 2, b′′{(𝜼)T𝒁i}[0,B]b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}\in[0,B]. From the result for estimation consistency of 𝜼^\hat{\bm{\eta}} in Appendix D.1, we have 𝜼^𝜼1sη(n1/2(logp)1/2+p1/2(logn)1/2)\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{1}\lesssim{s_{\eta}}(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}). In addition, we have 𝒁˙i=M+Op(n1/2+p1/2(logn)1/2)\|\dot{\bm{Z}}_{i}\|_{\infty}=M+O_{p}(n^{-1/2}+p^{-1/2}(\log n)^{1/2}). Based on these results and by the scaling condition n,pn,p\rightarrow\infty and sη(n1/2(logp)1/2+p1/2(logn)1/2)=op(1){s_{\eta}}(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2})=o_{p}(1), we have

maxi=1,,n|b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁i}|=op(1).\max_{i=1,\ldots,n}|b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}|=o_{p}(1). (A31)

Hence we have

Q1𝜹^T[n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑴iT]𝜹^4𝜹^22Q_{1}\geq\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}[n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]\hat{\bm{\delta}}}{4\|\hat{\bm{\delta}}\|_{2}^{2}} (A32)

with probability tending to 1 as the second term in (A29) goes to 0. Recall that we denote 𝑰𝜻𝜻=E[b′′{(𝜼)T𝒁i}𝑴i𝑴iT]\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}=E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}{\bm{M}}_{i}{\bm{M}}_{i}^{\mathrm{\scriptscriptstyle T}}]. Because

𝜹^T[n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑴iT]𝜹^𝜹^22\displaystyle\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}[n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}
=\displaystyle= 𝜹^T𝑰𝜻𝜻𝜹^𝜹^22+𝜹^T|n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑴iT𝑰𝜻𝜻|𝜹^𝜹^22\displaystyle\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}+\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}|\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}
\displaystyle\geq λmin(𝑰𝜻𝜻)|𝜹^T[n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑴iT𝑰𝜻𝜻]𝜹^𝜹^22|\displaystyle\lambda_{\min}(\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*})-\left|\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}[n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}]\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}\right|
\displaystyle\geq κ𝜹^12n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑴iT𝑰𝜻𝜻max𝜹^22\displaystyle\kappa-\frac{\|\hat{\bm{\delta}}\|_{1}^{2}\|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}\|_{\max}}{\|\hat{\bm{\delta}}\|_{2}^{2}}

where the last inequality is by the definition and properties of eigenvalue as well as matrix operations. Recall that we have obtained the cone condition 3𝜹^𝒮1𝜹^𝒮¯13\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}\geq\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}, which results in 𝜹^1216𝜹^𝒮1216sw𝜹^22\|\hat{\bm{\delta}}\|_{1}^{2}\leq 16\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}^{2}\leq 16s_{w}\|\hat{\bm{\delta}}\|_{2}^{2}. Moreover, using Bernstein’s inequality, it can be shown that n1i=1nb′′{(𝜼)T𝒁i}\|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\} 𝑴i𝑴iT𝑰𝜻𝜻max=Op{n1/2(logp)1/2}{\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}\|_{\max}=O_{p}\{n^{-1/2}(\log p)^{1/2}\}. As (sηsw)(n1/2(logp)1/2+p1/2(logn)1/2)=op(1)(s_{\eta}\vee s_{w})(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2})=o_{p}(1), the term 𝜹^12n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑴iT𝑰𝜻𝜻max/𝜹^22=op(1)\|\hat{\bm{\delta}}\|_{1}^{2}\|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}{\bm{M}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\bm{I}_{\bm{\zeta}\bm{\zeta}}\|_{\max}/\|\hat{\bm{\delta}}\|_{2}^{2}=o_{p}(1), which together with (A32) shows Q1κ/4Q_{1}\geq\kappa/4 with probability tending to 1.

For Q2Q_{2}, we can write it as

Q2=2𝜹^T[n1i=1nb′′(𝜼^T𝒁˙i)𝑴i𝑼~iT]𝜹^𝜹^22+2ni=1n𝜹^T𝑴i𝑼~iT𝜹^𝜹^22[b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁i}].Q_{2}=\frac{2\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}[n^{-1}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}){\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}+\frac{2}{n}\sum_{i=1}^{n}\frac{\hat{\bm{\delta}}^{{\mathrm{\scriptscriptstyle T}}}\bm{M}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}[b^{\prime\prime}(\hat{\bm{\eta}}^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}]. (A33)

Based on (A31) and

𝜹^T𝑴i𝑼~iT𝜹^𝜹^22𝜹^12𝑴i𝑼~i𝜹^22sw(1n+lognp),\displaystyle\frac{\hat{\bm{\delta}}^{{\mathrm{\scriptscriptstyle T}}}\bm{M}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}\hat{\bm{\delta}}}{\|\hat{\bm{\delta}}\|_{2}^{2}}\leq\frac{\|\hat{\bm{\delta}}\|_{1}^{2}\|\bm{M}_{i}\|_{\infty}\|\tilde{\bm{U}}_{i}\|_{\infty}}{\|\hat{\bm{\delta}}\|_{2}^{2}}\lesssim s_{w}\left(\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right),

and under the scaling condition (sηsw)(n1/2(logp)1/2+p1/2(logn)1/2)=op(1)(s_{\eta}\vee s_{w})(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2})=o_{p}(1), we have the second term in (A33) goes to 0 and thus

Q2\displaystyle Q_{2} \displaystyle\geq 𝜹^T[n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑼~iT]𝜹^2𝜹^22\displaystyle\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}[n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]\hat{\bm{\delta}}}{2\|\hat{\bm{\delta}}\|_{2}^{2}} (A34)
\displaystyle\geq λmin(𝑰𝜻𝜻)2|𝜹^T[n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑼~iT𝑰𝜻𝜻]𝜹^2𝜹^22|\displaystyle\frac{\lambda_{\min}(\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*})}{2}-\left|\frac{\hat{\bm{\delta}}^{\mathrm{\scriptscriptstyle T}}[n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}]\hat{\bm{\delta}}}{2\|\hat{\bm{\delta}}\|_{2}^{2}}\right|
\displaystyle\geq κ2𝜹^12n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑼~iT𝑰𝜻𝜻max𝜹^22.\displaystyle\frac{\kappa}{2}-\frac{\|\hat{\bm{\delta}}\|_{1}^{2}\|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}\|_{\max}}{\|\hat{\bm{\delta}}\|_{2}^{2}}.

Because we have

n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑼~iT𝑰𝜻𝜻max\displaystyle\|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}\|_{\max} (A35)
\displaystyle\leq n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑼~iT𝔼[b′′{(𝜼)T𝒁i}𝑴i𝑼~iT]max\displaystyle\|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\mathbb{E}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]\|_{\max}
+𝔼[b′′{(𝜼)T𝒁i}𝑴i𝑼~iT]𝑰𝜻𝜻max.\displaystyle+\|\mathbb{E}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}\|_{\max}.
\displaystyle\lesssim logpn+lognp,\displaystyle\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}},

where the last inequality is from techniques similar to the proof of Lemma 7 Condition (ii)(ii) that n1i=1nb′′{(𝜼)T𝒁i}𝑴i𝑼~iT𝔼[b′′{(𝜼)T𝒁i}𝑴i𝑼~iT]max=Op{n1/2(logp)1/2+p1/2(logn)1/2}\|n^{-1}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}-\mathbb{E}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]\|_{\max}=O_{p}\{n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\} and from Proposition 1 that 𝔼[b′′{(𝜼)T𝒁i}𝑴i𝑼~iT]𝑰𝜻𝜻max=Op{n1/2+p1/2(logn)1/2}\|\mathbb{E}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{M}}_{i}\tilde{\bm{U}}_{i}^{{\mathrm{\scriptscriptstyle T}}}]-\bm{I}_{\bm{\zeta}\bm{\zeta}}^{*}\|_{\max}=O_{p}\{n^{-1/2}+p^{-1/2}(\log n)^{1/2}\}. Because 𝜹^1216sw𝜹^22\|\hat{\bm{\delta}}\|_{1}^{2}\leq 16s_{w}\|\hat{\bm{\delta}}\|_{2}^{2}, under the scaling condition (sηsw){n1/2(s_{\eta}\vee s_{w})\{n^{-1/2} (logp)1/2+p1/2(logn)1/2}=op(1)(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\}=o_{p}(1), combining (A34) and (A35) gives Q2κ/2Q_{2}\geq\kappa/2 with probability tending to 1. Similarly, we have Q3κ/4Q_{3}\geq\kappa/4.

Summarizing the above results, we have that with a high probability Q1+Q2+Q3κQ_{1}+Q_{2}+Q_{3}\geq\kappa as n,pn,p\rightarrow\infty and (sηsw){n1/2(s_{\eta}\vee s_{w})\{n^{-1/2} (logp)1/2+p1/2(logn)1/2}=op(1)(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\}=o_{p}(1). Substituting this result for the terms Q1Q_{1}, Q2Q_{2} and Q3Q_{3} into (A28), and because 𝜹^𝒮12sw𝜹^𝒮22sw𝜹^22\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}^{2}\leq s_{w}\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{2}^{2}\leq s_{w}\|\hat{\bm{\delta}}\|_{2}^{2}, we have

Q(𝒘^,𝒘)κsw1𝜹^S12.Q(\hat{\bm{w}},\bm{w}^{*})\geq{\kappa s_{w}^{-1}\|\hat{\bm{\delta}}_{S}\|_{1}^{2}}. (A36)

Recall we have proved (A27), which can be written as

Q(𝒘^,𝒘)1/2{Q(𝒘^,𝒘)1/2sη1/2(logpn+lognp)1/2}\displaystyle Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}\left\{Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}-s_{\eta}^{1/2}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}\right\}
\displaystyle\lesssim (logpn+lognp)1/2(3𝜹^𝒮1𝜹^𝒮¯1)\displaystyle\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}(3\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}-\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1})
\displaystyle\lesssim 3(logpn+lognp)1/2𝜹^𝒮1.\displaystyle 3\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}.

From (A36), we have 𝜹^𝒮12swκ1Q(𝒘^,𝒘)\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}^{2}\leq s_{w}\kappa^{-1}Q(\hat{\bm{w}},\bm{w}^{*}) with high probability tending to 1. Substitute this result into the above inequality, we have

Q(𝒘^,𝒘)1/2{Q(𝒘^,𝒘)1/2sη1/2(logpn+lognp)1/2}\displaystyle Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}\left\{Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}-s_{\eta}^{1/2}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}\right\}
\displaystyle\lesssim 3sw1/2κ1/2(logpn+lognp)1/2Q(𝒘^,𝒘)1/2.\displaystyle\frac{3s_{w}^{1/2}}{\kappa^{1/2}}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}.

Cancelling the term Q(𝒘^,𝒘)1/2Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}, we have

Q(𝒘^,𝒘)1/2sη1/2(logpn+lognp)1/23sw1/2κ1/2(logpn+lognp)1/2,Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}-s_{\eta}^{1/2}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}\lesssim\frac{3s_{w}^{1/2}}{\kappa^{1/2}}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2},

which gives an upper bound for Q(𝒘^,𝒘)1/2Q(\hat{\bm{w}},\bm{w}^{*})^{1/2} that

Q(𝒘^,𝒘)1/2(swsη)1/2(logpn+lognp)1/2.Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}\lesssim(s_{w}\vee s_{\eta})^{1/2}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}. (A37)

which completes the proof of (A20) in case 2. Therefore, we prove (A20) holds in both cases. In both cases 1 and 2, by replacing Q(𝒘^,𝒘)Q(\hat{\bm{w}},\bm{w}^{*}) with Q(𝒘^,𝒘)Q^{*}(\hat{\bm{w}},\bm{w}^{*}), we get (A19) as

Q(𝒘^,𝒘)=Op{(swsη)(logpn+lognp)}.Q^{*}(\hat{\bm{w}},\bm{w}^{*})=O_{p}\left\{(s_{w}\vee s_{\eta})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\}.

To finally prove the estimation consistency of 𝒘^\hat{\bm{w}}, or equivalently, to derive the error bound 𝜹^1\|\hat{\bm{\delta}}\|_{1}, we also consider two situations. If cone condition 6𝜹^𝒮1𝜹^𝒮¯16\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}\geq\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1} holds, we then apply similar techniques as in case 2 to derive (A36). Therefore

𝜹^1𝜹^𝒮1+𝜹^𝒮¯17𝜹^𝒮1\displaystyle\|\hat{\bm{\delta}}\|_{1}\leq\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}+\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}\leq 7\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1} \displaystyle\lesssim sw1/2Q(𝒘^,𝒘)1/2\displaystyle s_{w}^{1/2}Q(\hat{\bm{w}},\bm{w}^{*})^{1/2}
\displaystyle\lesssim sw1/2(sηsw)1/2(logpn+lognp)1/2\displaystyle s_{w}^{1/2}(s_{\eta}\vee s_{w})^{1/2}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}
\displaystyle\lesssim (sηsw)(logpn+lognp)1/2.\displaystyle(s_{\eta}\vee s_{w})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}.

Otherwise, we have 6𝜹^𝒮1𝜹^𝒮¯16\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}\leq\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}. From (A27), we have

Q(𝒘^,𝒘)\displaystyle Q(\hat{\bm{w}},\bm{w}^{*}) \displaystyle\lesssim |Q(𝒘^,𝒘)|1/2sη1/2(logpn+lognp)\displaystyle|Q(\hat{\bm{w}},\bm{w}^{*})|^{1/2}s_{\eta}^{1/2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)
+(logpn+lognp)(3𝜹^𝒮1𝜹^𝒮¯1).\displaystyle+\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)(3\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}-\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}).
\displaystyle\lesssim |Q(𝒘^,𝒘)|1/2sη1/2(logpn+lognp)12(logpn+lognp)𝜹^𝒮¯1,\displaystyle|Q(\hat{\bm{w}},\bm{w}^{*})|^{1/2}s_{\eta}^{1/2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)-\frac{1}{2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1},

which together with 6𝜹^𝒮1𝜹^𝒮¯16\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}\leq\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1} gives

𝜹^1\displaystyle\|\hat{\bm{\delta}}\|_{1} \displaystyle\leq 76𝜹^𝒮¯1\displaystyle\frac{7}{6}\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}
\displaystyle\leq 73(logpn+lognp)1{|Q(𝒘^,𝒘)|1/2sη1/2(logpn+lognp)Q(𝒘^,𝒘)}\displaystyle\frac{7}{3}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)^{-1}\left\{|Q(\hat{\bm{w}},\bm{w}^{*})|^{1/2}s_{\eta}^{1/2}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)-Q(\hat{\bm{w}},\bm{w}^{*})\right\}
\displaystyle\lesssim (sηsw)(logpn+lognp)1/2,\displaystyle(s_{\eta}\vee s_{w})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2},

where the last inequality is from (A20).

In conclusion, under either 6𝜹^𝒮1𝜹^𝒮¯16\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}\geq\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1} or 6𝜹^𝒮1𝜹^𝒮¯16\|\hat{\bm{\delta}}_{{\mathcal{S}}}\|_{1}\leq\|\hat{\bm{\delta}}_{\bar{{\mathcal{S}}}}\|_{1}, we prove

𝜹^1=Op{(swsη)(logpn+lognp)1/2}.\|\hat{\bm{\delta}}\|_{1}=O_{p}\left\{(s_{w}\vee s_{\eta})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)^{1/2}\right\}.

Appendix E. Proof of Theorem 3

Throughout the rest of appendix, we denote

𝑯^\displaystyle\hat{\bm{H}} =\displaystyle= (𝑾^𝚺^e1𝑾^T)1,𝑯^p=p×𝑯^=(p1𝑾^𝚺^e1𝑾^T)1.\displaystyle(\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\hat{\bm{W}}^{{\mathrm{\scriptscriptstyle T}}})^{-1},\quad\hat{\bm{H}}_{p}=p\times\hat{\bm{H}}=(p^{-1}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\hat{\bm{W}}^{{\mathrm{\scriptscriptstyle T}}})^{-1}.

From the Corollary S.1 in Bai and Li (2016), we have 𝑯^p=Op(1)\hat{\bm{H}}_{p}=O_{p}(1) and 𝑯^=Op(p1)\hat{\bm{H}}=O_{p}(p^{-1}). These results will play an important role in the following proofs. We next introduce a few technical lemmas as tools in the proof of Theorem 3.

Lemma 11 (Smoothness of loss function)

Suppose that Assumptions 12 hold. With λλn1/2(logp)1/2+p1/2(logn)1/2\lambda\asymp\lambda^{\prime}\asymp{n^{-1/2}(\log p)^{1/2}}+p^{-1/2}(\log n)^{1/2} and (swsη)(s_{w}\vee s_{\eta}) (n1/2logp+p1n1/2logn)=op(1)(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1), the following conditions hold.

(iii)(iii) (𝛕)T{l(𝛈^)l(𝛈)2l(𝛈)(𝛈^𝛈)}=op(n1/2)(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l(\hat{\bm{\eta}})-\nabla l(\bm{\eta}^{*})-\nabla^{2}l(\bm{\eta}^{*})(\hat{\bm{\eta}}-\bm{\eta}^{*})\}=o_{p}(n^{-1/2});

(iv)(iv) (𝛕^𝛕)T{l(𝛈^)l(𝛈)}=op(n1/2)(\hat{\bm{\tau}}-\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l(\hat{\bm{\eta}})-\nabla l(\bm{\eta}^{*})\}=o_{p}(n^{-{1}/{2}}).

Lemma 11 shows the loss functions are smooth in a sense that they are second-order differentiable around the true parameter value. The conditions hold for quadratic loss functions as well as non-quadratic functions given the function b()b(\cdot) is properly constrained.

Lemma 12 (Central limit theorem of score function)

Under Assumptions 12, with λλn1/2(logp)1/2+p1/2(logn)1/2\lambda\asymp\lambda^{\prime}\asymp{n^{-1/2}(\log p)^{1/2}}+p^{-1/2}(\log n)^{1/2} and (swsη)(s_{w}\vee s_{\eta}) (n1/2logp+p1n1/2logn)=op(1)(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1), it holds that

n1/2(𝝉)Tl(𝜼)(Iθ𝜻)1/2dN(0,1).n^{1/2}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\bm{\eta}^{*})(I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}\rightarrow_{d}N(0,1). (A38)

Lemma 12 implies that a linear combination of the gradient of loss function, in other words, the score function is asymptotically normal.

Lemma 13 (Partial information estimator consistency)

Suppose that Assumptions 12 hold. With λλn1/2(logp)1/2+p1/2(logn)1/2\lambda\asymp\lambda^{\prime}\asymp{n^{-1/2}(\log p)^{1/2}}+p^{-1/2}(\log n)^{1/2}, if n,pn,p\rightarrow\infty and (swsη)(s_{w}\vee s_{\eta}) (n1/2logp+p1n1/2logn)=op(1)(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1), then the estimator for the partial information I^θ𝛇=θθ2l(𝛈^)𝐰^𝛇θ2l(𝛈^)\hat{I}_{\theta\mid\bm{\zeta}}=\nabla_{\theta\theta}^{2}l(\hat{\bm{\eta}})-\hat{\bm{w}}\nabla_{\bm{\zeta}\theta}^{2}l(\hat{\bm{\eta}}) is consistent,

I^θ𝜻Iθ𝜻=op(1).\hat{I}_{\theta\mid\bm{\zeta}}-{I}_{\theta\mid\bm{\zeta}}^{*}=o_{p}(1).

With these result, we now prove the asymptotic normality of the debiased estimator, which generalizes the proof of Theorem 3.2 in Ning and Liu (2017) to the setting with unmeasured confounders.

The goal is to show the debiased estimator θ~=θ^I^θ𝜻1S^(𝜼^)\tilde{\theta}=\hat{\theta}-{\hat{I}_{\theta\mid\bm{\zeta}}}^{-1}{\hat{S}(\hat{\bm{\eta}})} is asymptotically normal. First, by Lemma 12, we have (A38) hold. Therefore, it suffices to show that

n1/2|(θ~θ)(Iθ𝜻)1/2+(𝝉)Tl(𝜼)(Iθ𝜻)1/2|=op(1).n^{1/2}|(\tilde{\theta}-\theta^{*})(I_{\theta\mid\bm{\zeta}}^{*})^{1/2}+(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\bm{\eta}^{*})(I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}|=o_{p}(1).

which is equivalent to show

n1/2|(θ~θ)Iθ𝜻+(𝝉)Tl(𝜼)|=op(1).n^{1/2}|(\tilde{\theta}-\theta^{*})I_{\theta\mid\bm{\zeta}}^{*}+(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\bm{\eta}^{*})|=o_{p}(1).

since Iθ𝜻I_{\theta\mid\bm{\zeta}}^{*} is constant. Note that we have S^(𝜼^)=𝝉^Tl(𝜼^)\hat{S}(\hat{\bm{\eta}})=\hat{\bm{\tau}}^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}}) by the definition of estimated decorrelated score function, we next decompose the left hand side of the above expression and apply triangular inequality as

n1/2|{θ^I^θ𝜻1S^(𝜼^)θ}Iθ𝜻+(𝝉)Tl(𝜼)|\displaystyle n^{1/2}|\{\hat{\theta}-{\hat{I}_{\theta\mid\bm{\zeta}}}^{-1}\hat{S}(\hat{\bm{\eta}})-\theta^{*}\}I_{\theta\mid\bm{\zeta}}^{*}+(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\bm{\eta}^{*})|
=\displaystyle= n1/2|(θ^θ)Iθ𝜻(𝝉)Tl(𝜼^)+(𝝉)Tl(𝜼^)𝝉^Tl(𝜼^)+𝝉^Tl(𝜼^)\displaystyle n^{1/2}|(\hat{\theta}-\theta^{*})I_{\theta\mid\bm{\zeta}}^{*}-(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\hat{\bm{\eta}})+(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\hat{\bm{\eta}})-\hat{\bm{\tau}}^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}})+\hat{\bm{\tau}}^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}})
Iθ𝜻I^θ𝜻1𝝉^Tl(𝜼^)+(𝝉)Tl(𝜼)|\displaystyle-I_{\theta\mid\bm{\zeta}}^{*}\hat{I}_{\theta\mid\bm{\zeta}}^{-1}\hat{\bm{\tau}}^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}})+(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\bm{\eta}^{*})|
\displaystyle\leq n1/2|(θ^θ)Iθ𝜻(𝝉)T{l(𝜼^)l(𝜼)}|\displaystyle n^{1/2}|(\hat{\theta}-\theta^{*})I_{\theta\mid\bm{\zeta}}^{*}-(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l(\hat{\bm{\eta}})-\nabla l(\bm{\eta}^{*})\}|
+n1/2|(𝝉^𝝉)Tl(𝜼^)|+n1/2|(Iθ𝜻I^θ𝜻11)𝝉^Tl(𝜼^)|\displaystyle+n^{1/2}|(\hat{\bm{\tau}}-\bm{\tau}^{*})^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}})|+n^{1/2}|(I_{\theta\mid\bm{\zeta}}^{*}\hat{I}_{\theta\mid\bm{\zeta}}^{-1}-1)\hat{\bm{\tau}}^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}})|
=:\displaystyle=: I1+I2+I3.\displaystyle I_{1}+I_{2}+I_{3}.

By an application of Lemma 11 condition (iii)(iii), we write I1I_{1} as

I1\displaystyle I_{1} =\displaystyle= n1/2|(θ^θ)Iθ𝜻(𝝉)T{l(𝜼^)l(𝜼)}|\displaystyle n^{1/2}|(\hat{\theta}-\theta^{*})I_{\theta\mid\bm{\zeta}}^{*}-(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l(\hat{\bm{\eta}})-\nabla l(\bm{\eta}^{*})\}|
\displaystyle\leq n1/2|(θ^θ)Iθ𝜻(𝝉)T2l(𝜼)(𝜼^𝜼)|+op(1)\displaystyle n^{1/2}|(\hat{\theta}-\theta^{*})I_{\theta\mid\bm{\zeta}}^{*}-(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla^{2}l(\bm{\eta}^{*})(\hat{\bm{\eta}}-\bm{\eta}^{*})|+o_{p}(1)
\displaystyle\leq n1/2|(θ^θ)Iθ𝜻{θθ2l(𝜼)(𝒘)T𝜻θ2l(𝜼)}(θ^θ)\displaystyle n^{1/2}|(\hat{\theta}-\theta^{*})I_{\theta\mid\bm{\zeta}}^{*}-\{\nabla_{\theta\theta}^{2}l(\bm{\eta}^{*})-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}\theta}^{2}l(\bm{\eta}^{*})\}(\hat{\theta}-\theta^{*})
{θ𝜻2l(𝜼)𝒘T𝜻𝜻2l(𝜼)}(𝜻^𝜻)|+op(1)\displaystyle-\{\nabla_{\theta\bm{\zeta}}^{2}l(\bm{\eta}^{*})-\bm{w}^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}\bm{\zeta}}^{2}l(\bm{\eta}^{*})\}(\hat{\bm{\zeta}}-\bm{\zeta}^{*})|+o_{p}(1)
\displaystyle\leq n1/2𝜼^𝜼1T+op(1)\displaystyle n^{1/2}\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{1}\|T\|_{\infty}+o_{p}(1)

where the last inequality is by Hölder inequality with T={Iθ𝜻θθ2l(𝜼)+(𝒘)T𝜻θ2l(𝜼)T=\{I_{\theta\mid\bm{\zeta}}^{*}-\nabla_{\theta\theta}^{2}l(\bm{\eta}^{*})+(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}\theta}^{2}l(\bm{\eta}^{*}), θ𝜻2l(𝜼)(𝒘)T𝜻𝜻2l(𝜼)}T\nabla_{\theta\bm{\zeta}}^{2}l(\bm{\eta}^{*})-(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla_{\bm{\zeta}\bm{\zeta}}^{2}l(\bm{\eta}^{*})\}^{\mathrm{\scriptscriptstyle T}}. As a consequence of Lemma 7 condition (ii)(ii), we have T=Op{n1/2(logp)1/2+p1/2(logn)1/2}\|T\|_{\infty}=O_{p}\{n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\}. In addition, Theorem 2 gives 𝜼^𝜼1=Op{sη(n1/2(logp)1/2+p1/2(logn)1/2)}\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{1}=O_{p}\{s_{\eta}(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2})\}. Hence under the scaling condition of (swsη)(s_{w}\vee s_{\eta}) (n1/2logp+p1n1/2logn)=op(1)(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1), we have

I1nsη(logpn+lognp)+op(1)=op(1).I_{1}\lesssim\sqrt{n}s_{\eta}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)+o_{p}(1)=o_{p}(1). (A39)

By an application of Lemma 11 condition (iv)(iv), we write I2I_{2} as

I2\displaystyle I_{2} =\displaystyle= n1/2|(𝝉^𝝉)Tl(𝜼^)|\displaystyle{n}^{1/2}|(\hat{\bm{\tau}}-\bm{\tau}^{*})^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}})| (A40)
\displaystyle\leq n1/2|(𝝉^𝝉)Tl(𝜼)|+op(1)\displaystyle{n}^{1/2}|(\hat{\bm{\tau}}-\bm{\tau}^{*})^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\bm{\eta}^{*})|+o_{p}(1)
\displaystyle\leq n1/2𝝉^𝝉1l(𝜼)+op(1).\displaystyle{n}^{1/2}\|\hat{\bm{\tau}}-\bm{\tau}^{*}\|_{1}\|\nabla l(\bm{\eta}^{*})\|_{\infty}+o_{p}(1).

From Theorem 2, we have 𝝉^𝝉1=𝒘^𝒘1=Op{(swsη)(n1/2(logp)1/2+p1/2(logn)1/2)}\|\hat{\bm{\tau}}-\bm{\tau}^{*}\|_{1}=\|\hat{\bm{w}}-\bm{w}^{*}\|_{1}=O_{p}\{(s_{w}\vee s_{\eta})(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2})\}. From Lemma 7 Condition (i), we have l(η)=Op{n1/2(logp)1/2+p1/2(logn)1/2}\|\nabla l(\eta^{*})\|_{\infty}=O_{p}\{n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}\}. Hence under the condition of (swsη)(s_{w}\vee s_{\eta}) (n1/2logp+p1n1/2logn)=op(1)(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1), we have

I2n(swsη)(logpn+lognp)+op(1)=op(1).I_{2}\lesssim\sqrt{n}(s_{w}\vee s_{\eta})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)+o_{p}(1)=o_{p}(1). (A41)

For I3I_{3}, since I^θ𝜻Iθ𝜻=op(1)\hat{I}_{\theta\mid\bm{\zeta}}-{I}_{\theta\mid\bm{\zeta}}^{*}=o_{p}(1) from Lemma 13, we have Iθ𝜻I^θ𝜻1=Op(1)I_{\theta\mid\bm{\zeta}}^{*}\hat{I}_{\theta\mid\bm{\zeta}}^{-1}=O_{p}(1). From (A40) and 𝝉\bm{\tau}^{*} being fixed, the term I2=op(1)I_{2}=o_{p}(1) implies that n1/2|𝝉^Tl(𝜼^)|=op(1){n}^{1/2}|\hat{\bm{\tau}}^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}})|=o_{p}(1). Hence we have

I3=n1/2|(Iθ𝜻I^θ𝜻11)𝝉^Tl(𝜼^)|=op(1).I_{3}={n}^{1/2}|(I_{\theta\mid\bm{\zeta}}^{*}\hat{I}_{\theta\mid\bm{\zeta}}^{-1}-1)\hat{\bm{\tau}}^{{\mathrm{\scriptscriptstyle T}}}\nabla l(\hat{\bm{\eta}})|=o_{p}(1). (A42)

Combining (A39), (A41) and (A42), we obtain

n1/2|(θ~θ)Iθ𝜻+(𝝉)Tl(𝜼)|I1+I2+I3=op(1),{n}^{1/2}|(\tilde{\theta}-\theta^{*})I_{\theta\mid\bm{\zeta}}^{*}+(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\bm{\eta}^{*})|\leq I_{1}+I_{2}+I_{3}=o_{p}(1),

which completes the proof.

Appendix F. Proof of Proposition 1

To prove Proposition 1, we show that the estimated unmeasured confounders are bounded with convergence guarantees. From the expression of the estimator for unmeasured confounders in (4) of the main text and recall that H^=(𝑾^𝚺^e1𝑾^T)1\hat{H}=(\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\hat{\bm{W}}^{{\mathrm{\scriptscriptstyle T}}})^{-1} and 𝑯^p=p×𝑯^=(p1𝑾^𝚺^e1𝑾^T)1\hat{\bm{H}}_{p}=p\times\hat{\bm{H}}=(p^{-1}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\hat{\bm{W}}^{{\mathrm{\scriptscriptstyle T}}})^{-1}, we apply triangular inequality and have

maxi=1,,n𝑼^i𝑼i\displaystyle\max_{i=1,\ldots,n}\|\hat{\bm{U}}_{i}-\bm{U}_{i}\|_{\infty} =\displaystyle= maxi=1,,n𝑯^𝑾^𝚺^e1{(𝑾𝑾^)T𝑼i+𝑬i}\displaystyle\max_{i=1,\ldots,n}\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\{(\bm{W}^{*}-\hat{\bm{W}})^{{\mathrm{\scriptscriptstyle T}}}\bm{U}_{i}+{\bm{E}}_{i}\}\|_{\infty}
\displaystyle\leq maxi=1,,n𝑯^𝑾^𝚺^e1(𝑾𝑾^)T𝑼i+maxi=1,,n𝑯^𝑾^𝚺^e1𝑬i.\displaystyle\max_{i=1,\ldots,n}\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{W}^{*}-\hat{\bm{W}})^{{\mathrm{\scriptscriptstyle T}}}\bm{U}_{i}\|_{\infty}+\max_{i=1,\ldots,n}\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\bm{E}_{i}\|_{\infty}.

For the first term in (LABEL:eq:max_uhat), we have

maxi=1,,n𝑯^𝑾^𝚺^e1(𝑾𝑾^)T𝑼i\displaystyle\max_{i=1,\ldots,n}\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{W}^{*}-\hat{\bm{W}})^{{\mathrm{\scriptscriptstyle T}}}\bm{U}_{i}\|_{\infty} \displaystyle\leq 𝑯^𝑾^𝚺^e1(𝑾𝑾^)T1,maxi=1,,n𝑼i1\displaystyle\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{W}^{*}-\hat{\bm{W}})^{{\mathrm{\scriptscriptstyle T}}}\|_{1,\infty}\max_{i=1,\ldots,n}\|\bm{U}_{i}\|_{1}
\displaystyle\leq K𝑯^𝑾^𝚺^e1(𝑾𝑾^)TFmaxi=1,,nK𝑼i.\displaystyle\sqrt{K}\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{W}^{*}-\hat{\bm{W}})^{{\mathrm{\scriptscriptstyle T}}}\|_{F}\max_{i=1,\ldots,n}K\|\bm{U}_{i}\|_{\infty}.

From Assumption 2, we have 𝑼iM\|\bm{U}_{i}\|_{\infty}\leq M for all i=1,,ni=1,\dots,n and some constant M>0M>0. For the norm 𝑯^𝑾^𝚺^e1(𝑾𝑾^)TF\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{W}^{*}-\hat{\bm{W}})^{{\mathrm{\scriptscriptstyle T}}}\|_{F}, we apply the Cauchy-Schwarz inequality to bound the matrix norm

𝑯^𝑾^𝚺^e1(𝑾𝑾^)TF\displaystyle\left\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{W}^{*}-\hat{\bm{W}})^{{\mathrm{\scriptscriptstyle T}}}\right\|_{F} =\displaystyle= 𝑯^p1pj=1p1σ^j2𝑾^j(𝑾j𝑾^j)TF\displaystyle\left\|\hat{\bm{H}}_{p}\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}\hat{\bm{W}}_{j}(\bm{W}_{j}^{*}-\hat{\bm{W}}_{j})^{{\mathrm{\scriptscriptstyle T}}}\right\|_{F}
\displaystyle\leq 𝑯^pF(1pj=1p1σ^j2𝑾^j22)1/2(1pj=1p1σ^j2𝑾j𝑾^j22)1/2\displaystyle\|\hat{\bm{H}}_{p}\|_{F}\left(\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}\|\hat{\bm{W}}_{j}\|_{2}^{2}\right)^{1/2}\left(\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}\|\bm{W}_{j}^{*}-\hat{\bm{W}}_{j}\|_{2}^{2}\right)^{1/2}
=\displaystyle= Op(1n+1p)\displaystyle O_{p}\left(\frac{1}{\sqrt{n}}+\frac{1}{p}\right)

where the last equality follows because 𝑯^pF=Op(1)\|\hat{\bm{H}}_{p}\|_{F}=O_{p}(1) from Corollary S.1(b), (p1j=1p(p^{-1}\sum_{j=1}^{p} σ^j2𝑾^j22)1/2=Op(1)\hat{\sigma}_{j}^{-2}\|\hat{\bm{W}}_{j}\|_{2}^{2})^{1/2}=O_{p}(1) from Corollary S.1(a) and (p1j=1pσ^j2𝑾j𝑾^j22)1/2=Op(p^{-1}\sum_{j=1}^{p}{\hat{\sigma}_{j}^{-2}}\|\bm{W}_{j}^{*}-\hat{\bm{W}}_{j}\|_{2}^{2})^{1/2}=O_{p} (n1/2+p1)(n^{-1/2}+p^{-1}) from Proposition 1 in Bai and Li (2016). Therefore

maxi=1,,n𝑯^𝑾^𝚺^e1(𝑾𝑾^)T𝑼i=Op(1n+1p).\max_{i=1,\ldots,n}\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}(\bm{W}^{*}-\hat{\bm{W}})^{{\mathrm{\scriptscriptstyle T}}}\bm{U}_{i}\|_{\infty}=O_{p}\left(\frac{1}{\sqrt{n}}+\frac{1}{p}\right). (A44)

For the second term in (LABEL:eq:max_uhat), we have

maxi=1,,n𝑯^𝑾^𝚺^e1𝑬i\displaystyle\max_{i=1,\ldots,n}\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\bm{E}_{i}\|_{\infty} \displaystyle\leq 𝑯^p1,maxi=1,,np1𝑾^𝚺^e1𝑬i1\displaystyle\|\hat{\bm{H}}_{p}\|_{1,\infty}\max_{i=1,\ldots,n}\|p^{-1}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\bm{E}_{i}\|_{1} (A45)
\displaystyle\leq K𝑯^pFmaxi=1,,nKp1𝑾^𝚺^e1𝑬i.\displaystyle\sqrt{K}\|\hat{\bm{H}}_{p}\|_{F}\max_{i=1,\ldots,n}K\|p^{-1}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\bm{E}_{i}\|_{\infty}.

For p1𝑾^𝚺^e1𝑬ip^{-1}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\bm{E}_{i}, we express the KK-dimensional vector as

1pj=1pσ^j2𝑾^jEij\displaystyle\frac{1}{p}\sum_{j=1}^{p}{\hat{\sigma}_{j}^{-2}}\hat{\bm{W}}_{j}E_{ij} =\displaystyle= 1pj=1p(σj)2𝑾jEij+1pj=1p{1σ^j21(σj)2}𝑾jEij\displaystyle\frac{1}{p}\sum_{j=1}^{p}(\sigma_{j}^{*})^{-2}\bm{W}_{j}^{*}E_{ij}+\frac{1}{p}\sum_{j=1}^{p}\left\{\frac{1}{\hat{\sigma}_{j}^{2}}-\frac{1}{(\sigma_{j}^{*})^{2}}\right\}\bm{W}_{j}^{*}E_{ij} (A46)
+1pj=1p1σ^j2(𝑾^j𝑾j)Eij.\displaystyle+\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}(\hat{\bm{W}}_{j}-\bm{W}_{j})E_{ij}.
=:\displaystyle=: R1+R2+R3.\displaystyle R_{1}+R_{2}+R_{3}.

We next investigate the bound for the term R1=p1j=1p(σj)2𝑾jEijR_{1}=p^{-1}\sum_{j=1}^{p}(\sigma_{j}^{*})^{-2}\bm{W}_{j}^{*}E_{ij} and then show the other two terms R2R_{2} and R3R_{3} are of smaller order and dominated by R1R_{1}. From Assumption 2, we have 𝑼iM\|\bm{U}_{i}\|_{\infty}\leq M, 𝑾j2C\|\bm{W}_{j}^{*}\|_{2}\leq C and 𝑿iM\|\bm{X}_{i}\|_{\infty}\leq M. We apply triangular inequality and have

𝑬i\displaystyle\|\bm{E}_{i}\|_{\infty} =\displaystyle= 𝑿i(𝑾)T𝑼i\displaystyle\|\bm{X}_{i}-(\bm{W}^{*})^{{\mathrm{\scriptscriptstyle T}}}\bm{U}_{i}\|_{\infty}
\displaystyle\leq 𝑿i+(𝑾)T𝑼i\displaystyle\|\bm{X}_{i}\|_{\infty}+\|(\bm{W}^{*})^{{\mathrm{\scriptscriptstyle T}}}\bm{U}_{i}\|_{\infty}
\displaystyle\leq M+maxj=1,,p𝑾j1𝑼i1\displaystyle M+\max_{j=1,\ldots,p}\|\bm{W}_{j}^{*}\|_{1}\|\bm{U}_{i}\|_{1}
\displaystyle\leq M+maxj=1,,pK𝑾j2K𝑼i\displaystyle M+\max_{j=1,\ldots,p}\sqrt{K}\|\bm{W}_{j}^{*}\|_{2}K\|\bm{U}_{i}\|_{\infty}
\displaystyle\leq M(1+CK3/2),\displaystyle M(1+CK^{3/2}),

where the last three inequalities follow from matrix operation and properties of norms.

We have EijE_{ij} to be sub-exponential random variable since EijE_{ij} is bounded with |Eij|M(1+CK3/2)|E_{ij}|\leq M(1+CK^{3/2}). Combining the bound for EijE_{ij} with that |Wjk|𝑾j𝑾j2C|W_{jk}^{*}|\leq\|\bm{W}_{j}^{*}\|_{\infty}\leq\|\bm{W}_{j}^{*}\|_{2}\leq C and C2(σj)2C2{C^{-2}}\leq{(\sigma_{j}^{*})^{-2}}\leq C^{2}, we have |(σj)2WjkEij|MC3(1+CK3/2)|{(\sigma_{j}^{*})^{-2}}{W}_{jk}^{*}E_{ij}|\leq MC^{3}(1+CK^{{3}/{2}}), for j=1,,pj=1,\dots,p and k=1,,Kk=1,\dots,K. So (σj)2WjkEij{(\sigma_{j}^{*})^{-2}}{W}_{jk}^{*}E_{ij} is sub-Gaussian random variable and thus is sub-exponential. As EijE_{ij} has mean zero, (σj)2WjkEij(\sigma_{j}^{*})^{-2}{W}_{jk}^{*}E_{ij} has mean zero and by Bernstein inequality

P(|1pj=1p(σj)2WjkEij|t)\displaystyle P\left(\left|\frac{1}{p}\sum_{j=1}^{p}(\sigma_{j}^{*})^{-2}{W}_{jk}^{*}E_{ij}\right|\geq t\right)
2exp[C′′min{t2M2C6(1+CK3/2)2,tMC3(1+CK3/2)}p].\displaystyle\leq 2\exp\left[-C^{\prime\prime}\min\left\{\frac{t^{2}}{M^{2}C^{6}(1+CK^{{3}/{2}})^{2}},\frac{t}{MC^{3}(1+CK^{{3}/{2}})}\right\}p\right].

Apply union bound inequality, we have

P(maxi=1,,n1pj=1p(σj)2𝑾jEijt)\displaystyle P\left(\max_{i=1,\ldots,n}\left\|\frac{1}{p}\sum_{j=1}^{p}(\sigma_{j}^{*})^{-2}{\bm{W}}_{j}^{*}E_{ij}\right\|_{\infty}\geq t\right)
2nKexp[C′′min{t2M2C6(1+CK3/2)2,tMC3(1+CK3/2)}p],\displaystyle\leq 2nK\exp\left[-C^{\prime\prime}\min\left\{\frac{t^{2}}{M^{2}C^{6}(1+CK^{{3}/{2}})^{2}},\frac{t}{MC^{3}(1+CK^{{3}/{2}})}\right\}p\right],

where C′′>0C^{\prime\prime}>0 is a constant. At t=MC3(1+CK3/2)p1/2(logn)1/2t=MC^{3}(1+CK^{{3}/{2}})p^{-1/2}(\log n)^{1/2}, the inequality

maxi=1,,n1pj=1p(σj)2WjEijMC3(1+CK3/2)(lognp)1/2,\max_{i=1,\ldots,n}\left\|\frac{1}{p}\sum_{j=1}^{p}(\sigma_{j}^{*})^{-2}{W}_{j}^{*}E_{ij}\right\|_{\infty}\leq MC^{3}(1+CK^{{3}/{2}})\left(\frac{\log n}{p}\right)^{1/2}, (A47)

holds with probability at least 1n11-n^{-1}.

For the term R2R_{2}, we apply Cauchy-Schwarz inequality and have

1pj=1p{1σ^j21(σj)2}𝑾jEij2\displaystyle\left\|\frac{1}{p}\sum_{j=1}^{p}\left\{\frac{1}{\hat{\sigma}_{j}^{2}}-\frac{1}{(\sigma_{j}^{*})^{2}}\right\}\bm{W}_{j}^{*}E_{ij}\right\|_{2}
\displaystyle\leq [1pj=1p{1σ^j21(σj)2}2]1/2{1pj=1pEij(𝑾j)𝑾jEij}1/2.\displaystyle\left[\frac{1}{p}\sum_{j=1}^{p}\left\{\frac{1}{\hat{\sigma}_{j}^{2}}-\frac{1}{(\sigma_{j}^{*})^{2}}\right\}^{2}\right]^{1/2}\left\{\frac{1}{p}\sum_{j=1}^{p}E_{ij}(\bm{W}_{j}^{*})^{\intercal}\bm{W}_{j}^{*}E_{ij}\right\}^{1/2}.

By Assumption 1(b), σ^j2\hat{\sigma}_{j}^{-2} and (σj)2(\sigma_{j}^{*})^{-2} are bounded in [C2,C2][C^{-2},C^{2}] and as a result, we have

[1pj=1p{1σ^j21(σj)2}2]1/22C2.\left[\frac{1}{p}\sum_{j=1}^{p}\left\{\frac{1}{\hat{\sigma}_{j}^{2}}-\frac{1}{(\sigma_{j}^{*})^{2}}\right\}^{2}\right]^{1/2}\leq 2C^{2}.

and

maxi=1,,n1pj=1p{1σ^j21(σj)2}𝑾jEij\displaystyle\max_{i=1,\dots,n}\left\|\frac{1}{p}\sum_{j=1}^{p}\left\{\frac{1}{\hat{\sigma}_{j}^{2}}-\frac{1}{(\sigma_{j}^{*})^{2}}\right\}\bm{W}_{j}^{*}E_{ij}\right\|_{\infty} \displaystyle\leq maxi=1,,n1pj=1p{1σ^j21(σj)2}𝑾jEij2\displaystyle\max_{i=1,\dots,n}\left\|\frac{1}{p}\sum_{j=1}^{p}\left\{\frac{1}{\hat{\sigma}_{j}^{2}}-\frac{1}{(\sigma_{j}^{*})^{2}}\right\}\bm{W}_{j}^{*}E_{ij}\right\|_{2}
\displaystyle\leq 2C2maxi=1,,n{1pj=1pEij(𝑾j)𝑾jEij}1/2.\displaystyle 2C^{2}\max_{i=1,\dots,n}\left\{\frac{1}{p}\sum_{j=1}^{p}E_{ij}(\bm{W}_{j}^{*})^{\intercal}\bm{W}_{j}^{*}E_{ij}\right\}^{1/2}.

Because |WjkWjkEij2|M2C2(1+CK3/2)2|{W}_{jk}^{*}{W}_{jk^{\prime}}^{*}E_{ij}^{2}|\leq M^{2}C^{2}(1+CK^{{3}/{2}})^{2} and by Berstein inequality,

P(|1pj=1pWjkWjkEij2|t)\displaystyle P\left(\left|\frac{1}{p}\sum_{j=1}^{p}{W}_{jk}^{*}{W}_{jk^{\prime}}^{*}E_{ij}^{2}\right|\geq t\right)
2exp[C′′min{t2M4C4(1+CK3/2)4,tM2C2(1+CK3/2)2}p].\displaystyle\leq 2\exp\left[-C^{\prime\prime}\min\left\{\frac{t^{2}}{M^{4}C^{4}(1+CK^{{3}/{2}})^{4}},\frac{t}{M^{2}C^{2}(1+CK^{{3}/{2}})^{2}}\right\}p\right].

and then applying union bound gives

P(maxi=1,,n{1pj=1pEij(𝑾j)𝑾jEij}t)\displaystyle P\left(\max_{i=1,\dots,n}\left\{\frac{1}{p}\sum_{j=1}^{p}E_{ij}(\bm{W}_{j}^{*})^{\intercal}\bm{W}_{j}^{*}E_{ij}\right\}\geq t\right)
2nK2exp[C′′min{t2M4C4(1+CK3/2)4,tM2C2(1+CK3/2)2}p].\displaystyle\leq 2nK^{2}\exp\left[-C^{\prime\prime}\min\left\{\frac{t^{2}}{M^{4}C^{4}(1+CK^{{3}/{2}})^{4}},\frac{t}{M^{2}C^{2}(1+CK^{{3}/{2}})^{2}}\right\}p\right].

From the above probability bound, we have maxi=1,,n{p1j=1pEij(𝑾j)𝑾jEij}=Op(p1/2(logn)1/2)\max_{i=1,\dots,n}\left\{p^{-1}\sum_{j=1}^{p}E_{ij}(\bm{W}_{j}^{*})^{\intercal}\bm{W}_{j}^{*}E_{ij}\right\}=O_{p}(p^{-1/2}(\log n)^{1/2}) and

maxi=1,,n1pj=1p{1σ^j21(σj)2}𝑾jEij=Op{(lognp)1/4}\max_{i=1,\dots,n}\left\|\frac{1}{p}\sum_{j=1}^{p}\left\{\frac{1}{\hat{\sigma}_{j}^{2}}-\frac{1}{(\sigma_{j}^{*})^{2}}\right\}\bm{W}_{j}^{*}E_{ij}\right\|_{\infty}=O_{p}\left\{\left(\frac{\log n}{p}\right)^{1/4}\right\} (A48)

For the term R3R_{3}, we apply Cauchy-Schwarz inequality and get

1pj=1p1σ^j2(𝑾^j𝑾j)Eij2(1pj=1p1σ^j2𝑾^j𝑾j22)1/2(1pj=1p1σ^j2Eij2)1/2.\left\|\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}(\hat{\bm{W}}_{j}-\bm{W}_{j})E_{ij}\right\|_{2}\leq\left(\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}\|\hat{\bm{W}}_{j}-\bm{W}_{j}\|_{2}^{2}\right)^{1/2}\left(\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}E_{ij}^{2}\right)^{1/2}. (A49)

By Proposition 1 in Bai and Li (2016), the first term

(1pj=1p1σ^j2𝑾^j𝑾j22)1/2=Op(1n+1p).\left(\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}\|\hat{\bm{W}}_{j}-\bm{W}_{j}\|_{2}^{2}\right)^{1/2}=O_{p}\left(\frac{1}{\sqrt{n}}+\frac{1}{p}\right).

By a similar technique as in bounding R1R_{1} and R2R_{2} using the combination of Bernstein inequality and union bound, we can show maxi=1,,n(p1j=1pσ^j2Eij2)1/2=Op(p1/4(logn)1/4)\max_{i=1,\dots,n}(p^{-1}\sum_{j=1}^{p}{\hat{\sigma}_{j}^{-2}}E_{ij}^{2})^{1/2}=O_{p}(p^{-1/4}(\log n)^{1/4}), and therefore

maxi=1,,n1pj=1p1σ^j2(𝑾^j𝑾j)Eij\displaystyle\max_{i=1,\dots,n}\left\|\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}(\hat{\bm{W}}_{j}-\bm{W}_{j})E_{ij}\right\|_{\infty} (A50)
\displaystyle\leq maxi=1,,n1pj=1p1σ^j2(𝑾^j𝑾j)Eij2\displaystyle\max_{i=1,\dots,n}\left\|\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}(\hat{\bm{W}}_{j}-\bm{W}_{j})E_{ij}\right\|_{2}
\displaystyle\leq (1pj=1p1σ^j2𝑾^j𝑾j22)1/2maxi=1,,n(1pj=1p1σ^j2Eij2)1/2\displaystyle\left(\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}\|\hat{\bm{W}}_{j}-\bm{W}_{j}\|_{2}^{2}\right)^{1/2}\max_{i=1,\dots,n}\left(\frac{1}{p}\sum_{j=1}^{p}\frac{1}{\hat{\sigma}_{j}^{2}}E_{ij}^{2}\right)^{1/2}
=\displaystyle= Op{(logn)1/4p1/4(1n+1p)}\displaystyle O_{p}\left\{\frac{(\log n)^{1/4}}{p^{1/4}}\left(\frac{1}{\sqrt{n}}+\frac{1}{p}\right)\right\}

Combining (A46), (A47), (A48) and (A50), we have

maxi=1,,n1pj=1pσ^j2𝑾^jEij=Op{lognp}.\displaystyle\max_{i=1,\ldots,n}\left\|\frac{1}{p}\sum_{j=1}^{p}\hat{\sigma}_{j}^{-2}\hat{\bm{W}}_{j}E_{ij}\right\|_{\infty}=O_{p}\left\{\sqrt{\frac{\log n}{p}}\right\}.

In addition, from Corollary S.1 (a) in Bai and Li (2016), we have 𝑯^p=(𝚪)1+op(1)\hat{\bm{H}}_{p}=(\bm{\Gamma}^{*})^{-1}+o_{p}(1) with (𝚪)1sp=λmax1/2{(𝚪)T(𝚪)1}=λmax{(𝚪)1}\|(\bm{\Gamma}^{*})^{-1}\|_{\text{sp}}=\lambda_{\max}^{1/2}\{(\bm{\Gamma}^{*})^{-{\mathrm{\scriptscriptstyle T}}}(\bm{\Gamma}^{*})^{-1}\}=\lambda_{\max}\{(\bm{\Gamma}^{*})^{-1}\} being finite constant as (𝚪)1(\bm{\Gamma}^{*})^{-1} is symmetric and positive definite. Substituting these results into (A45), we have

maxi=1,,n𝑯^𝑾^𝚺^e1𝑬iK3/2λmax{(𝚪)1}Op{lognp}\max_{i=1,\ldots,n}\|\hat{\bm{H}}\hat{\bm{W}}\hat{\bm{\Sigma}}_{e}^{-1}\bm{E}_{i}\|_{\infty}\lesssim K^{{3}/{2}}\lambda_{\max}\{(\bm{\Gamma}^{*})^{-1}\}O_{p}\left\{\sqrt{\frac{\log n}{p}}\right\} (A51)

Combining (LABEL:eq:max_uhat) and (A51), we show

maxi=1,,n𝑼^i𝑼i=Op(1n+lognp).\max_{i=1,\ldots,n}\|\hat{\bm{U}}_{i}-\bm{U}_{i}\|_{\infty}=O_{p}\left(\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right).

Appendix G. Proofs of Lemmas

G.1 Proof of Lemma 7

Proof of Condition (i). To prove the Condition (i) in Lemma 7, we need to show that

1ni=1n[yib{(𝜼)T𝒁˙i}]𝒁˙i=Op{(logpn)1/2+(lognp)1/2}.\left\|-\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]\dot{\bm{Z}}_{i}\right\|_{\infty}=O_{p}\left\{\left({\frac{\log p}{n}}\right)^{1/2}+\left({\frac{\log n}{p}}\right)^{1/2}\right\}.

From Assumption 2, we have yib{(𝜼)T𝒁i}y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\} to be sub-exponential variable with mean 0 and yib{(𝜼)T𝒁i}φ1\|y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\|_{\varphi_{1}} M\leq M. In addition, it is assumed that a1(𝜼)T𝒁ia2a_{1}\leq(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\leq a_{2} and 0b{(𝜼)T𝒁i}B0\leq b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\leq B. Denote maxi𝑼^i𝑼i=𝒁˙i𝒁i=ϵ\max_{i}\|\hat{\bm{U}}_{i}-\bm{U}_{i}^{*}\|_{\infty}=\|\dot{\bm{Z}}_{i}-\bm{Z}_{i}\|_{\infty}=\epsilon. From Proposition 1, ϵ=Op(n1/2+p1/2(logn)1/2)\epsilon=O_{p}(n^{-1/2}+p^{-1/2}(\log n)^{1/2}). Since 𝒁˙i𝒁i=ϵ\|\dot{\bm{Z}}_{i}-\bm{Z}_{i}\|_{\infty}=\epsilon, it can be shown that |(𝜼)T(𝒁i𝒁˙i)|a2+sηMϵ|(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}(\bm{Z}_{i}-\dot{\bm{Z}}_{i})|\leq a_{2}+s_{\eta}M\epsilon.

To prove condition (i), we focus on finding the appropriate sequence of tt such that the following probability tends to 0.

P(1ni=1n[yib{(𝜼)T𝒁˙i}]𝒁˙it)\displaystyle P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]\dot{\bm{Z}}_{i}\right\|_{\infty}\geq t\right)
P(1ni=1n[yib{(𝜼)T𝒁˙i}]𝑼^it)+P(1ni=1n[yib{(𝜼)T𝒁˙i}]𝑿it)\displaystyle\leq P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]\hat{\bm{U}}_{i}\right\|_{\infty}\geq t\right)+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]{\bm{X}}_{i}\right\|_{\infty}\geq t\right)
P(1ni=1n[yib{(𝜼)T𝒁i}]𝑼it1δ1)\displaystyle\leq P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}]{\bm{U}}_{i}\right\|_{\infty}\geq t_{1}-\delta_{1}\right)
+P(1ni=1n[b{(𝜼)T𝒁i}b{(𝜼)T𝒁˙i}]𝑼iδ1)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]{\bm{U}}_{i}\right\|_{\infty}\geq\delta_{1}\right)
+P(1ni=1n[yib{(𝜼)T𝒁i}](𝑼^i𝑼i)t2δ2)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}](\hat{\bm{U}}_{i}-\bm{U}_{i})\right\|_{\infty}\geq t_{2}-\delta_{2}\right)
+P(1ni=1n[b{(𝜼)T𝒁i}b{(𝜼)T𝒁˙i}](𝑼^i𝑼i)δ2)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}](\hat{\bm{U}}_{i}-\bm{U}_{i})\right\|_{\infty}\geq\delta_{2}\right)
+P(1ni=1n[yib{(𝜼)T𝒁i}]𝑿it3δ3)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}]{\bm{X}}_{i}\right\|_{\infty}\geq t_{3}-\delta_{3}\right)
+P(1ni=1n[b{(𝜼)T𝒁i}b{(𝜼)T𝒁˙i}]𝑿iδ3)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]{\bm{X}}_{i}\right\|_{\infty}\geq\delta_{3}\right)
=:P1+P2+P3+P4+P5+P6,\displaystyle=:P_{1}+P_{2}+P_{3}+P_{4}+P_{5}+P_{6}, (A52)

where t=max{t1+t2,t3}t=\max\{t_{1}+t_{2},t_{3}\}

We let δ1=δ3=BM(a2+sηMϵ)\delta_{1}=\delta_{3}=BM(a_{2}+s_{\eta}M\epsilon) and δ2=Bϵ(a2+sηMϵ)\delta_{2}=B\epsilon(a_{2}+s_{\eta}M\epsilon), then it can be verified that the probability P2P_{2}, P4P_{4} and P6P_{6} tends to 0 under ϵ=Op(n1/2+p1/2(logn)1/2)\epsilon=O_{p}(n^{-1/2}+p^{-1/2}(\log n)^{1/2}). Next, we look at P1P_{1}, P3P_{3} and P5P_{5} to determine t1t_{1}, t2t_{2} and t3t_{3} separately.

For P1P_{1}, as [yib{(𝜼)T𝒁i}]Uik[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}]{U}_{ik} is independent mean 0 sub-exponential random variables and yib{(𝜼)T𝒁i}]Uikφ12M2\|y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}]{U}_{ik}\|_{\varphi_{1}}\leq 2M^{2}, by Bernstein inequality, we have

P(1ni=1n|yib{(𝜼)T𝒁i}]Uik|t1δ1)2exp{M~′′min((t1δ1)24M4,t1δ12M2)n}.P\left(\frac{1}{n}\sum_{i=1}^{n}|y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}]{U}_{ik}|\geq t_{1}-\delta_{1}\right)\leq 2\exp\left\{-\tilde{M}^{\prime\prime}\min\left(\frac{(t_{1}-\delta_{1})^{2}}{4M^{4}},\frac{t_{1}-\delta_{1}}{2M^{2}}\right)n\right\}.

Then applying union bound inequality, we have

P(1ni=1n[yib{(𝜼)T𝒁i}]𝑼it1δ1)\displaystyle P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}]\bm{U}_{i}\right\|_{\infty}\geq t_{1}-\delta_{1}\right)
\displaystyle\leq 2Kexp{M~′′min((t1δ1)24M4,t1δ12M2)n},\displaystyle 2K\exp\left\{-\tilde{M}^{\prime\prime}\min\left(\frac{(t_{1}-\delta_{1})^{2}}{4M^{4}},\frac{t_{1}-\delta_{1}}{2M^{2}}\right)n\right\},

so at t1δ1=M2n1/2t_{1}-\delta_{1}=M^{2}n^{-1/2}, the probability P1P_{1} tends to 0. Using similar techniques, we can show that at t2δ2=Mϵn1/2t_{2}-\delta_{2}=M\epsilon n^{-1/2}, the probability P3P_{3} tends to 0; at t3δ3=M2n1/2(logp)1/2t_{3}-\delta_{3}=M^{2}n^{-1/2}(\log p)^{1/2}, the probability P5P_{5} tends to 0. Hence at

t\displaystyle t =\displaystyle= max{t1+t2,t3}\displaystyle\max\{t_{1}+t_{2},t_{3}\}
=\displaystyle= max{M2n+BM(a2+sηMϵ)+Mϵn+Bϵ(a2+sηMϵ),\displaystyle\max\left\{\frac{M^{2}}{\sqrt{n}}+BM(a_{2}+s_{\eta}M\epsilon)+\frac{M\epsilon}{\sqrt{n}}+B\epsilon(a_{2}+s_{\eta}M\epsilon),\right.
M2logpn+BM(a2+sηMϵ)},\displaystyle\left.M^{2}\sqrt{\frac{\log p}{n}}+BM(a_{2}+s_{\eta}M\epsilon)\right\},

the probability (A52) tends to 0. As t3t_{3} dominates t1+t2t_{1}+t_{2} in the above expression, we eventually have

1ni=1n[yib{(𝜼)T𝒁˙i}]𝒁˙i=Op(logpn+lognp),\left\|\frac{1}{n}\sum_{i=1}^{n}[y_{i}-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]\dot{\bm{Z}}_{i}\right\|_{\infty}=O_{p}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right),

which completes the proof for condition (i).

Proof of Condition (ii)(ii). To show condition (ii)(ii) in Lemma 7 holds, we use a similar technique as in proving condition (i) and decomposing the probability as follows. Recall that 𝝉=(1,(𝒘)T)T\bm{\tau}^{*}=(1,-(\bm{w}^{*})^{{\mathrm{\scriptscriptstyle T}}})^{{\mathrm{\scriptscriptstyle T}}} and the two sub-vectors of 𝒘\bm{w}^{*} are denoted as 𝒘q=(w2,,wq)T\bm{w}_{q}^{*}=(w_{2}^{*},\ldots,w_{q}^{*})^{\mathrm{\scriptscriptstyle T}} and 𝒘u=(wp+1,,wp+K)T\bm{w}_{u}^{*}=(w_{p+1}^{*},\ldots,w_{p+K}^{*})^{\mathrm{\scriptscriptstyle T}}. Denote 𝝉q=(1,(𝒘q)T)T=(1,w2,,wq)T\bm{\tau}_{q}^{*}=(1,-(\bm{w}_{q}^{*})^{{\mathrm{\scriptscriptstyle T}}})^{{\mathrm{\scriptscriptstyle T}}}=(1,-w_{2}^{*},\dots,-w_{q}^{*})^{{\mathrm{\scriptscriptstyle T}}}.

P(1ni=1n(𝝉)Tb′′{(𝜼)T𝒁˙i}𝒁˙i𝒁˙iTEη[1ni=1n(𝝉)Tb′′{(𝜼)T𝒁˙i}𝒁˙i𝒁˙iT]t)\displaystyle P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t\right)
P(1ni=1n(𝒘u)Tb′′{(𝜼)T𝒁˙i}𝑼^i𝑿iTEη[1ni=1n(𝒘u)Tb′′{(𝜼)T𝒁˙i}𝑼^i𝑿iT]t)\displaystyle\leq P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t\right)
+P(1ni=1n(𝒘u)Tb′′{(𝜼)T𝒁˙i}𝑼^i𝑼^iTEη[1ni=1n(𝒘u)Tb′′{(𝜼)T𝒁˙i}𝑼^i𝑼^iT]t)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t\right)
+P(1ni=1n(𝝉q)Tb′′{(𝜼)T𝒁˙i}𝑿i𝑼^iTEη[1ni=1n(𝝉q)Tb′′{(𝜼)T𝒁˙i}𝑿i𝑼^iT]t)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}{\bm{X}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}{\bm{X}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t\right)
+P(1ni=1n(𝝉q)Tb′′{(𝜼)T𝒁˙i}𝑿i𝑿iTEη[1ni=1n(𝝉q)Tb′′{(𝜼)T𝒁˙i}𝑿i𝑿iT]t)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}{\bm{X}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}{\bm{X}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t\right)
P(1ni=1n(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^i𝑿iTEη[(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^i𝑿iT]t4δ4)\displaystyle\leq P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t_{4}-\delta_{4}\right)
+P(1ni=1n(𝒘u)T[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]𝑼^i𝑿iT\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}]\hat{\bm{U}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}\right.\right.
Eη[(𝒘u)T[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]𝑼^i𝑿iT]δ4)\displaystyle\left.\left.-E_{\eta^{*}}\left[(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}]\hat{\bm{U}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq\delta_{4}\right)
+P(1ni=1n(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^i𝑼^iTEη[(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^i𝑼^iT]t5δ5)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t_{5}-\delta_{5}\right)
+P(1ni=1n(𝒘u)T[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]𝑼^i𝑼^iT\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}]\hat{\bm{U}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}\right.\right.
Eη[(𝒘u)T[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]𝑼^i𝑼^iT]δ5)\displaystyle\left.\left.-E_{\eta^{*}}\left[(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}]\hat{\bm{U}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq\delta_{5}\right)
+P(1ni=1n(𝝉q)Tb′′{(𝜼)T𝒁i}𝑿i𝑼^iTEη[1ni=1n(𝝉q)Tb′′{(𝜼)T𝒁i}𝑿i𝑼^iT]t6δ6)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{X}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{X}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t_{6}-\delta_{6}\right)
+P(1ni=1n(𝝉q)T[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]𝑿i𝑼^iT\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}]{\bm{X}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}\right.\right.
Eη[1ni=1n(𝝉q)T[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]𝑿i𝑼^iT]δ6)\displaystyle\left.\left.-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}]{\bm{X}}_{i}\hat{\bm{U}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq\delta_{6}\right)
+P(1ni=1n(𝝉q)Tb′′{(𝜼)T𝒁i}𝑿i𝑿iTEη[1ni=1n(𝝉q)Tb′′{(𝜼)T𝒁i}𝑿i𝑿iT]t7δ7)\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{X}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}{\bm{X}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq t_{7}-\delta_{7}\right)
+P(1ni=1n(𝝉q)T[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]𝑿i𝑿iT\displaystyle+P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}]{\bm{X}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}\right.\right.
Eη[1ni=1n(𝝉q)T[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]𝑿i𝑿iT]δ7)\displaystyle\left.\left.-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}]{\bm{X}}_{i}{\bm{X}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}\geq\delta_{7}\right)
=:R1+R2+R3+R4+R5+R6+R7+R8,\displaystyle=:R_{1}+R_{2}+R_{3}+R_{4}+R_{5}+R_{6}+R_{7}+R_{8}, (A53)

where t=max{t4,t5,t6,t7}t=\max\{t_{4},t_{5},t_{6},t_{7}\}.

Similarly as in proof of condition (i)(i), we let maxi𝑼^i𝑼i=𝒁˙i𝒁i=ϵ\max_{i}\|\hat{\bm{U}}_{i}-\bm{U}_{i}^{*}\|_{\infty}=\|\dot{\bm{Z}}_{i}-\bm{Z}_{i}\|_{\infty}=\epsilon with ϵ=Op(n1/2+p1/2(logn)1/2)\epsilon=O_{p}(n^{-1/2}+p^{-1/2}(\log n)^{1/2}). From Assumption 2, we have b′′{(𝜼)T𝒁i}[0,B]b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{Z}_{i}\}\in[0,B], |(𝝉q)T𝑿i|2M|(\bm{\tau}_{q}^{*})^{\mathrm{\scriptscriptstyle T}}\bm{X}_{i}|\leq 2M and as a result, |(𝒘u)T𝑼^i|K𝒘(M+ϵ)|(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}\hat{\bm{U}}_{i}|\leq K\|\bm{w}^{*}\|_{\infty}(M+\epsilon). In addition, |b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}||b′′{(𝜼)T𝒁i}|(𝜼)T(𝒁˙i𝒁i)B(a2+sηMϵ)|b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}|\leq|b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}|(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}(\dot{\bm{Z}}_{i}-\bm{Z}_{i})\leq B(a_{2}+s_{\eta}M\epsilon).

We let

δ4=2BM(a2+sηMϵ){K𝒘(M+ϵ)};\displaystyle\delta_{4}=2BM(a_{2}+s_{\eta}M\epsilon)\{K\|\bm{w}^{*}\|_{\infty}(M+\epsilon)\};
δ5=2B(M+ϵ)(a2+sηMϵ){K𝒘(M+ϵ)};\displaystyle\delta_{5}=2B(M+\epsilon)(a_{2}+s_{\eta}M\epsilon)\{K\|\bm{w}^{*}\|_{\infty}(M+\epsilon)\};
δ6=4BM(M+ϵ)(a2+sηMϵ);\displaystyle\delta_{6}=4BM(M+\epsilon)(a_{2}+s_{\eta}M\epsilon);
δ7=4BM2(a2+sηMϵ).\displaystyle\delta_{7}=4BM^{2}(a_{2}+s_{\eta}M\epsilon).

It can be shown that R2R_{2}, R4R_{4}, R6R_{6} and R8R_{8} tends to 0 under ϵ=Op(n1/2+p1/2(logn)1/2)\epsilon=O_{p}(n^{-1/2}+p^{-1/2}(\log n)^{1/2}). Then we determine t4t_{4}, t5t_{5}, t6t_{6} and t7t_{7} such that the probability (A53) tends to 0.

For R1R_{1}, as (𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^iXijEη[(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^iXij](\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{X}_{ij}-E_{\eta^{*}}[(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{X}_{ij}] is independent mean 0 sub-exponential random variables and

(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^iXijEη[(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^iXij]φ1\displaystyle\|(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{X}_{ij}-E_{\eta^{*}}[(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{X}_{ij}]\|_{\varphi_{1}}
2BMK𝒘(M+ϵ)=:Mc(M+ϵ),\displaystyle\leq 2BMK\|\bm{w}^{*}\|_{\infty}(M+\epsilon)=:M_{c}(M+\epsilon),

where we denote Mc=2BMK𝒘M_{c}=2BMK\|\bm{w}^{*}\|_{\infty} for notational simplicity. By Bernstein inequality, we have

P(1ni=1n|(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^iXijEη[(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^iXij]|t4δ4)\displaystyle P\left(\frac{1}{n}\sum_{i=1}^{n}\left|(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{X}_{ij}-E_{\eta^{*}}[(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{X}_{ij}]\right|\geq t_{4}-\delta_{4}\right)
2exp{M~′′min((t4δ4)2Mc2(M+ϵ)2,t4δ4Mc2(M+ϵ)2)n}.\displaystyle\leq 2\exp\left\{-\tilde{M}^{\prime\prime}\min\left(\frac{(t_{4}-\delta_{4})^{2}}{M_{c}^{2}(M+\epsilon)^{2}},\frac{t_{4}-\delta_{4}}{M_{c}^{2}(M+\epsilon)^{2}}\right)n\right\}.

Then applying union bound inequality, we have

P(1ni=1n(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^i𝑿iEη[(𝒘u)Tb′′{(𝜼)T𝒁i}𝑼^i𝑿i]t4δ4)\displaystyle P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{\bm{X}}_{i}-E_{\eta^{*}}[(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}\hat{\bm{U}}_{i}{\bm{X}}_{i}]\right\|_{\infty}\geq t_{4}-\delta_{4}\right)
2pexp{M~′′min((t4δ4)2Mc2(M+ϵ)2,t4δ4Mc2(M+ϵ)2)n},\displaystyle\leq 2p\exp\left\{-\tilde{M}^{\prime\prime}\min\left(\frac{(t_{4}-\delta_{4})^{2}}{M_{c}^{2}(M+\epsilon)^{2}},\frac{t_{4}-\delta_{4}}{M_{c}^{2}(M+\epsilon)^{2}}\right)n\right\},

so at t4δ4=Mc(M+ϵ)n1/2(logp)1/2t_{4}-\delta_{4}=M_{c}(M+\epsilon)n^{-1/2}(\log p)^{1/2}, the probability R1R_{1} tends to 0. Using similar techniques, we can show that at t5δ5=2BK𝒘(M+ϵ)2n1/2t_{5}-\delta_{5}=2BK\|\bm{w}^{*}\|_{\infty}(M+\epsilon)^{2}n^{-1/2}, the probability R3R_{3} tends to 0; at t6δ6=4BM(M+ϵ)n1/2t_{6}-\delta_{6}=4BM(M+\epsilon)n^{-1/2}, the probability R5R_{5} tends to 0; at t7δ7=4BM2n1/2(logp)1/2t_{7}-\delta_{7}=4BM^{2}n^{-1/2}(\log p)^{1/2}, the probability R7R_{7} tends to 0. Hence at

t\displaystyle t =\displaystyle= max{t4,t5,t6,t7}\displaystyle\max\{t_{4},t_{5},t_{6},t_{7}\}
=\displaystyle= max{Mc(M+ϵ)n1/2(logp)1/2+2BM(a2+sηMϵ){K𝒘(M+ϵ)},\displaystyle\max\left\{M_{c}(M+\epsilon)n^{-1/2}(\log p)^{1/2}+2BM(a_{2}+s_{\eta}M\epsilon)\{K\|\bm{w}^{*}\|_{\infty}(M+\epsilon)\},\right.
2BK𝒘(M+ϵ)2n1/2+2B(M+ϵ)(a2+sηMϵ){K𝒘(M+ϵ)},\displaystyle 2BK\|\bm{w}^{*}\|_{\infty}(M+\epsilon)^{2}n^{-1/2}+2B(M+\epsilon)(a_{2}+s_{\eta}M\epsilon)\{K\|\bm{w}^{*}\|_{\infty}(M+\epsilon)\},
4BM(M+ϵ)n1/2+4BM(M+ϵ)(a2+sηMϵ),\displaystyle 4BM(M+\epsilon)n^{-1/2}+4BM(M+\epsilon)(a_{2}+s_{\eta}M\epsilon),
4BM2n1/2(logp)1/2+4BM2(a2+sηMϵ)},\displaystyle 4BM^{2}n^{-1/2}(\log p)^{1/2}+4BM^{2}(a_{2}+s_{\eta}M\epsilon)\left.\right\},

the probability (A53) tends to 0. Since t7t_{7} dominates t4t_{4}, t5t_{5} and t6t_{6}, we have

1ni=1n(𝝉)Tb′′{(𝜼)T𝒁˙i}𝒁˙i𝒁˙iTEη[1ni=1n(𝝉)Tb′′{(𝜼)T𝒁˙i}𝒁˙i𝒁˙iT]\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}-E_{\eta^{*}}\left[\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}\right]\right\|_{\infty}
=Op(logpn+lognp).\displaystyle=O_{p}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right).

This completes the proof for Condition (ii)(ii).

G.2 Proof of Lemma 9

In this proof, we denote

Hη\displaystyle H_{\eta} =\displaystyle= 1ni=1n(𝜼^𝜼)T𝒁˙ib′′{(𝜼)T𝒁˙i}𝒁˙iT(𝜼^𝜼)\displaystyle\frac{1}{n}\sum_{i=1}^{n}(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{\eta}}-\bm{\eta}^{*})

We continue to use the notations and results defined in Appendix D.1. Recall we define D(𝜼^,𝜼)=(𝜼^𝜼)T{l(𝜼^)l(𝜼)}D(\hat{\bm{\eta}},\bm{\eta}^{*})=({\hat{\bm{\eta}}}-\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l({\hat{\bm{\eta}}})-\nabla l(\bm{\eta}^{*})\}. To show (A18), we consider the difference of D(𝜼^,𝜼)D(\hat{\bm{\eta}},\bm{\eta}^{*}) and HηH_{\eta} and apply the mean value theorem with 𝜼~=ξ𝜼^+(1ξ)𝜼\tilde{\bm{\eta}}=\xi\hat{\bm{\eta}}+(1-\xi)\bm{\eta}^{*} for ξ[0,1]\xi\in[0,1] to get

|D(𝜼^,𝜼)Hη|\displaystyle|D({\hat{\bm{\eta}}},\bm{\eta}^{*})-H_{\eta}|
=\displaystyle= |(𝜼^𝜼)T{l(𝜼^)l(𝜼)}1ni=1n(𝜼^𝜼)T𝒁˙ib′′{(𝜼)T𝒁˙i}𝒁˙iT(𝜼^𝜼)|\displaystyle\left|(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\{\nabla l(\hat{\bm{\eta}})-\nabla l(\bm{\eta}^{*})\}-\frac{1}{n}\sum_{i=1}^{n}(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{\eta}}-\bm{\eta}^{*})\right|
=\displaystyle= |(𝜼^𝜼)T[2l(𝜼~)1ni=1nb′′{(𝜼)T𝒁˙i}𝒁˙i𝒁˙iT](𝜼^𝜼)|\displaystyle\left|(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\left[\nabla^{2}l(\tilde{\bm{\eta}})-\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}\right](\hat{\bm{\eta}}-\bm{\eta}^{*})\right|
=\displaystyle= |1ni=1n[b′′(𝜼~T𝒁˙i)b′′{(𝜼)T𝒁˙i}]{𝒁˙iT(𝜼^𝜼)}2|\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}(\tilde{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}]\{\dot{\bm{Z}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{\eta}}-\bm{\eta}^{*})\}^{2}\right|
\displaystyle\leq B|1ni=1nb′′{(𝜼)T𝒁˙i}{𝒁˙iT(𝜼^𝜼)}2|maxi|(𝜼^𝜼)T𝒁˙i|\displaystyle B\left|\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{\dot{\bm{Z}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{\eta}}-\bm{\eta}^{*})\}^{2}\right|\max_{i}\left|(\hat{\bm{\eta}}-\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\right|
\displaystyle\lesssim B|1ni=1nb′′{(𝜼)T𝒁˙i}{𝒁˙iT(𝜼^𝜼)}2|𝜼^𝜼1maxi=1,,n𝒁˙i\displaystyle B\left|\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{\dot{\bm{Z}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{\eta}}-\bm{\eta}^{*})\}^{2}\right|\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{1}\max_{i=1,\ldots,n}\|\dot{\bm{Z}}_{i}\|_{\infty}
\displaystyle\lesssim Hηsη(logpn+lognp)(M+1n+lognp)\displaystyle H_{\eta}{s_{\eta}}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\left(M+\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right) (A55)

where the inequality (G.2 Proof of Lemma 9) is based on Assumption 2(3) that |b′′(t1)b′′(t)|B|t1t|b′′(t)|b^{\prime\prime}(t_{1})-b^{\prime\prime}(t)|\leq B|t_{1}-t|b^{\prime\prime}(t) with t1=𝜼~T𝒁˙it_{1}=\tilde{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i} and t=(𝜼)T𝒁˙it=(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}. The last inequality (A55) is from the estimation error bound in the proof of Theorem 2 that 𝜼^𝜼1sη(n1/2(logp)1/2+p1/2(logn)1/2)\|\hat{\bm{\eta}}-\bm{\eta}^{*}\|_{1}\lesssim s_{\eta}(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2}) and from the result derived from Proposition 1 that maxi𝒁˙i=M+Op(n1/2+p1/2(logn)1/2)\max_{i}\|\dot{\bm{Z}}_{i}\|_{\infty}=M+O_{p}(n^{-1/2}+p^{-1/2}(\log n)^{1/2}). Under the assumption that sη(n1/2(logp)1/2+p1/2(logn)1/2)=op(1)s_{\eta}(n^{-1/2}(\log p)^{1/2}+p^{-1/2}(\log n)^{1/2})=o_{p}(1), we have

Hη{1sη(logpn+lognp)(M+1n+lognp)}\displaystyle H_{\eta}\left\{1-{s_{\eta}}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\left(M+\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right)\right\} \displaystyle\lesssim D(𝜼^,𝜼)\displaystyle D({\hat{\bm{\eta}}},\bm{\eta}^{*})
\displaystyle\lesssim 9c2sη(logpn+lognp),\displaystyle 9c^{2}s_{\eta}\left(\frac{\log p}{n}+\frac{\log n}{p}\right),

where the last inequality is by combining (A15) and (A17). Therefore

Hη=Op{sη(logpn+lognp)}.H_{\eta}=O_{p}\left\{s_{\eta}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\}.

The two inequalities (A19) and (A20) are shown to hold in Appendix D.2. Hence the proof for Lemma 9 is complete.

G.3 Proof of Lemma 11

As shown in the preliminaries, we have

l(𝜼)=1ni=1n{yi+b(𝜼T𝒁˙i)}𝒁˙i,2l(𝜼)=1ni=1n{yi+b′′(𝜼T𝒁˙i)}𝒁˙i𝒁˙iT.\nabla l(\bm{\eta})=\frac{1}{n}\sum_{i=1}^{n}\{-y_{i}+b^{\prime}(\bm{\eta}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\}\dot{\bm{Z}}_{i},\quad\nabla^{2}l(\bm{\eta})=\frac{1}{n}\sum_{i=1}^{n}\{-y_{i}+b^{\prime\prime}(\bm{\eta}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\}\dot{\bm{Z}}_{i}\dot{\bm{Z}}_{i}^{\mathrm{\scriptscriptstyle T}}.

Proof of Condition (iii). For condition (iii)(iii), we apply mean value theorem with 𝜼~=ξ𝜼^+(1ξ)𝜼\tilde{\bm{\eta}}=\xi\hat{\bm{\eta}}+(1-\xi)\bm{\eta}^{*} for ξ[0,1]\xi\in[0,1], the left hand side is equivalent to

|(𝝉)T{l(𝜼^)l(𝜼)2l(𝜼)(𝜼^𝜼)}|\displaystyle|(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l(\hat{\bm{\eta}})-\nabla l(\bm{\eta}^{*})-\nabla^{2}l(\bm{\eta}^{*})(\hat{\bm{\eta}}-\bm{\eta}^{*})\}| (A56)
=\displaystyle= |(𝝉)T{2l(𝜼~)2l(𝜼)}(𝜼^𝜼)|\displaystyle|(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla^{2}l(\tilde{\bm{\eta}})-\nabla^{2}l(\bm{\eta}^{*})\}(\hat{\bm{\eta}}-\bm{\eta}^{*})|
=\displaystyle= |1ni=1n[b′′(𝜼~T𝒁˙i)b′′{(𝜼)T𝒁˙i}]𝒁˙iT(𝜼^𝜼)(𝝉)T𝒁˙i|\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}(\tilde{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}]\dot{\bm{Z}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{\eta}}-\bm{\eta}^{*})(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\right|
\displaystyle\leq B|1ni=1nξ(𝜼^𝜼)T𝒁˙ib′′{(𝜼)T𝒁˙i}𝒁˙iT(𝜼^𝜼)|maxi|(𝝉)T𝒁˙i|\displaystyle B\left|\frac{1}{n}\sum_{i=1}^{n}\xi(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\dot{\bm{Z}}_{i}^{{\mathrm{\scriptscriptstyle T}}}(\hat{\bm{\eta}}-\bm{\eta}^{*})\right|\max_{i}|(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}|
\displaystyle\lesssim sη(logpn+lognp){M+Op(1n+lognp)}\displaystyle s_{\eta}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\left\{M+O_{p}\left(\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right)\right\} (A57)

where the inequality (A56) is based on Assumption 2(4) that |b′′(t1)b′′(t)|B|t1t|b′′(t)|b^{\prime\prime}(t_{1})-b^{\prime\prime}(t)|\leq B|t_{1}-t|b^{\prime\prime}(t) with t1=𝜼~T𝒁˙it_{1}=\tilde{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i} and t=(𝜼)T𝒁˙it=(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}. The last inequality (A57) is from the (A18) of Lemma 9 and from the results of Proposition 1.

As n,pn,p\rightarrow\infty, under the scaling condition that (swsη)(n1/2logp+p1/2n1/2logn)=op(1)(s_{w}\vee s_{\eta})(n^{-1/2}\log p+p^{-1/2}n^{1/2}\log n)=o_{p}(1), we have condition (iii)(iii) holds as

n|(𝝉)T{l(𝜼^)l(𝜼)2l(𝜼)(𝜼^𝜼)}|\displaystyle\sqrt{n}|(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\{\nabla l(\hat{\bm{\eta}})-\nabla l(\bm{\eta}^{*})-\nabla^{2}l(\bm{\eta}^{*})(\hat{\bm{\eta}}-\bm{\eta}^{*})\}|
\displaystyle\lesssim nsη(logpn+lognp){M+Op(1n+lognp)}\displaystyle\sqrt{n}s_{\eta}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\left\{M+O_{p}\left(\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right)\right\}
=\displaystyle= op(1).\displaystyle o_{p}(1).

Proof of Condition (iv)(iv). We multiply n1/2{n}^{1/2} to the left hand side of condition (iv)(iv) and apply mean value theorem with 𝜼~=ξ𝜼^+(1ξ)𝜼\tilde{\bm{\eta}}=\xi\hat{\bm{\eta}}+(1-\xi)\bm{\eta}^{*}. Then we have

n1/2|(𝝉^𝝉)T{l(𝜼^)l(𝜼)}|\displaystyle{n}^{1/2}|(\hat{\bm{\tau}}-\bm{\tau}^{*})^{{\mathrm{\scriptscriptstyle T}}}\{\nabla l(\hat{\bm{\eta}})-\nabla l(\bm{\eta}^{*})\}|
=n1/2|(𝝉^𝝉)T[1ni=1nb(𝜼^T𝒁˙i)b{(𝜼)T𝒁˙i)}𝒁˙i]|\displaystyle={n}^{1/2}\left|(\hat{\bm{\tau}}-\bm{\tau}^{*})^{{\mathrm{\scriptscriptstyle T}}}\left[\frac{1}{n}\sum_{i=1}^{n}b^{\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})-b^{\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\}\dot{\bm{Z}}_{i}\right]\right|
=n1/2|1ni=1nb′′(𝜼~T𝒁˙i)(𝝉^𝝉)T𝒁˙i(𝜼^𝜼)T𝒁˙i|\displaystyle={n}^{1/2}\left|\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\tilde{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})(\hat{\bm{\tau}}-\bm{\tau}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\right|
n1/2|1ni=1nb′′{(𝜼)T𝒁˙i}{(𝒘^𝒘)T𝑴˙i}2|1/2\displaystyle\leq{n}^{1/2}\left|\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{(\hat{\bm{w}}-\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}^{2}\right|^{1/2}
×|1ni=1nb′′{(𝜼)T𝒁˙i}{(𝜼^𝜼)T𝒁˙i}2|1/2\displaystyle\times\left|\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{(\hat{\bm{\eta}}-\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}^{2}\right|^{1/2} (A58)
n1/2{sη(logpn+lognp)}1/2{(sηsw)(logpn+lognp)}1/2\displaystyle\lesssim{n}^{1/2}\left\{s_{\eta}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\}^{1/2}\left\{(s_{\eta}\vee s_{w})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\}^{1/2} (A59)
n1/2(swsη)(logpn+lognp)=op(1),\displaystyle\lesssim n^{1/2}(s_{w}\vee s_{\eta})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)=o_{p}(1), (A60)

where (A58) is by applying Cauchy-Schwarz inequality, (A59) is from the (A18) and (A19) in Lemma 9 and (A60) is from the condition (swsη)(n1/2logp+p1/2n1/2logn)=op(1)(s_{w}\vee s_{\eta})(n^{-1/2}\log p+p^{-1/2}n^{1/2}\log n)=o_{p}(1). This completes the proof of Lemma 11.

G.4 Proof of Lemma 12

According to the definitions in the preliminaries, we have

(𝝉)Tl(𝜼)(Iθ𝜻)1/2\displaystyle(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\nabla l(\bm{\eta}^{*})(I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}
=1ni=1n(𝝉)T𝒁˙i[yi+b{(𝜼)T𝒁˙i}](Iθ𝜻)1/2.\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}[-y_{i}+b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}.
=1ni=1n(𝝉)T𝒁˙i[yi+b{(𝜼)T𝒁i}](Iθ𝜻)1/2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}[-y_{i}+b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}
+1ni=1n(𝝉)T𝒁˙i[b{(𝜼)T𝒁˙i}b{(𝜼)T𝒁i}](Iθ𝜻)1/2\displaystyle+\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}[b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}
=1ni=1n(𝝉)T𝒁i[yi+b{(𝜼)T𝒁i}](Iθ𝜻)1/2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}[-y_{i}+b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}
+1ni=1n(𝒘u)T(𝑼^i𝑼i)[yi+b{(𝜼)T𝒁i}](Iθ𝜻)1/2\displaystyle+\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}(\hat{\bm{U}}_{i}-\bm{U}_{i})[-y_{i}+b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}
+1ni=1n(𝝉)T𝒁i[b{(𝜼)T𝒁˙i}b{(𝜼)T𝒁i}](Iθ𝜻)1/2\displaystyle+\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}[b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}
+1ni=1n(𝒘u)T(𝑼^i𝑼i)[b{(𝜼)T𝒁˙i}b{(𝜼)T𝒁i}](Iθ𝜻)1/2\displaystyle+\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}(\hat{\bm{U}}_{i}-\bm{U}_{i})[b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}
=:G1+G2+G3+G4\displaystyle=:G_{1}+G_{2}+G_{3}+G_{4}

Because |(𝝉)T𝒁i||(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}| is bounded, [yi+b{(𝜼)T𝒁i}][-y_{i}+b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}] is sub-exponential from Assumption 2, the term (𝝉)T𝒁i[yi+b{(𝜼)T𝒁i}](Iθ𝜻)1/2(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}[-y_{i}+b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2} are independent and has finite moments. We apply Berry-Esseen Theorem and show that G1dN(0,1)G_{1}\rightarrow_{d}N(0,1)

For G2G_{2}, we apply similar techniques as in the proof of Lemma 7 condition(i) by Bernstein’s inequality and it can be verified that

G2=1ni=1n(𝒘u)T(𝑼^i𝑼i)[yi+b{(𝜼)T𝒁i}](Iθ𝜻)1/2=Op(1n+lognp).\displaystyle G_{2}=\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}(\hat{\bm{U}}_{i}-\bm{U}_{i})[-y_{i}+b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}=O_{p}\left(\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right).

As a result of Proposition 1, we have

G3=1ni=1n(𝝉)T𝒁i[b{(𝜼)T𝒁˙i}b{(𝜼)T𝒁i}](Iθ𝜻)1/2=Op(1n+lognp),G_{3}=\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}[b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}=O_{p}\left(\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right),

and

G4=1ni=1n(𝒘u)T(𝑼^i𝑼i)[b{(𝜼)T𝒁˙i}b{(𝜼)T𝒁i}](Iθ𝜻)1/2=Op(1n+lognp).G_{4}=\frac{1}{n}\sum_{i=1}^{n}(\bm{w}_{u}^{*})^{\mathrm{\scriptscriptstyle T}}(\hat{\bm{U}}_{i}-\bm{U}_{i})[b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}-b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}=O_{p}\left(\frac{1}{{n}}+{\frac{\log n}{p}}\right).

Under the scaling condition that n,pn,p\rightarrow\infty and (swsη)(n1/2logp+p1n1/2logn)=op(1)(s_{w}\vee s_{\eta})(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1), we show that G2p0G_{2}\rightarrow_{p}0, G3p0G_{3}\rightarrow_{p}0 and G4p0G_{4}\rightarrow_{p}0. Applying Slutsky’s Theorem, we have

1ni=1n(𝝉)T𝒁˙i[yi+b{(𝜼)T𝒁˙i}](Iθ𝜻)1/2dN(0,1).\frac{1}{n}\sum_{i=1}^{n}(\bm{\tau}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}[-y_{i}+b^{\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}](I_{\theta\mid\bm{\zeta}}^{*})^{-1/2}\rightarrow_{d}N(0,1).

This completes the proof of central limit theorem for the score function.

G.5 Proof of Lemma 13

In this proof, we will show the partial information

Iθ𝜻=E[b′′{(𝜼)T𝒁i}Di{Di(𝒘)T𝑴i}],{I}_{\theta\mid\bm{\zeta}}^{*}=E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}D_{i}\{D_{i}-({\bm{w}}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{M}}_{i}\}],

is consistently estimated by

I^θ𝜻=1ni=1nb′′(𝜼^T𝒁˙i)Di(Di𝒘^T𝑴˙i).\hat{I}_{\theta\mid\bm{\zeta}}=\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})D_{i}(D_{i}-\hat{\bm{w}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i}).

To see this, we write the difference between them as

I^θ𝜻Iθ𝜻\displaystyle\hat{I}_{\theta\mid\bm{\zeta}}-{I}_{\theta\mid\bm{\zeta}}^{*} =\displaystyle= 1ni=1nb′′(𝜼^T𝒁˙i)Di2E[b′′{(𝜼)T𝒁i}Di2]\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})D_{i}^{2}-E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}D_{i}^{2}]
+1ni=1nb′′(𝜼^T𝒁˙i)𝒘^T𝑴˙iDiE[b′′{(𝜼)T𝒁i}(𝒘)T𝑴iDi]\displaystyle+\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\hat{\bm{w}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i}D_{i}-E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{M}}_{i}D_{i}]
=\displaystyle= L1+L2.\displaystyle L_{1}+L_{2}.

For L1L_{1}, we decompose it into

L1\displaystyle L_{1} =\displaystyle= 1ni=1nb′′(𝜼^T𝒁˙i)Di2E[b′′{(𝜼)T𝒁i}Di2]\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})D_{i}^{2}-E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}D_{i}^{2}]
=\displaystyle= 1ni=1nb′′{(𝜼)T𝒁i}Di2E[b′′{(𝜼)T𝒁i}Di2]\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}D_{i}^{2}-E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}D_{i}^{2}]
+1ni=1n[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]Di2\displaystyle+\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}]D_{i}^{2}
+1ni=1n[b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁˙i}]Di2\displaystyle+\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]D_{i}^{2}
=\displaystyle= L11+L12+L13.\displaystyle L_{11}+L_{12}+L_{13}.

Applying similar techniques as in the proof of Lemma 7 condition (ii) by Bernstein inequality, we can show that

L11=1ni=1nb′′{(𝜼)T𝒁i}Di2E[b′′{(𝜼)T𝒁i}Di2]=Op(1n).L_{11}=\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}D_{i}^{2}-E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}D_{i}^{2}]=O_{p}\left(\frac{1}{\sqrt{n}}\right).

As a result of Proposition 1,

L12=1ni=1n[b′′{(𝜼)T𝒁˙i}b′′{(𝜼)T𝒁i}]Di2=Op(1n+lognp).L_{12}=\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}-b^{\prime\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}{\bm{Z}}_{i}\}]D_{i}^{2}=O_{p}\left(\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right).

Using the estimation consistency results in Theorem 2, we have

L13=1ni=1n[b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁˙i}]Di2\displaystyle L_{13}=\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}]D_{i}^{2}
=\displaystyle= Op{sη(logpn+lognp)(M+1n+lognp)}\displaystyle O_{p}\left\{s_{\eta}\left(\sqrt{\frac{\log p}{n}}+\sqrt{\frac{\log n}{p}}\right)\left(M+\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right)\right\}

Under the condition that n,pn,p\rightarrow\infty and (swsη)(s_{w}\vee s_{\eta}) (n1/2logp+p1n1/2logn)=op(1)(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1), we have

L1=L11+L12+L13=op(1).L_{1}=L_{11}+L_{12}+L_{13}=o_{p}(1). (A61)

For L2L_{2}, we decompose it into

L2\displaystyle L_{2} =\displaystyle= 1ni=1nb′′(𝜼^T𝒁˙i)𝒘^T𝑴˙iDiE[b′′{(𝜼)T𝒁i}(𝒘)T𝑴iDi]\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\hat{\bm{w}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{M}}_{i}D_{i}-E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{M}}_{i}D_{i}]
=\displaystyle= 1ni=1nb′′(𝜼^T𝒁˙i)(𝒘^𝒘)T𝑴˙iDi\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})(\hat{\bm{w}}-\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}D_{i}
+1ni=1n[b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁˙i}](𝒘)T𝑴˙iDi\displaystyle+\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}](\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}D_{i}
+1ni=1nb′′{(𝜼)T𝒁˙i}(𝒘)T𝑴˙iDiE[b′′{(𝜼)T𝒁i}(𝒘)T𝑴iDi]\displaystyle+\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}D_{i}-E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{M}}_{i}D_{i}]
=\displaystyle= L21+L22+L23.\displaystyle L_{21}+L_{22}+L_{23}.

We apply Hölder’s inequality on L21L_{21} and get

L21\displaystyle L_{21} =\displaystyle= 1ni=1nb′′(𝜼^T𝒁˙i)(𝒘^𝒘)T𝑴˙iDi\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})(\hat{\bm{w}}-\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}D_{i} (A62)
\displaystyle\leq [1ni=1nb′′(𝜼^T𝒁˙i){(𝒘^𝒘)T𝑴˙i}2]1/2{1ni=1nb′′(𝜼^T𝒁˙i)Di2}1/2\displaystyle\left[\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})\{(\hat{\bm{w}}-\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}^{2}\right]^{1/2}\left\{\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})D_{i}^{2}\right\}^{1/2}
\displaystyle\lesssim {(sηsw)(logpn+lognp)}1/2(M+1n+lognp)1/2\displaystyle\left\{(s_{\eta}\vee s_{w})\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\}^{1/2}\left(M+\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right)^{1/2}

where the inequality (A62) is from the the equation (A20) in Lemma 9 and similar arguments as in bounding L12+L13L_{12}+L_{13}.

For L22L_{22}, we apply Assumption 2(3) that |b′′(t1)b′′(t)|B|t1t|b′′(t)|b^{\prime\prime}(t_{1})-b^{\prime\prime}(t)|\leq B|t_{1}-t|b^{\prime\prime}(t) with t1=𝜼^T𝒁˙it_{1}=\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i} and t=(𝜼)T𝒁˙it=(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}. Then we apply Hölder’s inequality and have

L22\displaystyle L_{22} =\displaystyle= 1ni=1n[b′′(𝜼^T𝒁˙i)b′′{(𝜼)T𝒁˙i}](𝒘)T𝑴˙iDi\displaystyle\frac{1}{n}\sum_{i=1}^{n}[b^{\prime\prime}(\hat{\bm{\eta}}^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i})-b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}](\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}D_{i} (A63)
\displaystyle\leq 1ni=1nb′′{(𝜼)T𝒁˙i}(𝜼^𝜼)T𝒁˙i(𝒘)T𝑴˙iDi\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}D_{i}
\displaystyle\leq [1ni=1nb′′{(𝜼)T𝒁˙i}{(𝜼^𝜼)T𝒁˙i}2]1/2[1ni=1nb′′{(𝜼)T𝒁˙i}{(𝒘)T𝑴˙i}2Di2]1/2\displaystyle\left[\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{(\hat{\bm{\eta}}-\bm{\eta}^{*})^{{\mathrm{\scriptscriptstyle T}}}\dot{\bm{Z}}_{i}\}^{2}\right]^{1/2}\left[\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}\{(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}\}^{2}D_{i}^{2}\right]^{1/2}
\displaystyle\lesssim {sη(logpn+lognp)}1/2(M+1n+lognp)3/2\displaystyle\left\{s_{\eta}\left(\frac{\log p}{n}+\frac{\log n}{p}\right)\right\}^{1/2}\left(M+\frac{1}{\sqrt{n}}+\sqrt{\frac{\log n}{p}}\right)^{3/2}

where the inequality (A63) is from the results (A18) in Lemma 9 and from similar arguments as in bounding L11L_{11}.

For L23L_{23}, we follow the similar arguments in bounding L11L_{11} and get

L23\displaystyle L_{23} =\displaystyle= 1ni=1nb′′{(𝜼)T𝒁˙i}(𝒘)T𝑴˙iDiE[b′′{(𝜼)T𝒁i}(𝒘)T𝑴iDi]\displaystyle\frac{1}{n}\sum_{i=1}^{n}b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{Z}}_{i}\}(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}\dot{\bm{M}}_{i}D_{i}-E[b^{\prime\prime}\{(\bm{\eta}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{Z}}_{i}\}(\bm{w}^{*})^{\mathrm{\scriptscriptstyle T}}{\bm{M}}_{i}D_{i}] (A64)
\displaystyle\lesssim 1n+lognp.\displaystyle\frac{1}{{n}}+{\frac{\log n}{p}}.

Under the condition that n,pn,p\rightarrow\infty and (swsη)(s_{w}\vee s_{\eta}) (n1/2logp+p1n1/2logn)=op(1)(n^{-1/2}\log p+p^{-1}n^{1/2}\log n)=o_{p}(1) and (A62), (A63), (A64), we get

L2=L21+L22+L23=op(1)L_{2}=L_{21}+L_{22}+L_{23}=o_{p}(1) (A65)

Combining (A61) and (A65), we have

I^θ𝜻Iθ𝜻=op(1).\hat{I}_{\theta\mid\bm{\zeta}}-{I}_{\theta\mid\bm{\zeta}}^{*}=o_{p}(1).

This completes the proof of Lemma 13.


References

  • Ahn and Horenstein (2013) Seung C. Ahn and Alex R. Horenstein. Eigenvalue ratio test for the number of factors. Econometrica, 81(3):1203–1227, 2013.
  • Anderson and Amemiya (1988) T. W. Anderson and Yasuo Amemiya. The Asymptotic Normal Distribution of Estimators in Factor Analysis under General Conditions. The Annals of Statistics, 16(2):759–771, 1988.
  • Ather and Poynter (2018) Jennifer L. Ather and Matthew E. Poynter. Serum Amyloid A3 is required for normal weight and immunometabolic function in mice. PLOS ONE, 13(2):1–14, 2018.
  • Bach (2010) Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384–414, 2010.
  • Bai (2003) Jushan Bai. Inferential theory for factor models of large dimensions. Econometrica, 71(1):135–171, 2003.
  • Bai and Li (2012) Jushan Bai and Kunpeng Li. Statistical analysis of factor models of high dimension. The Annals of Statistics, 40(1):436–465, 2012.
  • Bai and Li (2016) Jushan Bai and Kunpeng Li. Maximum likelihood estimation and inference for approximate factor models of high dimension. Review of Economics and Statistics, 98(2):298–309, 2016.
  • Bai and Ng (2002) Jushan Bai and Serena Ng. Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221, 2002.
  • Bai and Ng (2006) Jushan Bai and Serena Ng. Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica, 74(4):1133–1150, 2006.
  • Belloni et al. (2012) Alexandre Belloni, Daniel Chen, Victor Chernozhukov, and Christian Hansen. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429, 2012.
  • Belloni et al. (2014) Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608–650, 2014.
  • Belloni et al. (2016) Alexandre Belloni, Victor Chernozhukov, and Ying Wei. Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics, 34(4):606–619, 2016.
  • Brown (2015) Timothy A Brown. Confirmatory factor analysis for applied research. Guilford publications, 2015.
  • Bühlmann et al. (2014) Peter Bühlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278, 2014.
  • Burgess et al. (2017) Stephen Burgess, Dylan S Small, and Simon G Thompson. A review of instrumental variable estimators for mendelian randomization. Statistical Methods in Medical Research, 26(5):2333–2355, 2017.
  • Cai et al. (2023) T. Tony Cai, Zijian Guo, and Rong Ma. Statistical inference for high-dimensional generalized linear models with binary outcomes. Journal of the American Statistical Association, 118(542):1319–1332, 2023.
  • Cattell (1966) Raymond B Cattell. The scree test for the number of factors. Multivariate behavioral research, 1(2):245–276, 1966.
  • Ćevid et al. (2020) Domagoj Ćevid, Peter Bühlmann, and Nicolai Meinshausen. Spectral deconfounding via perturbed sparse linear models. Journal of Machine Learning Research, 21(232):1–41, 2020.
  • Chen and Li (2021) Yunxiao Chen and Xiaoou Li. Determining the number of factors in high-dimensional generalized latent factor models. Biometrika, 109(3):769–782, 2021.
  • Chen et al. (2020) Yunxiao Chen, Xiaoou Li, and Siliang Zhang. Structured latent factor analysis for large-scale data: Identifiability, estimability, and their implications. Journal of the American Statistical Association, 115(532):1756–1770, 2020.
  • Chernozhukov et al. (2018) Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
  • Costello and Osborne (2005) Anna B Costello and Jason Osborne. Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical assessment, research, and evaluation, 10(1):7, 2005.
  • Dinno (2009) Alexis Dinno. Exploring the sensitivity of horn’s parallel analysis to the distributional form of random data. Multivariate Behavioral Research, 44(3):362–388, 2009.
  • Dobriban (2020) Edgar Dobriban. Permutation methods for factor analysis and PCA. The Annals of Statistics, 48(5):2824–2847, 2020.
  • Fan et al. (2013) Jianqing Fan, Yuan Liao, and Martina Mincheva. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B, Statistical methodology, 75(4), 2013.
  • Fan et al. (2014) Jianqing Fan, Fang Han, and Han Liu. Challenges of Big Data analysis. National Science Review, 1(2):293–314, 2014.
  • Fan et al. (2020) Jianqing Fan, Yuan Ke, and Kaizheng Wang. Factor-adjusted regularized model selection. Journal of Econometrics, 216(1):71–85, 2020.
  • Fan et al. (2023) Jianqing Fan, Zhipeng Lou, and Mengxin Yu. Are latent factor regression and sparse regression adequate? Journal of the American Statistical Association, in press, 2023.
  • Fewell et al. (2007) Zoe Fewell, George Davey Smith, and Jonathan A. C. Sterne. The Impact of Residual and Unmeasured Confounding in Epidemiologic Studies: A Simulation Study. American Journal of Epidemiology, 166(6):646–655, 2007.
  • Gagnon-Bartsch and Speed (2011) Johann Gagnon-Bartsch and Terence Speed. Using control genes to correct for unwanted variation in microarray data. Biostatistics (Oxford, England), 13:539–52, 2011.
  • Guo et al. (2018) Zijian Guo, Hyunseung Kang, T. Tony Cai, and Dylan S. Small. Confidence intervals for causal effects with invalid instruments by using two–stage hard thresholding with voting. Journal of the Royal Statistical Society Series B, 80(4):793–815, 2018.
  • Guo et al. (2022) Zijian Guo, Domagoj Ćevid, and Peter Bühlmann. Doubly debiased lasso: High-dimensional inference under hidden confounding. The Annals of Statistics, 50(3):1320–1347, 2022.
  • Hayton et al. (2004) James C Hayton, David G Allen, and Vida Scarpello. Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organizational research methods, 7(2):191–205, 2004.
  • Horn (1965) John L. Horn. A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2):179–185, 1965.
  • Jang et al. (2018) Ji-Su Jang, Jun-Ho Lee, Nam-Chul Jung, So-Yeon Choi, Soo-Yeoun Park, Ji-Young Yoo, Jie-Young Song, Han Geuk Seo, Hyun Soo Lee, and Dae-Seog Lim. RSAD2 is necessary for mouse dendritic cell maturation via the IRF7-mediated signaling pathway. Cell Death & Disease, 9(8):823, 2018.
  • Jankova and van de Geer (2018) Jana Jankova and Sara van de Geer. Semiparametric efficiency bounds for high-dimensional models. The Annals of Statistics, 46(5):2336–2359, 2018.
  • Javanmard and Montanari (2014) Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1):2869–2909, 2014.
  • Kang et al. (2016) Hyunseung Kang, Anru Zhang, T. Tony Cai, and Dylan S. Small. Instrumental variables estimation with some invalid instruments and its application to mendelian randomization. Journal of the American Statistical Association, 111(513):132–144, 2016.
  • Kneip and Sarda (2011) Alois Kneip and Pascal Sarda. Factor models and variable selection in high-dimensional regression analysis. The Annals of Statistics, 39(5):2410–2447, 2011.
  • Lam and Yao (2012) Clifford Lam and Qiwei Yao. Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics, 40(2):694–726, 2012.
  • Lazar et al. (2013) Cosmin Lazar, Stijn Meganck, Jonatan Taminau, David Steenhoff, Alain Coletta, Colin Molter, David Y Weiss-Solís, Robin Duque, Hugues Bersini, and Ann Nowé. Batch effect removal methods for microarray gene expression data integration: a survey. Briefings in Bioinformatics, 14(4):469–490, 2013.
  • Leek and Storey (2007) Jeffrey T Leek and John D Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS genetics, 3(9):e161, 2007.
  • Leek and Storey (2008) Jeffrey T Leek and John D Storey. A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences, 105(48):18718–18723, 2008.
  • Lin and Ying (1994) Dan Yu Lin and Zhiliang Ying. Semiparametric analysis of the additive risk model. Biometrika, 81(1):61–71, 1994.
  • Lin and Lv (2013) Wei Lin and Jinchi Lv. High-dimensional sparse additive hazards regression. Journal of the American Statistical Association, 108(501):247–264, 2013.
  • Lin et al. (2016) Zhixiang Lin, Can Yang, Ying Zhu, John Duchi, Yao Fu, Yong Wang, Bai Jiang, Mahdi Zamanighomi, Xuming Xu, Mingfeng Li, et al. Simultaneous dimension reduction and adjustment for confounding variation. Proceedings of the National Academy of Sciences, 113(51):14662–14667, 2016.
  • Listgarten et al. (2010) Jennifer Listgarten, Carl Kadie, Eric E. Schadt, and David Heckerman. Correction for hidden confounders in the genetic analysis of gene expression. Proceedings of the National Academy of Sciences, 107(38):16465–16470, 2010.
  • Liu et al. (2023) Wei Liu, Huazhen Lin, Shurong Zheng, and Jin Liu. Generalized factor model for ultra-high dimensional correlated variables with mixed types. Journal of the American Statistical Association, 118(542):1385–1401, 2023.
  • Loh and Wainwright (2015) Po-Ling Loh and Martin J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19):559–616, 2015.
  • Ma et al. (2021) Rong Ma, T. Tony Cai, and Hongzhe Li. Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association, 116(534):984–998, 2021.
  • Müller-Berghaus et al. (2004) J. Müller-Berghaus, K. Kern, A. Paschen, X D Nguyen, H. Klüter, G. Morahan, and D. Schadendorf. Deficient IL-12p70 secretion by dendritic cells based on IL12B promoter genotype. Genes & Immunity, 5(5):431–434, 2004.
  • Nemetz et al. (1999) A. Nemetz, M. Pilar Nosti-Escanilla, Tamás Molnár, Adorján Köpe, Ágota Kovács, János Fehér, Zsolt Tulassay, Ferenc Nagy, M. Asunción García-González, and A. S. Peña. IL1B gene polymorphisms influence the course and severity of inflammatory bowel disease. Immunogenetics, 49(6):527–531, 1999.
  • Ning and Liu (2017) Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195, 2017.
  • Owen and Wang (2016) Art B. Owen and Jingshu Wang. Bi-Cross-Validation for Factor Analysis. Statistical Science, 31(1):119–139, 2016.
  • Paul et al. (2008) Debashis Paul, Eric Bair, Trevor Hastie, and Robert Tibshirani. “Preconditioning” for feature selection and regression in high-dimensional problems. The Annals of Statistics, 36(4):1595–1618, 2008.
  • Peng et al. (2010) Jie Peng, Ji Zhu, Anna Bergamaschi, Wonshik Han, Dong-Young Noh, Jonathan R. Pollack, and Pei Wang. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. The Annals of Applied Statistics, 4(1):53–77, 2010.
  • Peres-Neto et al. (2005) Pedro R. Peres-Neto, Donald A. Jackson, and Keith M. Somers. How many principal components? stopping rules for determining the number of non-trivial axes revisited. Computational Statistics & Data Analysis, 49(4):974–997, 2005.
  • Price et al. (2006) Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.
  • Quadeer et al. (2014) Ahmed A Quadeer, Raymond HY Louie, Karthik Shekhar, Arup K Chakraborty, I-Ming Hsing, and Matthew R McKay. Statistical linkage analysis of substitutions in patient-derived sequences of genotype 1a hepatitis c virus nonstructural protein 3 exposes targets for immunogen design. Journal of Virology, 88(13):7628–7644, 2014.
  • Ren et al. (2015) Zhao Ren, Tingni Sun, Cun-Hui Zhang, and Harrison H. Zhou. Asymptotic normality and optimalities in estimation of large Gaussian graphical models. The Annals of Statistics, 43(3):991–1026, 2015.
  • Shalek et al. (2014) Alex K. Shalek, Rahul Satija, Joe Shuga, John J. Trombetta, Dave Gennert, Diana Lu, Peilin Chen, Rona S. Gertner, Jellert T. Gaublomme, Nir Yosef, Schraga Schwartz, Brian Fowler, Suzanne Weaver, Jing Wang, Xiaohui Wang, Ruihua Ding, Raktima Raychowdhury, Nir Friedman, Nir Hacohen, Hongkun Park, Andrew P. May, and Aviv Regev. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature, 510(7505):363–369, 2014.
  • Shi et al. (2021) Chengchun Shi, Rui Song, Wenbin Lu, and Runze Li. Statistical inference for high-dimensional models via recursive online-score estimation. Journal of the American Statistical Association, 116(535):1307–1318, 2021.
  • Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
  • Toyoshima et al. (2019) Yujiro Toyoshima, Hidemitsu Kitamura, Huihui Xiang, Yosuke Ohno, Shigenori Homma, Hideki Kawamura, Norihiko Takahashi, Toshiya Kamiyama, Mishie Tanino, and Akinobu Taketomi. IL6 modulates the immune status of the tumor microenvironment to facilitate metastatic colonization of colorectal cancer cells. Cancer Immunology Research, 7(12):1944–1957, 2019.
  • van de Geer et al. (2014) Sara van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
  • Wainwright (2019) Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
  • Wang (2022) Fa Wang. Maximum likelihood estimation and inference for high dimensional generalized factor models with application to factor-augmented regressions. Journal of Econometrics, 229(1):180–200, 2022.
  • Wang and Fan (2017) Weichen Wang and Jianqing Fan. Asymptotics of empirical eigenstructure for high dimensional spiked covariance. The Annals of statistics, 45(3):1342, 2017.
  • Wang and Blei (2019) Yixin Wang and David M. Blei. The blessings of multiple causes. Journal of the American Statistical Association, 114(528):1574–1596, 2019.
  • Windmeijer et al. (2019) Frank Windmeijer, Helmut Farbmacher, Neil Davies, and George Davey Smith. On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association, 114(527):1339–1350, 2019.
  • Zhang and Zhang (2014) Cun-Hui Zhang and Stephanie S. Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 76(1):217–242, 2014.
  • Zhang and Cheng (2017) Xianyang Zhang and Guang Cheng. Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112(518):757–768, 2017.
  • Zhu et al. (2020) Yunzhang Zhu, Xiaotong Shen, and Wei Pan. On high-dimensional constrained maximum likelihood inference. Journal of the American Statistical Association, 115(529):217–230, 2020.
  • Zwick and Velicer (1986) William R. Zwick and Wayne F. Velicer. Comparison of five rules for determining the number of components to retain. Psychological Bulletin, 99(3):432–442, 1986.