This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High-dimensional Simultaneous Inference on Non-Gaussian VAR Model via De-biased Estimator

Linbo Liu and Danna Zhang
Department of Mathematics, UC San Diego
(October 26, 2021)
Abstract

Simultaneous inference for high-dimensional non-Gaussian time series is always considered to be a challenging problem. Such tasks require not only robust estimation of the coefficients in the random process, but also deriving limiting distribution for a sum of dependent variables. In this paper, we propose a multiplier bootstrap procedure to conduct simultaneous inference for the transition coefficients in high-dimensional non-Gaussian vector autoregressive (VAR) models. This bootstrap-assisted procedure allows the dimension of the time series to grow exponentially fast in the number of observations. As a test statistic, a de-biased estimator is constructed for simultaneous inference. Unlike the traditional de-biased/de-sparsifying Lasso estimator, robust convex loss function and normalizing weight function are exploited to avoid any unfavorable behavior at the tail of the distribution. We develop Gaussian approximation theory for VAR model to derive the asymptotic distribution of the de-biased estimator and propose a multiplier bootstrap-assisted procedure to obtain critical values under very mild moment conditions on the innovations. As an important tool in the convergence analysis of various estimators, we establish a Bernstein-type probabilistic concentration inequality for bounded VAR models. Numerical experiments verify the validity and efficiency of the proposed method.

1 Introduction

High-dimensional statistics become increasingly important due to the rapid development of information technology in the past decade. In this paper, we are primarily interested in conducting simultaneous inference via de-biased MM-estimator on the transition matrices in a high-dimensional vector autoregressive model with non-Gaussian innovations. An extensive body of work has been proposed on estimation and inference on the coefficient vector in linear regression setting and we refer readers to Bühlmann and Van De Geer, (2011) for an overview of recent development in high-dimensional statistical techniques. MM-estimator is one of the most popular tools among them, which has been proved a success in signal estimation (Negahban et al., (2012)), support recovery (Loh and Wainwright, (2017)), variable selection (Zou, (2006)) and robust estimation with heavy-tailed noises using nonconvex loss functions (Loh, (2017)). As a penalized MM-estimator, Lasso (Tibshirani, (1996)) also plays an important role in estimating transition coefficients in high-dimensional VAR models beyond linear regression; see for example Hsu et al., (2008), Nardi and Rinaldo, (2011), Basu and Michailidis, (2015) among others. Another line of work is to achieve such estimation tasks by Dantzig selector (Candes and Tao, (2007)). Han et al., (2015) proposed a new approach to estimating the transition matrix via Dantzig-type estimator and solved a linear programming problem. They remarked that this estimation procedure enjoys many advantages including computational efficiency and weaker assumptions on the transition matrix. However, the aforementioned literature mainly discussed the scenario where Gaussian or sub-Gaussian noises are in presence.

To deal with the heavy-tailed errors, regularized robust methods have been widely studied. For instance, Li and Zhu, (2008) proposed an 1\ell_{1}-regularized quantile regression method in low dimensional setting and devised an algorithm to efficiently solve the proposed optimization problem. Wu and Liu, (2009) studied penalized quantile regression from the perspective of variable selection. However, quantile regression and least absolute deviation regression can be significantly different from the mean function, especially when the distribution of noise is asymmetric. To overcome this issue, Fan et al., (2017) developed robust approximation Lasso (RA-Lasso) estimator based on penalized Huber loss and proved the feasibility of RA-Lasso in estimation of high-dimensional mean regression. Apart from linear regression setting, Zhang, (2019) also used Huber loss to obtain a consistent estimate of mean vector and covariance matrix for high-dimensional time series. Also, robust estimation of the transition coefficients was studied in Liu and Zhang, (2021) via two types of approaches: Lasso-based and Dantzig-based estimator.

Besides estimation, recent research effort also turned to high-dimensional statistical inference, such as performing multiple hypothesis testing and constructing simultaneous confidence intervals, both for regression coefficients and mean vectors of random processes. To tackle the high dimensionality, the idea of low dimensional projection was exploited by numerous popular literature. For instance, Javanmard and Montanari, (2014), Van de Geer et al., (2014), Zhang and Zhang, (2014) constructed de-sparsifying Lasso by inverting the Karush-Kuhn-Tucker (KKT) condition and derived asymptotic distribution for the projection of high-dimensional parameters onto fixed-dimensional space. As an extension of the previous techniques, Loh, (2018) proposed the asymptotic theory of one-step estimator, allowing the presence of non-Gaussian noises. Employing Gaussian approximation theory (Chernozhukov et al., (2013)), Zhang and Cheng, (2017) proposed a bootstrap-assisted procedure to conduct simultaneous statistical inference, which allowed the number of testing to greatly surpass the number of observations as a significant improvement. Although a huge body of work has been completed for the inference of regression coefficients, there have been limited research on the generalization of these theoretical properties to time series, perhaps due to the technical difficulty when generalizing Gaussian approximation results to dependent random variables. Zhang and Wu, (2017) adopted the framework of functional dependence measures (Wu, (2005)) to account for temporal dependency and provided Gaussian approximation results for general time series. They also showed, as an application, how to construct simultaneous confidence intervals for mean vectors of high-dimensional random processes with asymptotically correct coverage probabilities.

In this paper, we consider simultaneous inference of transition coefficients in possibly non-Gaussian vector autoregressive (VAR) models with lag dd:

Xi=A1Xi1+A2Xi2++AdXid+εi,i=1,,n,X_{i}=A_{1}X_{i-1}+A_{2}X_{i-2}+\dots+A_{d}X_{i-d}+\varepsilon_{i},\quad i=1,\dots,n,

where XipX_{i}\in\mathbb{R}^{p} is the time series, Aip×p,i=1,,dA_{i}\in\mathbb{R}^{p\times p},\,i=1,\dots,d are the transition matrices, and εip\varepsilon_{i}\in\mathbb{R}^{p} are the innovation vectors. We allow the dimension pp to exceed the number of observations nn, or even logp=o(nb)\log p=o(n^{b}) for some b>0b>0, as is commonly assumed in high-dimensional regime. Different from many other work, we do not impose Gaussianity or sub-Gaussianity assumptions on the noise terms εi\varepsilon_{i}.

We are particularly interested in the following simultaneous testing problem:

H0:Ai=Ai0,for all i=1,,dH_{0}:A_{i}=A_{i}^{0},\quad\text{for all }i=1,\dots,d

versus the alternative hypothesis

H1:AiAi0,for some i=1,,d.H_{1}:A_{i}\neq A_{i}^{0},\quad\text{for some }i=1,\dots,d.

It’s worth mentioning that the above problems still have p2p^{2} null hypotheses to verify even if the lag d=1d=1. We propose to build a de-biased estimator βˇ\check{\beta} from some consistent pilot estimator β^\widehat{\beta} (for example, the one provided in Liu and Zhang, (2021)). There are a few challenges when we prove the feasibility of de-biased estimator as well as its theoretical guarantees: (i) VAR models display temporal dependency across observations, which makes the majority of probabilistic tools such as classic Bernstein inequality and Gaussian approximation inapplicable. (ii) Fat-tailed innovations εi\varepsilon_{i} imply fat-tailed xix_{i} in VAR model, while robust methods regarding linear regression can assume εi\varepsilon_{i} to have heavy-tail but xix_{i} remains sub-Gaussian (Fan et al., (2017) and Zhang and Cheng, (2017)). (iii) We hope our simultaneous inference procedure to work in ultra-high dimensional regime, where pp can grow exponentially fast in nn. As a result, these challenges inspire us to establish a new Bernstein-type inequality (section 3) and Gaussian approximation results (section 4) under the framework of VAR model. Also, we will adopt the definition of spectral decay index to capture the dependency among time series data, as in Liu and Zhang, (2021).

The paper is organized as follows. In section 2, we first present more details and some preparatory definitions of VAR models and propose the test statistics for simultaneous inference via de-biased estimator, which is constructed through a robust loss function and a weight function on xix_{i}. The main result delivering critical values for such test statistics by multiplier bootstrap is given in section 2.4. In section 3, we complete the estimation of multiple statistics by establishing a Bernstein inequality. A thorough discussion of Gaussian approximation and its derivation under VAR model are presented in section 4. Some numerical experiments are conducted in section 5 to assess the empirical performance of the multiplier bootstrap procedure.

Finally, we introduce some notation. For a vector β=(β1,,βp)\beta=(\beta_{1},\dots,\beta_{p})^{\top}, let |β|1=i|βi||\beta|_{1}=\sum_{i}|\beta_{i}|, |β|2=(iβi2)1/2|\beta|_{2}=(\,{\sum_{i}\beta_{i}^{2}}\,)^{1/2} and |β|=maxi|βi||\beta|_{\infty}=\max_{i}|\beta_{i}| be its 1,2,\ell_{1},\ell_{2},\ell_{\infty} norm respectively. For a matrix A=(aij)1i,jpA=(a_{ij})_{1\leq i,j\leq p}, let λi,i=1,,p\lambda_{i},\,i=1,\dots,p, be its eigenvalues and λmax(A),λmin(A)\lambda_{\max}(A),\,\lambda_{\min}(A) be its maximum and minimum eigenvalues respectively. Also let ρ(A)=maxi|λi|\rho(A)=\max_{i}|\lambda_{i}| be the spectral radius. Denote A1=maxji|aij|\|A\|_{1}=\max_{j}\sum_{i}|a_{ij}|, A=maxij|aij|\|A\|_{\infty}=\max_{i}\sum_{j}|a_{ij}|, and spectral norm A=A2=sup|x|20|Ax|2/|x|2\|A\|=\|A\|_{2}=\sup_{|x|_{2}\neq 0}|Ax|_{2}/|x|_{2}. Moreover, let Amax=maxi,j|aij|\|A\|_{\max}=\max_{i,j}|a_{ij}| be the entry-wise maximum norm. For a random variable XX and q>0q>0, define Xq=(𝔼[Xq])1/q\|X\|_{q}=(\mathbb{E}[X^{q}])^{1/q}. For two real numbers x,yx,y, set xy=max(x,y)x\lor y=\max(x,y). For two sequences of positive numbers {an}\{a_{n}\} and {bn}\{b_{n}\}, we write anbna_{n}\lesssim b_{n} if there exists some constant C>0C>0, such that an/bnCa_{n}/b_{n}\leq C as nn\to\infty, and also write anbna_{n}\asymp b_{n} if anbna_{n}\lesssim b_{n} and bnanb_{n}\lesssim a_{n}. We use c0,c1,c_{0},c_{1},\dots and C0,C1,C_{0},C_{1},\dots to denote some universal positive constants whose values may vary in different context. Throughout the paper, we consider the high-dimensional regime, allowing the dimension pp to grow with the sample size nn, that is, we assume p=pnp=p_{n}\to\infty as nn\to\infty.

2 Main Results

2.1 Vector autoregressive model

Consider a VAR(d) model:

Xi=A1Xi1+A2Xi2++AdXid+εi,i=1,,n,X_{i}=A_{1}X_{i-1}+A_{2}X_{i-2}+\dots+A_{d}X_{i-d}+\varepsilon_{i},\quad i=1,\dots,n, (2.1)

where Xi=(Xi1,Xi2,,Xip)pX_{i}=(X_{i1},X_{i2},\dots,X_{ip})\in\mathbb{R}^{p} is the random process of interests, Aip×pA_{i}\in\mathbb{R}^{p\times p}, i=1,,di=1,\dots,d, are the transition matrices and εi\varepsilon_{i}, ii\in\mathbb{Z}, are i.i.d. innovation vectors with zero mean and symmetric distribution, i.e. εi=εi\varepsilon_{i}=-\varepsilon_{i} in distribution, for all ii\in\mathbb{Z}. By a rearrangement of variables, VAR(d) models can be formulated as VAR(1) models (see Liu and Zhang, (2021)). Therefore, without loss of generality, we shall work with VAR(1) models:

Xi=AXi1+εi,i=1,,n.X_{i}=AX_{i-1}+\varepsilon_{i},\quad i=1,\dots,n. (2.2)

This type of random process has a wide range of application, such as finance development (Shan, (2005)), economy (Juselius, (2006)) and exchange rate dynamics (Wu and Zhou, (2010)).

To ensure model stationarity, we assume that the spectral radius ρ(A)<1\rho(A)<1 throughout the paper, which is also the sufficient and necessary condition for a VAR(1) model to be stationary. However, a more restrictive condition that A<1\|A\|<1 is always assumed in most of the earlier work. See for example, Han et al., (2015), Loh and Wainwright, (2012) and Negahban and Wainwright, (2011). For a non-symmetric matrix AA, it could happen that A1\|A\|\geq 1 while ρ(A)<1\rho(A)<1. To fill the gap between ρ(A)\rho(A) and A\|A\|, Basu and Michailidis, (2015) proposed stability measures for high-dimensional time series to capture temporal and cross-section dependence via the spectral density function

fX(θ)=12π=ΓX()eiθ,θ[π,π],f_{X}(\theta)=\frac{1}{2\pi}\sum_{\ell=-\infty}^{\infty}\Gamma_{X}(\ell)\mathrm{e}^{-i\ell\theta},\quad\theta\in[-\pi,\pi],

where ΓX()=Cov(Xi,Xi+),i,\Gamma_{X}(\ell)=\text{Cov}(X_{i},X_{i+\ell}),~{}i,\ell\in\mathbb{Z} is the autocovariance function of the process {Xi}\{X_{i}\}. In a more recent work, Liu and Zhang, (2021) defined spectral decay index to connect ρ(A)\rho(A) with A\|A\| from a different point of view. In this paper, we will adopt the framework of spectral decay index in Liu and Zhang, (2021).

Definition 2.1.

For any matrix Ap×pA\in\mathbb{R}^{p\times p} such that ρ(A)<1\rho(A)<1, define the spectral decay index as

τ=min{t+:At<ρ}\tau=\min\{t\in\mathbb{Z}^{+}:\|A^{t}\|_{\infty}<\rho\} (2.3)

for some constant 0<ρ<10<\rho<1.

Remark 2.2.

Note that in (2.3), we use LL_{\infty} norm, while spectral norm is considered in Liu and Zhang, (2021). However, the spectral decay index shares many properties even if defined in different matrix norms. Some of them are summarized as follows. For any matrix AA with ρ(A)<1\rho(A)<1, finite spectral decay index τ\tau exists. In general, τ\tau may not be of constant order when the dimension pp increases. Technically speaking, we need to explicitly write τ=τp\tau=\tau_{p} to capture the dependence on pp. However, in the rest of the paper, we simply write τ\tau for ease of notation. For more analysis of spectral decay index, see section 2 of Liu and Zhang, (2021).

Next, we are interested in building some estimators of AA for which we could establish asymptotic distribution theory. This allows one to conduct statistical inference, such as finding simultaneous confidence interval. There have been some work on the robust estimation only. Liu and Zhang, (2021) provides both a Lasso-type estimator and a Dantzig-type estimator to consistently estimate the transition coefficient AA given {Xi}\{X_{i}\}, under very mild moment condition on XiX_{i} and ϵi\epsilon_{i}. It turns out that both Lasso-type and Dantzig-type estimators are not unbiased for estimating the transition matrix, thus insufficient for tasks like statistical inference. Therefore, one needs to develop more refined method to establish results in terms of asymptotic distributional theory. In the following sections, we will construct a de-biased estimator based on the existing one and derive the limiting distribution for the de-biased estimator.

Note that unlike many other existing work (Han et al., (2015), Basu and Michailidis, (2015), etc.), we do not assume εi\varepsilon_{i} to be Gaussian or sub-Gaussian. Instead, it could happen that the innovations εi\varepsilon_{i} only have some finite moments, which makes the standard techniques for estimation and inference invalid.

2.2 De-biased estimator

In this section, we construct a de-biased estimator using the techniques introduced in Bickel, (1975). To fix the idea, let aja_{j}^{\top} be the jj-th row of AA and β=Vec(A)=(a1,a2,,ap)p2\beta^{*}=\text{Vec}(A)=(a_{1}^{\top},a_{2}^{\top},\dots,a_{p}^{\top})^{\top}\in\mathbb{R}^{p^{2}}. Suppose we are given a consistent, possibly biased, estimator β^\widehat{\beta} of β\beta^{*}, i.e. |β^β|=o(1)|\widehat{\beta}-\beta^{*}|=o(1) (for example, Lasso-type or Dantzig-type estimators in Liu and Zhang, (2021)). Define a loss function L:p2L:\mathbb{R}^{p^{2}}\to\mathbb{R} as

Ln(β)=1ni=1nk=1p(XikXi1βk)w(Xi1),L_{n}(\beta)=\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{p}\ell(X_{ik}-X_{i-1}^{\top}\beta_{k})w(X_{i-1}), (2.4)

where β=(β1,,βp)\beta=(\beta_{1}^{\top},\dots,\beta_{p}^{\top})^{\top} with βkp\beta_{k}\in\mathbb{R}^{p} for 1kp1\leq k\leq p, the weight function

w(x)=min{1,T3|x|3}w(x)=\min\bigg{\{}1,\frac{T^{3}}{|x|_{\infty}^{3}}\bigg{\}}

for some threshold T>0T>0 to be determined later, and the robust loss function (x)\ell(x) satisfies:

  • (i)

    (x)\ell(x) is a thrice differentiable convex and even function.

  • (ii)

    For some constant C>0C>0, ||,|′′|,|(3)|C|\ell^{\prime}|,|\ell^{\prime\prime}|,|\ell^{(3)}|\leq C.

We give two examples of such loss functions from Pan et al., (2021) that satisfy the above conditions.

Examples 2.3 (Smoothed huber loss I).
(x)={x2/2|x|3/6if |x|1,|x|/21/6if |x|>1.\ell(x)=\begin{cases}x^{2}/2-|x|^{3}/6\quad&\text{if }|x|\leq 1,\\ |x|/2-1/6\quad&\text{if }|x|>1.\end{cases}
Examples 2.4 (Smoothed huber loss II).
(x)={x2/2x4/24,if |x|2,(22/3)|x|1/2,if |x|>2.\ell(x)=\begin{cases}x^{2}/2-x^{4}/24,\quad&\text{if }|x|\leq\sqrt{2},\\ \big{(}2\sqrt{2}/3\big{)}|x|-1/2,\quad&\text{if }|x|>\sqrt{2}.\end{cases}

Direct calculation shows that everywhere twice differentiable and almost everywhere thrice differentiable. Also, the derivative of first three orders are bounded in magnitude. We mention that generalization to other loss functions that does not satisfy the differentiability conditions (for example, huber loss) may be derived under more refined arguments, but will be omitted in this paper.

Denote by ψ(x)=(x)\psi(x)=\ell^{\prime}(x) the derivative of (x)\ell(x), then ψ(x)\psi(x) is twice differentiable by condition (i) and |ψ(x)|C|\psi(x)|\leq C for all xx\in\mathbb{R} by condition (ii). Let μ=(μ1,,μp)p\mu=(\mu_{1},\dots,\mu_{p})^{\top}\in\mathbb{R}^{p} with μk=𝔼[ψ(εik)]\mu_{k}=\mathbb{E}[\psi^{\prime}(\varepsilon_{ik})] and μ1=(μ11,,μp1)\mu^{-1}=(\mu_{1}^{-1},\dots,\mu_{p}^{-1})^{\top}. Let μ^=(μ^1,,μ^p)\widehat{\mu}=(\widehat{\mu}_{1},\dots,\widehat{\mu}_{p}) be the estimate of μ\mu with μ^k=1ni=1nψ(ε^ik)\widehat{\mu}_{k}=\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\widehat{\varepsilon}_{ik}), where ε^ik=XikXi1β^k\widehat{\varepsilon}_{ik}=X_{ik}-X_{i-1}^{\top}\widehat{\beta}_{k}. Let Σx=𝔼[XiXiw(Xi)]p×p\Sigma_{x}=\mathbb{E}[X_{i}X_{i}^{\top}w(X_{i})]\in\mathbb{R}^{p\times p} be the weighted covariance matrix and Ωx=Σx1p×p\Omega_{x}=\Sigma_{x}^{-1}\in\mathbb{R}^{p\times p} be the weighted precision matrix. Denote by Σ^x=n1i=1nXi1Xi1w(Xi1)\widehat{\Sigma}_{x}=n^{-1}\sum_{i=1}^{n}X_{i-1}X_{i-1}^{\top}w(X_{i-1}) the weighted sample covariance. Furthermore, suppose that Ω^x\widehat{\Omega}_{x} is a suitable approximation of the weighted precision matrix Ωx\Omega_{x} (e.g., CLIME estimator introduced by Cai et al., (2011)), as will be discussed in section 2.3. To ensure the validity of such estimator, the sparsity of each row of Ωx\Omega_{x} is always assumed due to high dimensionality. Now we introduce a few more notations:

Σ\displaystyle\Sigma =diag(μ)Σx=[μ1Σx0000μ2Σx0000000μpΣx]p2×p2,\displaystyle=\text{diag}(\mu)\otimes\Sigma_{x}=\begin{bmatrix}\mu_{1}\Sigma_{x}&0&0&\dots&0\\ 0&\mu_{2}\Sigma_{x}&0&\dots&0\\ \vdots&\vdots&\ddots&\dots&0\\ 0&0&0&0&\mu_{p}\Sigma_{x}\end{bmatrix}\in\mathbb{R}^{p^{2}\times p^{2}}, (2.5)

and analogously,

Ω\displaystyle\Omega =Σ1=diag(μ1)Ωx;\displaystyle=\Sigma^{-1}=\text{diag}(\mu^{-1})\otimes\Omega_{x};
Σ^\displaystyle\widehat{\Sigma} =diag(μ^)Σ^x,Ω^=diag(μ^1)Ω^x.\displaystyle=\text{diag}(\widehat{\mu})\otimes\widehat{\Sigma}_{x},\quad\widehat{\Omega}=\text{diag}(\widehat{\mu}^{-1})\otimes\widehat{\Omega}_{x}. (2.6)

Following the one-step estimator in Bickel, (1975), we de-bias β^\widehat{\beta} by adding an additional term involving the gradient of the loss function LL:

βˇ=β^+Ω^Ln(β^).\check{\beta}=\widehat{\beta}+\widehat{\Omega}\,\nabla L_{n}(\widehat{\beta}). (2.7)

To briefly explain the presence of Ω^\widehat{\Omega}, consider Taylor expansion of Ln(β^)\nabla L_{n}(\widehat{\beta}) around Ln(β)\nabla L_{n}(\beta^{*}). Write

n(βˇβ)\displaystyle\sqrt{n}(\check{\beta}-\beta^{*}) =n(β^β)+nΩ^Ln(β)nΩ^(Ln(β^)Ln(β))\displaystyle=\sqrt{n}(\widehat{\beta}-\beta^{*})+\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{*})-\sqrt{n}\,\widehat{\Omega}(\nabla L_{n}(\widehat{\beta})-\nabla L_{n}(\beta^{*}))
=nΩ^Ln(β)+n[(β^β)Ω^2Ln(β)(β^β)+R]\displaystyle=\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{*})+\sqrt{n}\left[(\widehat{\beta}-\beta^{*})-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{*})(\widehat{\beta}-\beta^{*})+R\right]
=nΩ^Ln(β)A+n[(Ip2Ω^2Ln(β)(β^β)]Δ+nR,\displaystyle=\underbrace{\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{*})}_{A}+\underbrace{\sqrt{n}\left[\Big{(}I_{p^{2}}-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{*})(\widehat{\beta}-\beta^{*})\right]}_{\Delta}+\sqrt{n}R, (2.8)

where the remainder term nR=o(1)\sqrt{n}R=o(1) under certain conditions. Moreover, we also hope Δ\Delta to be negligible. As will be shown in the following sections,

Δn(ΩΩ^1Σmax+2Ln(β)ΣmaxΩ^1)|β^β|1,\Delta\leq\sqrt{n}\Big{(}\|\Omega-\widehat{\Omega}\|_{1}\|\Sigma\|_{\max}+\|\nabla^{2}L_{n}(\beta^{*})-\Sigma\|_{\max}\|\widehat{\Omega}\|_{1}\Big{)}|\widehat{\beta}-\beta^{*}|_{1}, (2.9)

To this end, Ω^\widehat{\Omega} needs to be a good approximation of the precision matrix Ω\Omega, which inspires the construction of such Ω^\widehat{\Omega}. More rigorous arguments will be presented in the subsequent sections.

Note that the estimator βˇ\check{\beta} is closely related to the de-sparsifying Lasso estimator (Van de Geer et al., (2014) and Zhang and Zhang, (2014)), which is employed to conduct simultaneous inference for linear regression models in Zhang and Cheng, (2017). βˇ\check{\beta} will reduce to de-sparsifying Lasso estimator if the loss (x)\ell(x) in (2.4) is squared error loss and the weight w(x)1w(x)\equiv 1. Moreover, Loh, (2018) uses this one-step estimator to build the limiting distribution of high-dimensional vector restricted to a fixed number of coordinates, and delivers a result that agrees with Bickel, (1975) for low-dimensional robust M-estimators. Different from that, we will derive such conclusions simultaneously for all p2p^{2} coordinates of β\beta^{*}.

In the subsequent sections, we aim at obtaining a limiting distribution for βˇ\check{\beta}.

2.3 Estimation of the precision matrix

In this section, we mainly discuss the validity of having Ω^\widehat{\Omega} as an approximation of Ω\Omega. By the structure of Ω\Omega, we need to first find a suitable estimator of the weighted precision Ωx\Omega_{x}.

The estimation of the sparse inverse covariance matrix based on a collection of observations {Xi}\{X_{i}\} plays a crucial role in establishing the asymptotic distribution. In high-dimensional regime, one cannot obtain a suitable estimator for the precision matrix by simply inverting the sample covariance, as the sample covariance is not invertible when the number of features exceeds the number of observations. Depending on the purposes, various methodology have been proposed to solve problem of estimating the precision. See for example, graphical Lasso (Yuan and Lin, (2007) and Friedman et al., (2008)) and nodewise regression (Meinshausen and Bühlmann, (2006)). From a different perspective, Cai et al., (2011) proposed a CLIME approach to sparse precision estimation, which shall be applied in this paper. For completeness, we reproduce the CLIME estimator in the following.

Suppose that the sparsity of each row of Ωx\Omega_{x} is at most ss, i.e., s=max1ip|{j:Ωx,ij0}|.s=\max_{1\leq i\leq p}|\{j:\Omega_{x,ij}\neq 0\}|. We first obtain Θ^\widehat{\Theta} by solving

Θ^=argminΘi,j|Θij|subject to: Σ^xΘIpmaxλn,\widehat{\Theta}=\operatorname*{\arg\min}_{\Theta}\sum_{i,j}\big{|}\Theta_{ij}\big{|}\quad\text{subject to: }\quad\|\widehat{\Sigma}_{x}\Theta-I_{p}\|_{\max}\leq\lambda_{n},

for some regularization parameter λn>0\lambda_{n}>0. Note that the solution Θ\Theta may not symmetric. To account for symmetry, the CLIME estimator Ω^x\widehat{\Omega}_{x} is defined as

Ω^x=(ω^ij),where ω^ij=ω^ji=Θ^ij𝕀{|Θ^ij||Θ^ji|}+Θ^ji𝕀{|Θ^ij|>|Θ^ji|}.\widehat{\Omega}_{x}=(\widehat{\omega}_{ij}),\quad\text{where }\widehat{\omega}_{ij}=\widehat{\omega}_{ji}=\widehat{\Theta}_{ij}\mathbb{I}\{|\widehat{\Theta}_{ij}|\leq|\widehat{\Theta}_{ji}|\}+\widehat{\Theta}_{ji}\mathbb{I}\{|\widehat{\Theta}_{ij}|>|\widehat{\Theta}_{ji}|\}. (2.10)

For more analysis of CLIME estimator, see Cai et al., (2011). Next, we present the convergence theorem for CLIME estimator.

Theorem 2.5.

Let τ\tau be defined in definition 2.1 and γ=maxt=0,1,τ1At\gamma=\max_{t=0,1\dots,\tau-1}\|A^{t}\|. Choose λnΩx1γτ2T2(logp)3/2n1/2\lambda_{n}\asymp\|\Omega_{x}\|_{1}\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}, then with probability at least 14pc01-4p^{-c_{0}} for some constant c0>0c_{0}>0,

Ω^xΩxmaxΩx1λnandΩ^xΩx1Ωx1sλn.\|\widehat{\Omega}_{x}-\Omega_{x}\|_{\max}\lesssim\|\Omega_{x}\|_{1}\lambda_{n}\quad\text{and}\quad\|\widehat{\Omega}_{x}-\Omega_{x}\|_{1}\lesssim\|\Omega_{x}\|_{1}s\lambda_{n}.
Remark 2.6.

Theorem 2.5 is a direct application of Theorem 6 of Cai et al., (2011). Note that if we assume the eigenvalue condition on Σx\Sigma_{x} that 0cλmin(Σx)λmax(Σx)C0\leq c\leq\lambda_{\min}(\Sigma_{x})\leq\lambda_{\max}(\Sigma_{x})\leq C, then Ωx21/λmin(Σx)=O(1).\|\Omega_{x}\|_{2}\leq 1/\lambda_{\min}(\Sigma_{x})=O(1). Therefore, by the sparsity condition on Ωx\Omega_{x}, we immediately have that Ωx1=O(s).\|\Omega_{x}\|_{1}=O(\sqrt{s}). Suppose the scaling condition holds that sγτ2T2(logp)3/2n1/2=o(1)s\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}=o(1), then the CLIME estimator Ω^x\widehat{\Omega}_{x} defined in (2.10) is consistent in estimating the weighted precision matrix of the VAR(1) model (2.2).

The following theorem shows that ΩΩ^\|\Omega-\widehat{\Omega}\| enjoys the same convergence rate as in the previous theorem.

Theorem 2.7.

Let Ω^x\widehat{\Omega}_{x} be the CLIME estimator defined above. Assume that μk>c1>0\mu_{k}>c_{1}>0 for all 1kp1\leq k\leq p, then with probability at least 16pc1-6p^{-c},

ΩΩ^maxΩx1λnandΩΩ^1Ωx1sλn.\|\Omega-\widehat{\Omega}\|_{\max}\lesssim\|\Omega_{x}\|_{1}\lambda_{n}\quad\text{and}\quad\|\Omega-\widehat{\Omega}\|_{1}\lesssim\|\Omega_{x}\|_{1}s\lambda_{n}.

The above theorem is built upon two facts: Ω^x\widehat{\Omega}_{x} approximates Ωx\Omega_{x} and μ^\widehat{\mu} approximates μ\mu. The result regarding the latter approximation will be given in Lemma 3.7.

2.4 Simultaneous inference

In this section, consider the following hypothesis testing problem:

H0:Aij=Aij0,for all i,j=1,,p(or equivalently, βj=βj0,for all j=1,,p)H_{0}:A_{ij}=A^{0}_{ij},\quad\text{for all }i,j=1,\dots,p\quad(\text{or equivalently, }\beta^{*}_{j}=\beta^{0}_{j},\quad\text{for all }j=1,\dots,p)

versus the alternative hypothesis H1:AijAij0H_{1}:A_{ij}\neq A^{0}_{ij} for some i,j=1,,pi,j=1,\dots,p. Instead of projecting the explanatory variables onto a subspace of fixed dimension (Javanmard and Montanari, (2014), Zhang and Zhang, (2014), Van de Geer et al., (2014) and Loh, (2018)), we allow the number of testings to grow as fast as an exponential order of the sample size nn. Zhang and Cheng, (2017) presented a more related work, where it’s also allowed that the testing size to grow as a function of pp. However, they conducted such simultaneous inference procedure under linear regression setting with independent random variables.

Employing the de-biased estimator βˇ\check{\beta} defined in (2.7), we propose to use the test statistics

n|βˇβ0|,\sqrt{n}|\check{\beta}-\beta^{0}|_{\infty}, (2.11)

where βˇ\check{\beta} is defined in (2.7). In the next several theorems, we elaborate a multiplier bootstrap method to obtain the critical value of the test statistics, which requires a few scaling and moment assumptions. Recall definition 2.1 for τ\tau and theorem 2.5 for the definition of γ\gamma. Also recall that s=max1ip|{j:Ωx,ij0}|.s=\max_{1\leq i\leq p}|\{j:\Omega_{x,ij}\neq 0\}|.

Assumptions

  • (A1)

    nT3|β^β|12=o(1)\sqrt{n}T^{3}|\widehat{\beta}-\beta^{*}|_{1}^{2}=o(1).

  • (A2)

    Ωx12sγ2τ4T4(logp)3/n=o(1)\|\Omega_{x}\|_{1}^{2}s\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}=o(1).

  • (A3)

    sγτ2T2(logp)3/2|β^β|1=o(1)s\gamma\tau^{2}T^{2}(\log p)^{3/2}|\widehat{\beta}-\beta^{*}|_{1}=o(1).

  • (A4)

    sT2(log(pn))7/nncsT^{2}(\log(pn))^{7}/n\lesssim n^{-c}.

  • (A5)

    (logp)3/2(logn)1/2Tsτγ/n1/4=o(1).(\log p)^{3/2}(\log n)^{1/2}T\sqrt{s\tau}\gamma/n^{1/4}=o(1).

Additionally, throughout the paper we assume that for some constant C>0C>0, 𝔼[Xik2]C\mathbb{E}[X_{ik}^{2}]\leq C and 𝔼[εik2]C\mathbb{E}[\varepsilon_{ik}^{2}]\leq C for all 1kp1\leq k\leq p. We also suppose that Σxmax=O(1)\|\Sigma_{x}\|_{\max}=O(1) and 0<cλmin(Σx)λmax(Σx)C0<c\leq\lambda_{\min}(\Sigma_{x})\leq\lambda_{\max}(\Sigma_{x})\leq C. Thus, Ωx21/λmin(Σx)=O(1)\|\Omega_{x}\|_{2}\leq 1/\lambda_{\min}(\Sigma_{x})=O(1) and Ωx1=O(s)\|\Omega_{x}\|_{1}=O(\sqrt{s}), where the row sparsity s=max1ip|{j:Ωx,ij0}|s=\max_{1\leq i\leq p}|\{j:\Omega_{x,ij}\neq 0\}|.

Theorem 2.8.

Let ζ1=γτ2T2(logp)3/2|β^β|1+nT3|β^β|12+sγ2τ4T4(logp)3/n.\zeta_{1}=\gamma\tau^{2}T^{2}(\log p)^{3/2}|\widehat{\beta}-\beta^{*}|_{1}+\sqrt{n}T^{3}|\widehat{\beta}-\beta^{*}|_{1}^{2}+s\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}. Suppose assumptions (A1) — (A3) hold. Additionally assume that ζ11log(p/ζ1)=o(1)\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}=o(1), then we have that

(|n(βˇβ)nΩLn(β)|>ζ1)<ζ2,\mathbb{P}\bigg{(}|\sqrt{n}(\check{\beta}-\beta^{*})-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}>\zeta_{1}\bigg{)}<\zeta_{2},

where ζ11log(p/ζ1)=o(1)\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}=o(1) and ζ2=o(1)\zeta_{2}=o(1).

Theorem 2.8 rigorously verifies that nR=o(1)\sqrt{n}R=o(1) and Δ=o(1)\Delta=o(1) in (2.2) by the proposed construction of Ω^\widehat{\Omega} and suggests us to perform further analysis on nΩLn(β)\sqrt{n}\Omega\nabla L_{n}(\beta^{*}). To derive the limiting distribution, we shall use Gaussian approximation technique, since the classic central limit theorem fails in high-dimensional setting.

Gaussian approximation was initially invented for high-dimensional independent random variables in Chernozhukov et al., (2013) and further generalized to high-dimensional time series in Zhang and Wu, (2017). Zhang and Cheng, (2017) and Loh, (2018) applied the GA technique in Chernozhukov et al., (2013) to the derivation of asymptotic distribution in linear regression setting. However, data generated from VAR model suffers temporal dependence, which makes the aforementioned techniques unavailable. Although Zhang and Wu, (2017) established such GA results for general time series using dependence adjusted norm, direct application of their theorems does not yield desirable conclusion in ultra-high dimensional setting. This leads us to derive a new GA theorem with better convergence rate, which is achievable thanks to the structure of VAR model.

The next theorem establishes a Gaussian approximation(GA) result for the term nΩLn(β)\sqrt{n}\Omega\nabla L_{n}(\beta^{*}). For a more detailed description of Gaussian approximation procedure, see section XXX.

Theorem 2.9.

Denote D=(Djk)1j,kpp2×p2D=(D_{jk})_{1\leq j,k\leq p}\in\mathbb{R}^{p^{2}\times p^{2}} with

Djk=Ωx𝔼[ψ(εij)ψ(εik)]𝔼[XiXiw2(Xi)]Ωxμjμkp×p.D_{jk}=\frac{\Omega_{x}\mathbb{E}[\psi(\varepsilon_{ij})\psi(\varepsilon_{ik})]\mathbb{E}[X_{i}X_{i}^{\top}w^{2}(X_{i})]\Omega_{x}^{\top}}{\mu_{j}\mu_{k}}\in\mathbb{R}^{p\times p}.

Under Assumption (A4) and (A5), we have the following Gaussian Approximation result that

supt|(|nΩLn(β)|t)(|i=1nzi/n|t)|=o(1),\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\bigg{(}|\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}\leq t\bigg{)}-\mathbb{P}\bigg{(}\big{|}\sum_{i=1}^{n}z_{i}/\sqrt{n}\big{|}_{\infty}\leq t\bigg{)}\bigg{|}=o(1),

where zi=(zi1,,zip2)z_{i}=(z_{i1},\dots,z_{ip^{2}})^{\top} is a sequence of mean zero independent Gaussian vectors with each 𝔼zizi=D\mathbb{E}z_{i}z_{i}^{\top}=D.

Remark 2.10.

The above GA results allows the ultra-high dimensional regime, wehere pp grows as fast as O(enb)O(\mathrm{e}^{n^{b}}) for some 0<b<10<b<1.

Since the covariance matrix DD of the Gaussian analogue ziz_{i} is not accessible from the observation {Xi}\{X_{i}\}, we need to give a suitable estimation of DD before further performing multiplier bootstrap. The next theorem delivers a consistent estimator for our purpose.

Theorem 2.11.
D^jk=Ω^x(1ni=1nψ(ε^ij)ψ(ε^ik))(1ni=1nXiXiw2(Xi))Ω^xμ^jμ^kp×p,\widehat{D}_{jk}=\frac{\widehat{\Omega}_{x}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\psi(\widehat{\varepsilon}_{ij})\psi(\widehat{\varepsilon}_{ik})\Big{)}\Big{(}\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}w^{2}(X_{i})\Big{)}\widehat{\Omega}_{x}^{\top}}{\widehat{\mu}_{j}\widehat{\mu}_{k}}\in\mathbb{R}^{p\times p}, (2.12)

where Ω^x\widehat{\Omega}_{x} is the CLIME estimator of Ωx\Omega_{x}. Under assumptions (A1)–(A5) and additionally assume that Ωx1=O(s)\|\Omega_{x}\|_{1}=O(\sqrt{s}) and that for all 1kp1\leq k\leq p, μk>C>0\mu_{k}>C>0 for some constant CC, we have with probability at least 112pc1-12p^{-c}, we have

D^Dmaxsγτ2T2(logp)3/2n1/2+|β^β|1.\|\widehat{D}-D\|_{\max}\lesssim s\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}+|\widehat{\beta}-\beta^{*}|_{1}.

Indeed, under the scaling assumptions, D^Dmax=o(1)\|\widehat{D}-D\|_{\max}=o(1). With these preparatory results, we are ready to present the main theorem of this paper, which describes a procedure to find the critical value of n|βˇβ|\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty} using bootstrap.

Theorem 2.12.

Denote

W=|D^1/2η|,W=|\widehat{D}^{1/2}\eta|_{\infty},

where ηN(0,Ip2)\eta\sim N(0,I_{p^{2}}) is independent of (Xi)i=1n(X_{i})_{i=1}^{n} and D^\widehat{D} is defined in (2.12). Let the bootstrap critical value be given by c(α)=inf{t:(Wt|𝐗)1α}c(\alpha)=\inf\{t\in\mathbb{R}:\mathbb{P}(W\leq t|\boldsymbol{X})\geq 1-\alpha\}. Let assumptions (A1) — (A5) and the assumptions in theorem 2.8 hold. Denote v=c(sγτ2T2(logp)3/2/n+|β^β|1)v=c(s\gamma\tau^{2}T^{2}(\log p)^{3/2}/\sqrt{n}+|\widehat{\beta}-\beta^{*}|_{1}) for some constant cc. Assume that π(v)=Cv1/3(1log(p/v))2/3=o(1)\pi(v)=Cv^{1/3}(1\lor\log(p/v))^{2/3}=o(1), then we have

supα(0,1)|(n|βˇβ|>c(α))α|=o(1).\sup_{\alpha\in(0,1)}\bigg{|}\mathbb{P}\Big{(}\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty}>c(\alpha)\Big{)}-\alpha\bigg{|}=o(1).

This result suggests a way to not only find the asymptotic distribution, but also to provide an accurate critical value c(α)c(\alpha) using multiplier bootstrap. Under the null hypothesis H0H_{0}, we have n|βˇβ0|=n|βˇβ|\sqrt{n}|\check{\beta}-\beta^{0}|_{\infty}=\sqrt{n}|\check{\beta}-\beta^{*}|. This verifies the validity of having (2.11) as a test statistics for simultaneous inference.

3 Estimation

Many estimation tasks are needed as preparatory results for proving Theorem 2.12. For instance, Theorem 2.12 requires an estimation of the theoretical covariance matrix DD of the Gaussian analogue ZZ, as stated in Theorem 2.11. Besides, the convergence of CLIME estimator (section 2.3) depends on the convergence of corresponding covariance matrix. Therefore, these problems requires us to develop a new estimation theory that delivers the convergence even in ultra-high dimensional regime.

The success of high-dimensional estimation relies heavily on the application of probability concentration inequality, among which Bernstein-type inequality is especially important. The celebrated Bernstein’s inequality (Bernstein, (1946)) provides an exponential concentration inequality for sums of independent random variables which are uniformly bounded. Later works relaxed the uniform boundedness condition and extended the validity of Bernstein inequality to independent random variables that have finite exponential moment; see for example, Massart, (2007) and Wainwright, (2019).

Despite the extensive body of work on concentration inequalities for independent random variables, literature remains quiet when it comes to establishing exponential-type tail concentration results for random process. Some related existing work includes Bernstein inequality for sums of strong mixing processes (Merlevède et al., (2009)), Bernstein inequality under functional dependence measures (Zhang, (2019)), etc. In a more recent work, Liu and Zhang, (2021) established a sharp Bernstein inequality for VAR model using the definition of spectral decay index, which improved the current rate by a factor of (logn)2(\log n)^{2}. In this paper, we will derive another Bernstein inequality for VAR model under slightly different condition from Liu and Zhang, (2021). Before presenting the main results, recall the definition of τ\tau in definition 2.1.

Lemma 3.1.

Let {Xi}i=0n\{X_{i}\}_{i=0}^{n} be generated by a VAR(1) model. Suppose G:pG:\mathbb{R}^{p}\to\mathbb{R} satisfies that

|G(X)G(Y)||XY|,|G(X)-G(Y)|\leq|X-Y|_{\infty}, (3.1)

and that |G(x)|B|G(x)|\leq B for all xx\in\mathbb{R}. Assume that 𝔼[|εij|2]σ2\mathbb{E}[|\varepsilon_{ij}|^{2}]\leq\sigma^{2} for all j=1,,pj=1,\dots,p. Then there exists some constants C1,C2,C3,C4>0C_{1},C_{2},C_{3},C_{4}>0 only depending on ρ\rho and σ\sigma, such that

(|1ni=1nG(Xi1)𝔼[\displaystyle\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}G(X_{i-1})-\mathbb{E}[ G(Xi1)]|x)2exp{nx2C3n1γ2τ3+C4τBx}\displaystyle G(X_{i-1})]\Big{|}\geq x\bigg{)}\leq 2\exp\bigg{\{}-\frac{nx^{2}}{C_{3}n^{-1}\gamma^{2}\tau^{3}+C_{4}\tau Bx}\bigg{\}}
+2exp{nx2(1+C1B2)γ2τ4B2(logp)2(n1τlogp+1)+C2τ2B(logp)x}.\displaystyle+2\exp\bigg{\{}-\frac{nx^{2}}{(1+C_{1}B^{-2})\gamma^{2}\tau^{4}B^{2}(\log p)^{2}(n^{-1}\tau\log p+1)+C_{2}\tau^{2}B(\log p)x}\bigg{\}}.

Specifically, under assumption (A2), we see that τ(logp)/n0\tau(\log p)/n\to 0. So for sufficiently large B>0B>0, we have

(|1ni=1nG(Xi1)𝔼[G(Xi1)]|x)4exp{nx2C1γ2τ4B2(logp)2+C2τ2B(logp)x},\displaystyle\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}G(X_{i-1})-\mathbb{E}[G(X_{i-1})]\Big{|}\geq x\bigg{)}\leq 4\exp\bigg{\{}-\frac{nx^{2}}{C_{1}^{\prime}\gamma^{2}\tau^{4}B^{2}(\log p)^{2}+C_{2}^{\prime}\tau^{2}B(\log p)x}\bigg{\}}, (3.2)

for some positive constants C1,C2C_{1}^{\prime},C_{2}^{\prime} depending only on ρ\rho and σ\sigma.

Remark 3.2.

Note that the Lipschitz condition (3.1) is slightly different from that in Liu and Zhang, (2021), where instead, they assumed that

|G(x)G(y)|g|xy|,|G(x)-G(y)|\leq g^{\top}|x-y|, (3.3)

for some vector gp.g\in\mathbb{R}^{p}. Since condition (3.1) is weaker than (3.3), the additional (logp)(\log p) appears in the denominator of right-hand side in (3.2). For more detailed comparison of different versions of Bernstein inequalities, we refer readers to Liu and Zhang, (2021) and the references therein.

With a minor modification of the proof of Lemma 3.1, we have the following version of Bernstein inequality which includes a bounded function of the latest innovation εi\varepsilon_{i} as a multiple.

Corollary 3.3.

Let {Xi}i=0n\{X_{i}\}_{i=0}^{n} be generated by a VAR(1) model. Suppose |h(x)|1|h(x)|\leq 1 and G:pG:\mathbb{R}^{p}\to\mathbb{R} satisfies that

|G(X)G(Y)||XY|,|G(X)-G(Y)|\leq|X-Y|_{\infty},

and that |G(x)|B|G(x)|\leq B for all xx\in\mathbb{R}. Assume that 𝔼[|εij|2]σ2\mathbb{E}[|\varepsilon_{ij}|^{2}]\leq\sigma^{2} for all j=1,,pj=1,\dots,p. Then there exists some constants C1,C2,C3,C4>0C_{1},C_{2},C_{3},C_{4}>0 only depending on ρ\rho and σ\sigma, such that

(|1ni=1nh(εi)G(Xi1)\displaystyle\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}h(\varepsilon_{i})G(X_{i-1}) 𝔼[h(εi)G(Xi1)]|x)2exp{nx2C3n1γ2τ3+C4τBx}\displaystyle-\mathbb{E}[h(\varepsilon_{i})G(X_{i-1})]\Big{|}\geq x\bigg{)}\leq 2\exp\bigg{\{}-\frac{nx^{2}}{C_{3}n^{-1}\gamma^{2}\tau^{3}+C_{4}\tau Bx}\bigg{\}}
+2exp{nx2(1+C1B2)γ2τ4B2(logp)2(n1τlogp+1)+C2τ2B(logp)x}.\displaystyle+2\exp\bigg{\{}-\frac{nx^{2}}{(1+C_{1}B^{-2})\gamma^{2}\tau^{4}B^{2}(\log p)^{2}(n^{-1}\tau\log p+1)+C_{2}\tau^{2}B(\log p)x}\bigg{\}}.

Specifically, under assumption (A2), we see that τ(logp)/n0\tau(\log p)/n\to 0. So for sufficiently large B>0B>0, we have

(|1ni=1nh(εi)G(Xi1)\displaystyle\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}h(\varepsilon_{i})G(X_{i-1}) 𝔼[h(εi)G(Xi1)]|x)4exp{nx2C1γ2τ4B2(logp)2+C2τ2B(logp)x},\displaystyle-\mathbb{E}[h(\varepsilon_{i})G(X_{i-1})]\Big{|}\geq x\bigg{)}\leq 4\exp\bigg{\{}-\frac{nx^{2}}{C_{1}^{\prime}\gamma^{2}\tau^{4}B^{2}(\log p)^{2}+C_{2}^{\prime}\tau^{2}B(\log p)x}\bigg{\}},

for some positive constants C1,C2C_{1}^{\prime},C_{2}^{\prime} depending only on ρ\rho and σ\sigma.

Remark 3.4.

Since the additional term h(εi)h(\varepsilon_{i}) is independent of G(Xi1)G(X_{i-1}), the proof of Lemma 3.1 directly applies without any extra technical difficulty.

Equipped with our new Bernstein inequalities, several estimation results follow immediately. The next theorem regarding the estimation of Σx\Sigma_{x} is essential when we prove the convergence rate of CLIME estimator in section 2.3.

Theorem 3.5 (Estimation of Σx\Sigma_{x}).

Let Σ^x=n1i=1nXi1Xi1w(Xi1)\widehat{\Sigma}_{x}=n^{-1}\sum_{i=1}^{n}X_{i-1}X_{i-1}^{\top}w(X_{i-1}) and Σx=𝔼[XiXiw(Xi)]\Sigma_{x}=\mathbb{E}[X_{i}X_{i}^{\top}w(X_{i})]. Then with probability at least 14pc01-4p^{-c_{0}} for some constant c0>0c_{0}>0, it holds that

Σ^xΣxmaxγτ2T2n1/2(logp)3/2.\|\widehat{\Sigma}_{x}-\Sigma_{x}\|_{\max}\lesssim\gamma\tau^{2}T^{2}n^{-1/2}(\log p)^{3/2}.

We see that the convergence rate of CLIME estimator in Theorem 2.5 essentially inherits from the convergence rate in Theorem 3.5, with an additional term Ωx1\|\Omega_{x}\|_{1}. The following theorem plays an important role in verifying that the Δ\Delta defined in (2.9) is indeed negligible.

Theorem 3.6 (Estimation of Σ\Sigma by 2Ln(β)\nabla^{2}L_{n}(\beta^{*})).

Assume that 𝔼[εik2]σ2\mathbb{E}[\varepsilon_{ik}^{2}]\leq\sigma^{2} for all 1kp1\leq k\leq p. Then for some constant c1>0c_{1}>0, with probability at least 14pc11-4p^{-c_{1}}, it holds that

2Ln(β)Σmaxγτ2T2n1/2(logp)3/2.\|\nabla^{2}L_{n}(\beta^{*})-\Sigma\|_{\max}\lesssim\gamma\tau^{2}T^{2}n^{-1/2}(\log p)^{3/2}.

While the last two theorems make use of Lemma 3.1 in this paper, the next estimation for μ\mu directly applies the concentration inequality in Liu and Zhang, (2021) thanks to the stronger assumption that μ^\widehat{\mu} satisfies.

Lemma 3.7.

Suppose that βk\beta_{k}^{*} lies in a bounded 1\ell_{1} normed ball for all 1kp1\leq k\leq p and that 𝔼[Xij2]C\mathbb{E}[X_{ij}^{2}]\leq C for some constant C>0C>0 and for all 1jp1\leq j\leq p. Then we have

(|μ^μ|γτ2logpn+|β^β|1)2pc,\mathbb{P}\bigg{(}|\widehat{\mu}-\mu|_{\infty}\geq\gamma\tau^{2}\sqrt{\frac{\log p}{n}}+|\widehat{\beta}-\beta^{*}|_{1}\bigg{)}\leq 2p^{-c},

for some positive constant cc.

4 Gaussian Approximation

Conducting simultaneous inference for high-dimensional data is always considered to be a hard task, since central limit theorem fails when the dimension of random vectors can grow as a function of the number of observation nn, or even exceeds nn. As an alternative to central limit theorem, Chernozhukov et al., (2013) proposed Gaussian approximation theorem, which states that under certain conditions, the distribution of the maximum of a sum of independent high-dimensional random vectors can be approximated by that of the maximum of a sum of the Gaussian random vectors with the same covariance matrices as the original vectors. Their Gaussian approximation results allow the ultra-high dimensional cases, where the dimension pp grows exponentially in nn. In the meantime, they also proved that Gaussian multiplier bootstrap method yields a high quality approximation of the distribution of the original maximum and showcased a wide range of application, such as high-dimensional estimation, multiple hypothesis testing, and adaptive specification testing. It is worth noticing that the results from Chernozhukov et al., (2013) are only applicable when the sequence of random vectors is independent.

Zhang and Wu, (2017) generalized Gaussian approximation results to general high-dimensional stationary time series, using the framework of functional dependence measure (Wu, (2005)). We specifically mention that a direct application of Gaussian approximation from Zhang and Wu, (2017) cannot deliver a desired conclusion in ultra-high dimensional regime, due to coarser capture of dependence measure for VAR model. In what follows, we will use refined argument to establish a new Gaussian approximation result for VAR model.

By Theorem 2.8, n|βˇβ|\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty} can be approximated by n|ΩLn(β)|\sqrt{n}|\Omega\nabla L_{n}(\beta^{*})|_{\infty}. Hence, we shall build a GA result for nΩLn(β)\sqrt{n}\Omega\nabla L_{n}(\beta^{*}). Observe that nΩLn(β)p2\sqrt{n}\Omega\nabla L_{n}(\beta^{*})\in\mathbb{R}^{p^{2}} can be written as

(1ni=1nΩxμ1ψα(εi1)Xi1w(Xi1),,1ni=1nΩxμpψα(εip)Xi1w(Xi1),),\bigg{(}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\Omega_{x}}{\mu_{1}}\psi_{\alpha}(\varepsilon_{i1})X_{i-1}^{\top}w(X_{i-1}),\dots,\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\Omega_{x}}{\mu_{p}}\psi_{\alpha}(\varepsilon_{ip})X_{i-1}^{\top}w(X_{i-1}),\bigg{)}^{\top},

so it’s sufficient to establish GA result for one sub-vector

1ni=1nΩxμkψα(εik)Xi1w(Xi1),k=1,,p.\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\Omega_{x}}{\mu_{k}}\psi_{\alpha}(\varepsilon_{ik})X_{i-1}^{\top}w(X_{i-1}),\quad k=1,\dots,p.

Fix 1kp1\leq k\leq p and denote Θk=Ωxμk1\Theta_{k}=\Omega_{x}\mu_{k}^{-1}. Let Xi,m=l=0mAlεilX_{i,m}=\sum_{l=0}^{m}A^{l}\varepsilon_{i-l} be the mm-approximation of XiX_{i} with mm to be determined later. Let Yi=ψα(εik)ΘkXi1w(Xi1)Y_{i}=\psi_{\alpha}(\varepsilon_{ik})\Theta_{k}X_{i-1}w(X_{i-1}) be the quantity that we will establish Gaussian approximation for and denote TY=i=1nYiT_{Y}=\sum_{i=1}^{n}Y_{i}. Analogously, let Yi,m=ψα(εik)ΘkXi1,mw(Xi1,m)Y_{i,m}=\psi_{\alpha}(\varepsilon_{ik})\Theta_{k}X_{i-1,m}w(X_{i-1,m}) be the mm-approximation of YiY_{i} and write TY,m=i=1nYi,mT_{Y,m}=\sum_{i=1}^{n}Y_{i,m}. For simplicity, assume n=(m+M)wn=(m+M)w, where M,M\to\infty, mm\to\infty , ww\to\infty and m/M0m/M\to 0. Divide the interval [1,n][1,n] into alternating large blocks Lb=[(b1)(M+m)+1,bM+(b1)m]L_{b}=[(b-1)(M+m)+1,bM+(b-1)m] with MM points and small blocks Sb=[bM+(b1)m+1,b(M+m)]S_{b}=[bM+(b-1)m+1,b(M+m)] with mm points, for 1bw1\leq b\leq w. Denote

ξb\displaystyle\xi_{b} =iLbYi,m/M,TY,S=b=1wiSbYi,m,TY,L=b=1wiLbYi,m,\displaystyle=\sum_{i\in L_{b}}Y_{i,m}/\sqrt{M},\quad T_{Y,S}=\sum_{b=1}^{w}\sum_{i\in S_{b}}Y_{i,m},\quad T_{Y,L}=\sum_{b=1}^{w}\sum_{i\in L_{b}}Y_{i,m},
Z\displaystyle Z =1ni=1nUi,where UiN(0,μk2𝔼[ψ2(εik)]Ωx𝔼[XiXiw2(Xi)]Ωx)\displaystyle=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}U_{i},\quad\text{where }U_{i}\sim N(0,\mu_{k}^{-2}\mathbb{E}[\psi^{2}(\varepsilon_{ik})]\Omega_{x}\mathbb{E}[X_{i}X_{i}^{\top}w^{2}(X_{i})]\Omega_{x}^{\top})

Note that the Yi,mY_{i,m} from different large blocks LbL_{b} are independent, i.e. iLbYi,m\sum_{i\in L_{b}}Y_{i,m} is independent in b=1,wb=1\dots,w. The main result of this section is presented as follow.

Theorem 4.1.

Suppose 𝔼[εik2]σ2\mathbb{E}[\varepsilon_{ik}^{2}]\leq\sigma^{2} for all 1kp1\leq k\leq p and the odd function ψ()\psi(\cdot) satisfies that |ψ()|C|\psi(\cdot)|\leq C and |ψ()|C|\psi^{\prime}(\cdot)|\leq C. Suppose the scaling condition holds that sT2(log(pn))7/nc1nc2sT^{2}(\log(pn))^{7}/n\leq c_{1}n^{-c_{2}}. Then for any η>0\eta>0, the Gaussian Approximation holds that

:\displaystyle\mathcal{H}: =supt|(|TY/n|t)(|Z|t)|\displaystyle=\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\big{(}|T_{Y}/\sqrt{n}|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}|Z|_{\infty}\leq t\big{)}\bigg{|}
f1(η/2,m)+f2(η/2,m)+ηlogp+ηlog(1/η)+cnc,\displaystyle\lesssim f_{1}(\eta/2,m)+f_{2}(\eta/2,m)+\eta\sqrt{\log p}+\eta\sqrt{\log(1/\eta)}+cn^{-c^{\prime}}, (4.1)

for some c,c>0c,c^{\prime}>0.

This theorem gives an upper bound on the supremum of the difference between the distribution of the maximum of sum of YiY_{i} and that of the maximum of sum of Gaussian vectors UiU_{i} with the same covariance. Now, we present the outline of the proof of the previous theorem, while we leave the complete proof in the appendix.

First, we show that the sum of Yi,mY_{i,m} in the small blocks are negligible, so TY,mTY,LT_{Y,m}\approx T_{Y,L}. Next, we prove that the sum of YiY_{i} can be approximated by its mm-approximation, that is, TYTY,mTY,LT_{Y}\approx T_{Y,m}\approx T_{Y,L}. Since TY,LT_{Y,L} is a sum of independent random vector {iLbYi,m}b=1w\{\sum_{i\in L_{b}}Y_{i,m}\}_{b=1}^{w}, the GA theorem from Chernozhukov et al., (2013) can be applied.

5 Numerical Experiments

In this section, we evaluate the performance of the proposed bootstrap-assist procedure in simultaneous inference. We consider the model (2.2), where εij\varepsilon_{ij}’s are i.i.d. Student’s tt-distributions with df=5df=5 or 1010. Let s=logps=\lfloor\log p\rfloor. We pick n=30n=30 and p=10p=10 in the numerical setup. For the true transition matrix A=(aij)A=(a_{ij}), we consider the following designs.

  • (1)

    Banded: A=(λ|ij|𝟏{|ij|s})A=(\lambda^{|i-j|}{\bf 1}\{|i-j|\leq s\}) and λ=0.5\lambda=0.5.

  • (2)

    Block diagonal: A=diag{Ai}A=\text{diag}\{A_{i}\}, where each Ais×sA_{i}\in\mathbb{R}^{s\times s} has λi\lambda_{i} on the diagonal and λi2\lambda_{i}^{2} on the superdiagonal with λiUnif(0.8,0.8).\lambda_{i}\sim Unif(-0.8,0.8).

The design in (1) is further scaled by 2ρ(A)2\rho(A) to ensure that ρ(A)<1\rho(A)<1. Hence sparse symmetric matrices are generated in (1) and sparse asymmetric matrices are constructed in (2). We draw the qq-plots of the data quantile of n|βˇβ|\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty} versus the data quantile of WW defined in Theorem 2.12 from m=100m=100 duplicates. The qq-plots are shown in figure 2 and figure 2 for banded and block diagonal designs respectively.

Refer to caption
Figure 1: The qq-plot of banded design.
Refer to caption
Figure 2: The qq-plot of block diagonal design.

Appendix A Proofs of Results in Section 2

Before proceeding with the proofs, we state a helpful lemma that is repeatedly used throughout the paper and present its proof. This simply lemma is an application of the triangle inequality to the product of two matrices.

Lemma A.1.

Let A,BA,B and A^,B^\widehat{A},\widehat{B} be p×pp\times p symmetric matrices and AA^1=o(1)\|A-\widehat{A}\|_{1}=o(1). Suppose A1=O(1)\|A\|_{1}=O(1) and B1=O(1)\|B\|_{1}=O(1). Then ABA^B^maxAA^max+BB^max\|AB-\widehat{A}\widehat{B}\|_{\max}\lesssim\|A-\widehat{A}\|_{\max}+\|B-\widehat{B}\|_{\max}.

Proof of Lemma A.1.

Since A1=O(1)\|A\|_{1}=O(1) and AA^1=o(1)\|A-\widehat{A}\|_{1}=o(1), A^1AA^1+A1=O(1)\|\widehat{A}\|_{1}\leq\|A-\widehat{A}\|_{1}+\|A\|_{1}=O(1). Hence, by triangular inequality,

ABA^B^max\displaystyle\|AB-\widehat{A}\widehat{B}\|_{\max} (AA^)Bmax+A^(BB^)max\displaystyle\leq\|(A-\widehat{A})B\|_{\max}+\|\widehat{A}(B-\widehat{B})\|_{\max}
B1AA^max+A^1BB^max\displaystyle\leq\|B\|_{1}\|A-\widehat{A}\|_{\max}+\|\widehat{A}\|_{1}\|B-\widehat{B}\|_{\max}
AA^max+BB^max\displaystyle\lesssim\|A-\widehat{A}\|_{\max}+\|B-\widehat{B}\|_{\max}

Proof of Theorem 2.5.

By Theorem 3.5, with probability at least 14pc01-4p^{-c_{0}},

Σ^xΣxmaxλn.\|\widehat{\Sigma}_{x}-\Sigma_{x}\|_{\max}\leq\lambda_{n}.

By Theorem 6 of Cai et al., (2011), we have the desired result. ∎

Proof of Theorem 2.7.

Recall that

Ω\displaystyle\Omega =Ωxdiag(μ1)=[μ11Ωx0000μ21Ωx0000000μp1Ωx]\displaystyle=\Omega_{x}\otimes\text{diag}(\mu^{-1})=\begin{bmatrix}\mu_{1}^{-1}\Omega_{x}&0&0&\dots&0\\ 0&\mu_{2}^{-1}\Omega_{x}&0&\dots&0\\ \vdots&\vdots&\ddots&\dots&0\\ 0&0&0&0&\mu_{p}^{-1}\Omega_{x}\end{bmatrix} (A.1)

and

Ω^\displaystyle\widehat{\Omega} =Ω^xdiag(μ^1)=[μ^11Ω^x0000μ^21Ω^x0000000μ^p1Ω^x]\displaystyle=\widehat{\Omega}_{x}\otimes\text{diag}(\widehat{\mu}^{-1})=\begin{bmatrix}\widehat{\mu}_{1}^{-1}\widehat{\Omega}_{x}&0&0&\dots&0\\ 0&\widehat{\mu}_{2}^{-1}\widehat{\Omega}_{x}&0&\dots&0\\ \vdots&\vdots&\ddots&\dots&0\\ 0&0&0&0&\widehat{\mu}_{p}^{-1}\widehat{\Omega}_{x}\end{bmatrix} (A.2)

For 1kp1\leq k\leq p, consider

Ω^xμ^k1Ωxμk1max\displaystyle\|\widehat{\Omega}_{x}\widehat{\mu}_{k}^{-1}-\Omega_{x}\mu_{k}^{-1}\|_{\max} Ω^xΩxmax|μ^k1|+Ωxmax|μ^k1μk1|\displaystyle\leq\|\widehat{\Omega}_{x}-\Omega_{x}\|_{\max}|\widehat{\mu}_{k}^{-1}|+\|\Omega_{x}\|_{\max}|\widehat{\mu}_{k}^{-1}-\mu_{k}^{-1}|
Ωx1λn+Ωxmax|μkμ^k|μkμ^k\displaystyle\lesssim\|\Omega_{x}\|_{1}\lambda_{n}+\|\Omega_{x}\|_{\max}\frac{|\mu_{k}-\widehat{\mu}_{k}|}{\mu_{k}\widehat{\mu}_{k}}
Ωx1λn,\displaystyle\lesssim\|\Omega_{x}\|_{1}\lambda_{n},

with probability no less than 16pc1-6p^{-c^{\prime}} by theorem 2.5 and lemma 3.6. Taking a union bound for all kk yields

ΩΩ^max=max1kpμk1Ωxμ^k1Ω^xmaxΩx1λn,\|\Omega-\widehat{\Omega}\|_{\max}=\max_{1\leq k\leq p}\|\mu_{k}^{-1}\Omega_{x}-\widehat{\mu}_{k}^{-1}\widehat{\Omega}_{x}\|_{\max}\lesssim\|\Omega_{x}\|_{1}\lambda_{n},

with probability at least 16p(c1)1-6p^{-(c^{\prime}-1)}. Replacing max\max-norm by L1L_{1}-norm delivers

ΩΩ^1Ωx1sλn.\|\Omega-\widehat{\Omega}\|_{1}\lesssim\|\Omega_{x}\|_{1}s\lambda_{n}.

The next Lemma provides a high probability bound on |Ln(β)||\nabla L_{n}(\beta^{*})|_{\infty}, which will be used in the proof of Theorem 2.8.

Lemma A.2.

Suppose that 𝔼[εij2]C\mathbb{E}[\varepsilon_{ij}^{2}]\leq C for all 1jp1\leq j\leq p. Then it holds that

(|Ln(β)|γτ2T(logp)3/2/n)4pc,\mathbb{P}(|\nabla L_{n}(\beta^{*})|_{\infty}\gtrsim\gamma\tau^{2}T(\log p)^{3/2}/\sqrt{n})\leq 4p^{-c},

for some constant c>0c>0.

Proof of Lemma A.2.

We shall apply Corollary 3.3. Consider the first coordinate Ln1(β)\nabla L_{n1}(\beta^{*}) of Ln(β)\nabla L_{n}(\beta^{*}). In Corollary 3.3, let h(εi)=ψ(εi1)h(\varepsilon_{i})=\psi(\varepsilon_{i1}) and G(Xi)=Xi1w(Xi).G(X_{i})=X_{i1}w(X_{i}). Observe that 𝔼[Ln(β)]=0\mathbb{E}[\nabla L_{n}(\beta^{*})]=0. By Corollary 3.3,

(|Ln1(β)|x)\displaystyle\mathbb{P}(|\nabla L_{n1}(\beta^{*})|\geq x) =(|Ln1(β)𝔼[Ln1(β)]|x)\displaystyle=\mathbb{P}(|\nabla L_{n1}(\beta^{*})-\mathbb{E}[\nabla L_{n1}(\beta^{*})]|_{\infty}\geq x)
=(|1ni=1nh(εi)G(Xi1)𝔼[h(εi)G(Xi1)]|x)\displaystyle=\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}h(\varepsilon_{i})G(X_{i-1})-\mathbb{E}[h(\varepsilon_{i})G(X_{i-1})]\Big{|}\geq x\bigg{)}
4exp{nx2C1γ2τ4T2(logp)2+C2τ2T(logp)x}\displaystyle\leq 4\exp\bigg{\{}-\frac{nx^{2}}{C_{1}\gamma^{2}\tau^{4}T^{2}(\log p)^{2}+C_{2}\tau^{2}T(\log p)x}\bigg{\}}

Choose x=cγτ2T(logp)3/2/nx=c^{\prime}\gamma\tau^{2}T(\log p)^{3/2}/\sqrt{n} and we get

(|Ln1(β)|cγτ2T(logp)3/2/n)4pc,\mathbb{P}(|\nabla L_{n1}(\beta^{*})|\geq c^{\prime}\gamma\tau^{2}T(\log p)^{3/2}/\sqrt{n})\leq 4p^{-c},

for some constant c>0c>0. Take sufficiently large cc^{\prime} such that c>1c>1, so by a union bound we obtain

(|Ln(β)|cγτ2T(logp)3/2/n)4pc′′,\mathbb{P}(|\nabla L_{n}(\beta^{*})|_{\infty}\geq c^{\prime}\gamma\tau^{2}T(\log p)^{3/2}/\sqrt{n})\leq 4p^{-c^{\prime\prime}},

where c′′=c1>0c^{\prime\prime}=c-1>0. ∎

Proof of Theorem 2.8.

By Taylor expansion, we write

n(βˇβ)\displaystyle\sqrt{n}(\check{\beta}-\beta^{*}) =n(β^β)+nΩ^Ln(β)nΩ^(Ln(β^)Ln(β))\displaystyle=\sqrt{n}(\widehat{\beta}-\beta^{*})+\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{*})-\sqrt{n}\,\widehat{\Omega}(\nabla L_{n}(\widehat{\beta})-\nabla L_{n}(\beta^{*}))
=nΩ^Ln(β)+n[(β^β)Ω^2Ln(β)(β^β)+R]\displaystyle=\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{*})+\sqrt{n}\left[(\widehat{\beta}-\beta^{*})-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{*})(\widehat{\beta}-\beta^{*})+R\right]
=nΩ^Ln(β)A+n[(Ip2Ω^2Ln(β)(β^β)]Δ+nR,\displaystyle=\underbrace{\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{*})}_{A}+\underbrace{\sqrt{n}\left[\Big{(}I_{p^{2}}-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{*})(\widehat{\beta}-\beta^{*})\right]}_{\Delta}+\sqrt{n}R,

where the remainder

R=12ni=1n(ψ′′(zi1)(Xi1(β^1β1))2Xi1w(Xi1),,ψ′′(zip)(Xi1(β^pβp))2Xi1w(Xi1)),R=\frac{1}{2n}\sum_{i=1}^{n}\Big{(}\psi^{\prime\prime}(z_{i1})\big{(}X^{\top}_{i-1}(\widehat{\beta}_{1}-\beta^{*}_{1})\big{)}^{2}X_{i-1}^{\top}w(X_{i-1}),\dots,\psi^{\prime\prime}(z_{ip})\big{(}X^{\top}_{i-1}(\widehat{\beta}_{p}-\beta^{*}_{p})\big{)}^{2}X_{i-1}^{\top}w(X_{i-1})\Big{)}^{\top},

where zik=XiXi1β~z_{ik}=X_{i}-X_{i-1}^{\top}\widetilde{\beta} for some β~\widetilde{\beta} lying between β\beta^{*} and β^\widehat{\beta}. Now we analyze the above terms R,Δ,AR,\Delta,A respectively. First we see that n|R|=O(nT3|β^β|12)=o(1)\sqrt{n}|R|_{\infty}=O_{\mathbb{P}}(\sqrt{n}T^{3}|\widehat{\beta}-\beta^{*}|_{1}^{2})=o(1) by assumption (A1). To analyze Δ\Delta, denote H=2Ln(β)H=\nabla^{2}L_{n}(\beta^{*}). Then we write

Δ=n(Ip2Ω^H)(β^β)=n(ΩΣΩ^H)(β^β).\Delta=\sqrt{n}\Big{(}I_{p^{2}}-\widehat{\Omega}\,H\Big{)}(\widehat{\beta}-\beta^{*})=\sqrt{n}\Big{(}\Omega\Sigma-\widehat{\Omega}H\Big{)}(\widehat{\beta}-\beta^{*}).

Thus, by theorem 3.6 and theorem 2.7, with probability tending to 1,

|Δ|\displaystyle|\Delta|_{\infty} nΩΣΩ^Hmax|β^β|1n(ΩΩ^1Σmax+HΣmaxΩ^1)|β^β|1\displaystyle\leq\sqrt{n}\|\Omega\Sigma-\widehat{\Omega}H\|_{\max}|\widehat{\beta}-\beta^{*}|_{1}\leq\sqrt{n}\Big{(}\|\Omega-\widehat{\Omega}\|_{1}\|\Sigma\|_{\max}+\|H-\Sigma\|_{\max}\|\widehat{\Omega}\|_{1}\Big{)}|\widehat{\beta}-\beta^{*}|_{1}
nΩx1λn|β^β|1sγτ2T2(logp)3/2|β^β|1=o(1)\displaystyle\lesssim\sqrt{n}\|\Omega_{x}\|_{1}\lambda_{n}|\widehat{\beta}-\beta^{*}|_{1}\asymp s\gamma\tau^{2}T^{2}(\log p)^{3/2}|\widehat{\beta}-\beta^{*}|_{1}=o(1)

by assumption (A3). Finally, by Lemma A.2 and Theorem 2.7, with probability tending to 1, it holds that

|AnΩLn(β)|\displaystyle|A-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty} Ω^Ω1|nLn(β)|Ωx12sγ2τ4T4(logp)3/n\displaystyle\leq\|\widehat{\Omega}-\Omega\|_{1}|\sqrt{n}\nabla L_{n}(\beta^{*})|_{\infty}\leq\|\Omega_{x}\|_{1}^{2}s\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}
s2γ2τ4T4(logp)3/n.\displaystyle\asymp s^{2}\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}.

Therefore,

|n(βˇβ)nΩLn(β)||n(βˇβ)A|+|AnΩLn(β)|ζ1,|\sqrt{n}(\check{\beta}-\beta^{*})-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}\leq|\sqrt{n}(\check{\beta}-\beta^{*})-A|_{\infty}+|A-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}\leq\zeta_{1},

where

ζ1=sγτ2T2(logp)3/2|β^β|1+nT3|β^β|12+s2γ2τ4T4(logp)3/n.\zeta_{1}=s\gamma\tau^{2}T^{2}(\log p)^{3/2}|\widehat{\beta}-\beta^{*}|_{1}+\sqrt{n}T^{3}|\widehat{\beta}-\beta^{*}|_{1}^{2}+s^{2}\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}.

Proof of Theorem 2.9.

The proof of Theorem 4.1 can be easily generalized to p2p^{2} dimensional space, thus it still holds for |nΩLn(β)||\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}. By Theorem 4.1, we have for any η>0\eta>0,

supt|(|nΩLn(β)|t)(|Z|t)|\displaystyle\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\big{(}|\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}|Z|_{\infty}\leq t\big{)}\bigg{|}
\displaystyle\lesssim{} f1(η/2,m)+f2(η/2,m)+ηlogp+ηlog(1/η)+cnc,\displaystyle f_{1}(\eta/2,m)+f_{2}(\eta/2,m)+\eta\sqrt{\log p}+\eta\sqrt{\log(1/\eta)}+cn^{-c^{\prime}}, (A.3)

where

f1(x,m)=c1sp3γ2ρm/τx2andf2(x)=2pexp{nx22sTMnx+4mwsT2σ2}.f_{1}(x,m)=\frac{c_{1}sp^{3}\gamma^{2}\rho^{m/\tau}}{x^{2}}\quad\text{and}\quad f_{2}(x)=2p\exp\bigg{\{}-\frac{nx^{2}}{2\sqrt{s}TM\sqrt{n}x+4mwsT^{2}\sigma^{2}}\bigg{\}}. (A.4)

Now choose η(logp)Tsτγ/n1/4=o(1),ωn1/2,Mn1/2,m=cτlogp\eta\asymp(\log p)T\sqrt{s\tau}\gamma/n^{1/4}=o(1),\omega\asymp n^{1/2},M\asymp n^{1/2},m=c\tau\log p in (A) for some constant c>0c>0. For sufficiently large cc, basic algebra shows that

f1(η/2,m)sγ2pc3η2n1/2pc3T2τ(logp)2=o(1),\displaystyle f_{1}(\eta/2,m)\lesssim\frac{s\gamma^{2}}{p^{c-3}\eta^{2}}\asymp\frac{n^{1/2}}{p^{c-3}T^{2}\tau(\log p)^{2}}=o(1), (A.5)

since the order of pc3p^{c-3} dominates the order of n1/2n^{1/2}. Moreover,

f2(η/2,m)2pexp{c1γ2logpc2γ+c3}2pexp{c4logp}=o(1),\displaystyle f_{2}(\eta/2,m)\leq 2p\exp\bigg{\{}-\frac{c_{1}\gamma^{2}\log p}{c_{2}\gamma+c_{3}}\bigg{\}}\leq 2p\exp\{-c_{4}\log p\}=o(1), (A.6)

by a proper choice of constant c1,c2,c3c_{1},c_{2},c_{3}. Also, by assumption (A5), ηlogp=o(1)\eta\sqrt{\log p}=o(1) and

ηlog(1/η)Tsτγlogpn1/4logn=o(1).\eta\sqrt{\log(1/\eta)}\lesssim\frac{T\sqrt{s\tau}\gamma\log p}{n^{1/4}}\sqrt{\log n}=o(1).

Thus the proof is completed. ∎

Proof of Theorem 2.11.

First, we collect several useful results.

  • (i)

    With probability at least 14pc11-4p^{-c_{1}}, ΩxΩ^x1Ωx12sγτ2T2(logp)3/2n1/2\|\Omega_{x}-\widehat{\Omega}_{x}\|_{1}\leq\|\Omega_{x}\|_{1}^{2}s\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2} and ΩxΩ^xmaxΩx12γτ2T2(logp)3/2n1/2\|\Omega_{x}-\widehat{\Omega}_{x}\|_{\max}\leq\|\Omega_{x}\|_{1}^{2}\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2} by Theorem 2.5. Therefore, ΩxΩ^x1=o(1)\|\Omega_{x}-\widehat{\Omega}_{x}\|_{1}=o(1) and ΩxΩ^xmax=o(1)\|\Omega_{x}-\widehat{\Omega}_{x}\|_{\max}=o(1) by assumption (A2).

  • (ii)

    With probability at least 12pc21-2p^{-c_{2}}, |μμ^|γτ2logpn+|β^β|1=o(1)|\mu-\widehat{\mu}|_{\infty}\lesssim\gamma\tau^{2}\sqrt{\frac{\log p}{n}}+|\widehat{\beta}-\beta^{*}|_{1}=o(1) Lemma 3.7 and the order comes from assumptions (A1) and (A2).

  • (iii)

    Similar to the proof of Lemma 3.7, we have with probability at least 12pc31-2p^{-c_{3}},

    |1ni=1nψ(ε^ij)ψ(ε^ik)𝔼[ψ(εij)ψ(εik)]|γτ2logpn+|β^β|1=o(1)\Big{|}\frac{1}{n}\sum_{i=1}^{n}\psi(\widehat{\varepsilon}_{ij})\psi(\widehat{\varepsilon}_{ik})-\mathbb{E}[\psi(\varepsilon_{ij})\psi(\varepsilon_{ik})]\Big{|}_{\infty}\lesssim\gamma\tau^{2}\sqrt{\frac{\log p}{n}}+|\widehat{\beta}-\beta^{*}|_{1}=o(1)
  • (iv)

    Similar to the proof of Lemma 3.5, we have with probability at least 14pc41-4p^{-c_{4}},

    1ni=1nXiXiw2(Xi)𝔼[XiXiw2(Xi)]maxγτ2T2(logp)3/2n1/2=o(1).\Big{\|}\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}w^{2}(X_{i})-\mathbb{E}[X_{i}X_{i}^{\top}w^{2}(X_{i})]\Big{\|}_{\max}\lesssim\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}=o(1).

Repeatedly using Lemma A.1, we get

D^Dmax\displaystyle\|\widehat{D}-D\|_{\max} max1j,kp|1μjμk1μ^jμ^k|+|1ni=1nψ(ε^ij)ψ(ε^ik)𝔼[ψ(εij)ψ(εik)]|\displaystyle\lesssim\max_{1\leq j,k\leq p}\Big{|}\frac{1}{\mu_{j}\mu_{k}}-\frac{1}{\widehat{\mu}_{j}\widehat{\mu}_{k}}\Big{|}+\Big{|}\frac{1}{n}\sum_{i=1}^{n}\psi(\widehat{\varepsilon}_{ij})\psi(\widehat{\varepsilon}_{ik})-\mathbb{E}[\psi(\varepsilon_{ij})\psi(\varepsilon_{ik})]\Big{|}_{\infty}
+2ΩxΩ^xmax+1ni=1nXiXiw2(Xi)𝔼[XiXiw2(Xi)]max\displaystyle+2\|\Omega_{x}-\widehat{\Omega}_{x}\|_{\max}+\Big{\|}\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}w^{2}(X_{i})-\mathbb{E}[X_{i}X_{i}^{\top}w^{2}(X_{i})]\Big{\|}_{\max}
γτ2logpn+|β^β|1+γτ2T2(logp)3/2n1/2\displaystyle\lesssim\gamma\tau^{2}\sqrt{\frac{\log p}{n}}+|\widehat{\beta}-\beta^{*}|_{1}+\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}
γτ2T2(logp)3/2n1/2+|β^β|1\displaystyle\lesssim\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}+|\widehat{\beta}-\beta^{*}|_{1}

with probability at least 112pc1-12p^{-c}, where c=min1i4cic=\min_{1\leq i\leq 4}c_{i}. ∎

Proof of Theorem 2.12.

By theorem 2.8, we see that

(|n(βˇβ)nΩLn(β)|>ζ1)<ζ2,\mathbb{P}\bigg{(}|\sqrt{n}(\check{\beta}-\beta^{*})-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}>\zeta_{1}\bigg{)}<\zeta_{2},

where ζ11log(p/ζ1)=o(1)\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}=o(1) and ζ2=o(1)\zeta_{2}=o(1). Define π(v)=Cv1/3(1log(p/v))2/3\pi(v)=Cv^{1/3}(1\lor\log(p/v))^{2/3} with C2>0C_{2}>0 and

Γ=D^Dmax.\Gamma=\|\widehat{D}-D\|_{\max}.

Let cz(α)=inf{t:(|i=1nzi/n|t)1α}c_{z}(\alpha)=\inf\{t\in\mathbb{R}:\mathbb{P}(|\sum_{i=1}^{n}z_{i}/\sqrt{n}|_{\infty}\leq t)\geq 1-\alpha\}, where the sequence {zi}\{z_{i}\} is defined in theorem 2.9. From the proof of Lemma 3.2 in Chernozhukov et al., (2013), we have

(c(α)cz(α+π(v)))\displaystyle\mathbb{P}\Big{(}c(\alpha)\leq c_{z}(\alpha+\pi(v))\Big{)} 1(Γ>v)\displaystyle\geq 1-\mathbb{P}(\Gamma>v) (A.7)
(cz(α)c(α+π(v)))\displaystyle\mathbb{P}\Big{(}c_{z}(\alpha)\leq c(\alpha+\pi(v))\Big{)} 1(Γ>v)\displaystyle\geq 1-\mathbb{P}(\Gamma>v) (A.8)

Therefore, by theorem 2.9, (A.7) and (A.8), we have for every v>0v>0,

supα(0,1)|(nΩLn(β)>c(α))α|\displaystyle\sup_{\alpha\in(0,1)}\bigg{|}\mathbb{P}\Big{(}\sqrt{n}\Omega\nabla L_{n}(\beta^{*})>c(\alpha)\Big{)}-\alpha\bigg{|} supα(0,1)|(|i=1nzi/n|>c(α))α|+o(1)\displaystyle\lesssim\sup_{\alpha\in(0,1)}\bigg{|}\mathbb{P}\Big{(}|\sum_{i=1}^{n}z_{i}/\sqrt{n}|_{\infty}>c(\alpha)\Big{)}-\alpha\bigg{|}+o(1)
π(v)+(Γ>v)+o(1)\displaystyle\lesssim\pi(v)+\mathbb{P}(\Gamma>v)+o(1)

Furthermore, following the same spirit as the proof of Theorem 3.2 in Chernozhukov et al., (2013), we see that

supα(0,1)|(n|βˇβ|>c(α))α|π(v)+(Γ>v)+ζ11log(p/ζ1)+ζ2+o(1).\sup_{\alpha\in(0,1)}\bigg{|}\mathbb{P}\Big{(}\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty}>c(\alpha)\Big{)}-\alpha\bigg{|}\lesssim\pi(v)+\mathbb{P}(\Gamma>v)+\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}+\zeta_{2}+o(1).

Now that ζ11log(p/ζ1)=o(1)\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}=o(1) and ζ2=o(1)\zeta_{2}=o(1) from Theorem 2.8, we only need to choose v>0v>0, such that π(v)=o(1)\pi(v)=o(1) and (Γ>v)=o(1)\mathbb{P}(\Gamma>v)=o(1). Let vsγτ2T2(logp)3/2n1/2+|β^β|1v\asymp s\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}+|\widehat{\beta}-\beta^{*}|_{1}. Then we see that the conditions that (Γ>v)=o(1)\mathbb{P}(\Gamma>v)=o(1) and π(v)=o(1)\pi(v)=o(1) are satisfied by Theorem 2.11 and the scaling hypothesis. ∎

Appendix B Proofs of Results in Section 3

Proof of Lemma 3.1.

Define the filtration {i}\{\mathcal{F}_{i}\} with i=σ(εi,εi1,)\mathcal{F}_{i}=\sigma(\varepsilon_{i},\varepsilon_{i-1},\dots), and let Pj()=𝔼(|j)𝔼(|j1)P_{j}(\cdot)=\mathbb{E}(\cdot|\mathcal{F}_{j})-\mathbb{E}(\cdot|\mathcal{F}_{j-1}) be a projection. Conventionally it follows that Pj(G(Xi))=0P_{j}(G(X_{i}))=0 for ji+1j\geq i+1. We can write

i=1nG(Xi)𝔼G(Xi)=j=n(i=1nPj(G(Xi)))=:j=nLj,\sum_{i=1}^{n}G(X_{i})-\mathbb{E}G(X_{i})=\sum_{j=-\infty}^{n}\left(\sum_{i=1}^{n}P_{j}(G(X_{i}))\right)=:\sum_{j=-\infty}^{n}L_{j},

where Lj=i=1nPj(G(Xi))L_{j}=\sum_{i=1}^{n}P_{j}(G(X_{i})). By the Markov inequality, for λ>0\lambda>0, we have

(i=1nG(Xi)𝔼G(Xi)2x)\displaystyle\mathbb{P}\bigg{(}\sum_{i=1}^{n}G(X_{i})-\mathbb{E}G(X_{i})\geq 2x\bigg{)} (j=sLjx)+(j=s+1nLjx)\displaystyle\leq\mathbb{P}\bigg{(}\sum_{j=-\infty}^{-s}L_{j}\geq x\bigg{)}+\mathbb{P}\bigg{(}\sum_{j=-s+1}^{n}L_{j}\geq x\bigg{)}
eλx𝔼[exp{λj=sLj}]+eλx𝔼[exp{λj=s+1nLj}],\displaystyle\leq\mathrm{e}^{-\lambda x}\mathbb{E}\bigg{[}\exp\bigg{\{}\lambda\sum_{j=-\infty}^{-s}L_{j}\bigg{\}}\bigg{]}+\mathrm{e}^{-\lambda x}\mathbb{E}\bigg{[}\exp\bigg{\{}\lambda\sum_{j=-s+1}^{n}L_{j}\bigg{\}}\bigg{]}, (B.1)

for some s>0s>0 to be determined later. We shall bound the right-hand side of (B) with a suitable choice of λ>0\lambda>0. Observing that {Lj}jn\{L_{j}\}_{j\leq n} is a sequence of martingale differences with respect to {j}\{\mathcal{F}_{j}\}, we then seek an upper bound on 𝔼[eλLj|j1]\mathbb{E}[\mathrm{e}^{\lambda L_{j}}\bigr{|}\mathcal{F}_{j-1}]. It follows that

|Lj|\displaystyle|L_{j}| i=1jnmin{|𝔼[G(Xi)|j]𝔼[G(Xi)|j1]|,2B}\displaystyle\leq\sum_{i=1\lor j}^{n}\min\left\{\left|\mathbb{E}\left[G(X_{i})\bigr{|}\mathcal{F}_{j}\right]-\mathbb{E}\left[G(X_{i})|\mathcal{F}_{j-1}\right]\right|,2B\right\}
i=1jnmin{Aij𝔼[|εjεj||j],2B}\displaystyle\leq\sum_{i=1\lor j}^{n}\min\left\{\|A^{i-j}\|_{\infty}\mathbb{E}\left[|\varepsilon_{j}-\varepsilon_{j}^{\prime}|_{\infty}\bigr{|}\mathcal{F}_{j}\right],2B\right\}
i=1jnmin{pρ1γρ(ij)/τηj,2B},\displaystyle\leq\sum_{i=1\lor j}^{n}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}, (B.2)

where εj\varepsilon^{\prime}_{j} is an i.i.d. copy of εj\varepsilon_{j} and ηj=𝔼[|εj1εj1||j].\eta_{j}=\mathbb{E}\bigr{[}|\varepsilon_{j1}-\varepsilon_{j1}^{\prime}|\bigr{|}\mathcal{F}_{j}\bigr{]}.

Denote s=τlogp/log(1/ρ)+1s=\lfloor\tau\log p/\log(1/\rho)\rfloor+1. Note that s>0s>0 is a positive integer. For s<j0-s<j\leq 0, we have

|Lj|\displaystyle|L_{j}| i=0min{pρ1γρ(ij)/τηj,2B}\displaystyle\leq\sum_{i=0}^{\infty}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}
i=0s1min{pρ1γρ(ij)/τηj,2B}+i=smin{pρ1γρ(ij)/τηj,2B}\displaystyle\leq\sum_{i=0}^{s-1}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}+\sum_{i=s}^{\infty}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}
2sB+i=0min{ρ1γρi/τηj,2B}\displaystyle\leq 2sB+\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}

For 0<jn0<j\leq n, we also have

|Lj|i=jmin{pρ1γρ(ij)/τηj,2B}2sB+i=0min{ρ1γρi/τηj,2B}\displaystyle|L_{j}|\leq\sum_{i=j}^{\infty}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}\leq-2sB+\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}

Basic algebra shows that

𝔼[|Lj|k|j1]\displaystyle\mathbb{E}[|L_{j}|^{k}|\mathcal{F}_{j-1}] (1)𝔼[(2sB+i=0min{ρ1γρi/τηj,2B})k]\displaystyle\overset{(1)}{\leq}\mathbb{E}\bigg{[}\Big{(}2sB+\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}\Big{)}^{k}\bigg{]}
𝔼[2k((2sB)k+(i=0min{ρ1γρi/τηj,2B})k)]\displaystyle\leq\mathbb{E}\bigg{[}2^{k}\Big{(}(2sB)^{k}+\Big{(}\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}\Big{)}^{k}\Big{)}\bigg{]}
2k[(2sB)k+(i=0min{ρ1γρi/τηj,2B}k)k],\displaystyle\leq 2^{k}\bigg{[}(2sB)^{k}+\Big{(}\sum_{i=0}^{\infty}\Big{\|}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}\Big{\|}_{k}\Big{)}^{k}\bigg{]}, (B.3)

where (1) comes from the independence of ηj\eta_{j} and j1\mathcal{F}_{j-1}. To analyze (B), we further compute

min{ρ1γρi/τηj,2B}k\displaystyle\Big{\|}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}\Big{\|}_{k} =2B𝕀(γρρi/τηj2B)+γρρi/τηj𝕀(γρρi/τηj2B)k\displaystyle=\Big{\|}2B\mathbb{I}\Big{(}\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\geq 2B\Big{)}+\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\mathbb{I}\Big{(}\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\leq 2B\Big{)}\Big{\|}_{k}
2B((γρρi/τηj2B))1/k+𝔼[(γρρi/τηj)2(2B)k2]1/k\displaystyle\leq 2B\bigg{(}\mathbb{P}\Big{(}\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\geq 2B\Big{)}\bigg{)}^{1/k}+\mathbb{E}\Big{[}\Big{(}\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\Big{)}^{2}(2B)^{k-2}\Big{]}^{1/k}
(4σ2γ2ρ2)1/kρ2i/τk(2B)12/k\displaystyle\leq\Big{(}4\sigma^{2}\frac{\gamma^{2}}{\rho^{2}}\Big{)}^{1/k}\rho^{2i/\tau k}(2B)^{1-2/k} (B.4)

Plugging (B) into (B) yields, for some constant C1,C2>0C_{1},C_{2}>0, that

𝔼[|Lj|k|j1]\displaystyle\mathbb{E}[|L_{j}|^{k}|\mathcal{F}_{j-1}] 2k[(2sB)k+4σ2γ2ρ2(2B)k2(11ρ2/τk)k]\displaystyle\leq 2^{k}\bigg{[}(2sB)^{k}+4\sigma^{2}\frac{\gamma^{2}}{\rho^{2}}(2B)^{k-2}\Big{(}\frac{1}{1-\rho^{2/\tau k}}\Big{)}^{k}\bigg{]}
(1)2k[(2sB)k+4σ2γ2ρ2(2B)k2(τk2)kρ2/τ(log(1/ρ))k]\displaystyle\overset{(1)}{\leq}2^{k}\bigg{[}(2sB)^{k}+4\sigma^{2}\frac{\gamma^{2}}{\rho^{2}}(2B)^{k-2}\Big{(}\frac{\tau k}{2}\Big{)}^{k}\rho^{-2/\tau}\Big{(}\log(1/\rho)\Big{)}^{k}\bigg{]}
(2)2k[(2sB)k+C1γ2B2C2kBkτkk!]\displaystyle\overset{(2)}{\leq}2^{k}\bigg{[}(2sB)^{k}+C_{1}\gamma^{2}B^{-2}C_{2}^{k}B^{k}\tau^{k}k!\bigg{]}
γ2(Bsτ)kk![4+C1B2(2C2)k]γ2(Bsτ)kk!(1+C1B2)(4+2C2)k,\displaystyle\leq\gamma^{2}(Bs\tau)^{k}k![4+C_{1}B^{-2}(2C_{2})^{k}]\leq\gamma^{2}(Bs\tau)^{k}k!(1+C_{1}B^{-2})(4+2C_{2})^{k}, (B.5)

where (1) uses the inequality that 1xxlogx1-x\geq-x\log x for x(0,1)x\in(0,1) and (2) uses Stirling formula and the fact that ρ2/τρ2\rho^{-2/\tau}\leq\rho^{-2}. Let C~1=1+C1B2\tilde{C}_{1}=1+C_{1}B^{-2} and C~2=4+2C2\tilde{C}_{2}=4+2C_{2}. Then we obtain

𝔼[eλLj|j1]\displaystyle\mathbb{E}\Big{[}\mathrm{e}^{\lambda L_{j}}|\mathcal{F}_{j-1}\Big{]} 1+k=2[C~1γ2(C~2Bsτλ)k]=1+C~1γ2C~22(Bsτ)2λ21C~2Bsτλ\displaystyle\leq 1+\sum_{k=2}^{\infty}\Big{[}\tilde{C}_{1}\gamma^{2}(\tilde{C}_{2}Bs\tau\lambda)^{k}\Big{]}=1+\frac{\tilde{C}_{1}\gamma^{2}\tilde{C}_{2}^{2}(Bs\tau)^{2}\lambda^{2}}{1-\tilde{C}_{2}Bs\tau\lambda}
exp{C~1γ2C~22(Bsτ)2λ21C~2Bsτλ}.\displaystyle\leq\exp\bigg{\{}\frac{\tilde{C}_{1}\gamma^{2}\tilde{C}_{2}^{2}(Bs\tau)^{2}\lambda^{2}}{1-\tilde{C}_{2}Bs\tau\lambda}\bigg{\}}. (B.6)

Furthermore,

𝔼[exp{λj=snLj}]exp{C~1γ2C~22(Bsτ)2(s+n)λ21C~2Bsτλ}.\mathbb{E}\Big{[}\exp\Big{\{}\lambda\sum_{j=s}^{n}L_{j}\Big{\}}\Big{]}\leq\exp\bigg{\{}\frac{\tilde{C}_{1}\gamma^{2}\tilde{C}_{2}^{2}(Bs\tau)^{2}(s+n)\lambda^{2}}{1-\tilde{C}_{2}Bs\tau\lambda}\bigg{\}}. (B.7)

Take λ=x(C~2Bsτx+2C~1γ2C~22(Bsτ)2(s+n))1\lambda=x(\tilde{C}_{2}Bs\tau x+2\tilde{C}_{1}\gamma^{2}\tilde{C}_{2}^{2}(Bs\tau)^{2}(s+n))^{-1} and by (B) we have

(j=s+1nLjx)exp{x2(1+C1B2)γ2B2τ4(logp)2(τlogp+n)+C4τ2B(logp)x}.\displaystyle\mathbb{P}\left(\sum_{j=-s+1}^{n}L_{j}\geq x\right)\leq\exp\left\{-\frac{x^{2}}{(1+C_{1}B^{-2})\gamma^{2}B^{2}\tau^{4}(\log p)^{2}(\tau\log p+n)+C_{4}\tau^{2}B(\log p)x}\right\}. (B.8)

Similarly, for jsj\leq-s, since pρs/τp\leq\rho^{-s/\tau},

|Lj|\displaystyle|L_{j}| i=0min{ρ1γρ(ijs)/τηj,2B}.\displaystyle\leq\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{(i-j-s)/\tau}\eta_{j},2B\right\}.

By the same argument, we immediate have

(j=sLjx)exp{x2C3γ2τ3+C4τBx},\displaystyle\mathbb{P}\bigg{(}\sum_{j=-\infty}^{-s}L_{j}\geq x\bigg{)}\leq\exp\left\{-\frac{x^{2}}{C_{3}\gamma^{2}\tau^{3}+C_{4}\tau Bx}\right\}, (B.9)

where C3=32e2σ2(2π)1/2[ρ2log(1/ρ)]3C_{3}=32\mathrm{e}^{2}\sigma^{2}(2\pi)^{-1/2}[\rho^{2}\log(1/\rho)]^{-3} and C4=8e[log(1/ρ)]1C_{4}=8\mathrm{e}[\log(1/\rho)]^{-1}. By (B.8), (B.9) and symmetrization argument, we complete the proof. ∎

Proof of Corollary 3.3.

It follows from the proof of lemma 3.1 without any extra technical difficulty. ∎

Proof of Theorem 3.5.

Define Gjk:pG_{jk}:\mathbb{R}^{p}\to\mathbb{R} be defined as Gjk(x)=(xxw(x))jk=xjxkw(x)G_{jk}(x)=\big{(}xx^{\top}w(x)\big{)}_{jk}=x_{j}x_{k}w(x) for j,k=1,,pj,k=1,\dots,p, and hence |G(x)|T|G(x)|\leq T. Let u(x)=w1/3(x)u(x)=w^{1/3}(x). Observe that

|Gjk(x)Gjk(y)|\displaystyle|G_{jk}(x)-G_{jk}(y)| |xju(x)xku(x)yju(y)yku(y)|u(x)+|yju(y)yku(y)||u(x)u(y)|\displaystyle\leq|x_{j}u(x)\,x_{k}u(x)-y_{j}u(y)\,y_{k}u(y)|u(x)+|y_{j}u(y)\,y_{k}u(y)||u(x)-u(y)|
|xiu(x)yiu(y)||xju(x)|+|xju(x)yju(y)||yiu(y)|+T2|u(x)u(y)|\displaystyle\leq|x_{i}u(x)-y_{i}u(y)||x_{j}u(x)|+|x_{j}u(x)-y_{j}u(y)||y_{i}u(y)|+T^{2}|u(x)-u(y)|
3T|xy|.\displaystyle\leq 3T|x-y|_{\infty}.

By lemma 3.1, take x=cγτ2Tn1/2(logp)3/2x=c\gamma\tau^{2}Tn^{-1/2}(\log p)^{3/2}

(|Σ^x,jkΣx,jk|cTx)=(|1ni=1nGjk(Xi1)1ni=1n𝔼Gjk(Xi1)|cTx)4pc1.\mathbb{P}\bigg{(}|\widehat{\Sigma}_{x,jk}-\Sigma_{x,jk}|\geq cTx\bigg{)}=\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}G_{jk}(X_{i-1})-\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}G_{jk}(X_{i-1})\Big{|}\geq cTx\bigg{)}\leq 4p^{-c_{1}}.

A union bound yields

(Σ^xΣxmaxcTx)4pc0,\mathbb{P}\bigg{(}\|\widehat{\Sigma}_{x}-\Sigma_{x}\|_{\max}\geq cTx\bigg{)}\leq 4p^{-c_{0}},

where c0=c11>0c_{0}=c_{1}-1>0. ∎

Proof of Theorem 3.6.

Denote H=2Ln(β)H=\nabla^{2}L_{n}(\beta^{*}). First we write

HΣmax=max1kp1ni=1nψ(εik)Xi1Xi1w(Xi1)𝔼[ψ(εik)Xi1Xi1w(Xi1)]max.\|H-\Sigma\|_{\max}=\max_{1\leq k\leq p}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\varepsilon_{ik})X_{i-1}X_{i-1}^{\top}w(X_{i-1})-\mathbb{E}[\psi^{\prime}(\varepsilon_{ik})X_{i-1}X_{i-1}^{\top}w(X_{i-1})]\Big{\|}_{\max}.

Using Corollary 3.3, it follows from the same argument of the proof of Theoremt 3.5 that for some constant c1>1c_{1}>1, with probability at least 14pc11-4p^{-c_{1}}

max1kp1ni=1nψ(εik)Xi1Xi1w(Xi1)𝔼[ψ(εik)Xi1Xi1w(Xi1)]maxγτ2T2n1/2(logp)3/2,\max_{1\leq k\leq p}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\varepsilon_{ik})X_{i-1}X_{i-1}^{\top}w(X_{i-1})-\mathbb{E}[\psi^{\prime}(\varepsilon_{ik})X_{i-1}X_{i-1}^{\top}w(X_{i-1})]\Big{\|}_{\max}\lesssim\gamma\tau^{2}T^{2}n^{-1/2}(\log p)^{3/2},

Finally, a union bound over 1kp1\leq k\leq p yields the conclusion. ∎

Proof of Lemma 3.7.

The strategy is to consider each component of μ^\widehat{\mu} and take a union bound. Observe that

|μ^kμk||1ni=1nψ(ε^ik)𝔼[ψ(ε^ik)]|+|𝔼ψ(ε^ik)𝔼ψ(εik)|,k=1,2,,p.|\widehat{\mu}_{k}-\mu_{k}|\leq\bigg{|}\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\widehat{\varepsilon}_{ik})-\mathbb{E}[\psi^{\prime}(\widehat{\varepsilon}_{ik})]\bigg{|}+|\mathbb{E}\psi^{\prime}(\widehat{\varepsilon}_{ik})-\mathbb{E}\psi^{\prime}(\varepsilon_{ik})|,\quad k=1,2,\dots,p.

Since |ψ′′||\psi^{\prime\prime}| is bounded, by the mean value theorem, we have that for some ξ\xi between xx and yy,

|ψ(x)ψ(y)|=|ψ′′(ξ)(xy)||xy|.|\psi^{\prime}(x)-\psi^{\prime}(y)|=|\psi^{\prime\prime}(\xi)(x-y)|\lesssim|x-y|.

So it can be verified that ψ(XikXi1β^k)\psi^{\prime}(X_{ik}-X_{i-1}^{\top}\widehat{\beta}_{k}) satisfies the conditions in Corollary 2.5 of Liu and Zhang, (2021). By Corollary 2.5 of Liu and Zhang, (2021), it holds that

|1ni=1nψ(ε^ik)𝔼[ψ(ε^ik)]|γτ2logpn\bigg{|}\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\widehat{\varepsilon}_{ik})-\mathbb{E}[\psi^{\prime}(\widehat{\varepsilon}_{ik})]\bigg{|}\lesssim\gamma\tau^{2}\sqrt{\frac{\log p}{n}}

with probability at least 12pc1-2p^{-c} for some positive constant cc. Moreover,

max1kp|𝔼ε^ik𝔼εik|max1kp𝔼[|Xi1(β^kβ)|]|β^β|1,\max_{1\leq k\leq p}|\mathbb{E}\widehat{\varepsilon}_{ik}-\mathbb{E}\varepsilon_{ik}|\lesssim\max_{1\leq k\leq p}\mathbb{E}\big{[}|X_{i-1}^{\top}(\widehat{\beta}_{k}-\beta^{*})|\big{]}\lesssim|\widehat{\beta}-\beta^{*}|_{1},

where the last inequality comes from the fact that Xi1X_{i-1} has bounded second moment. ∎

Appendix C Proofs of Result in Section 4

Before proving Theorem 4.1, we will first state and prove the corresponding lemmas in the outline listed at the end of section 4.

Lemma C.1.

Suppose 𝔼[εik2]σ2\mathbb{E}[\varepsilon_{ik}^{2}]\leq\sigma^{2} for all 1kp1\leq k\leq p and the odd function ψ()\psi(\cdot) satisfies that |ψ()|C|\psi(\cdot)|\leq C and |ψ()|C|\psi^{\prime}(\cdot)|\leq C, then we have

(|(TYTY,m)/n|x)c1sp3γ2ρm/τx2=:f1(x,m),\mathbb{P}\bigg{(}\big{|}({T_{Y}-T_{Y,m})}/{\sqrt{n}}\big{|}_{\infty}\geq x\bigg{)}\leq\frac{c_{1}sp^{3}\gamma^{2}\rho^{m/\tau}}{x^{2}}=:f_{1}(x,m),

for some constants C1,C2>0C_{1},C_{2}>0.

Proof of Lemma C.1.

Let Di=YiYi,mD_{i}=Y_{i}-Y_{i,m}. For any λ>0\lambda>0, by Markov inequality we have

(i=1nDij/nx)𝔼[(i=1nDij/n)2]x2.\mathbb{P}\bigg{(}\sum_{i=1}^{n}D_{ij}/\sqrt{n}\geq x\bigg{)}\leq\frac{\mathbb{E}\big{[}\big{(}\sum_{i=1}^{n}D_{ij}/\sqrt{n}\big{)}^{2}\big{]}}{x^{2}}. (C.1)

Notice that the martingale difference {Dij}i=1n\{D_{ij}\}_{i=1}^{n} satisfies

|Dij|s|XiXi,m|.|D_{ij}|\lesssim\sqrt{s}|X_{i}-X_{i,m}|_{\infty}.

Thus,

Dij2|XiXi,m|2l=m+1Al|εil|2spγρm/τ.\|D_{ij}\|_{2}\leq\||X_{i}-X_{i,m}|_{\infty}\|_{2}\leq\sum_{l=m+1}^{\infty}\|A^{l}\|_{\infty}\||\varepsilon_{i-l}|_{\infty}\|_{2}\lesssim\sqrt{s}p\gamma\rho^{m/\tau}.

By Burkholder inequality (Burkholder, (1973)), we have

𝔼[(i=1nDij/n)2]\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i=1}^{n}D_{ij}/\sqrt{n}\bigg{)}^{2}\bigg{]} 𝔼[|Dij|2]sp2γ2ρ2m/τ\displaystyle\lesssim\mathbb{E}[|D_{ij}|^{2}]\lesssim sp^{2}\gamma^{2}\rho^{2m/\tau} (C.2)

Hence, by (C.1),

(i=1nDij/nx)c1sp2γ2ρm/τx2\mathbb{P}\bigg{(}\sum_{i=1}^{n}D_{ij}/\sqrt{n}\geq x\bigg{)}\leq\frac{c_{1}^{\prime}sp^{2}\gamma^{2}\rho^{m/\tau}}{x^{2}}

Finally, symmetrization and a union bound give the desired result. ∎

Lemma C.2.

Under the assumptions in Lemma C.1, it holds that

(|TY,S|/nx)2pexp{nx2C1sTnx+C2mwsT2σ2}=:f2(x,m).\mathbb{P}\big{(}|T_{Y,S}|_{\infty}/\sqrt{n}\geq x\big{)}\leq 2p\exp\bigg{\{}-\frac{nx^{2}}{C_{1}\sqrt{s}T\sqrt{n}x+C_{2}mwsT^{2}\sigma^{2}}\bigg{\}}=:f_{2}(x,m).
Proof of Lemma C.2.

By the property of ψ()\psi(\cdot) and the mean value theorem, we have |ψ(x)|C|x||\psi(x)|\leq C|x|. Consider the first coordinate (TY,S)1(T_{Y,S})_{1} of TY,ST_{Y,S}. We can write (TY,S)1=i=j1jrYi,m,1(T_{Y,S})_{1}=\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}, where r=mωr=m\omega. Observe that {Yi,m,1}\{Y_{i,m,1}\} is a martingale difference adapted to the filtration {i=σ(εi,εi1,)}\{\mathcal{F}_{i}=\sigma(\varepsilon_{i},\varepsilon_{i-1},\dots)\} and that |Yi,m,1|ψ(εik)sTCsT|Y_{i,m,1}|\leq\psi(\varepsilon_{ik})\sqrt{s}T\leq C\sqrt{s}T. We shall establish a Bernstein-type inequality for the sum of martingale differences (TY,S)1(T_{Y,S})_{1}:

((TY,S)1x)eλx𝔼eλi=j1jrYi,m,1,for any λ>0.\mathbb{P}((T_{Y,S})_{1}\geq x)\leq\mathrm{e}^{-\lambda x}\mathbb{E}\mathrm{e}^{\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}},\quad\text{for any }\lambda>0. (C.3)

We now bound 𝔼eλi=j1jrYi,m,1\mathbb{E}\mathrm{e}^{\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}} from above. By the tower property,

𝔼exp{λi=j1jrYi,m,1}\displaystyle\mathbb{E}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}\Big{\}} =𝔼[𝔼[exp{λi=j1jrYi,m,1}|r1]]\displaystyle=\mathbb{E}\bigg{[}\mathbb{E}\Big{[}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}\Big{\}}\Big{|}\mathcal{F}_{r-1}\Big{]}\bigg{]}
=𝔼[exp{λi=j1jr1Yi,m,1}𝔼[eλYjr,m,1|jr1]]\displaystyle=\mathbb{E}\bigg{[}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r-1}}Y_{i,m,1}\Big{\}}\mathbb{E}[\mathrm{e}^{\lambda Y_{j_{r},m,1}}|\mathcal{F}_{j_{r}-1}]\bigg{]} (C.4)

Now, consider

𝔼[eλYjr,m,1|jr1]\displaystyle\mathbb{E}[\mathrm{e}^{\lambda Y_{j_{r},m,1}}|\mathcal{F}_{j_{r}-1}] =1+𝔼[t=2(λYjr,m,1)tt!|jr1]1+𝔼[λ2T2sψ2(εjrk)t=0(λTsC)t]\displaystyle=1+\mathbb{E}\Big{[}\sum_{t=2}^{\infty}\frac{(\lambda Y_{j_{r},m,1})^{t}}{t!}\Big{|}\mathcal{F}_{j_{r}-1}\Big{]}\leq 1+\mathbb{E}\Big{[}\lambda^{2}T^{2}s\psi^{2}(\varepsilon_{j_{r}k})\sum_{t=0}^{\infty}(\lambda T\sqrt{s}C)^{t}\Big{]}
(1)1+Cλ2T2sσ21CλTsexp{Cλ2T2sσ21CλTs}\displaystyle\overset{\text{(1)}}{\leq}1+\frac{C\lambda^{2}T^{2}s\sigma^{2}}{1-C\lambda T\sqrt{s}}\leq\exp\bigg{\{}\frac{C\lambda^{2}T^{2}s\sigma^{2}}{1-C\lambda T\sqrt{s}}\bigg{\}} (C.5)

where the inequality (1) makes use of the fact that ψ2(εjr,k)εjr,k2\psi^{2}(\varepsilon_{j_{r},k})\leq\varepsilon_{j_{r},k}^{2}. Plug (C) into (C) and we obtain

𝔼exp{λi=j1jrYi,m,1}exp{Cλ2T2sσ21CλTs}𝔼[exp{λi=j1jr1Yi,m,1}]\mathbb{E}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}\Big{\}}\leq\exp\bigg{\{}\frac{C\lambda^{2}T^{2}s\sigma^{2}}{1-C\lambda T\sqrt{s}}\bigg{\}}\mathbb{E}\bigg{[}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r-1}}Y_{i,m,1}\Big{\}}\bigg{]} (C.6)

Iterating this procedure yields

𝔼exp{λi=j1jrYi,m,1}exp{Cmωλ2T2sσ21CλTs}\mathbb{E}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}\Big{\}}\leq\exp\bigg{\{}\frac{Cm\omega\lambda^{2}T^{2}s\sigma^{2}}{1-C\lambda T\sqrt{s}}\bigg{\}} (C.7)

Choose λ=x(CTs+2CmwT2sσ2)1\lambda=x(CT\sqrt{s}+2CmwT^{2}s\sigma^{2})^{-1} and by (C.3) we have

((TY,S)1x)exp{x2C1Tsx+C2mωT2sσ2}.\mathbb{P}((T_{Y,S})_{1}\geq x)\leq\exp\bigg{\{}-\frac{x^{2}}{C_{1}T\sqrt{s}x+C_{2}m\omega T^{2}s\sigma^{2}}\bigg{\}}.

The symmetrization argument and a union bound deliver the desired result. ∎

Lemma C.3.

Suppose the scaling condition holds that sT2(log(pn))7/nc3nc4sT^{2}(\log(pn))^{7}/n\leq c_{3}n^{-c_{4}}. Assume that 𝔼[Xik]C\mathbb{E}[X_{ik}]\leq C^{\prime} for all 1kp1\leq k\leq p. Then we have the following Gaussian Approximation result that

𝒰:=supt|(|TY,L/n|t)(|Z|t)|cnc\mathcal{U}:=\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\big{(}|T_{Y,L}/\sqrt{n}|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}|Z|_{\infty}\leq t\big{)}\bigg{|}\leq cn^{-c^{\prime}}

for some constants c,c>0c,c^{\prime}>0.

Proof of Lemma C.3.

Recall that ξb=iLbYi,m/M\xi_{b}=\sum_{i\in L_{b}}Y_{i,m}/\sqrt{M}, thus

𝒰=supt|(|1wb=1wξb|t)(|Z|t)|.\mathcal{U}=\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\big{(}|\frac{1}{\sqrt{w}}\sum_{b=1}^{w}\xi_{b}|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}|Z|_{\infty}\leq t\big{)}\bigg{|}.

Observe that ξ1,ξ2,,ξw\xi_{1},\xi_{2},\dots,\xi_{w} are independent random variables. We shall apply Corollary 2.1 of Chernozhukov et al., (2013) by verifying the condition (E.1) therein. For completeness, the conditions are stated below.

  • (i)

    c1𝔼[ξbj2]c2c_{1}\leq\mathbb{E}[\xi_{bj}^{2}]\leq c_{2} for all 1jp1\leq j\leq p.

  • (ii)

    maxk=1,2𝔼[|ξbj|2+k/Bnk]+𝔼[exp(|ξbj/Bn|)]4,\max_{k=1,2}\mathbb{E}[|\xi_{bj}|^{2+k}/B_{n}^{k}]+\mathbb{E}[\exp(|\xi_{bj}/B_{n}|)]\leq 4, for some Bn>0B_{n}>0 and all 1jp1\leq j\leq p.

  • (iii)

    Bn2(log(pn))7/nc3nc4B_{n}^{2}(\log(pn))^{7}/n\leq c_{3}n^{-c_{4}}.

To verify condition (i), we see that

𝔼[ξbj2]cσ2𝔼[𝔼[Ωx,jXiw(Xi)|εim,,εi]2]c𝔼[(Ωx,jXiw(Xi))2]cΩx,jΣxΩx,jcΩx,jj,\mathbb{E}[\xi_{bj}^{2}]\leq c\,\sigma^{2}\mathbb{E}[\mathbb{E}[\Omega_{x,j}^{\top}X_{i}w(X_{i})|\varepsilon_{i-m},\dots,\varepsilon_{i}]^{2}]\leq c\,\mathbb{E}[(\Omega_{x,j}^{\top}X_{i}w(X_{i}))^{2}]\leq c\,\Omega_{x,j}^{\top}\Sigma_{x}\Omega_{x,j}\leq c\,\Omega_{x,jj},

where Ωx,j\Omega_{x,j} is the jj-th row of Ωx\Omega_{x} and Ωx,jj\Omega_{x,jj} is the jj-th diagonal entry of Ωx\Omega_{x}. Now we check condition (ii). By Theorem 3.2 of Burkholder, (1973), we have for k2k\geq 2,

𝔼[|ξbj|k]18kk𝔼[|Yij,m|k]k!ek𝔼[|Yij,m|2](sT)k2k!ek(sT)k2.\mathbb{E}[|\xi_{bj}|^{k}]\leq 18k^{k}\mathbb{E}\big{[}\big{|}Y_{ij,m}\big{|}^{k}\big{]}\lesssim k!\mathrm{e}^{k}\mathbb{E}\big{[}\big{|}Y_{ij,m}\big{|}^{2}\big{]}(\sqrt{s}T)^{k-2}\lesssim k!\mathrm{e}^{k}(\sqrt{s}T)^{k-2}.

Therefore, take Bn=CsTB_{n}=C\sqrt{s}T for sufficiently large C>0C>0 and we have

𝔼[exp(|ξbj/Bn|)]1+C1k=1(e/C)k<2.\mathbb{E}[\exp(|\xi_{bj}/B_{n}|)]\leq 1+C_{1}\sum_{k=1}^{\infty}(\mathrm{e}/C)^{k}<2.

Moreover, for a suitable choice of C>0C>0,

maxk=1,2𝔼[|ξbj|2+k/Bnk]<2.\max_{k=1,2}\mathbb{E}[|\xi_{bj}|^{2+k}/B_{n}^{k}]<2.

Hence, condition (ii) is satisfied. Condition (iii) is guaranteed by the scaling assumption. ∎

Now, we are ready to give the proof of Theorem 4.1.

Proof of Theorem 4.1.

By triangle inequality,

\displaystyle\mathcal{H} supt|(|TY/n|t)(|TY,L/n|t)|+supt|(|TY,L/n|t)(|Z|t)|\displaystyle\leq\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\big{(}|T_{Y}/\sqrt{n}|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}|T_{Y,L}/\sqrt{n}|_{\infty}\leq t\big{)}\bigg{|}+\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\big{(}|T_{Y,L}/\sqrt{n}|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}|Z|_{\infty}\leq t\big{)}\bigg{|}
=:I+II.\displaystyle=:I+II. (C.8)

For any η>0\eta>0, elementary calculation shows that

I\displaystyle I (|(TYTY,L)/n|>η)+supt(||TY,L/n|t|η)\displaystyle\leq\mathbb{P}\big{(}\big{|}(T_{Y}-T_{Y,L})/\sqrt{n}\big{|}_{\infty}>\eta\big{)}+\sup_{t\in\mathbb{R}}\mathbb{P}\bigg{(}\bigg{|}\big{|}T_{Y,L}/\sqrt{n}\big{|}_{\infty}-t\bigg{|}\leq\eta\bigg{)}
(|(TYTY,m)/n|>η2)+(|TY,S/n|>η2)+supt(||TY,L/n|t|η)\displaystyle\leq\mathbb{P}\big{(}\big{|}(T_{Y}-T_{Y,m})/\sqrt{n}\big{|}_{\infty}>\frac{\eta}{2}\big{)}+\mathbb{P}\big{(}\big{|}T_{Y,S}/\sqrt{n}\big{|}_{\infty}>\frac{\eta}{2}\big{)}+\sup_{t\in\mathbb{R}}\mathbb{P}\bigg{(}\bigg{|}\big{|}T_{Y,L}/\sqrt{n}\big{|}_{\infty}-t\bigg{|}\leq\eta\bigg{)} (C.9)

By lemma C.2 and C.1,

(|(TYTY,m)/n|>η2)f1(η/2,m)and(|TY,S/n|>η2)f2(η/2).\displaystyle\mathbb{P}\big{(}\big{|}(T_{Y}-T_{Y,m})/\sqrt{n}\big{|}_{\infty}>\frac{\eta}{2}\big{)}\leq f_{1}(\eta/2,m)\quad\text{and}\quad\mathbb{P}\big{(}\big{|}T_{Y,S}/\sqrt{n}\big{|}_{\infty}>\frac{\eta}{2}\big{)}\leq f_{2}(\eta/2). (C.10)

By lemma C.3 and theorem 3 of Chernozhukov et al., (2015), we obtain that

supt(||TY,L/n|t|η)\displaystyle\sup_{t\in\mathbb{R}}\mathbb{P}\bigg{(}\bigg{|}\big{|}T_{Y,L}/\sqrt{n}\big{|}_{\infty}-t\bigg{|}\leq\eta\bigg{)} supt(||Z|t|η)+𝒰\displaystyle\leq\sup_{t\in\mathbb{R}}\mathbb{P}\bigg{(}\bigg{|}\big{|}Z\big{|}_{\infty}-t\bigg{|}\leq\eta\bigg{)}+\mathcal{U}
ηlogp+ηlog(1/η)+cnc,\displaystyle\lesssim\eta\sqrt{\log p}+\eta\sqrt{\log(1/\eta)}+cn^{-c^{\prime}}, (C.11)

and that

II=𝒰cnc.II=\mathcal{U}\leq cn^{-c^{\prime}}. (C.12)

By (C), (C.10), (C) and (C.12), we obtain the inequality stated in the theorem. ∎

References

  • Basu and Michailidis, (2015) Basu, S. and Michailidis, G. (2015). Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics, 43(4):1535–1567.
  • Bernstein, (1946) Bernstein, S. (1946). The theory of probabilities.
  • Bickel, (1975) Bickel, P. J. (1975). One-step huber estimates in the linear model. Journal of the American Statistical Association, 70(350):428–434.
  • Bühlmann and Van De Geer, (2011) Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
  • Burkholder, (1973) Burkholder, D. L. (1973). Distribution function inequalities for martingales. the Annals of Probability, pages 19–42.
  • Cai et al., (2011) Cai, T., Liu, W., and Luo, X. (2011). A constrained 1\ell_{1} minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607.
  • Candes and Tao, (2007) Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than n. The annals of Statistics, 35(6):2313–2351.
  • Chernozhukov et al., (2015) Chernozhukov, V., Chetverikov, D., and Kato, K. (2015). Comparison and anti-concentration bounds for maxima of gaussian random vectors. Probability Theory and Related Fields, 162(1):47–70.
  • Chernozhukov et al., (2013) Chernozhukov, V., Chetverikov, D., Kato, K., et al. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819.
  • Fan et al., (2017) Fan, J., Li, Q., and Wang, Y. (2017). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society. Series B, Statistical methodology, 79(1):247.
  • Friedman et al., (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
  • Han et al., (2015) Han, F., Lu, H., and Liu, H. (2015). A direct estimation of high dimensional stationary vector autoregressions. Journal of Machine Learning Research.
  • Hsu et al., (2008) Hsu, N.-J., Hung, H.-L., and Chang, Y.-M. (2008). Subset selection for vector autoregressive processes using lasso. Computational Statistics & Data Analysis, 52(7):3645–3657.
  • Javanmard and Montanari, (2014) Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909.
  • Juselius, (2006) Juselius, K. (2006). The cointegrated VAR model: methodology and applications. Oxford university press.
  • Li and Zhu, (2008) Li, Y. and Zhu, J. (2008). L 1-norm quantile regression. Journal of Computational and Graphical Statistics, 17(1):163–185.
  • Liu and Zhang, (2021) Liu, L. and Zhang, D. (2021). Robust estimation of high-dimensional vector autoregressive models. arXiv preprint arXiv:2109.10354.
  • Loh, (2017) Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust mm-estimators. The Annals of Statistics, 45(2):866–896.
  • Loh, (2018) Loh, P.-L. (2018). Scale calibration for high-dimensional robust regression. arXiv preprint arXiv:1811.02096.
  • Loh and Wainwright, (2012) Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, pages 1637–1664.
  • Loh and Wainwright, (2017) Loh, P.-L. and Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455–2482.
  • Massart, (2007) Massart, P. (2007). Concentration inequalities and model selection, volume 6. Springer.
  • Meinshausen and Bühlmann, (2006) Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462.
  • Merlevède et al., (2009) Merlevède, F., Peligrad, M., Rio, E., et al. (2009). Bernstein inequality and moderate deviations under strong mixing conditions. In High dimensional probability V: the Luminy volume, pages 273–292. Institute of Mathematical Statistics.
  • Nardi and Rinaldo, (2011) Nardi, Y. and Rinaldo, A. (2011). Autoregressive process modeling via the lasso procedure. Journal of Multivariate Analysis, 102(3):528–549.
  • Negahban and Wainwright, (2011) Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, pages 1069–1097.
  • Negahban et al., (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Statistical science, 27(4):538–557.
  • Pan et al., (2021) Pan, X., Sun, Q., and Zhou, W.-X. (2021). Iteratively reweighted \ell1-penalized robust regression. Electronic Journal of Statistics, 15(1):3287–3348.
  • Shan, (2005) Shan, J. (2005). Does financial development ’lead’ economic growth? a vector auto-regression appraisal. Applied Economics, 37(12):1353–1367.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
  • Van de Geer et al., (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202.
  • Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press.
  • Wu, (2005) Wu, W. B. (2005). Nonlinear system theory: Another look at dependence. Proceedings of the National Academy of Sciences, 102(40):14150–14154.
  • Wu and Liu, (2009) Wu, Y. and Liu, Y. (2009). Variable selection in quantile regression. Statistica Sinica, pages 801–817.
  • Wu and Zhou, (2010) Wu, Y. and Zhou, X. (2010). Var models: Estimation, inferences, and applications. In Handbook of Quantitative Finance and Risk Management, pages 1391–1398. Springer.
  • Yuan and Lin, (2007) Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35.
  • Zhang and Zhang, (2014) Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242.
  • Zhang, (2019) Zhang, D. (2019). Robust estimation of the mean and covariance matrix for high dimensional time series. Statistica Sinica, to appear.
  • Zhang and Wu, (2017) Zhang, D. and Wu, W. B. (2017). Gaussian approximation for high dimensional time series. The Annals of Statistics, 45(5):1895–1919.
  • Zhang and Cheng, (2017) Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112(518):757–768.
  • Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429.