High-dimensional Simultaneous Inference on Non-Gaussian VAR Model via De-biased Estimator

Linbo Liu and Danna Zhang
Department of Mathematics, UC San Diego

(October 26, 2021)

Abstract

Simultaneous inference for high-dimensional non-Gaussian time series is always considered to be a challenging problem. Such tasks require not only robust estimation of the coefficients in the random process, but also deriving limiting distribution for a sum of dependent variables. In this paper, we propose a multiplier bootstrap procedure to conduct simultaneous inference for the transition coefficients in high-dimensional non-Gaussian vector autoregressive (VAR) models. This bootstrap-assisted procedure allows the dimension of the time series to grow exponentially fast in the number of observations. As a test statistic, a de-biased estimator is constructed for simultaneous inference. Unlike the traditional de-biased/de-sparsifying Lasso estimator, robust convex loss function and normalizing weight function are exploited to avoid any unfavorable behavior at the tail of the distribution. We develop Gaussian approximation theory for VAR model to derive the asymptotic distribution of the de-biased estimator and propose a multiplier bootstrap-assisted procedure to obtain critical values under very mild moment conditions on the innovations. As an important tool in the convergence analysis of various estimators, we establish a Bernstein-type probabilistic concentration inequality for bounded VAR models. Numerical experiments verify the validity and efficiency of the proposed method.

1 Introduction

High-dimensional statistics become increasingly important due to the rapid development of information technology in the past decade. In this paper, we are primarily interested in conducting simultaneous inference via de-biased $M$ -estimator on the transition matrices in a high-dimensional vector autoregressive model with non-Gaussian innovations. An extensive body of work has been proposed on estimation and inference on the coefficient vector in linear regression setting and we refer readers to Bühlmann and Van De Geer, (2011) for an overview of recent development in high-dimensional statistical techniques. $M$ -estimator is one of the most popular tools among them, which has been proved a success in signal estimation (Negahban et al., (2012)), support recovery (Loh and Wainwright, (2017)), variable selection (Zou, (2006)) and robust estimation with heavy-tailed noises using nonconvex loss functions (Loh, (2017)). As a penalized $M$ -estimator, Lasso (Tibshirani, (1996)) also plays an important role in estimating transition coefficients in high-dimensional VAR models beyond linear regression; see for example Hsu et al., (2008), Nardi and Rinaldo, (2011), Basu and Michailidis, (2015) among others. Another line of work is to achieve such estimation tasks by Dantzig selector (Candes and Tao, (2007)). Han et al., (2015) proposed a new approach to estimating the transition matrix via Dantzig-type estimator and solved a linear programming problem. They remarked that this estimation procedure enjoys many advantages including computational efficiency and weaker assumptions on the transition matrix. However, the aforementioned literature mainly discussed the scenario where Gaussian or sub-Gaussian noises are in presence.

To deal with the heavy-tailed errors, regularized robust methods have been widely studied. For instance, Li and Zhu, (2008) proposed an $\ell_{1}$ -regularized quantile regression method in low dimensional setting and devised an algorithm to efficiently solve the proposed optimization problem. Wu and Liu, (2009) studied penalized quantile regression from the perspective of variable selection. However, quantile regression and least absolute deviation regression can be significantly different from the mean function, especially when the distribution of noise is asymmetric. To overcome this issue, Fan et al., (2017) developed robust approximation Lasso (RA-Lasso) estimator based on penalized Huber loss and proved the feasibility of RA-Lasso in estimation of high-dimensional mean regression. Apart from linear regression setting, Zhang, (2019) also used Huber loss to obtain a consistent estimate of mean vector and covariance matrix for high-dimensional time series. Also, robust estimation of the transition coefficients was studied in Liu and Zhang, (2021) via two types of approaches: Lasso-based and Dantzig-based estimator.

Besides estimation, recent research effort also turned to high-dimensional statistical inference, such as performing multiple hypothesis testing and constructing simultaneous confidence intervals, both for regression coefficients and mean vectors of random processes. To tackle the high dimensionality, the idea of low dimensional projection was exploited by numerous popular literature. For instance, Javanmard and Montanari, (2014), Van de Geer et al., (2014), Zhang and Zhang, (2014) constructed de-sparsifying Lasso by inverting the Karush-Kuhn-Tucker (KKT) condition and derived asymptotic distribution for the projection of high-dimensional parameters onto fixed-dimensional space. As an extension of the previous techniques, Loh, (2018) proposed the asymptotic theory of one-step estimator, allowing the presence of non-Gaussian noises. Employing Gaussian approximation theory (Chernozhukov et al., (2013)), Zhang and Cheng, (2017) proposed a bootstrap-assisted procedure to conduct simultaneous statistical inference, which allowed the number of testing to greatly surpass the number of observations as a significant improvement. Although a huge body of work has been completed for the inference of regression coefficients, there have been limited research on the generalization of these theoretical properties to time series, perhaps due to the technical difficulty when generalizing Gaussian approximation results to dependent random variables. Zhang and Wu, (2017) adopted the framework of functional dependence measures (Wu, (2005)) to account for temporal dependency and provided Gaussian approximation results for general time series. They also showed, as an application, how to construct simultaneous confidence intervals for mean vectors of high-dimensional random processes with asymptotically correct coverage probabilities.

In this paper, we consider simultaneous inference of transition coefficients in possibly non-Gaussian vector autoregressive (VAR) models with lag $d$ :

X_{i}=A_{1}X_{i-1}+A_{2}X_{i-2}+\dots+A_{d}X_{i-d}+\varepsilon_{i},\quad i=1,\dots,n,

where $X_{i}\in\mathbb{R}^{p}$ is the time series, $A_{i}\in\mathbb{R}^{p\times p},\,i=1,\dots,d$ are the transition matrices, and $\varepsilon_{i}\in\mathbb{R}^{p}$ are the innovation vectors. We allow the dimension $p$ to exceed the number of observations $n$ , or even $\log p=o(n^{b})$ for some $b>0$ , as is commonly assumed in high-dimensional regime. Different from many other work, we do not impose Gaussianity or sub-Gaussianity assumptions on the noise terms $\varepsilon_{i}$ .

We are particularly interested in the following simultaneous testing problem:

H_{0}:A_{i}=A_{i}^{0},\quad\text{for all }i=1,\dots,d

versus the alternative hypothesis

H_{1}:A_{i}\neq A_{i}^{0},\quad\text{for some }i=1,\dots,d.

It’s worth mentioning that the above problems still have $p^{2}$ null hypotheses to verify even if the lag $d=1$ . We propose to build a de-biased estimator $\check{\beta}$ from some consistent pilot estimator $\widehat{\beta}$ (for example, the one provided in Liu and Zhang, (2021)). There are a few challenges when we prove the feasibility of de-biased estimator as well as its theoretical guarantees: (i) VAR models display temporal dependency across observations, which makes the majority of probabilistic tools such as classic Bernstein inequality and Gaussian approximation inapplicable. (ii) Fat-tailed innovations $\varepsilon_{i}$ imply fat-tailed $x_{i}$ in VAR model, while robust methods regarding linear regression can assume $\varepsilon_{i}$ to have heavy-tail but $x_{i}$ remains sub-Gaussian (Fan et al., (2017) and Zhang and Cheng, (2017)). (iii) We hope our simultaneous inference procedure to work in ultra-high dimensional regime, where $p$ can grow exponentially fast in $n$ . As a result, these challenges inspire us to establish a new Bernstein-type inequality (section 3) and Gaussian approximation results (section 4) under the framework of VAR model. Also, we will adopt the definition of spectral decay index to capture the dependency among time series data, as in Liu and Zhang, (2021).

The paper is organized as follows. In section 2, we first present more details and some preparatory definitions of VAR models and propose the test statistics for simultaneous inference via de-biased estimator, which is constructed through a robust loss function and a weight function on $x_{i}$ . The main result delivering critical values for such test statistics by multiplier bootstrap is given in section 2.4. In section 3, we complete the estimation of multiple statistics by establishing a Bernstein inequality. A thorough discussion of Gaussian approximation and its derivation under VAR model are presented in section 4. Some numerical experiments are conducted in section 5 to assess the empirical performance of the multiplier bootstrap procedure.

Finally, we introduce some notation. For a vector $\beta=(\beta_{1},\dots,\beta_{p})^{\top}$ , let $|\beta|_{1}=\sum_{i}|\beta_{i}|$ , $|\beta|_{2}=(\,{\sum_{i}\beta_{i}^{2}}\,)^{1/2}$ and $|\beta|_{\infty}=\max_{i}|\beta_{i}|$ be its $\ell_{1},\ell_{2},\ell_{\infty}$ norm respectively. For a matrix $A=(a_{ij})_{1\leq i,j\leq p}$ , let $\lambda_{i},\,i=1,\dots,p$ , be its eigenvalues and $\lambda_{\max}(A),\,\lambda_{\min}(A)$ be its maximum and minimum eigenvalues respectively. Also let $\rho(A)=\max_{i}|\lambda_{i}|$ be the spectral radius. Denote $\|A\|_{1}=\max_{j}\sum_{i}|a_{ij}|$ , $\|A\|_{\infty}=\max_{i}\sum_{j}|a_{ij}|$ , and spectral norm $\|A\|=\|A\|_{2}=\sup_{|x|_{2}\neq 0}|Ax|_{2}/|x|_{2}$ . Moreover, let $\|A\|_{\max}=\max_{i,j}|a_{ij}|$ be the entry-wise maximum norm. For a random variable $X$ and $q>0$ , define $\|X\|_{q}=(\mathbb{E}[X^{q}])^{1/q}$ . For two real numbers $x,y$ , set $x\lor y=\max(x,y)$ . For two sequences of positive numbers $\{a_{n}\}$ and $\{b_{n}\}$ , we write $a_{n}\lesssim b_{n}$ if there exists some constant $C>0$ , such that $a_{n}/b_{n}\leq C$ as $n\to\infty$ , and also write $a_{n}\asymp b_{n}$ if $a_{n}\lesssim b_{n}$ and $b_{n}\lesssim a_{n}$ . We use $c_{0},c_{1},\dots$ and $C_{0},C_{1},\dots$ to denote some universal positive constants whose values may vary in different context. Throughout the paper, we consider the high-dimensional regime, allowing the dimension $p$ to grow with the sample size $n$ , that is, we assume $p=p_{n}\to\infty$ as $n\to\infty$ .

2 Main Results

2.1 Vector autoregressive model

Consider a VAR(d) model:

X_{i}=A_{1}X_{i-1}+A_{2}X_{i-2}+\dots+A_{d}X_{i-d}+\varepsilon_{i},\quad i=1,\dots,n,

(2.1)

where $X_{i}=(X_{i1},X_{i2},\dots,X_{ip})\in\mathbb{R}^{p}$ is the random process of interests, $A_{i}\in\mathbb{R}^{p\times p}$ , $i=1,\dots,d$ , are the transition matrices and $\varepsilon_{i}$ , $i\in\mathbb{Z}$ , are i.i.d. innovation vectors with zero mean and symmetric distribution, i.e. $\varepsilon_{i}=-\varepsilon_{i}$ in distribution, for all $i\in\mathbb{Z}$ . By a rearrangement of variables, VAR(d) models can be formulated as VAR(1) models (see Liu and Zhang, (2021)). Therefore, without loss of generality, we shall work with VAR(1) models:

X_{i}=AX_{i-1}+\varepsilon_{i},\quad i=1,\dots,n.

(2.2)

This type of random process has a wide range of application, such as finance development (Shan, (2005)), economy (Juselius, (2006)) and exchange rate dynamics (Wu and Zhou, (2010)).

To ensure model stationarity, we assume that the spectral radius $\rho(A)<1$ throughout the paper, which is also the sufficient and necessary condition for a VAR(1) model to be stationary. However, a more restrictive condition that $\|A\|<1$ is always assumed in most of the earlier work. See for example, Han et al., (2015), Loh and Wainwright, (2012) and Negahban and Wainwright, (2011). For a non-symmetric matrix $A$ , it could happen that $\|A\|\geq 1$ while $\rho(A)<1$ . To fill the gap between $\rho(A)$ and $\|A\|$ , Basu and Michailidis, (2015) proposed stability measures for high-dimensional time series to capture temporal and cross-section dependence via the spectral density function

f_{X}(\theta)=\frac{1}{2\pi}\sum_{\ell=-\infty}^{\infty}\Gamma_{X}(\ell)\mathrm{e}^{-i\ell\theta},\quad\theta\in[-\pi,\pi],

where $\Gamma_{X}(\ell)=\text{Cov}(X_{i},X_{i+\ell}),~{}i,\ell\in\mathbb{Z}$ is the autocovariance function of the process $\{X_{i}\}$ . In a more recent work, Liu and Zhang, (2021) defined spectral decay index to connect $\rho(A)$ with $\|A\|$ from a different point of view. In this paper, we will adopt the framework of spectral decay index in Liu and Zhang, (2021).

Definition 2.1.

For any matrix $A\in\mathbb{R}^{p\times p}$ such that $\rho(A)<1$ , define the spectral decay index as

\tau=\min\{t\in\mathbb{Z}^{+}:\|A^{t}\|_{\infty}<\rho\}

(2.3)

for some constant $0<\rho<1$ .

Remark 2.2.

Note that in (2.3), we use $L_{\infty}$ norm, while spectral norm is considered in Liu and Zhang, (2021). However, the spectral decay index shares many properties even if defined in different matrix norms. Some of them are summarized as follows. For any matrix $A$ with $\rho(A)<1$ , finite spectral decay index $\tau$ exists. In general, $\tau$ may not be of constant order when the dimension $p$ increases. Technically speaking, we need to explicitly write $\tau=\tau_{p}$ to capture the dependence on $p$ . However, in the rest of the paper, we simply write $\tau$ for ease of notation. For more analysis of spectral decay index, see section 2 of Liu and Zhang, (2021).

Next, we are interested in building some estimators of $A$ for which we could establish asymptotic distribution theory. This allows one to conduct statistical inference, such as finding simultaneous confidence interval. There have been some work on the robust estimation only. Liu and Zhang, (2021) provides both a Lasso-type estimator and a Dantzig-type estimator to consistently estimate the transition coefficient $A$ given $\{X_{i}\}$ , under very mild moment condition on $X_{i}$ and $\epsilon_{i}$ . It turns out that both Lasso-type and Dantzig-type estimators are not unbiased for estimating the transition matrix, thus insufficient for tasks like statistical inference. Therefore, one needs to develop more refined method to establish results in terms of asymptotic distributional theory. In the following sections, we will construct a de-biased estimator based on the existing one and derive the limiting distribution for the de-biased estimator.

Note that unlike many other existing work (Han et al., (2015), Basu and Michailidis, (2015), etc.), we do not assume $\varepsilon_{i}$ to be Gaussian or sub-Gaussian. Instead, it could happen that the innovations $\varepsilon_{i}$ only have some finite moments, which makes the standard techniques for estimation and inference invalid.

2.2 De-biased estimator

In this section, we construct a de-biased estimator using the techniques introduced in Bickel, (1975). To fix the idea, let $a_{j}^{\top}$ be the $j$ -th row of $A$ and $\beta^{*}=\text{Vec}(A)=(a_{1}^{\top},a_{2}^{\top},\dots,a_{p}^{\top})^{\top}\in\mathbb{R}^{p^{2}}$ . Suppose we are given a consistent, possibly biased, estimator $\widehat{\beta}$ of $\beta^{*}$ , i.e. $|\widehat{\beta}-\beta^{*}|=o(1)$ (for example, Lasso-type or Dantzig-type estimators in Liu and Zhang, (2021)). Define a loss function $L:\mathbb{R}^{p^{2}}\to\mathbb{R}$ as

L_{n}(\beta)=\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{p}\ell(X_{ik}-X_{i-1}^{\top}\beta_{k})w(X_{i-1}),

(2.4)

where $\beta=(\beta_{1}^{\top},\dots,\beta_{p}^{\top})^{\top}$ with $\beta_{k}\in\mathbb{R}^{p}$ for $1\leq k\leq p$ , the weight function

w(x)=\min\bigg{\{}1,\frac{T^{3}}{|x|_{\infty}^{3}}\bigg{\}}

for some threshold $T>0$ to be determined later, and the robust loss function $\ell(x)$ satisfies:

(i)

$\ell(x)$ is a thrice differentiable convex and even function.
(ii)

For some constant $C>0$ , $|\ell^{\prime}|,|\ell^{\prime\prime}|,|\ell^{(3)}|\leq C$ .

We give two examples of such loss functions from Pan et al., (2021) that satisfy the above conditions.

Examples 2.3 (Smoothed huber loss I).

\ell(x)=\begin{cases}x^{2}/2-|x|^{3}/6\quad&\text{if }|x|\leq 1,\\ |x|/2-1/6\quad&\text{if }|x|>1.\end{cases}

Examples 2.4 (Smoothed huber loss II).

\ell(x)=\begin{cases}x^{2}/2-x^{4}/24,\quad&\text{if }|x|\leq\sqrt{2},\\ \big{(}2\sqrt{2}/3\big{)}|x|-1/2,\quad&\text{if }|x|>\sqrt{2}.\end{cases}

Direct calculation shows that everywhere twice differentiable and almost everywhere thrice differentiable. Also, the derivative of first three orders are bounded in magnitude. We mention that generalization to other loss functions that does not satisfy the differentiability conditions (for example, huber loss) may be derived under more refined arguments, but will be omitted in this paper.

Denote by $\psi(x)=\ell^{\prime}(x)$ the derivative of $\ell(x)$ , then $\psi(x)$ is twice differentiable by condition (i) and $|\psi(x)|\leq C$ for all $x\in\mathbb{R}$ by condition (ii). Let $\mu=(\mu_{1},\dots,\mu_{p})^{\top}\in\mathbb{R}^{p}$ with $\mu_{k}=\mathbb{E}[\psi^{\prime}(\varepsilon_{ik})]$ and $\mu^{-1}=(\mu_{1}^{-1},\dots,\mu_{p}^{-1})^{\top}$ . Let $\widehat{\mu}=(\widehat{\mu}_{1},\dots,\widehat{\mu}_{p})$ be the estimate of $\mu$ with $\widehat{\mu}_{k}=\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\widehat{\varepsilon}_{ik})$ , where $\widehat{\varepsilon}_{ik}=X_{ik}-X_{i-1}^{\top}\widehat{\beta}_{k}$ . Let $\Sigma_{x}=\mathbb{E}[X_{i}X_{i}^{\top}w(X_{i})]\in\mathbb{R}^{p\times p}$ be the weighted covariance matrix and $\Omega_{x}=\Sigma_{x}^{-1}\in\mathbb{R}^{p\times p}$ be the weighted precision matrix. Denote by $\widehat{\Sigma}_{x}=n^{-1}\sum_{i=1}^{n}X_{i-1}X_{i-1}^{\top}w(X_{i-1})$ the weighted sample covariance. Furthermore, suppose that $\widehat{\Omega}_{x}$ is a suitable approximation of the weighted precision matrix $\Omega_{x}$ (e.g., CLIME estimator introduced by Cai et al., (2011)), as will be discussed in section 2.3. To ensure the validity of such estimator, the sparsity of each row of $\Omega_{x}$ is always assumed due to high dimensionality. Now we introduce a few more notations:

\displaystyle\Sigma

\displaystyle=\text{diag}(\mu)\otimes\Sigma_{x}=\begin{bmatrix}\mu_{1}\Sigma_{x}&0&0&\dots&0\\ 0&\mu_{2}\Sigma_{x}&0&\dots&0\\ \vdots&\vdots&\ddots&\dots&0\\ 0&0&0&0&\mu_{p}\Sigma_{x}\end{bmatrix}\in\mathbb{R}^{p^{2}\times p^{2}},

(2.5)

and analogously,

	$\displaystyle\Omega$	$\displaystyle=\Sigma^{-1}=\text{diag}(\mu^{-1})\otimes\Omega_{x};$
	$\displaystyle\widehat{\Sigma}$	$\displaystyle=\text{diag}(\widehat{\mu})\otimes\widehat{\Sigma}_{x},\quad\widehat{\Omega}=\text{diag}(\widehat{\mu}^{-1})\otimes\widehat{\Omega}_{x}.$		(2.6)

Following the one-step estimator in Bickel, (1975), we de-bias $\widehat{\beta}$ by adding an additional term involving the gradient of the loss function $L$ :

\check{\beta}=\widehat{\beta}+\widehat{\Omega}\,\nabla L_{n}(\widehat{\beta}).

(2.7)

To briefly explain the presence of $\widehat{\Omega}$ , consider Taylor expansion of $\nabla L_{n}(\widehat{\beta})$ around $\nabla L_{n}(\beta^{*})$ . Write

$\displaystyle\sqrt{n}(\check{\beta}-\beta^{*})$	$\displaystyle=\sqrt{n}(\widehat{\beta}-\beta^{})+\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})-\sqrt{n}\,\widehat{\Omega}(\nabla L_{n}(\widehat{\beta})-\nabla L_{n}(\beta^{*}))$
	$\displaystyle=\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})+\sqrt{n}\left[(\widehat{\beta}-\beta^{})-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{})(\widehat{\beta}-\beta^{})+R\right]$
	$\displaystyle=\underbrace{\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})}_{A}+\underbrace{\sqrt{n}\left[\Big{(}I_{p^{2}}-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{})(\widehat{\beta}-\beta^{*})\right]}_{\Delta}+\sqrt{n}R,$	(2.8)

where the remainder term $\sqrt{n}R=o(1)$ under certain conditions. Moreover, we also hope $\Delta$ to be negligible. As will be shown in the following sections,

\Delta\leq\sqrt{n}\Big{(}\|\Omega-\widehat{\Omega}\|_{1}\|\Sigma\|_{\max}+\|\nabla^{2}L_{n}(\beta^{*})-\Sigma\|_{\max}\|\widehat{\Omega}\|_{1}\Big{)}|\widehat{\beta}-\beta^{*}|_{1},

(2.9)

To this end, $\widehat{\Omega}$ needs to be a good approximation of the precision matrix $\Omega$ , which inspires the construction of such $\widehat{\Omega}$ . More rigorous arguments will be presented in the subsequent sections.

Note that the estimator $\check{\beta}$ is closely related to the de-sparsifying Lasso estimator (Van de Geer et al., (2014) and Zhang and Zhang, (2014)), which is employed to conduct simultaneous inference for linear regression models in Zhang and Cheng, (2017). $\check{\beta}$ will reduce to de-sparsifying Lasso estimator if the loss $\ell(x)$ in (2.4) is squared error loss and the weight $w(x)\equiv 1$ . Moreover, Loh, (2018) uses this one-step estimator to build the limiting distribution of high-dimensional vector restricted to a fixed number of coordinates, and delivers a result that agrees with Bickel, (1975) for low-dimensional robust M-estimators. Different from that, we will derive such conclusions simultaneously for all $p^{2}$ coordinates of $\beta^{*}$ .

In the subsequent sections, we aim at obtaining a limiting distribution for $\check{\beta}$ .

2.3 Estimation of the precision matrix

In this section, we mainly discuss the validity of having $\widehat{\Omega}$ as an approximation of $\Omega$ . By the structure of $\Omega$ , we need to first find a suitable estimator of the weighted precision $\Omega_{x}$ .

The estimation of the sparse inverse covariance matrix based on a collection of observations $\{X_{i}\}$ plays a crucial role in establishing the asymptotic distribution. In high-dimensional regime, one cannot obtain a suitable estimator for the precision matrix by simply inverting the sample covariance, as the sample covariance is not invertible when the number of features exceeds the number of observations. Depending on the purposes, various methodology have been proposed to solve problem of estimating the precision. See for example, graphical Lasso (Yuan and Lin, (2007) and Friedman et al., (2008)) and nodewise regression (Meinshausen and Bühlmann, (2006)). From a different perspective, Cai et al., (2011) proposed a CLIME approach to sparse precision estimation, which shall be applied in this paper. For completeness, we reproduce the CLIME estimator in the following.

Suppose that the sparsity of each row of $\Omega_{x}$ is at most $s$ , i.e., $s=\max_{1\leq i\leq p}|\{j:\Omega_{x,ij}\neq 0\}|.$ We first obtain $\widehat{\Theta}$ by solving

\widehat{\Theta}=\operatorname*{\arg\min}_{\Theta}\sum_{i,j}\big{|}\Theta_{ij}\big{|}\quad\text{subject to: }\quad\|\widehat{\Sigma}_{x}\Theta-I_{p}\|_{\max}\leq\lambda_{n},

for some regularization parameter $\lambda_{n}>0$ . Note that the solution $\Theta$ may not symmetric. To account for symmetry, the CLIME estimator $\widehat{\Omega}_{x}$ is defined as

\widehat{\Omega}_{x}=(\widehat{\omega}_{ij}),\quad\text{where }\widehat{\omega}_{ij}=\widehat{\omega}_{ji}=\widehat{\Theta}_{ij}\mathbb{I}\{|\widehat{\Theta}_{ij}|\leq|\widehat{\Theta}_{ji}|\}+\widehat{\Theta}_{ji}\mathbb{I}\{|\widehat{\Theta}_{ij}|>|\widehat{\Theta}_{ji}|\}.

(2.10)

For more analysis of CLIME estimator, see Cai et al., (2011). Next, we present the convergence theorem for CLIME estimator.

Theorem 2.5.

Let $\tau$ be defined in definition 2.1 and $\gamma=\max_{t=0,1\dots,\tau-1}\|A^{t}\|$ . Choose $\lambda_{n}\asymp\|\Omega_{x}\|_{1}\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}$ , then with probability at least $1-4p^{-c_{0}}$ for some constant $c_{0}>0$ ,

\|\widehat{\Omega}_{x}-\Omega_{x}\|_{\max}\lesssim\|\Omega_{x}\|_{1}\lambda_{n}\quad\text{and}\quad\|\widehat{\Omega}_{x}-\Omega_{x}\|_{1}\lesssim\|\Omega_{x}\|_{1}s\lambda_{n}.

Remark 2.6.

Theorem 2.5 is a direct application of Theorem 6 of Cai et al., (2011). Note that if we assume the eigenvalue condition on $\Sigma_{x}$ that $0\leq c\leq\lambda_{\min}(\Sigma_{x})\leq\lambda_{\max}(\Sigma_{x})\leq C$ , then $\|\Omega_{x}\|_{2}\leq 1/\lambda_{\min}(\Sigma_{x})=O(1).$ Therefore, by the sparsity condition on $\Omega_{x}$ , we immediately have that $\|\Omega_{x}\|_{1}=O(\sqrt{s}).$ Suppose the scaling condition holds that $s\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}=o(1)$ , then the CLIME estimator $\widehat{\Omega}_{x}$ defined in (2.10) is consistent in estimating the weighted precision matrix of the VAR(1) model (2.2).

The following theorem shows that $\|\Omega-\widehat{\Omega}\|$ enjoys the same convergence rate as in the previous theorem.

Theorem 2.7.

Let $\widehat{\Omega}_{x}$ be the CLIME estimator defined above. Assume that $\mu_{k}>c_{1}>0$ for all $1\leq k\leq p$ , then with probability at least $1-6p^{-c}$ ,

\|\Omega-\widehat{\Omega}\|_{\max}\lesssim\|\Omega_{x}\|_{1}\lambda_{n}\quad\text{and}\quad\|\Omega-\widehat{\Omega}\|_{1}\lesssim\|\Omega_{x}\|_{1}s\lambda_{n}.

The above theorem is built upon two facts: $\widehat{\Omega}_{x}$ approximates $\Omega_{x}$ and $\widehat{\mu}$ approximates $\mu$ . The result regarding the latter approximation will be given in Lemma 3.7.

2.4 Simultaneous inference

In this section, consider the following hypothesis testing problem:

H_{0}:A_{ij}=A^{0}_{ij},\quad\text{for all }i,j=1,\dots,p\quad(\text{or equivalently, }\beta^{*}_{j}=\beta^{0}_{j},\quad\text{for all }j=1,\dots,p)

versus the alternative hypothesis $H_{1}:A_{ij}\neq A^{0}_{ij}$ for some $i,j=1,\dots,p$ . Instead of projecting the explanatory variables onto a subspace of fixed dimension (Javanmard and Montanari, (2014), Zhang and Zhang, (2014), Van de Geer et al., (2014) and Loh, (2018)), we allow the number of testings to grow as fast as an exponential order of the sample size $n$ . Zhang and Cheng, (2017) presented a more related work, where it’s also allowed that the testing size to grow as a function of $p$ . However, they conducted such simultaneous inference procedure under linear regression setting with independent random variables.

Employing the de-biased estimator $\check{\beta}$ defined in (2.7), we propose to use the test statistics

\sqrt{n}|\check{\beta}-\beta^{0}|_{\infty},

(2.11)

where $\check{\beta}$ is defined in (2.7). In the next several theorems, we elaborate a multiplier bootstrap method to obtain the critical value of the test statistics, which requires a few scaling and moment assumptions. Recall definition 2.1 for $\tau$ and theorem 2.5 for the definition of $\gamma$ . Also recall that $s=\max_{1\leq i\leq p}|\{j:\Omega_{x,ij}\neq 0\}|.$

Assumptions

(A1)

$\sqrt{n}T^{3}|\widehat{\beta}-\beta^{*}|_{1}^{2}=o(1)$ .
(A2)

$\|\Omega_{x}\|_{1}^{2}s\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}=o(1)$ .
(A3)

$s\gamma\tau^{2}T^{2}(\log p)^{3/2}|\widehat{\beta}-\beta^{*}|_{1}=o(1)$ .
(A4)

$sT^{2}(\log(pn))^{7}/n\lesssim n^{-c}$ .
(A5)

$(\log p)^{3/2}(\log n)^{1/2}T\sqrt{s\tau}\gamma/n^{1/4}=o(1).$

Additionally, throughout the paper we assume that for some constant $C>0$ , $\mathbb{E}[X_{ik}^{2}]\leq C$ and $\mathbb{E}[\varepsilon_{ik}^{2}]\leq C$ for all $1\leq k\leq p$ . We also suppose that $\|\Sigma_{x}\|_{\max}=O(1)$ and $0<c\leq\lambda_{\min}(\Sigma_{x})\leq\lambda_{\max}(\Sigma_{x})\leq C$ . Thus, $\|\Omega_{x}\|_{2}\leq 1/\lambda_{\min}(\Sigma_{x})=O(1)$ and $\|\Omega_{x}\|_{1}=O(\sqrt{s})$ , where the row sparsity $s=\max_{1\leq i\leq p}|\{j:\Omega_{x,ij}\neq 0\}|$ .

Theorem 2.8.

Let $\zeta_{1}=\gamma\tau^{2}T^{2}(\log p)^{3/2}|\widehat{\beta}-\beta^{*}|_{1}+\sqrt{n}T^{3}|\widehat{\beta}-\beta^{*}|_{1}^{2}+s\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}.$ Suppose assumptions (A1) — (A3) hold. Additionally assume that $\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}=o(1)$ , then we have that

\mathbb{P}\bigg{(}|\sqrt{n}(\check{\beta}-\beta^{*})-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}>\zeta_{1}\bigg{)}<\zeta_{2},

where $\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}=o(1)$ and $\zeta_{2}=o(1)$ .

Theorem 2.8 rigorously verifies that $\sqrt{n}R=o(1)$ and $\Delta=o(1)$ in (2.2) by the proposed construction of $\widehat{\Omega}$ and suggests us to perform further analysis on $\sqrt{n}\Omega\nabla L_{n}(\beta^{*})$ . To derive the limiting distribution, we shall use Gaussian approximation technique, since the classic central limit theorem fails in high-dimensional setting.

Gaussian approximation was initially invented for high-dimensional independent random variables in Chernozhukov et al., (2013) and further generalized to high-dimensional time series in Zhang and Wu, (2017). Zhang and Cheng, (2017) and Loh, (2018) applied the GA technique in Chernozhukov et al., (2013) to the derivation of asymptotic distribution in linear regression setting. However, data generated from VAR model suffers temporal dependence, which makes the aforementioned techniques unavailable. Although Zhang and Wu, (2017) established such GA results for general time series using dependence adjusted norm, direct application of their theorems does not yield desirable conclusion in ultra-high dimensional setting. This leads us to derive a new GA theorem with better convergence rate, which is achievable thanks to the structure of VAR model.

The next theorem establishes a Gaussian approximation(GA) result for the term $\sqrt{n}\Omega\nabla L_{n}(\beta^{*})$ . For a more detailed description of Gaussian approximation procedure, see section XXX.

Theorem 2.9.

Denote $D=(D_{jk})_{1\leq j,k\leq p}\in\mathbb{R}^{p^{2}\times p^{2}}$ with

D_{jk}=\frac{\Omega_{x}\mathbb{E}[\psi(\varepsilon_{ij})\psi(\varepsilon_{ik})]\mathbb{E}[X_{i}X_{i}^{\top}w^{2}(X_{i})]\Omega_{x}^{\top}}{\mu_{j}\mu_{k}}\in\mathbb{R}^{p\times p}.

Under Assumption (A4) and (A5), we have the following Gaussian Approximation result that

\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\bigg{(}|\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}\leq t\bigg{)}-\mathbb{P}\bigg{(}\big{|}\sum_{i=1}^{n}z_{i}/\sqrt{n}\big{|}_{\infty}\leq t\bigg{)}\bigg{|}=o(1),

where $z_{i}=(z_{i1},\dots,z_{ip^{2}})^{\top}$ is a sequence of mean zero independent Gaussian vectors with each $\mathbb{E}z_{i}z_{i}^{\top}=D$ .

Remark 2.10.

The above GA results allows the ultra-high dimensional regime, wehere $p$ grows as fast as $O(\mathrm{e}^{n^{b}})$ for some $0<b<1$ .

Since the covariance matrix $D$ of the Gaussian analogue $z_{i}$ is not accessible from the observation $\{X_{i}\}$ , we need to give a suitable estimation of $D$ before further performing multiplier bootstrap. The next theorem delivers a consistent estimator for our purpose.

Theorem 2.11.

\widehat{D}_{jk}=\frac{\widehat{\Omega}_{x}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\psi(\widehat{\varepsilon}_{ij})\psi(\widehat{\varepsilon}_{ik})\Big{)}\Big{(}\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}w^{2}(X_{i})\Big{)}\widehat{\Omega}_{x}^{\top}}{\widehat{\mu}_{j}\widehat{\mu}_{k}}\in\mathbb{R}^{p\times p},

(2.12)

where $\widehat{\Omega}_{x}$ is the CLIME estimator of $\Omega_{x}$ . Under assumptions (A1)–(A5) and additionally assume that $\|\Omega_{x}\|_{1}=O(\sqrt{s})$ and that for all $1\leq k\leq p$ , $\mu_{k}>C>0$ for some constant $C$ , we have with probability at least $1-12p^{-c}$ , we have

\|\widehat{D}-D\|_{\max}\lesssim s\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}+|\widehat{\beta}-\beta^{*}|_{1}.

Indeed, under the scaling assumptions, $\|\widehat{D}-D\|_{\max}=o(1)$ . With these preparatory results, we are ready to present the main theorem of this paper, which describes a procedure to find the critical value of $\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty}$ using bootstrap.

Theorem 2.12.

Denote

W=|\widehat{D}^{1/2}\eta|_{\infty},

where $\eta\sim N(0,I_{p^{2}})$ is independent of $(X_{i})_{i=1}^{n}$ and $\widehat{D}$ is defined in (2.12). Let the bootstrap critical value be given by $c(\alpha)=\inf\{t\in\mathbb{R}:\mathbb{P}(W\leq t|\boldsymbol{X})\geq 1-\alpha\}$ . Let assumptions (A1) — (A5) and the assumptions in theorem 2.8 hold. Denote $v=c(s\gamma\tau^{2}T^{2}(\log p)^{3/2}/\sqrt{n}+|\widehat{\beta}-\beta^{*}|_{1})$ for some constant $c$ . Assume that $\pi(v)=Cv^{1/3}(1\lor\log(p/v))^{2/3}=o(1)$ , then we have

\sup_{\alpha\in(0,1)}\bigg{|}\mathbb{P}\Big{(}\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty}>c(\alpha)\Big{)}-\alpha\bigg{|}=o(1).

This result suggests a way to not only find the asymptotic distribution, but also to provide an accurate critical value $c(\alpha)$ using multiplier bootstrap. Under the null hypothesis $H_{0}$ , we have $\sqrt{n}|\check{\beta}-\beta^{0}|_{\infty}=\sqrt{n}|\check{\beta}-\beta^{*}|$ . This verifies the validity of having (2.11) as a test statistics for simultaneous inference.

3 Estimation

Many estimation tasks are needed as preparatory results for proving Theorem 2.12. For instance, Theorem 2.12 requires an estimation of the theoretical covariance matrix $D$ of the Gaussian analogue $Z$ , as stated in Theorem 2.11. Besides, the convergence of CLIME estimator (section 2.3) depends on the convergence of corresponding covariance matrix. Therefore, these problems requires us to develop a new estimation theory that delivers the convergence even in ultra-high dimensional regime.

The success of high-dimensional estimation relies heavily on the application of probability concentration inequality, among which Bernstein-type inequality is especially important. The celebrated Bernstein’s inequality (Bernstein, (1946)) provides an exponential concentration inequality for sums of independent random variables which are uniformly bounded. Later works relaxed the uniform boundedness condition and extended the validity of Bernstein inequality to independent random variables that have finite exponential moment; see for example, Massart, (2007) and Wainwright, (2019).

Despite the extensive body of work on concentration inequalities for independent random variables, literature remains quiet when it comes to establishing exponential-type tail concentration results for random process. Some related existing work includes Bernstein inequality for sums of strong mixing processes (Merlevède et al., (2009)), Bernstein inequality under functional dependence measures (Zhang, (2019)), etc. In a more recent work, Liu and Zhang, (2021) established a sharp Bernstein inequality for VAR model using the definition of spectral decay index, which improved the current rate by a factor of $(\log n)^{2}$ . In this paper, we will derive another Bernstein inequality for VAR model under slightly different condition from Liu and Zhang, (2021). Before presenting the main results, recall the definition of $\tau$ in definition 2.1.

Lemma 3.1.

Let $\{X_{i}\}_{i=0}^{n}$ be generated by a VAR(1) model. Suppose $G:\mathbb{R}^{p}\to\mathbb{R}$ satisfies that

|G(X)-G(Y)|\leq|X-Y|_{\infty},

(3.1)

and that $|G(x)|\leq B$ for all $x\in\mathbb{R}$ . Assume that $\mathbb{E}[|\varepsilon_{ij}|^{2}]\leq\sigma^{2}$ for all $j=1,\dots,p$ . Then there exists some constants $C_{1},C_{2},C_{3},C_{4}>0$ only depending on $\rho$ and $\sigma$ , such that

	$\displaystyle\mathbb{P}\bigg{(}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}G(X_{i-1})-\mathbb{E}[$	$\displaystyle G(X_{i-1})]\Big{\|}\geq x\bigg{)}\leq 2\exp\bigg{\{}-\frac{nx^{2}}{C_{3}n^{-1}\gamma^{2}\tau^{3}+C_{4}\tau Bx}\bigg{\}}$
		$\displaystyle+2\exp\bigg{\{}-\frac{nx^{2}}{(1+C_{1}B^{-2})\gamma^{2}\tau^{4}B^{2}(\log p)^{2}(n^{-1}\tau\log p+1)+C_{2}\tau^{2}B(\log p)x}\bigg{\}}.$

Specifically, under assumption (A2), we see that $\tau(\log p)/n\to 0$ . So for sufficiently large $B>0$ , we have

\displaystyle\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}G(X_{i-1})-\mathbb{E}[G(X_{i-1})]\Big{|}\geq x\bigg{)}\leq 4\exp\bigg{\{}-\frac{nx^{2}}{C_{1}^{\prime}\gamma^{2}\tau^{4}B^{2}(\log p)^{2}+C_{2}^{\prime}\tau^{2}B(\log p)x}\bigg{\}},

(3.2)

for some positive constants $C_{1}^{\prime},C_{2}^{\prime}$ depending only on $\rho$ and $\sigma$ .

Remark 3.2.

Note that the Lipschitz condition (3.1) is slightly different from that in Liu and Zhang, (2021), where instead, they assumed that

|G(x)-G(y)|\leq g^{\top}|x-y|,

(3.3)

for some vector $g\in\mathbb{R}^{p}.$ Since condition (3.1) is weaker than (3.3), the additional $(\log p)$ appears in the denominator of right-hand side in (3.2). For more detailed comparison of different versions of Bernstein inequalities, we refer readers to Liu and Zhang, (2021) and the references therein.

With a minor modification of the proof of Lemma 3.1, we have the following version of Bernstein inequality which includes a bounded function of the latest innovation $\varepsilon_{i}$ as a multiple.

Corollary 3.3.

Let $\{X_{i}\}_{i=0}^{n}$ be generated by a VAR(1) model. Suppose $|h(x)|\leq 1$ and $G:\mathbb{R}^{p}\to\mathbb{R}$ satisfies that

|G(X)-G(Y)|\leq|X-Y|_{\infty},

	$\displaystyle\mathbb{P}\bigg{(}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}h(\varepsilon_{i})G(X_{i-1})$	$\displaystyle-\mathbb{E}[h(\varepsilon_{i})G(X_{i-1})]\Big{\|}\geq x\bigg{)}\leq 2\exp\bigg{\{}-\frac{nx^{2}}{C_{3}n^{-1}\gamma^{2}\tau^{3}+C_{4}\tau Bx}\bigg{\}}$
		$\displaystyle+2\exp\bigg{\{}-\frac{nx^{2}}{(1+C_{1}B^{-2})\gamma^{2}\tau^{4}B^{2}(\log p)^{2}(n^{-1}\tau\log p+1)+C_{2}\tau^{2}B(\log p)x}\bigg{\}}.$

Specifically, under assumption (A2), we see that $\tau(\log p)/n\to 0$ . So for sufficiently large $B>0$ , we have

\displaystyle\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}h(\varepsilon_{i})G(X_{i-1})

\displaystyle-\mathbb{E}[h(\varepsilon_{i})G(X_{i-1})]\Big{|}\geq x\bigg{)}\leq 4\exp\bigg{\{}-\frac{nx^{2}}{C_{1}^{\prime}\gamma^{2}\tau^{4}B^{2}(\log p)^{2}+C_{2}^{\prime}\tau^{2}B(\log p)x}\bigg{\}},

for some positive constants $C_{1}^{\prime},C_{2}^{\prime}$ depending only on $\rho$ and $\sigma$ .

Remark 3.4.

Since the additional term $h(\varepsilon_{i})$ is independent of $G(X_{i-1})$ , the proof of Lemma 3.1 directly applies without any extra technical difficulty.

Equipped with our new Bernstein inequalities, several estimation results follow immediately. The next theorem regarding the estimation of $\Sigma_{x}$ is essential when we prove the convergence rate of CLIME estimator in section 2.3.

Theorem 3.5 (Estimation of $\Sigma_{x}$ ).

Let $\widehat{\Sigma}_{x}=n^{-1}\sum_{i=1}^{n}X_{i-1}X_{i-1}^{\top}w(X_{i-1})$ and $\Sigma_{x}=\mathbb{E}[X_{i}X_{i}^{\top}w(X_{i})]$ . Then with probability at least $1-4p^{-c_{0}}$ for some constant $c_{0}>0$ , it holds that

\|\widehat{\Sigma}_{x}-\Sigma_{x}\|_{\max}\lesssim\gamma\tau^{2}T^{2}n^{-1/2}(\log p)^{3/2}.

We see that the convergence rate of CLIME estimator in Theorem 2.5 essentially inherits from the convergence rate in Theorem 3.5, with an additional term $\|\Omega_{x}\|_{1}$ . The following theorem plays an important role in verifying that the $\Delta$ defined in (2.9) is indeed negligible.

Theorem 3.6 (Estimation of $\Sigma$ by $\nabla^{2}L_{n}(\beta^{*})$ ).

Assume that $\mathbb{E}[\varepsilon_{ik}^{2}]\leq\sigma^{2}$ for all $1\leq k\leq p$ . Then for some constant $c_{1}>0$ , with probability at least $1-4p^{-c_{1}}$ , it holds that

\|\nabla^{2}L_{n}(\beta^{*})-\Sigma\|_{\max}\lesssim\gamma\tau^{2}T^{2}n^{-1/2}(\log p)^{3/2}.

While the last two theorems make use of Lemma 3.1 in this paper, the next estimation for $\mu$ directly applies the concentration inequality in Liu and Zhang, (2021) thanks to the stronger assumption that $\widehat{\mu}$ satisfies.

Lemma 3.7.

Suppose that $\beta_{k}^{*}$ lies in a bounded $\ell_{1}$ normed ball for all $1\leq k\leq p$ and that $\mathbb{E}[X_{ij}^{2}]\leq C$ for some constant $C>0$ and for all $1\leq j\leq p$ . Then we have

\mathbb{P}\bigg{(}|\widehat{\mu}-\mu|_{\infty}\geq\gamma\tau^{2}\sqrt{\frac{\log p}{n}}+|\widehat{\beta}-\beta^{*}|_{1}\bigg{)}\leq 2p^{-c},

for some positive constant $c$ .

4 Gaussian Approximation

Conducting simultaneous inference for high-dimensional data is always considered to be a hard task, since central limit theorem fails when the dimension of random vectors can grow as a function of the number of observation $n$ , or even exceeds $n$ . As an alternative to central limit theorem, Chernozhukov et al., (2013) proposed Gaussian approximation theorem, which states that under certain conditions, the distribution of the maximum of a sum of independent high-dimensional random vectors can be approximated by that of the maximum of a sum of the Gaussian random vectors with the same covariance matrices as the original vectors. Their Gaussian approximation results allow the ultra-high dimensional cases, where the dimension $p$ grows exponentially in $n$ . In the meantime, they also proved that Gaussian multiplier bootstrap method yields a high quality approximation of the distribution of the original maximum and showcased a wide range of application, such as high-dimensional estimation, multiple hypothesis testing, and adaptive specification testing. It is worth noticing that the results from Chernozhukov et al., (2013) are only applicable when the sequence of random vectors is independent.

Zhang and Wu, (2017) generalized Gaussian approximation results to general high-dimensional stationary time series, using the framework of functional dependence measure (Wu, (2005)). We specifically mention that a direct application of Gaussian approximation from Zhang and Wu, (2017) cannot deliver a desired conclusion in ultra-high dimensional regime, due to coarser capture of dependence measure for VAR model. In what follows, we will use refined argument to establish a new Gaussian approximation result for VAR model.

By Theorem 2.8, $\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty}$ can be approximated by $\sqrt{n}|\Omega\nabla L_{n}(\beta^{*})|_{\infty}$ . Hence, we shall build a GA result for $\sqrt{n}\Omega\nabla L_{n}(\beta^{*})$ . Observe that $\sqrt{n}\Omega\nabla L_{n}(\beta^{*})\in\mathbb{R}^{p^{2}}$ can be written as

\bigg{(}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\Omega_{x}}{\mu_{1}}\psi_{\alpha}(\varepsilon_{i1})X_{i-1}^{\top}w(X_{i-1}),\dots,\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\Omega_{x}}{\mu_{p}}\psi_{\alpha}(\varepsilon_{ip})X_{i-1}^{\top}w(X_{i-1}),\bigg{)}^{\top},

so it’s sufficient to establish GA result for one sub-vector

\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\Omega_{x}}{\mu_{k}}\psi_{\alpha}(\varepsilon_{ik})X_{i-1}^{\top}w(X_{i-1}),\quad k=1,\dots,p.

Fix $1\leq k\leq p$ and denote $\Theta_{k}=\Omega_{x}\mu_{k}^{-1}$ . Let $X_{i,m}=\sum_{l=0}^{m}A^{l}\varepsilon_{i-l}$ be the $m$ -approximation of $X_{i}$ with $m$ to be determined later. Let $Y_{i}=\psi_{\alpha}(\varepsilon_{ik})\Theta_{k}X_{i-1}w(X_{i-1})$ be the quantity that we will establish Gaussian approximation for and denote $T_{Y}=\sum_{i=1}^{n}Y_{i}$ . Analogously, let $Y_{i,m}=\psi_{\alpha}(\varepsilon_{ik})\Theta_{k}X_{i-1,m}w(X_{i-1,m})$ be the $m$ -approximation of $Y_{i}$ and write $T_{Y,m}=\sum_{i=1}^{n}Y_{i,m}$ . For simplicity, assume $n=(m+M)w$ , where $M\to\infty,$ $m\to\infty$ , $w\to\infty$ and $m/M\to 0$ . Divide the interval $[1,n]$ into alternating large blocks $L_{b}=[(b-1)(M+m)+1,bM+(b-1)m]$ with $M$ points and small blocks $S_{b}=[bM+(b-1)m+1,b(M+m)]$ with $m$ points, for $1\leq b\leq w$ . Denote

	$\displaystyle\xi_{b}$	$\displaystyle=\sum_{i\in L_{b}}Y_{i,m}/\sqrt{M},\quad T_{Y,S}=\sum_{b=1}^{w}\sum_{i\in S_{b}}Y_{i,m},\quad T_{Y,L}=\sum_{b=1}^{w}\sum_{i\in L_{b}}Y_{i,m},$
	$\displaystyle Z$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}U_{i},\quad\text{where }U_{i}\sim N(0,\mu_{k}^{-2}\mathbb{E}[\psi^{2}(\varepsilon_{ik})]\Omega_{x}\mathbb{E}[X_{i}X_{i}^{\top}w^{2}(X_{i})]\Omega_{x}^{\top})$

Note that the $Y_{i,m}$ from different large blocks $L_{b}$ are independent, i.e. $\sum_{i\in L_{b}}Y_{i,m}$ is independent in $b=1\dots,w$ . The main result of this section is presented as follow.

Theorem 4.1.

Suppose $\mathbb{E}[\varepsilon_{ik}^{2}]\leq\sigma^{2}$ for all $1\leq k\leq p$ and the odd function $\psi(\cdot)$ satisfies that $|\psi(\cdot)|\leq C$ and $|\psi^{\prime}(\cdot)|\leq C$ . Suppose the scaling condition holds that $sT^{2}(\log(pn))^{7}/n\leq c_{1}n^{-c_{2}}$ . Then for any $\eta>0$ , the Gaussian Approximation holds that

	$\displaystyle\mathcal{H}:$	$\displaystyle=\sup_{t\in\mathbb{R}}\bigg{\|}\mathbb{P}\big{(}\|T_{Y}/\sqrt{n}\|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}\|Z\|_{\infty}\leq t\big{)}\bigg{\|}$
		$\displaystyle\lesssim f_{1}(\eta/2,m)+f_{2}(\eta/2,m)+\eta\sqrt{\log p}+\eta\sqrt{\log(1/\eta)}+cn^{-c^{\prime}},$		(4.1)

for some $c,c^{\prime}>0$ .

This theorem gives an upper bound on the supremum of the difference between the distribution of the maximum of sum of $Y_{i}$ and that of the maximum of sum of Gaussian vectors $U_{i}$ with the same covariance. Now, we present the outline of the proof of the previous theorem, while we leave the complete proof in the appendix.

First, we show that the sum of $Y_{i,m}$ in the small blocks are negligible, so $T_{Y,m}\approx T_{Y,L}$ . Next, we prove that the sum of $Y_{i}$ can be approximated by its $m$ -approximation, that is, $T_{Y}\approx T_{Y,m}\approx T_{Y,L}$ . Since $T_{Y,L}$ is a sum of independent random vector $\{\sum_{i\in L_{b}}Y_{i,m}\}_{b=1}^{w}$ , the GA theorem from Chernozhukov et al., (2013) can be applied.

5 Numerical Experiments

In this section, we evaluate the performance of the proposed bootstrap-assist procedure in simultaneous inference. We consider the model (2.2), where $\varepsilon_{ij}$ ’s are i.i.d. Student’s $t$ -distributions with $df=5$ or $10$ . Let $s=\lfloor\log p\rfloor$ . We pick $n=30$ and $p=10$ in the numerical setup. For the true transition matrix $A=(a_{ij})$ , we consider the following designs.

(1)

Banded: $A=(\lambda^{|i-j|}{\bf 1}\{|i-j|\leq s\})$ and $\lambda=0.5$ .
(2)

Block diagonal: $A=\text{diag}\{A_{i}\}$ , where each $A_{i}\in\mathbb{R}^{s\times s}$ has $\lambda_{i}$ on the diagonal and $\lambda_{i}^{2}$ on the superdiagonal with $\lambda_{i}\sim Unif(-0.8,0.8).$

The design in (1) is further scaled by $2\rho(A)$ to ensure that $\rho(A)<1$ . Hence sparse symmetric matrices are generated in (1) and sparse asymmetric matrices are constructed in (2). We draw the qq-plots of the data quantile of $\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty}$ versus the data quantile of $W$ defined in Theorem 2.12 from $m=100$ duplicates. The qq-plots are shown in figure 2 and figure 2 for banded and block diagonal designs respectively.

Refer to caption — Figure 1: The qq-plot of banded design.

Appendix A Proofs of Results in Section 2

Before proceeding with the proofs, we state a helpful lemma that is repeatedly used throughout the paper and present its proof. This simply lemma is an application of the triangle inequality to the product of two matrices.

Lemma A.1.

Let $A,B$ and $\widehat{A},\widehat{B}$ be $p\times p$ symmetric matrices and $\|A-\widehat{A}\|_{1}=o(1)$ . Suppose $\|A\|_{1}=O(1)$ and $\|B\|_{1}=O(1)$ . Then $\|AB-\widehat{A}\widehat{B}\|_{\max}\lesssim\|A-\widehat{A}\|_{\max}+\|B-\widehat{B}\|_{\max}$ .

Proof of Lemma A.1.

Since $\|A\|_{1}=O(1)$ and $\|A-\widehat{A}\|_{1}=o(1)$ , $\|\widehat{A}\|_{1}\leq\|A-\widehat{A}\|_{1}+\|A\|_{1}=O(1)$ . Hence, by triangular inequality,

	$\displaystyle\\|AB-\widehat{A}\widehat{B}\\|_{\max}$	$\displaystyle\leq\\|(A-\widehat{A})B\\|_{\max}+\\|\widehat{A}(B-\widehat{B})\\|_{\max}$
		$\displaystyle\leq\\|B\\|_{1}\\|A-\widehat{A}\\|_{\max}+\\|\widehat{A}\\|_{1}\\|B-\widehat{B}\\|_{\max}$
		$\displaystyle\lesssim\\|A-\widehat{A}\\|_{\max}+\\|B-\widehat{B}\\|_{\max}$

∎

Proof of Theorem 2.5.

By Theorem 3.5, with probability at least $1-4p^{-c_{0}}$ ,

\|\widehat{\Sigma}_{x}-\Sigma_{x}\|_{\max}\leq\lambda_{n}.

By Theorem 6 of Cai et al., (2011), we have the desired result. ∎

Proof of Theorem 2.7.

Recall that

\displaystyle\Omega

\displaystyle=\Omega_{x}\otimes\text{diag}(\mu^{-1})=\begin{bmatrix}\mu_{1}^{-1}\Omega_{x}&0&0&\dots&0\\ 0&\mu_{2}^{-1}\Omega_{x}&0&\dots&0\\ \vdots&\vdots&\ddots&\dots&0\\ 0&0&0&0&\mu_{p}^{-1}\Omega_{x}\end{bmatrix}

(A.1)

and

\displaystyle\widehat{\Omega}

\displaystyle=\widehat{\Omega}_{x}\otimes\text{diag}(\widehat{\mu}^{-1})=\begin{bmatrix}\widehat{\mu}_{1}^{-1}\widehat{\Omega}_{x}&0&0&\dots&0\\ 0&\widehat{\mu}_{2}^{-1}\widehat{\Omega}_{x}&0&\dots&0\\ \vdots&\vdots&\ddots&\dots&0\\ 0&0&0&0&\widehat{\mu}_{p}^{-1}\widehat{\Omega}_{x}\end{bmatrix}

(A.2)

For $1\leq k\leq p$ , consider

	$\displaystyle\\|\widehat{\Omega}_{x}\widehat{\mu}_{k}^{-1}-\Omega_{x}\mu_{k}^{-1}\\|_{\max}$	$\displaystyle\leq\\|\widehat{\Omega}_{x}-\Omega_{x}\\|_{\max}\|\widehat{\mu}_{k}^{-1}\|+\\|\Omega_{x}\\|_{\max}\|\widehat{\mu}_{k}^{-1}-\mu_{k}^{-1}\|$
		$\displaystyle\lesssim\\|\Omega_{x}\\|_{1}\lambda_{n}+\\|\Omega_{x}\\|_{\max}\frac{\|\mu_{k}-\widehat{\mu}_{k}\|}{\mu_{k}\widehat{\mu}_{k}}$
		$\displaystyle\lesssim\\|\Omega_{x}\\|_{1}\lambda_{n},$

with probability no less than $1-6p^{-c^{\prime}}$ by theorem 2.5 and lemma 3.6. Taking a union bound for all $k$ yields

\|\Omega-\widehat{\Omega}\|_{\max}=\max_{1\leq k\leq p}\|\mu_{k}^{-1}\Omega_{x}-\widehat{\mu}_{k}^{-1}\widehat{\Omega}_{x}\|_{\max}\lesssim\|\Omega_{x}\|_{1}\lambda_{n},

with probability at least $1-6p^{-(c^{\prime}-1)}$ . Replacing $\max$ -norm by $L_{1}$ -norm delivers

\|\Omega-\widehat{\Omega}\|_{1}\lesssim\|\Omega_{x}\|_{1}s\lambda_{n}.

∎

The next Lemma provides a high probability bound on $|\nabla L_{n}(\beta^{*})|_{\infty}$ , which will be used in the proof of Theorem 2.8.

Lemma A.2.

Suppose that $\mathbb{E}[\varepsilon_{ij}^{2}]\leq C$ for all $1\leq j\leq p$ . Then it holds that

\mathbb{P}(|\nabla L_{n}(\beta^{*})|_{\infty}\gtrsim\gamma\tau^{2}T(\log p)^{3/2}/\sqrt{n})\leq 4p^{-c},

for some constant $c>0$ .

Proof of Lemma A.2.

We shall apply Corollary 3.3. Consider the first coordinate $\nabla L_{n1}(\beta^{*})$ of $\nabla L_{n}(\beta^{*})$ . In Corollary 3.3, let $h(\varepsilon_{i})=\psi(\varepsilon_{i1})$ and $G(X_{i})=X_{i1}w(X_{i}).$ Observe that $\mathbb{E}[\nabla L_{n}(\beta^{*})]=0$ . By Corollary 3.3,

	$\displaystyle\mathbb{P}(\|\nabla L_{n1}(\beta^{*})\|\geq x)$	$\displaystyle=\mathbb{P}(\|\nabla L_{n1}(\beta^{})-\mathbb{E}[\nabla L_{n1}(\beta^{})]\|_{\infty}\geq x)$
		$\displaystyle=\mathbb{P}\bigg{(}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}h(\varepsilon_{i})G(X_{i-1})-\mathbb{E}[h(\varepsilon_{i})G(X_{i-1})]\Big{\|}\geq x\bigg{)}$
		$\displaystyle\leq 4\exp\bigg{\{}-\frac{nx^{2}}{C_{1}\gamma^{2}\tau^{4}T^{2}(\log p)^{2}+C_{2}\tau^{2}T(\log p)x}\bigg{\}}$

Choose $x=c^{\prime}\gamma\tau^{2}T(\log p)^{3/2}/\sqrt{n}$ and we get

\mathbb{P}(|\nabla L_{n1}(\beta^{*})|\geq c^{\prime}\gamma\tau^{2}T(\log p)^{3/2}/\sqrt{n})\leq 4p^{-c},

for some constant $c>0$ . Take sufficiently large $c^{\prime}$ such that $c>1$ , so by a union bound we obtain

\mathbb{P}(|\nabla L_{n}(\beta^{*})|_{\infty}\geq c^{\prime}\gamma\tau^{2}T(\log p)^{3/2}/\sqrt{n})\leq 4p^{-c^{\prime\prime}},

where $c^{\prime\prime}=c-1>0$ . ∎

Proof of Theorem 2.8.

By Taylor expansion, we write

	$\displaystyle\sqrt{n}(\check{\beta}-\beta^{*})$	$\displaystyle=\sqrt{n}(\widehat{\beta}-\beta^{})+\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})-\sqrt{n}\,\widehat{\Omega}(\nabla L_{n}(\widehat{\beta})-\nabla L_{n}(\beta^{*}))$
		$\displaystyle=\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})+\sqrt{n}\left[(\widehat{\beta}-\beta^{})-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{})(\widehat{\beta}-\beta^{})+R\right]$
		$\displaystyle=\underbrace{\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})}_{A}+\underbrace{\sqrt{n}\left[\Big{(}I_{p^{2}}-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{})(\widehat{\beta}-\beta^{*})\right]}_{\Delta}+\sqrt{n}R,$

where the remainder

R=\frac{1}{2n}\sum_{i=1}^{n}\Big{(}\psi^{\prime\prime}(z_{i1})\big{(}X^{\top}_{i-1}(\widehat{\beta}_{1}-\beta^{*}_{1})\big{)}^{2}X_{i-1}^{\top}w(X_{i-1}),\dots,\psi^{\prime\prime}(z_{ip})\big{(}X^{\top}_{i-1}(\widehat{\beta}_{p}-\beta^{*}_{p})\big{)}^{2}X_{i-1}^{\top}w(X_{i-1})\Big{)}^{\top},

where $z_{ik}=X_{i}-X_{i-1}^{\top}\widetilde{\beta}$ for some $\widetilde{\beta}$ lying between $\beta^{*}$ and $\widehat{\beta}$ . Now we analyze the above terms $R,\Delta,A$ respectively. First we see that $\sqrt{n}|R|_{\infty}=O_{\mathbb{P}}(\sqrt{n}T^{3}|\widehat{\beta}-\beta^{*}|_{1}^{2})=o(1)$ by assumption (A1). To analyze $\Delta$ , denote $H=\nabla^{2}L_{n}(\beta^{*})$ . Then we write

\Delta=\sqrt{n}\Big{(}I_{p^{2}}-\widehat{\Omega}\,H\Big{)}(\widehat{\beta}-\beta^{*})=\sqrt{n}\Big{(}\Omega\Sigma-\widehat{\Omega}H\Big{)}(\widehat{\beta}-\beta^{*}).

Thus, by theorem 3.6 and theorem 2.7, with probability tending to 1,

	$\displaystyle\|\Delta\|_{\infty}$	$\displaystyle\leq\sqrt{n}\\|\Omega\Sigma-\widehat{\Omega}H\\|_{\max}\|\widehat{\beta}-\beta^{}\|_{1}\leq\sqrt{n}\Big{(}\\|\Omega-\widehat{\Omega}\\|_{1}\\|\Sigma\\|_{\max}+\\|H-\Sigma\\|_{\max}\\|\widehat{\Omega}\\|_{1}\Big{)}\|\widehat{\beta}-\beta^{}\|_{1}$
		$\displaystyle\lesssim\sqrt{n}\\|\Omega_{x}\\|_{1}\lambda_{n}\|\widehat{\beta}-\beta^{}\|_{1}\asymp s\gamma\tau^{2}T^{2}(\log p)^{3/2}\|\widehat{\beta}-\beta^{}\|_{1}=o(1)$

by assumption (A3). Finally, by Lemma A.2 and Theorem 2.7, with probability tending to 1, it holds that

	$\displaystyle\|A-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})\|_{\infty}$	$\displaystyle\leq\\|\widehat{\Omega}-\Omega\\|_{1}\|\sqrt{n}\nabla L_{n}(\beta^{*})\|_{\infty}\leq\\|\Omega_{x}\\|_{1}^{2}s\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}$
		$\displaystyle\asymp s^{2}\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}.$

Therefore,

|\sqrt{n}(\check{\beta}-\beta^{*})-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}\leq|\sqrt{n}(\check{\beta}-\beta^{*})-A|_{\infty}+|A-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}\leq\zeta_{1},

where

\zeta_{1}=s\gamma\tau^{2}T^{2}(\log p)^{3/2}|\widehat{\beta}-\beta^{*}|_{1}+\sqrt{n}T^{3}|\widehat{\beta}-\beta^{*}|_{1}^{2}+s^{2}\gamma^{2}\tau^{4}T^{4}(\log p)^{3}/\sqrt{n}.

∎

Proof of Theorem 2.9.

The proof of Theorem 4.1 can be easily generalized to $p^{2}$ dimensional space, thus it still holds for $|\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}$ . By Theorem 4.1, we have for any $\eta>0$ ,

		$\displaystyle\sup_{t\in\mathbb{R}}\bigg{\|}\mathbb{P}\big{(}\|\sqrt{n}\Omega\nabla L_{n}(\beta^{*})\|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}\|Z\|_{\infty}\leq t\big{)}\bigg{\|}$
	$\displaystyle\lesssim{}$	$\displaystyle f_{1}(\eta/2,m)+f_{2}(\eta/2,m)+\eta\sqrt{\log p}+\eta\sqrt{\log(1/\eta)}+cn^{-c^{\prime}},$		(A.3)

where

f_{1}(x,m)=\frac{c_{1}sp^{3}\gamma^{2}\rho^{m/\tau}}{x^{2}}\quad\text{and}\quad f_{2}(x)=2p\exp\bigg{\{}-\frac{nx^{2}}{2\sqrt{s}TM\sqrt{n}x+4mwsT^{2}\sigma^{2}}\bigg{\}}.

(A.4)

Now choose $\eta\asymp(\log p)T\sqrt{s\tau}\gamma/n^{1/4}=o(1),\omega\asymp n^{1/2},M\asymp n^{1/2},m=c\tau\log p$ in (A) for some constant $c>0$ . For sufficiently large $c$ , basic algebra shows that

\displaystyle f_{1}(\eta/2,m)\lesssim\frac{s\gamma^{2}}{p^{c-3}\eta^{2}}\asymp\frac{n^{1/2}}{p^{c-3}T^{2}\tau(\log p)^{2}}=o(1),

(A.5)

since the order of $p^{c-3}$ dominates the order of $n^{1/2}$ . Moreover,

\displaystyle f_{2}(\eta/2,m)\leq 2p\exp\bigg{\{}-\frac{c_{1}\gamma^{2}\log p}{c_{2}\gamma+c_{3}}\bigg{\}}\leq 2p\exp\{-c_{4}\log p\}=o(1),

(A.6)

by a proper choice of constant $c_{1},c_{2},c_{3}$ . Also, by assumption (A5), $\eta\sqrt{\log p}=o(1)$ and

\eta\sqrt{\log(1/\eta)}\lesssim\frac{T\sqrt{s\tau}\gamma\log p}{n^{1/4}}\sqrt{\log n}=o(1).

Thus the proof is completed. ∎

Proof of Theorem 2.11.

First, we collect several useful results.

(i)

With probability at least $1-4p^{-c_{1}}$ , $\|\Omega_{x}-\widehat{\Omega}_{x}\|_{1}\leq\|\Omega_{x}\|_{1}^{2}s\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}$ and $\|\Omega_{x}-\widehat{\Omega}_{x}\|_{\max}\leq\|\Omega_{x}\|_{1}^{2}\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}$ by Theorem 2.5. Therefore, $\|\Omega_{x}-\widehat{\Omega}_{x}\|_{1}=o(1)$ and $\|\Omega_{x}-\widehat{\Omega}_{x}\|_{\max}=o(1)$ by assumption (A2).
(ii)

With probability at least $1-2p^{-c_{2}}$ , $|\mu-\widehat{\mu}|_{\infty}\lesssim\gamma\tau^{2}\sqrt{\frac{\log p}{n}}+|\widehat{\beta}-\beta^{*}|_{1}=o(1)$ Lemma 3.7 and the order comes from assumptions (A1) and (A2).

(iii)

Similar to the proof of Lemma 3.7, we have with probability at least $1-2p^{-c_{3}}$ ,

\Big{|}\frac{1}{n}\sum_{i=1}^{n}\psi(\widehat{\varepsilon}_{ij})\psi(\widehat{\varepsilon}_{ik})-\mathbb{E}[\psi(\varepsilon_{ij})\psi(\varepsilon_{ik})]\Big{|}_{\infty}\lesssim\gamma\tau^{2}\sqrt{\frac{\log p}{n}}+|\widehat{\beta}-\beta^{*}|_{1}=o(1)

(iv)

Similar to the proof of Lemma 3.5, we have with probability at least $1-4p^{-c_{4}}$ ,

\Big{\|}\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}w^{2}(X_{i})-\mathbb{E}[X_{i}X_{i}^{\top}w^{2}(X_{i})]\Big{\|}_{\max}\lesssim\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}=o(1).

Repeatedly using Lemma A.1, we get

	$\displaystyle\\|\widehat{D}-D\\|_{\max}$	$\displaystyle\lesssim\max_{1\leq j,k\leq p}\Big{\|}\frac{1}{\mu_{j}\mu_{k}}-\frac{1}{\widehat{\mu}_{j}\widehat{\mu}_{k}}\Big{\|}+\Big{\|}\frac{1}{n}\sum_{i=1}^{n}\psi(\widehat{\varepsilon}_{ij})\psi(\widehat{\varepsilon}_{ik})-\mathbb{E}[\psi(\varepsilon_{ij})\psi(\varepsilon_{ik})]\Big{\|}_{\infty}$
		$\displaystyle+2\\|\Omega_{x}-\widehat{\Omega}_{x}\\|_{\max}+\Big{\\|}\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}w^{2}(X_{i})-\mathbb{E}[X_{i}X_{i}^{\top}w^{2}(X_{i})]\Big{\\|}_{\max}$
		$\displaystyle\lesssim\gamma\tau^{2}\sqrt{\frac{\log p}{n}}+\|\widehat{\beta}-\beta^{*}\|_{1}+\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}$
		$\displaystyle\lesssim\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}+\|\widehat{\beta}-\beta^{*}\|_{1}$

with probability at least $1-12p^{-c}$ , where $c=\min_{1\leq i\leq 4}c_{i}$ . ∎

Proof of Theorem 2.12.

By theorem 2.8, we see that

\mathbb{P}\bigg{(}|\sqrt{n}(\check{\beta}-\beta^{*})-\sqrt{n}\Omega\nabla L_{n}(\beta^{*})|_{\infty}>\zeta_{1}\bigg{)}<\zeta_{2},

where $\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}=o(1)$ and $\zeta_{2}=o(1)$ . Define $\pi(v)=Cv^{1/3}(1\lor\log(p/v))^{2/3}$ with $C_{2}>0$ and

\Gamma=\|\widehat{D}-D\|_{\max}.

Let $c_{z}(\alpha)=\inf\{t\in\mathbb{R}:\mathbb{P}(|\sum_{i=1}^{n}z_{i}/\sqrt{n}|_{\infty}\leq t)\geq 1-\alpha\}$ , where the sequence $\{z_{i}\}$ is defined in theorem 2.9. From the proof of Lemma 3.2 in Chernozhukov et al., (2013), we have

	$\displaystyle\mathbb{P}\Big{(}c(\alpha)\leq c_{z}(\alpha+\pi(v))\Big{)}$	$\displaystyle\geq 1-\mathbb{P}(\Gamma>v)$		(A.7)
	$\displaystyle\mathbb{P}\Big{(}c_{z}(\alpha)\leq c(\alpha+\pi(v))\Big{)}$	$\displaystyle\geq 1-\mathbb{P}(\Gamma>v)$		(A.8)

Therefore, by theorem 2.9, (A.7) and (A.8), we have for every $v>0$ ,

	$\displaystyle\sup_{\alpha\in(0,1)}\bigg{\|}\mathbb{P}\Big{(}\sqrt{n}\Omega\nabla L_{n}(\beta^{*})>c(\alpha)\Big{)}-\alpha\bigg{\|}$	$\displaystyle\lesssim\sup_{\alpha\in(0,1)}\bigg{\|}\mathbb{P}\Big{(}\|\sum_{i=1}^{n}z_{i}/\sqrt{n}\|_{\infty}>c(\alpha)\Big{)}-\alpha\bigg{\|}+o(1)$
		$\displaystyle\lesssim\pi(v)+\mathbb{P}(\Gamma>v)+o(1)$

Furthermore, following the same spirit as the proof of Theorem 3.2 in Chernozhukov et al., (2013), we see that

\sup_{\alpha\in(0,1)}\bigg{|}\mathbb{P}\Big{(}\sqrt{n}|\check{\beta}-\beta^{*}|_{\infty}>c(\alpha)\Big{)}-\alpha\bigg{|}\lesssim\pi(v)+\mathbb{P}(\Gamma>v)+\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}+\zeta_{2}+o(1).

Now that $\zeta_{1}\sqrt{1\lor\log(p/\zeta_{1})}=o(1)$ and $\zeta_{2}=o(1)$ from Theorem 2.8, we only need to choose $v>0$ , such that $\pi(v)=o(1)$ and $\mathbb{P}(\Gamma>v)=o(1)$ . Let $v\asymp s\gamma\tau^{2}T^{2}(\log p)^{3/2}n^{-1/2}+|\widehat{\beta}-\beta^{*}|_{1}$ . Then we see that the conditions that $\mathbb{P}(\Gamma>v)=o(1)$ and $\pi(v)=o(1)$ are satisfied by Theorem 2.11 and the scaling hypothesis. ∎

Appendix B Proofs of Results in Section 3

Proof of Lemma 3.1.

Define the filtration $\{\mathcal{F}_{i}\}$ with $\mathcal{F}_{i}=\sigma(\varepsilon_{i},\varepsilon_{i-1},\dots)$ , and let $P_{j}(\cdot)=\mathbb{E}(\cdot|\mathcal{F}_{j})-\mathbb{E}(\cdot|\mathcal{F}_{j-1})$ be a projection. Conventionally it follows that $P_{j}(G(X_{i}))=0$ for $j\geq i+1$ . We can write

\sum_{i=1}^{n}G(X_{i})-\mathbb{E}G(X_{i})=\sum_{j=-\infty}^{n}\left(\sum_{i=1}^{n}P_{j}(G(X_{i}))\right)=:\sum_{j=-\infty}^{n}L_{j},

where $L_{j}=\sum_{i=1}^{n}P_{j}(G(X_{i}))$ . By the Markov inequality, for $\lambda>0$ , we have

	$\displaystyle\mathbb{P}\bigg{(}\sum_{i=1}^{n}G(X_{i})-\mathbb{E}G(X_{i})\geq 2x\bigg{)}$	$\displaystyle\leq\mathbb{P}\bigg{(}\sum_{j=-\infty}^{-s}L_{j}\geq x\bigg{)}+\mathbb{P}\bigg{(}\sum_{j=-s+1}^{n}L_{j}\geq x\bigg{)}$
		$\displaystyle\leq\mathrm{e}^{-\lambda x}\mathbb{E}\bigg{[}\exp\bigg{\{}\lambda\sum_{j=-\infty}^{-s}L_{j}\bigg{\}}\bigg{]}+\mathrm{e}^{-\lambda x}\mathbb{E}\bigg{[}\exp\bigg{\{}\lambda\sum_{j=-s+1}^{n}L_{j}\bigg{\}}\bigg{]},$		(B.1)

for some $s>0$ to be determined later. We shall bound the right-hand side of (B) with a suitable choice of $\lambda>0$ . Observing that $\{L_{j}\}_{j\leq n}$ is a sequence of martingale differences with respect to $\{\mathcal{F}_{j}\}$ , we then seek an upper bound on $\mathbb{E}[\mathrm{e}^{\lambda L_{j}}\bigr{|}\mathcal{F}_{j-1}]$ . It follows that

$\displaystyle\|L_{j}\|$	$\displaystyle\leq\sum_{i=1\lor j}^{n}\min\left\{\left\|\mathbb{E}\left[G(X_{i})\bigr{\|}\mathcal{F}_{j}\right]-\mathbb{E}\left[G(X_{i})\|\mathcal{F}_{j-1}\right]\right\|,2B\right\}$
	$\displaystyle\leq\sum_{i=1\lor j}^{n}\min\left\{\\|A^{i-j}\\|_{\infty}\mathbb{E}\left[\|\varepsilon_{j}-\varepsilon_{j}^{\prime}\|_{\infty}\bigr{\|}\mathcal{F}_{j}\right],2B\right\}$
	$\displaystyle\leq\sum_{i=1\lor j}^{n}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\},$	(B.2)

where $\varepsilon^{\prime}_{j}$ is an i.i.d. copy of $\varepsilon_{j}$ and $\eta_{j}=\mathbb{E}\bigr{[}|\varepsilon_{j1}-\varepsilon_{j1}^{\prime}|\bigr{|}\mathcal{F}_{j}\bigr{]}.$

Denote $s=\lfloor\tau\log p/\log(1/\rho)\rfloor+1$ . Note that $s>0$ is a positive integer. For $-s<j\leq 0$ , we have

	$\displaystyle\|L_{j}\|$	$\displaystyle\leq\sum_{i=0}^{\infty}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}$
		$\displaystyle\leq\sum_{i=0}^{s-1}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}+\sum_{i=s}^{\infty}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}$
		$\displaystyle\leq 2sB+\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}$

For $0<j\leq n$ , we also have

\displaystyle|L_{j}|\leq\sum_{i=j}^{\infty}\min\left\{p\rho^{-1}\gamma\rho^{(i-j)/\tau}\eta_{j},2B\right\}\leq-2sB+\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}

Basic algebra shows that

$\displaystyle\mathbb{E}[\|L_{j}\|^{k}\|\mathcal{F}_{j-1}]$	$\displaystyle\overset{(1)}{\leq}\mathbb{E}\bigg{[}\Big{(}2sB+\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}\Big{)}^{k}\bigg{]}$
	$\displaystyle\leq\mathbb{E}\bigg{[}2^{k}\Big{(}(2sB)^{k}+\Big{(}\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}\Big{)}^{k}\Big{)}\bigg{]}$
	$\displaystyle\leq 2^{k}\bigg{[}(2sB)^{k}+\Big{(}\sum_{i=0}^{\infty}\Big{\\|}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}\Big{\\|}_{k}\Big{)}^{k}\bigg{]},$	(B.3)

where (1) comes from the independence of $\eta_{j}$ and $\mathcal{F}_{j-1}$ . To analyze (B), we further compute

$\displaystyle\Big{\\|}\min\left\{\rho^{-1}\gamma\rho^{i/\tau}\eta_{j},2B\right\}\Big{\\|}_{k}$	$\displaystyle=\Big{\\|}2B\mathbb{I}\Big{(}\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\geq 2B\Big{)}+\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\mathbb{I}\Big{(}\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\leq 2B\Big{)}\Big{\\|}_{k}$
	$\displaystyle\leq 2B\bigg{(}\mathbb{P}\Big{(}\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\geq 2B\Big{)}\bigg{)}^{1/k}+\mathbb{E}\Big{[}\Big{(}\frac{\gamma}{\rho}\rho^{i/\tau}\eta_{j}\Big{)}^{2}(2B)^{k-2}\Big{]}^{1/k}$
	$\displaystyle\leq\Big{(}4\sigma^{2}\frac{\gamma^{2}}{\rho^{2}}\Big{)}^{1/k}\rho^{2i/\tau k}(2B)^{1-2/k}$	(B.4)

Plugging (B) into (B) yields, for some constant $C_{1},C_{2}>0$ , that

$\displaystyle\mathbb{E}[\|L_{j}\|^{k}\|\mathcal{F}_{j-1}]$	$\displaystyle\leq 2^{k}\bigg{[}(2sB)^{k}+4\sigma^{2}\frac{\gamma^{2}}{\rho^{2}}(2B)^{k-2}\Big{(}\frac{1}{1-\rho^{2/\tau k}}\Big{)}^{k}\bigg{]}$
	$\displaystyle\overset{(1)}{\leq}2^{k}\bigg{[}(2sB)^{k}+4\sigma^{2}\frac{\gamma^{2}}{\rho^{2}}(2B)^{k-2}\Big{(}\frac{\tau k}{2}\Big{)}^{k}\rho^{-2/\tau}\Big{(}\log(1/\rho)\Big{)}^{k}\bigg{]}$
	$\displaystyle\overset{(2)}{\leq}2^{k}\bigg{[}(2sB)^{k}+C_{1}\gamma^{2}B^{-2}C_{2}^{k}B^{k}\tau^{k}k!\bigg{]}$
	$\displaystyle\leq\gamma^{2}(Bs\tau)^{k}k![4+C_{1}B^{-2}(2C_{2})^{k}]\leq\gamma^{2}(Bs\tau)^{k}k!(1+C_{1}B^{-2})(4+2C_{2})^{k},$	(B.5)

where (1) uses the inequality that $1-x\geq-x\log x$ for $x\in(0,1)$ and (2) uses Stirling formula and the fact that $\rho^{-2/\tau}\leq\rho^{-2}$ . Let $\tilde{C}_{1}=1+C_{1}B^{-2}$ and $\tilde{C}_{2}=4+2C_{2}$ . Then we obtain

	$\displaystyle\mathbb{E}\Big{[}\mathrm{e}^{\lambda L_{j}}\|\mathcal{F}_{j-1}\Big{]}$	$\displaystyle\leq 1+\sum_{k=2}^{\infty}\Big{[}\tilde{C}_{1}\gamma^{2}(\tilde{C}_{2}Bs\tau\lambda)^{k}\Big{]}=1+\frac{\tilde{C}_{1}\gamma^{2}\tilde{C}_{2}^{2}(Bs\tau)^{2}\lambda^{2}}{1-\tilde{C}_{2}Bs\tau\lambda}$
		$\displaystyle\leq\exp\bigg{\{}\frac{\tilde{C}_{1}\gamma^{2}\tilde{C}_{2}^{2}(Bs\tau)^{2}\lambda^{2}}{1-\tilde{C}_{2}Bs\tau\lambda}\bigg{\}}.$		(B.6)

Furthermore,

\mathbb{E}\Big{[}\exp\Big{\{}\lambda\sum_{j=s}^{n}L_{j}\Big{\}}\Big{]}\leq\exp\bigg{\{}\frac{\tilde{C}_{1}\gamma^{2}\tilde{C}_{2}^{2}(Bs\tau)^{2}(s+n)\lambda^{2}}{1-\tilde{C}_{2}Bs\tau\lambda}\bigg{\}}.

(B.7)

Take $\lambda=x(\tilde{C}_{2}Bs\tau x+2\tilde{C}_{1}\gamma^{2}\tilde{C}_{2}^{2}(Bs\tau)^{2}(s+n))^{-1}$ and by (B) we have

\displaystyle\mathbb{P}\left(\sum_{j=-s+1}^{n}L_{j}\geq x\right)\leq\exp\left\{-\frac{x^{2}}{(1+C_{1}B^{-2})\gamma^{2}B^{2}\tau^{4}(\log p)^{2}(\tau\log p+n)+C_{4}\tau^{2}B(\log p)x}\right\}.

(B.8)

Similarly, for $j\leq-s$ , since $p\leq\rho^{-s/\tau}$ ,

\displaystyle|L_{j}|

\displaystyle\leq\sum_{i=0}^{\infty}\min\left\{\rho^{-1}\gamma\rho^{(i-j-s)/\tau}\eta_{j},2B\right\}.

By the same argument, we immediate have

\displaystyle\mathbb{P}\bigg{(}\sum_{j=-\infty}^{-s}L_{j}\geq x\bigg{)}\leq\exp\left\{-\frac{x^{2}}{C_{3}\gamma^{2}\tau^{3}+C_{4}\tau Bx}\right\},

(B.9)

where $C_{3}=32\mathrm{e}^{2}\sigma^{2}(2\pi)^{-1/2}[\rho^{2}\log(1/\rho)]^{-3}$ and $C_{4}=8\mathrm{e}[\log(1/\rho)]^{-1}$ . By (B.8), (B.9) and symmetrization argument, we complete the proof. ∎

Proof of Corollary 3.3.

It follows from the proof of lemma 3.1 without any extra technical difficulty. ∎

Proof of Theorem 3.5.

Define $G_{jk}:\mathbb{R}^{p}\to\mathbb{R}$ be defined as $G_{jk}(x)=\big{(}xx^{\top}w(x)\big{)}_{jk}=x_{j}x_{k}w(x)$ for $j,k=1,\dots,p$ , and hence $|G(x)|\leq T$ . Let $u(x)=w^{1/3}(x)$ . Observe that

	$\displaystyle\|G_{jk}(x)-G_{jk}(y)\|$	$\displaystyle\leq\|x_{j}u(x)\,x_{k}u(x)-y_{j}u(y)\,y_{k}u(y)\|u(x)+\|y_{j}u(y)\,y_{k}u(y)\|\|u(x)-u(y)\|$
		$\displaystyle\leq\|x_{i}u(x)-y_{i}u(y)\|\|x_{j}u(x)\|+\|x_{j}u(x)-y_{j}u(y)\|\|y_{i}u(y)\|+T^{2}\|u(x)-u(y)\|$
		$\displaystyle\leq 3T\|x-y\|_{\infty}.$

By lemma 3.1, take $x=c\gamma\tau^{2}Tn^{-1/2}(\log p)^{3/2}$

\mathbb{P}\bigg{(}|\widehat{\Sigma}_{x,jk}-\Sigma_{x,jk}|\geq cTx\bigg{)}=\mathbb{P}\bigg{(}\Big{|}\frac{1}{n}\sum_{i=1}^{n}G_{jk}(X_{i-1})-\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}G_{jk}(X_{i-1})\Big{|}\geq cTx\bigg{)}\leq 4p^{-c_{1}}.

A union bound yields

\mathbb{P}\bigg{(}\|\widehat{\Sigma}_{x}-\Sigma_{x}\|_{\max}\geq cTx\bigg{)}\leq 4p^{-c_{0}},

where $c_{0}=c_{1}-1>0$ . ∎

Proof of Theorem 3.6.

Denote $H=\nabla^{2}L_{n}(\beta^{*})$ . First we write

\|H-\Sigma\|_{\max}=\max_{1\leq k\leq p}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\varepsilon_{ik})X_{i-1}X_{i-1}^{\top}w(X_{i-1})-\mathbb{E}[\psi^{\prime}(\varepsilon_{ik})X_{i-1}X_{i-1}^{\top}w(X_{i-1})]\Big{\|}_{\max}.

Using Corollary 3.3, it follows from the same argument of the proof of Theoremt 3.5 that for some constant $c_{1}>1$ , with probability at least $1-4p^{-c_{1}}$

\max_{1\leq k\leq p}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\varepsilon_{ik})X_{i-1}X_{i-1}^{\top}w(X_{i-1})-\mathbb{E}[\psi^{\prime}(\varepsilon_{ik})X_{i-1}X_{i-1}^{\top}w(X_{i-1})]\Big{\|}_{\max}\lesssim\gamma\tau^{2}T^{2}n^{-1/2}(\log p)^{3/2},

Finally, a union bound over $1\leq k\leq p$ yields the conclusion. ∎

Proof of Lemma 3.7.

The strategy is to consider each component of $\widehat{\mu}$ and take a union bound. Observe that

|\widehat{\mu}_{k}-\mu_{k}|\leq\bigg{|}\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\widehat{\varepsilon}_{ik})-\mathbb{E}[\psi^{\prime}(\widehat{\varepsilon}_{ik})]\bigg{|}+|\mathbb{E}\psi^{\prime}(\widehat{\varepsilon}_{ik})-\mathbb{E}\psi^{\prime}(\varepsilon_{ik})|,\quad k=1,2,\dots,p.

Since $|\psi^{\prime\prime}|$ is bounded, by the mean value theorem, we have that for some $\xi$ between $x$ and $y$ ,

|\psi^{\prime}(x)-\psi^{\prime}(y)|=|\psi^{\prime\prime}(\xi)(x-y)|\lesssim|x-y|.

So it can be verified that $\psi^{\prime}(X_{ik}-X_{i-1}^{\top}\widehat{\beta}_{k})$ satisfies the conditions in Corollary 2.5 of Liu and Zhang, (2021). By Corollary 2.5 of Liu and Zhang, (2021), it holds that

\bigg{|}\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime}(\widehat{\varepsilon}_{ik})-\mathbb{E}[\psi^{\prime}(\widehat{\varepsilon}_{ik})]\bigg{|}\lesssim\gamma\tau^{2}\sqrt{\frac{\log p}{n}}

with probability at least $1-2p^{-c}$ for some positive constant $c$ . Moreover,

\max_{1\leq k\leq p}|\mathbb{E}\widehat{\varepsilon}_{ik}-\mathbb{E}\varepsilon_{ik}|\lesssim\max_{1\leq k\leq p}\mathbb{E}\big{[}|X_{i-1}^{\top}(\widehat{\beta}_{k}-\beta^{*})|\big{]}\lesssim|\widehat{\beta}-\beta^{*}|_{1},

where the last inequality comes from the fact that $X_{i-1}$ has bounded second moment. ∎

Appendix C Proofs of Result in Section 4

Before proving Theorem 4.1, we will first state and prove the corresponding lemmas in the outline listed at the end of section 4.

Lemma C.1.

\mathbb{P}\bigg{(}\big{|}({T_{Y}-T_{Y,m})}/{\sqrt{n}}\big{|}_{\infty}\geq x\bigg{)}\leq\frac{c_{1}sp^{3}\gamma^{2}\rho^{m/\tau}}{x^{2}}=:f_{1}(x,m),

for some constants $C_{1},C_{2}>0$ .

Proof of Lemma C.1.

Let $D_{i}=Y_{i}-Y_{i,m}$ . For any $\lambda>0$ , by Markov inequality we have

\mathbb{P}\bigg{(}\sum_{i=1}^{n}D_{ij}/\sqrt{n}\geq x\bigg{)}\leq\frac{\mathbb{E}\big{[}\big{(}\sum_{i=1}^{n}D_{ij}/\sqrt{n}\big{)}^{2}\big{]}}{x^{2}}.

(C.1)

Notice that the martingale difference $\{D_{ij}\}_{i=1}^{n}$ satisfies

|D_{ij}|\lesssim\sqrt{s}|X_{i}-X_{i,m}|_{\infty}.

Thus,

\|D_{ij}\|_{2}\leq\||X_{i}-X_{i,m}|_{\infty}\|_{2}\leq\sum_{l=m+1}^{\infty}\|A^{l}\|_{\infty}\||\varepsilon_{i-l}|_{\infty}\|_{2}\lesssim\sqrt{s}p\gamma\rho^{m/\tau}.

By Burkholder inequality (Burkholder, (1973)), we have

\displaystyle\mathbb{E}\bigg{[}\bigg{(}\sum_{i=1}^{n}D_{ij}/\sqrt{n}\bigg{)}^{2}\bigg{]}

\displaystyle\lesssim\mathbb{E}[|D_{ij}|^{2}]\lesssim sp^{2}\gamma^{2}\rho^{2m/\tau}

(C.2)

Hence, by (C.1),

\mathbb{P}\bigg{(}\sum_{i=1}^{n}D_{ij}/\sqrt{n}\geq x\bigg{)}\leq\frac{c_{1}^{\prime}sp^{2}\gamma^{2}\rho^{m/\tau}}{x^{2}}

Finally, symmetrization and a union bound give the desired result. ∎

Lemma C.2.

Under the assumptions in Lemma C.1, it holds that

\mathbb{P}\big{(}|T_{Y,S}|_{\infty}/\sqrt{n}\geq x\big{)}\leq 2p\exp\bigg{\{}-\frac{nx^{2}}{C_{1}\sqrt{s}T\sqrt{n}x+C_{2}mwsT^{2}\sigma^{2}}\bigg{\}}=:f_{2}(x,m).

Proof of Lemma C.2.

By the property of $\psi(\cdot)$ and the mean value theorem, we have $|\psi(x)|\leq C|x|$ . Consider the first coordinate $(T_{Y,S})_{1}$ of $T_{Y,S}$ . We can write $(T_{Y,S})_{1}=\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}$ , where $r=m\omega$ . Observe that $\{Y_{i,m,1}\}$ is a martingale difference adapted to the filtration $\{\mathcal{F}_{i}=\sigma(\varepsilon_{i},\varepsilon_{i-1},\dots)\}$ and that $|Y_{i,m,1}|\leq\psi(\varepsilon_{ik})\sqrt{s}T\leq C\sqrt{s}T$ . We shall establish a Bernstein-type inequality for the sum of martingale differences $(T_{Y,S})_{1}$ :

\mathbb{P}((T_{Y,S})_{1}\geq x)\leq\mathrm{e}^{-\lambda x}\mathbb{E}\mathrm{e}^{\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}},\quad\text{for any }\lambda>0.

(C.3)

We now bound $\mathbb{E}\mathrm{e}^{\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}}$ from above. By the tower property,

	$\displaystyle\mathbb{E}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}\Big{\}}$	$\displaystyle=\mathbb{E}\bigg{[}\mathbb{E}\Big{[}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}\Big{\}}\Big{\|}\mathcal{F}_{r-1}\Big{]}\bigg{]}$
		$\displaystyle=\mathbb{E}\bigg{[}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r-1}}Y_{i,m,1}\Big{\}}\mathbb{E}[\mathrm{e}^{\lambda Y_{j_{r},m,1}}\|\mathcal{F}_{j_{r}-1}]\bigg{]}$		(C.4)

Now, consider

	$\displaystyle\mathbb{E}[\mathrm{e}^{\lambda Y_{j_{r},m,1}}\|\mathcal{F}_{j_{r}-1}]$	$\displaystyle=1+\mathbb{E}\Big{[}\sum_{t=2}^{\infty}\frac{(\lambda Y_{j_{r},m,1})^{t}}{t!}\Big{\|}\mathcal{F}_{j_{r}-1}\Big{]}\leq 1+\mathbb{E}\Big{[}\lambda^{2}T^{2}s\psi^{2}(\varepsilon_{j_{r}k})\sum_{t=0}^{\infty}(\lambda T\sqrt{s}C)^{t}\Big{]}$
		$\displaystyle\overset{\text{(1)}}{\leq}1+\frac{C\lambda^{2}T^{2}s\sigma^{2}}{1-C\lambda T\sqrt{s}}\leq\exp\bigg{\{}\frac{C\lambda^{2}T^{2}s\sigma^{2}}{1-C\lambda T\sqrt{s}}\bigg{\}}$		(C.5)

where the inequality (1) makes use of the fact that $\psi^{2}(\varepsilon_{j_{r},k})\leq\varepsilon_{j_{r},k}^{2}$ . Plug (C) into (C) and we obtain

\mathbb{E}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}\Big{\}}\leq\exp\bigg{\{}\frac{C\lambda^{2}T^{2}s\sigma^{2}}{1-C\lambda T\sqrt{s}}\bigg{\}}\mathbb{E}\bigg{[}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r-1}}Y_{i,m,1}\Big{\}}\bigg{]}

(C.6)

Iterating this procedure yields

\mathbb{E}\exp\Big{\{}\lambda\sum_{i=j_{1}}^{j_{r}}Y_{i,m,1}\Big{\}}\leq\exp\bigg{\{}\frac{Cm\omega\lambda^{2}T^{2}s\sigma^{2}}{1-C\lambda T\sqrt{s}}\bigg{\}}

(C.7)

Choose $\lambda=x(CT\sqrt{s}+2CmwT^{2}s\sigma^{2})^{-1}$ and by (C.3) we have

\mathbb{P}((T_{Y,S})_{1}\geq x)\leq\exp\bigg{\{}-\frac{x^{2}}{C_{1}T\sqrt{s}x+C_{2}m\omega T^{2}s\sigma^{2}}\bigg{\}}.

The symmetrization argument and a union bound deliver the desired result. ∎

Lemma C.3.

Suppose the scaling condition holds that $sT^{2}(\log(pn))^{7}/n\leq c_{3}n^{-c_{4}}$ . Assume that $\mathbb{E}[X_{ik}]\leq C^{\prime}$ for all $1\leq k\leq p$ . Then we have the following Gaussian Approximation result that

\mathcal{U}:=\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\big{(}|T_{Y,L}/\sqrt{n}|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}|Z|_{\infty}\leq t\big{)}\bigg{|}\leq cn^{-c^{\prime}}

for some constants $c,c^{\prime}>0$ .

Proof of Lemma C.3.

Recall that $\xi_{b}=\sum_{i\in L_{b}}Y_{i,m}/\sqrt{M}$ , thus

\mathcal{U}=\sup_{t\in\mathbb{R}}\bigg{|}\mathbb{P}\big{(}|\frac{1}{\sqrt{w}}\sum_{b=1}^{w}\xi_{b}|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}|Z|_{\infty}\leq t\big{)}\bigg{|}.

Observe that $\xi_{1},\xi_{2},\dots,\xi_{w}$ are independent random variables. We shall apply Corollary 2.1 of Chernozhukov et al., (2013) by verifying the condition (E.1) therein. For completeness, the conditions are stated below.

(i)

$c_{1}\leq\mathbb{E}[\xi_{bj}^{2}]\leq c_{2}$ for all $1\leq j\leq p$ .
(ii)

$\max_{k=1,2}\mathbb{E}[|\xi_{bj}|^{2+k}/B_{n}^{k}]+\mathbb{E}[\exp(|\xi_{bj}/B_{n}|)]\leq 4,$ for some $B_{n}>0$ and all $1\leq j\leq p$ .
(iii)

$B_{n}^{2}(\log(pn))^{7}/n\leq c_{3}n^{-c_{4}}$ .

To verify condition (i), we see that

\mathbb{E}[\xi_{bj}^{2}]\leq c\,\sigma^{2}\mathbb{E}[\mathbb{E}[\Omega_{x,j}^{\top}X_{i}w(X_{i})|\varepsilon_{i-m},\dots,\varepsilon_{i}]^{2}]\leq c\,\mathbb{E}[(\Omega_{x,j}^{\top}X_{i}w(X_{i}))^{2}]\leq c\,\Omega_{x,j}^{\top}\Sigma_{x}\Omega_{x,j}\leq c\,\Omega_{x,jj},

where $\Omega_{x,j}$ is the $j$ -th row of $\Omega_{x}$ and $\Omega_{x,jj}$ is the $j$ -th diagonal entry of $\Omega_{x}$ . Now we check condition (ii). By Theorem 3.2 of Burkholder, (1973), we have for $k\geq 2$ ,

\mathbb{E}[|\xi_{bj}|^{k}]\leq 18k^{k}\mathbb{E}\big{[}\big{|}Y_{ij,m}\big{|}^{k}\big{]}\lesssim k!\mathrm{e}^{k}\mathbb{E}\big{[}\big{|}Y_{ij,m}\big{|}^{2}\big{]}(\sqrt{s}T)^{k-2}\lesssim k!\mathrm{e}^{k}(\sqrt{s}T)^{k-2}.

Therefore, take $B_{n}=C\sqrt{s}T$ for sufficiently large $C>0$ and we have

\mathbb{E}[\exp(|\xi_{bj}/B_{n}|)]\leq 1+C_{1}\sum_{k=1}^{\infty}(\mathrm{e}/C)^{k}<2.

Moreover, for a suitable choice of $C>0$ ,

\max_{k=1,2}\mathbb{E}[|\xi_{bj}|^{2+k}/B_{n}^{k}]<2.

Hence, condition (ii) is satisfied. Condition (iii) is guaranteed by the scaling assumption. ∎

Now, we are ready to give the proof of Theorem 4.1.

Proof of Theorem 4.1.

By triangle inequality,

	$\displaystyle\mathcal{H}$	$\displaystyle\leq\sup_{t\in\mathbb{R}}\bigg{\|}\mathbb{P}\big{(}\|T_{Y}/\sqrt{n}\|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}\|T_{Y,L}/\sqrt{n}\|_{\infty}\leq t\big{)}\bigg{\|}+\sup_{t\in\mathbb{R}}\bigg{\|}\mathbb{P}\big{(}\|T_{Y,L}/\sqrt{n}\|_{\infty}\leq t\big{)}-\mathbb{P}\big{(}\|Z\|_{\infty}\leq t\big{)}\bigg{\|}$
		$\displaystyle=:I+II.$		(C.8)

For any $\eta>0$ , elementary calculation shows that

	$\displaystyle I$	$\displaystyle\leq\mathbb{P}\big{(}\big{\|}(T_{Y}-T_{Y,L})/\sqrt{n}\big{\|}_{\infty}>\eta\big{)}+\sup_{t\in\mathbb{R}}\mathbb{P}\bigg{(}\bigg{\|}\big{\|}T_{Y,L}/\sqrt{n}\big{\|}_{\infty}-t\bigg{\|}\leq\eta\bigg{)}$
		$\displaystyle\leq\mathbb{P}\big{(}\big{\|}(T_{Y}-T_{Y,m})/\sqrt{n}\big{\|}_{\infty}>\frac{\eta}{2}\big{)}+\mathbb{P}\big{(}\big{\|}T_{Y,S}/\sqrt{n}\big{\|}_{\infty}>\frac{\eta}{2}\big{)}+\sup_{t\in\mathbb{R}}\mathbb{P}\bigg{(}\bigg{\|}\big{\|}T_{Y,L}/\sqrt{n}\big{\|}_{\infty}-t\bigg{\|}\leq\eta\bigg{)}$		(C.9)

By lemma C.2 and C.1,

\displaystyle\mathbb{P}\big{(}\big{|}(T_{Y}-T_{Y,m})/\sqrt{n}\big{|}_{\infty}>\frac{\eta}{2}\big{)}\leq f_{1}(\eta/2,m)\quad\text{and}\quad\mathbb{P}\big{(}\big{|}T_{Y,S}/\sqrt{n}\big{|}_{\infty}>\frac{\eta}{2}\big{)}\leq f_{2}(\eta/2).

(C.10)

By lemma C.3 and theorem 3 of Chernozhukov et al., (2015), we obtain that

	$\displaystyle\sup_{t\in\mathbb{R}}\mathbb{P}\bigg{(}\bigg{\|}\big{\|}T_{Y,L}/\sqrt{n}\big{\|}_{\infty}-t\bigg{\|}\leq\eta\bigg{)}$	$\displaystyle\leq\sup_{t\in\mathbb{R}}\mathbb{P}\bigg{(}\bigg{\|}\big{\|}Z\big{\|}_{\infty}-t\bigg{\|}\leq\eta\bigg{)}+\mathcal{U}$
		$\displaystyle\lesssim\eta\sqrt{\log p}+\eta\sqrt{\log(1/\eta)}+cn^{-c^{\prime}},$		(C.11)

and that

II=\mathcal{U}\leq cn^{-c^{\prime}}.

(C.12)

By (C), (C.10), (C) and (C.12), we obtain the inequality stated in the theorem. ∎

References

Basu and Michailidis, (2015) Basu, S. and Michailidis, G. (2015). Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics, 43(4):1535–1567.
Bernstein, (1946) Bernstein, S. (1946). The theory of probabilities.
Bickel, (1975) Bickel, P. J. (1975). One-step huber estimates in the linear model. Journal of the American Statistical Association, 70(350):428–434.
Bühlmann and Van De Geer, (2011) Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
Burkholder, (1973) Burkholder, D. L. (1973). Distribution function inequalities for martingales. the Annals of Probability, pages 19–42.
Cai et al., (2011) Cai, T., Liu, W., and Luo, X. (2011). A constrained $\ell_{1}$ minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607.
Candes and Tao, (2007) Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than n. The annals of Statistics, 35(6):2313–2351.
Chernozhukov et al., (2015) Chernozhukov, V., Chetverikov, D., and Kato, K. (2015). Comparison and anti-concentration bounds for maxima of gaussian random vectors. Probability Theory and Related Fields, 162(1):47–70.
Chernozhukov et al., (2013) Chernozhukov, V., Chetverikov, D., Kato, K., et al. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819.
Fan et al., (2017) Fan, J., Li, Q., and Wang, Y. (2017). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society. Series B, Statistical methodology, 79(1):247.
Friedman et al., (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
Han et al., (2015) Han, F., Lu, H., and Liu, H. (2015). A direct estimation of high dimensional stationary vector autoregressions. Journal of Machine Learning Research.
Hsu et al., (2008) Hsu, N.-J., Hung, H.-L., and Chang, Y.-M. (2008). Subset selection for vector autoregressive processes using lasso. Computational Statistics & Data Analysis, 52(7):3645–3657.
Javanmard and Montanari, (2014) Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909.
Juselius, (2006) Juselius, K. (2006). The cointegrated VAR model: methodology and applications. Oxford university press.
Li and Zhu, (2008) Li, Y. and Zhu, J. (2008). L 1-norm quantile regression. Journal of Computational and Graphical Statistics, 17(1):163–185.
Liu and Zhang, (2021) Liu, L. and Zhang, D. (2021). Robust estimation of high-dimensional vector autoregressive models. arXiv preprint arXiv:2109.10354.
Loh, (2017) Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust $m$ -estimators. The Annals of Statistics, 45(2):866–896.
Loh, (2018) Loh, P.-L. (2018). Scale calibration for high-dimensional robust regression. arXiv preprint arXiv:1811.02096.
Loh and Wainwright, (2012) Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, pages 1637–1664.
Loh and Wainwright, (2017) Loh, P.-L. and Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455–2482.
Massart, (2007) Massart, P. (2007). Concentration inequalities and model selection, volume 6. Springer.
Meinshausen and Bühlmann, (2006) Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462.
Merlevède et al., (2009) Merlevède, F., Peligrad, M., Rio, E., et al. (2009). Bernstein inequality and moderate deviations under strong mixing conditions. In High dimensional probability V: the Luminy volume, pages 273–292. Institute of Mathematical Statistics.
Nardi and Rinaldo, (2011) Nardi, Y. and Rinaldo, A. (2011). Autoregressive process modeling via the lasso procedure. Journal of Multivariate Analysis, 102(3):528–549.
Negahban and Wainwright, (2011) Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, pages 1069–1097.
Negahban et al., (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A unified framework for high-dimensional analysis of $m$ -estimators with decomposable regularizers. Statistical science, 27(4):538–557.
Pan et al., (2021) Pan, X., Sun, Q., and Zhou, W.-X. (2021). Iteratively reweighted $\ell$ 1-penalized robust regression. Electronic Journal of Statistics, 15(1):3287–3348.
Shan, (2005) Shan, J. (2005). Does financial development ’lead’ economic growth? a vector auto-regression appraisal. Applied Economics, 37(12):1353–1367.
Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
Van de Geer et al., (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202.
Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press.
Wu, (2005) Wu, W. B. (2005). Nonlinear system theory: Another look at dependence. Proceedings of the National Academy of Sciences, 102(40):14150–14154.
Wu and Liu, (2009) Wu, Y. and Liu, Y. (2009). Variable selection in quantile regression. Statistica Sinica, pages 801–817.
Wu and Zhou, (2010) Wu, Y. and Zhou, X. (2010). Var models: Estimation, inferences, and applications. In Handbook of Quantitative Finance and Risk Management, pages 1391–1398. Springer.
Yuan and Lin, (2007) Yuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35.
Zhang and Zhang, (2014) Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242.
Zhang, (2019) Zhang, D. (2019). Robust estimation of the mean and covariance matrix for high dimensional time series. Statistica Sinica, to appear.
Zhang and Wu, (2017) Zhang, D. and Wu, W. B. (2017). Gaussian approximation for high dimensional time series. The Annals of Statistics, 45(5):1895–1919.
Zhang and Cheng, (2017) Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112(518):757–768.
Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429.

$\displaystyle\sqrt{n}(\check{\beta}-\beta^{*})$	$\displaystyle=\sqrt{n}(\widehat{\beta}-\beta^{})+\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})-\sqrt{n}\,\widehat{\Omega}(\nabla L_{n}(\widehat{\beta})-\nabla L_{n}(\beta^{*}))$
	$\displaystyle=\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})+\sqrt{n}\left[(\widehat{\beta}-\beta^{})-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{})(\widehat{\beta}-\beta^{})+R\right]$
	$\displaystyle=\underbrace{\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})}_{A}+\underbrace{\sqrt{n}\left[\Big{(}I_{p^{2}}-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{})(\widehat{\beta}-\beta^{*})\right]}_{\Delta}+\sqrt{n}R,$	(2.8)

	$\displaystyle\\|AB-\widehat{A}\widehat{B}\\|_{\max}$	$\displaystyle\leq\\|(A-\widehat{A})B\\|_{\max}+\\|\widehat{A}(B-\widehat{B})\\|_{\max}$
		$\displaystyle\leq\\|B\\|_{1}\\|A-\widehat{A}\\|_{\max}+\\|\widehat{A}\\|_{1}\\|B-\widehat{B}\\|_{\max}$
		$\displaystyle\lesssim\\|A-\widehat{A}\\|_{\max}+\\|B-\widehat{B}\\|_{\max}$

	$\displaystyle\\|\widehat{\Omega}_{x}\widehat{\mu}_{k}^{-1}-\Omega_{x}\mu_{k}^{-1}\\|_{\max}$	$\displaystyle\leq\\|\widehat{\Omega}_{x}-\Omega_{x}\\|_{\max}\|\widehat{\mu}_{k}^{-1}\|+\\|\Omega_{x}\\|_{\max}\|\widehat{\mu}_{k}^{-1}-\mu_{k}^{-1}\|$
		$\displaystyle\lesssim\\|\Omega_{x}\\|_{1}\lambda_{n}+\\|\Omega_{x}\\|_{\max}\frac{\|\mu_{k}-\widehat{\mu}_{k}\|}{\mu_{k}\widehat{\mu}_{k}}$
		$\displaystyle\lesssim\\|\Omega_{x}\\|_{1}\lambda_{n},$

	$\displaystyle\mathbb{P}(\|\nabla L_{n1}(\beta^{*})\|\geq x)$	$\displaystyle=\mathbb{P}(\|\nabla L_{n1}(\beta^{})-\mathbb{E}[\nabla L_{n1}(\beta^{})]\|_{\infty}\geq x)$
		$\displaystyle=\mathbb{P}\bigg{(}\Big{\|}\frac{1}{n}\sum_{i=1}^{n}h(\varepsilon_{i})G(X_{i-1})-\mathbb{E}[h(\varepsilon_{i})G(X_{i-1})]\Big{\|}\geq x\bigg{)}$
		$\displaystyle\leq 4\exp\bigg{\{}-\frac{nx^{2}}{C_{1}\gamma^{2}\tau^{4}T^{2}(\log p)^{2}+C_{2}\tau^{2}T(\log p)x}\bigg{\}}$

	$\displaystyle\sqrt{n}(\check{\beta}-\beta^{*})$	$\displaystyle=\sqrt{n}(\widehat{\beta}-\beta^{})+\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})-\sqrt{n}\,\widehat{\Omega}(\nabla L_{n}(\widehat{\beta})-\nabla L_{n}(\beta^{*}))$
		$\displaystyle=\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})+\sqrt{n}\left[(\widehat{\beta}-\beta^{})-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{})(\widehat{\beta}-\beta^{})+R\right]$
		$\displaystyle=\underbrace{\sqrt{n}\,\widehat{\Omega}\,\nabla L_{n}(\beta^{})}_{A}+\underbrace{\sqrt{n}\left[\Big{(}I_{p^{2}}-\widehat{\Omega}\,\nabla^{2}L_{n}(\beta^{})(\widehat{\beta}-\beta^{*})\right]}_{\Delta}+\sqrt{n}R,$