This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimax Posterior Convergence Rates and Model Selection Consistency in High-dimensional DAG Models based on Sparse Cholesky Factors

Kyoungjae Leelabel=e1]klee25@nd.edu [    Jaeyong Leelabel=e2]leejyc@gmail.com [    Lizhen Linlabel=e3]lizhen.lin@nd.edu [ Department of Applied and Computational
Mathematics and Statistics
The University of Notre Dame
Notre Dame, Indiana 46556
USA
Department of Statistics
Seoul National University
1 Gwanak-ro, Gwanak-gu
Seoul 08826
South Korea
Department of Applied and Computational
Mathematics and Statistics
The University of Notre Dame
Notre Dame, Indiana 46556
USA
The University of Notre Dame\thanksmarkm1 and Seoul National University\thanksmarkm2
   Kyoungjae Leelabel=e1]klee25@nd.edu [    Jaeyong Leelabel=e2]leejyc@gmail.com [    Lizhen Linlabel=e3]lizhen.lin@nd.edu [ Department of Applied and Computational
Mathematics and Statistics
The University of Notre Dame
Notre Dame, Indiana 46556
USA
Department of Statistics
Seoul National University
1 Gwanak-ro, Gwanak-gu
Seoul 08826
South Korea
Department of Applied and Computational
Mathematics and Statistics
The University of Notre Dame
Notre Dame, Indiana 46556
USA
The University of Notre Dame\thanksmarkm1 and Seoul National University\thanksmarkm2

Supplementary to “Minimax Posterior Convergence Rates and Model Selection Consistency in High-dimensional DAG Models based on Sparse Cholesky Factors”

Kyoungjae Leelabel=e1]klee25@nd.edu [    Jaeyong Leelabel=e2]leejyc@gmail.com [    Lizhen Linlabel=e3]lizhen.lin@nd.edu [ Department of Applied and Computational
Mathematics and Statistics
The University of Notre Dame
Notre Dame, Indiana 46556
USA
Department of Statistics
Seoul National University
1 Gwanak-ro, Gwanak-gu
Seoul 08826
South Korea
Department of Applied and Computational
Mathematics and Statistics
The University of Notre Dame
Notre Dame, Indiana 46556
USA
The University of Notre Dame\thanksmarkm1 and Seoul National University\thanksmarkm2
   Kyoungjae Leelabel=e1]klee25@nd.edu [    Jaeyong Leelabel=e2]leejyc@gmail.com [    Lizhen Linlabel=e3]lizhen.lin@nd.edu [ Department of Applied and Computational
Mathematics and Statistics
The University of Notre Dame
Notre Dame, Indiana 46556
USA
Department of Statistics
Seoul National University
1 Gwanak-ro, Gwanak-gu
Seoul 08826
South Korea
Department of Applied and Computational
Mathematics and Statistics
The University of Notre Dame
Notre Dame, Indiana 46556
USA
The University of Notre Dame\thanksmarkm1 and Seoul National University\thanksmarkm2
Abstract

In this paper, we study the high-dimensional sparse directed acyclic graph (DAG) models under the empirical sparse Cholesky prior. Among our results, strong model selection consistency or graph selection consistency is obtained under more general conditions than those in the existing literature. Compared to Cao, Khare and Ghosh (2017), the required conditions are weakened in terms of the dimensionality, sparsity and lower bound of the nonzero elements in the Cholesky factor. Furthermore, our result does not require the irrepresentable condition, which is necessary for Lasso type methods. We also derive the posterior convergence rates for precision matrices and Cholesky factors with respect to various matrix norms. The obtained posterior convergence rates are the fastest among those of the existing Bayesian approaches. In particular, we prove that our posterior convergence rates for Cholesky factors are the minimax or at least nearly minimax depending on the relative size of true sparseness for the entire dimension. The simulation study confirms that the proposed method outperforms the competing methods.

Abstract

In this supplement, we present the remaining proofs for posterior convergence rates and other auxiliary results.

62C20,
62F15,
62C12.,
DAG model,
Precision matrix,
Cholesky factor,
Posterior convergence rate,
Strong model selection consistency,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs

and and

1 Introduction

Detecting the dependence structure of multivariate data is one of important and challenging tasks, especially when the number of variables is much larger than the sample size. Due to advancements in technology, such data are routinely collected in a wide range of areas including genomics, climatology, proteomics and neuroimaging. The estimation of the covariance (or precision) matrix is crucial to reveal the dependence structure. Under the high-dimensional setting, however, the traditional sample covariance matrix is no longer a consistent estimator of the true covariance matrix (Johnstone and Lu, 2009). For the consistent estimation of the high-dimensional covariance or precision matrices, various restrictive matrix classes have been proposed to reduce the number of effective parameters. They include the bandable matrices (Bickel and Levina, 2008; Cai, Zhang and Zhou, 2010; Cai and Yuan, 2012; Banerjee and Ghosal, 2014), sparse matrices (Cai and Zhou, 2012a, b; Banerjee and Ghosal, 2015) and low-dimensional structural matrices such as the sparse spiked covariance (Cai, Ma and Wu, 2015; Gao and Zhou, 2015) and sparse factor models (Fan, Fan and Lv, 2008; Pati et al., 2014). When the class of sparse matrices is of interest, the sparsity pattern can be encoded in many different ways. Sparsity can be imposed on the covariance matrix, precision matrix or Cholesky factor, which lead to different graph models. In this paper, we focus on imposing sparsity on the Cholesky factor of the precision matrix.

Consider a sample of data X1,,Xni.i.d.Np(0,Σn)X_{1},\ldots,X_{n}\overset{i.i.d.}{\sim}N_{p}(0,\Sigma_{n}), where Np(μ,Σ)N_{p}(\mu,\Sigma) is the pp-dimensional normal distribution with the mean vector μp\mu\in\mathbb{R}^{p} and covariance matrix Σp×p\Sigma\in\mathbb{R}^{p\times p}. For every positive definite matrix Ωn=Σn1\Omega_{n}=\Sigma_{n}^{-1}, the modified Cholesky decomposition (MCD) guarantees the existence of unique Cholesky factor AnA_{n} and diagonal matrix DnD_{n} such that Ωn=(IpAn)TDn1(IpAn)\Omega_{n}=(I_{p}-A_{n})^{T}D_{n}^{-1}(I_{p}-A_{n}). The sparsity of a Gaussian directed acyclic graph (DAG) can be uniquely encoded by the Cholesky factor AnA_{n} through the structure of the graph. In this paper, we assume that the parent ordering of the variables is known, which is a common assumption used in the literature such as in Ben-David et al. (2015), Khare et al. (2016), Yu and Bien (2017) and Cao, Khare and Ghosh (2017). The details on this concept will be provided in subsection 2.2. For the estimation of Cholesky factor AnA_{n}, the banded assumption and the sparsity assumption are two popular assumptions. Under the banded assumption, the elements of the matrix far from the diagonal are assumed to be all zero, while under the sparsity assumption, there is no constraint on the zero-pattern other than assuming most of the entries are zero. In recent years, various penalized likelihood estimators have been proposed with the sparsity assumption on AnA_{n} (Huang et al., 2006; Rothman, Levina and Zhu, 2010; Shojaie and Michailidis, 2010; van de Geer and Bühlmann, 2013; Khare et al., 2016) and banded assumption on AnA_{n} (Yu and Bien, 2017).

On the Bayesian side, relatively few works have dealt with asymptotic properties of the posteriors of high-dimensional Gaussian DAG models. Posterior convergence rates for the precision matrices with GG-Wishart priors (Roverato, 2000) were derived by Banerjee and Ghosal (2014) and Xiang, Khare and Ghosh (2015), where GG is a decomposable graph. Note that a decomposable graph can be converted to a perfect DAG, a special case of the DAGs, by ignoring directions. Lee and Lee (2017) obtained the posterior convergence rates and minimax lower bounds for the precision matrices, but only bandable Cholesky factors were considered. Posterior convergence rates for the precision matrices as well as strong model selection consistency were recently derived by Cao, Khare and Ghosh (2017) for sparse DAG models. However, their results are not adaptive to the unknown sparsity s0s_{0}, and the conditions required for obtaining such results are somewhat restrictive.

In this paper, we consider high-dimensional sparse Gaussian DAG models where sparsity is imposed via the sparse Cholesky factor. We adopt an empirical Bayes approach with a fractional likelihood. The empirical Bayes approach is justified by showing desirable asymptotic properties of the induced posterior such as strong model selection consistency and optimal posterior convergence rates. In addition, our theoretical results are adaptive to the unknown sparsity s0s_{0}.

There are four main contributions of this work. First, we show strong model selection consistency under much more general conditions than those in the literature. Specifically, the required conditions on the dimensionality, sparsity, structure of the Cholesky factor AnA_{n} and the lower bound of the nonzero elements in AnA_{n} (the beta-min condition, which will be described later) are significantly weakened. Second, we derive the minimax or nearly minimax posterior convergence rates for the Cholesky factors under two scenarios: with or without the beta-min condition for the true Cholesky factor. We show that at least one of the posterior convergence rates is minimax depending on the relative size of true sparseness for the entire dimension. To the best of our knowledge, this is the first result on minimax posterior convergence rates in high-dimensional DAG models. Third, we obtain the posterior convergence rates for precision matrices with respect to the spectral norm and matrix \ell_{\infty} norm, which is the fastest among those of existing Bayesian approaches. Compared to Cao, Khare and Ghosh (2017), we achieve faster posterior convergence rate under more general conditions, except the bounded eigenvalue condition. Furthermore, their results depend on the unknown sparsity s0s_{0}, whereas ours do not. Fourth, our method significantly improves the model selection performance in practice. In particular, our method outperforms the other state-of-the-art methods in a simulation study. The theoretical choice of hyperparameters provided good guidelines for practical performance. Note that the choice of the hyperparameter, the individual edge probability qnq_{n}, in the hierarchical DAG-Wishart prior (Cao, Khare and Ghosh, 2017) can be problematic in practice, as the posterior with the theoretical choice of qnq_{n} tends to be stuck at very small size models.

The rest of paper is organized as follows. In section 2, we introduce notations, Gaussian DAG models, the empirical sparse Cholesky prior, the fractional posterior and the parameter class for the precision matrices. In section 3, strong model selection consistency, posterior convergence rates and minimax lower bounds for the Cholesky factor, and posterior convergence rates for the precision matrices are established. A simulation study focusing on the model selection property are represented in section 4. The proofs of the main results are provided in the supplemental article (Lee, Lee and Lin, 2018).

2 Preliminaries

2.1 Norms and Notations

For any a,ba,b\in\mathbb{R}, we denote aba\vee b and aba\wedge b as the maximum and minimum of aa and bb, respectively. For any aa\in\mathbb{R}, we denote a\lfloor a\rfloor as the largest integer strictly smaller than aa. For any sequences ana_{n} and bnb_{n}, an=o(bn)a_{n}=o(b_{n}) denotes an/bn0a_{n}/b_{n}\to 0 as nn\to\infty. We denote an=O(bn)a_{n}=O(b_{n}), or equivalently anbna_{n}\lesssim b_{n}, if anCbna_{n}\leq Cb_{n} for some constant C>0C>0, where CC is an universal constant. We denote the indicator function for a set AA as I(A)I(\cdot\in A) or IA()I_{A}(\cdot). For a given pp-dimensional vector u=(u1,,up)Tu=(u_{1},\ldots,u_{p})^{T} and set S{1,,p}S\subseteq\{1,\ldots,p\}, we define uS=(uj)jST|S|u_{S}=(u_{j})^{T}_{j\in S}\in\mathbb{R}^{|S|}, where |S||S| is the cardinality of SS. For given index sets SS, S{1,,p}S^{\prime}\subseteq\{1,\ldots,p\} and real matrix Bp×pB\in\mathbb{R}^{p\times p}, we denote B(S,S)B_{(S,S^{\prime})} as the |S|×|S||S|\times|S^{\prime}| submatrix consisting only of SSth columns and SS^{\prime}th rows of BB, and let BS=B(S,S)B_{S}=B_{(S,S)}. For a real matrix BB, we denote SBS_{B} as the index set for nonzero elements of BB and call SBS_{B} the support of BB. We define 𝒞p\mathcal{C}_{p} as the class of all p×pp\times p dimensional positive definite matrices. For any p×pp\times p symmetric matrix BB, λmin(B)\lambda_{\min}(B) and λmax(B)\lambda_{\max}(B) are the minimum and maximum eigenvalues of BB, respectively.

For any pp-dimensional vector u=(u1,,up)Tu=(u_{1},\ldots,u_{p})^{T}, we define vector norms u1=j=1p|uj|\|u\|_{1}=\sum_{j=1}^{p}|u_{j}|, u2=(j=1puj2)1/2\|u\|_{2}=(\sum_{j=1}^{p}u_{j}^{2})^{1/2} and umax=max1jp|uj|\|u\|_{\max}=\max_{1\leq j\leq p}|u_{j}|. For any p×pp\times p matrix B=(bij)B=(b_{ij}), we define the spectral norm, matrix 1\ell_{1} norm, matrix \ell_{\infty} norm and Frobenius norm by

B\displaystyle\|B\| =\displaystyle= supxpx2=1Bx2=(λmax(BTB))1/2,\displaystyle\sup_{x\in\mathbb{R}^{p}\atop\|x\|_{2}=1}\|Bx\|_{2}\,\,=\,\,\left(\lambda_{\rm max}(B^{T}B)\right)^{1/2},
B1\displaystyle\|B\|_{1} =\displaystyle= supxpx1=1Bx1=max1jpi=1p|bij|,\displaystyle\sup_{x\in\mathbb{R}^{p}\atop\|x\|_{1}=1}\|Bx\|_{1}\,\,=\,\,\max_{1\leq j\leq p}\sum_{i=1}^{p}|b_{ij}|,
B\displaystyle\|B\|_{\infty} =\displaystyle= supxpxmax=1Bxmax=max1ipj=1p|bij|, and\displaystyle\sup_{x\in\mathbb{R}^{p}\atop\|x\|_{\max}=1}\|Bx\|_{\max}\,\,=\,\,\max_{1\leq i\leq p}\sum_{j=1}^{p}|b_{ij}|,\quad\text{ and}
BF\displaystyle\|B\|_{F} =\displaystyle= (i=1pj=1pbij2)1/2,\displaystyle\Big{(}\sum_{i=1}^{p}\sum_{j=1}^{p}b_{ij}^{2}\Big{)}^{1/2},

respectively.

For a given positive integer mm, we denote χm2\chi^{2}_{m} as the chi-square distribution with degrees of freedom mm. For any random variables Y1,Y2Y_{1},Y_{2} and Y3Y_{3}, we denote Y1𝑑Y2Y3Y_{1}\overset{d}{\equiv}Y_{2}\oplus Y_{3} if the distribution of Y1Y_{1} is equal to that of Y2+Y3Y_{2}+Y_{3}, and Y2Y_{2} and Y3Y_{3} are independent. For given positive numbers aa and bb, Gamma(a,b)Gamma(a,b) and IG(a,b)IG(a,b) are the gamma distribution and inverse-gamma distribution with shape parameter aa and rate parameter bb, respectively. Beta(a,b)Beta(a,b) is the beta distribution whose density function at x(0,1)x\in(0,1) is proportional to xa1(1x)b1x^{a-1}(1-x)^{b-1}. We denote Np(Xμ,Σ)N_{p}(X\mid\mu,\Sigma) as the density function of Np(μ,Σ)N_{p}(\mu,\Sigma) at XpX\in\mathbb{R}^{p}. We denote the inverse-Wishart distribution by IWp(ν,Φ)IW_{p}(\nu,\Phi), where the degree of freedom and scale matrix are ν>p1\nu>p-1 and Φ𝒞p\Phi\in\mathcal{C}_{p}, respectively.

2.2 Gaussian DAG Models

We consider the model

X1,,XnΩn\displaystyle X_{1},\ldots,X_{n}\mid\Omega_{n} i.i.d.\displaystyle\overset{i.i.d.}{\sim} Np(0,Ωn1),\displaystyle N_{p}(0,\Omega_{n}^{-1}), (1)

where Ωn=Σn1\Omega_{n}=\Sigma_{n}^{-1} is a p×pp\times p precision matrix and Xi=(Xi1,,Xip)TpX_{i}=(X_{i1},\ldots,X_{ip})^{T}\in\mathbb{R}^{p} for all i=1,,ni=1,\ldots,n. The MCD guarantees that there exists unique lower triangular matrix An=(ajl)A_{n}=(a_{jl}) and diagonal matrix Dn=diag(dj)D_{n}=diag(d_{j}) such that

Ωn\displaystyle\Omega_{n} =\displaystyle= (IpAn)TDn1(IpAn),\displaystyle(I_{p}-A_{n})^{T}D_{n}^{-1}(I_{p}-A_{n}), (2)

where ajj=0a_{jj}=0 and dj>0d_{j}>0 for all j=1,,pj=1,\ldots,p. Let SAnS_{A_{n}} be the support of the Cholesky factor AnA_{n}, and SjS_{j} be the support of the jjth row of AnA_{n}. Let Ωn\mathbb{P}_{\Omega_{n}} and 𝔼Ωn\mathbb{E}_{\Omega_{n}} be the probability measure and expectation corresponding to the model (12), respectively.

The model (12) can be interpreted as a Gaussian DAG model depending on the sparsity pattern of AnA_{n}. For a set of vertices V={1,,p}V=\{1,\ldots,p\} and a set of directed edges EE, a graph 𝒟=(V,E)\mathcal{D}=(V,E) is said to be a DAG if there is no directed cycles. It is also called the Bayesian network or belief network. In this paper, we assume that the variables have a known natural ordering in which no edges exist from larger vertices to smaller vertices. It has been commonly assumed in literature including Shojaie and Michailidis (2010), Ben-David et al. (2015), Khare et al. (2016), Yu and Bien (2017) and Cao, Khare and Ghosh (2017). There are relatively few works on DAG models when the ordering of variables is unknown (Kalisch and Bühlmann, 2007; Rütimann and Bühlmann, 2009; van de Geer and Bühlmann, 2013). As discussed in van de Geer and Bühlmann (2013), when the ordering is unknown, a very different technique is needed relative to the known ordering case.

For iVi\in V, define the set of all ii’s parents as the subset of VV smaller than ii and sharing an edge with ii and denote it as pai(𝒟)pa_{i}(\mathcal{D}). Any multivariate Gaussian distribution that obeys the directed Markov property with respect to a DAG 𝒟\mathcal{D} is said to be a Gaussian DAG model over 𝒟\mathcal{D}. To be specific, if X=(X1,,Xp)TNp(0,Ω1)X=(X_{1},\ldots,X_{p})^{T}\sim N_{p}(0,\Omega^{-1}) and Np(0,Ω1)N_{p}(0,\Omega^{-1}) belongs to a Gaussian DAG model over 𝒟\mathcal{D}, then

Xj\displaystyle X_{j} \displaystyle\perp {Xj}j<j,jpaj(𝒟)|(X)paj(𝒟),\displaystyle\{X_{j^{\prime}}\}_{j^{\prime}<j,\,j^{\prime}\notin pa_{j}(\mathcal{D})}\,\,\Big{|}\,\,(X)_{pa_{j}(\mathcal{D})},

for each j=1,,pj=1,\ldots,p. Furthermore, if we adopt the MCD as in (2), with the known ordering of variables, Np(0,Ω1)N_{p}(0,\Omega^{-1}) belongs to a Gaussian DAG model over 𝒟\mathcal{D} if and only if ajl=0a_{jl}=0 whenever lpaj(𝒟)l\notin pa_{j}(\mathcal{D}) (Cao, Khare and Ghosh, 2017). In other words, the support of AA uniquely determines a DAG 𝒟\mathcal{D} under the known ordering assumption.  The model X=(X1,,Xp)TNp(0,Ω1)X=(X_{1},\ldots,X_{p})^{T}\sim N_{p}(0,\Omega^{-1}) given SAS_{A} is equivalent to a Gaussian DAG model, and it can be represented as a linear autoregressive model,

X1d1N(0,d1),XjaSj,dj,SjindN(lSjXlajl,dj),j=2,,p,\displaystyle\begin{split}{X}_{1}\mid d_{1}\,\,&\sim\,\,N(0,\,d_{1}),\\ {X}_{j}\mid a_{S_{j}},d_{j},S_{j}\,\,&\overset{ind}{\sim}\,\,N\Big{(}\sum_{l\in S_{j}}{X}_{l}a_{jl},\,d_{j}\Big{)},~~j=2,\ldots,p,\end{split} (3)

where aSj=aj,Sj=(ajj)jSjTa_{S_{j}}=a_{j,S_{j}}=(a_{jj^{\prime}})_{j^{\prime}\in S_{j}}^{T}. For more details on the expression (3), refer to Bickel and Levina (2008) and Ben-David et al. (2015). The autoregressive model interpretation enables us to adopt the priors introduced in the linear regression literature. Since aSja_{S_{j}} corresponds to nonzero elements among aj=(aj1,,aj,j1)Ta_{j}=(a_{j1},\ldots,a_{j,j-1})^{T}, one can use a prior designed for sparse regression coefficient vectors for aja_{j}, which is our strategy introduced in section 2.3.

In this paper, we consider the high-dimensional setting where p=pnp=p_{n} is a function of nn increasing to infinity as nn\to\infty and pnp\geq n. We assume that the data were generated from a true precision matrix Ω0n\Omega_{0n}, where Σ0n=Ω0n1\Sigma_{0n}=\Omega_{0n}^{-1} is the true covariance matrix. Denote the MCD (2) of the true precision matrix by Ω0n=(IpA0n)TD0n1(IpA0n)\Omega_{0n}=(I_{p}-A_{0n})^{T}D_{0n}^{-1}(I_{p}-A_{0n}), where A0n=(a0,jl)A_{0n}=(a_{0,jl}), a0j=(a0,j1,,a0,j,j1)Ta_{0j}=(a_{0,j1},\ldots,a_{0,j,j-1})^{T} and D0n=diag(d0j)D_{0n}=diag(d_{0j}). For notational convenience, let 0=Ω0n\mathbb{P}_{0}=\mathbb{P}_{\Omega_{0n}} and 𝔼0=𝔼Ω0n\mathbb{E}_{0}=\mathbb{E}_{\Omega_{0n}}.

We now define some notations related to the data set. Let 𝐗n=(X1,,Xn)T{\bf X}_{n}=(X_{1},\ldots,X_{n})^{T} n×p\in\mathbb{R}^{n\times p} be the data of size nn, and X~j=(X1j,,Xnj)Tn\tilde{X}_{j}=(X_{1j},\ldots,X_{nj})^{T}\in\mathbb{R}^{n} be the jjth column of 𝐗n{\bf X}_{n}. For a given index set S{1,,p}S\subseteq\{1,\ldots,p\}, let 𝐗S=(X~j)jSn×|S|{\bf X}_{S}=(\tilde{X}_{j})_{j\in S}\in\mathbb{R}^{n\times|S|} be the data matrix consisting only of SSth columns of 𝐗n{\bf X}_{n}. Let Zij=(Xi1,,Xi,j1)Tj1Z_{ij}=(X_{i1},\ldots,X_{i,j-1})^{T}\in\mathbb{R}^{j-1} and Z~j=(Z1j,,Znj)Tn×(j1)\tilde{Z}_{j}=(Z_{1j},\ldots,Z_{nj})^{T}\in\mathbb{R}^{n\times(j-1)} for all j=2,,pj=2,\ldots,p.

For a given positive integer 1sp1\leq s\leq p, we define Ψmax(s)2=supS:0<|S|sλmax(𝐗ST𝐗S)\Psi_{\max}(s)^{2}=\sup_{S:0<|S|\leq s}\lambda_{\max}({\bf X}_{S}^{T}{\bf X}_{S}) and Ψmin(s)2=infS:0<|S|sλmin(𝐗ST𝐗S)\Psi_{\min}(s)^{2}=\inf_{S:0<|S|\leq s}\lambda_{\min}({\bf X}_{S}^{T}{\bf X}_{S}), where the supremum and infimum are taken over all index sets S{1,,p}S\subseteq\{1,\ldots,p\}. We say that the restricted eigenvalue condition is met for some integer ss if n1Ψmin(s)2n^{-1}\Psi_{\min}(s)^{2} is bounded away from zero uniformly for all large nn. This condition has been used in the high-dimensional regression literature to control the behavior of the design matrix. The autoregressive model representation (3) connects the eigenvalues of the precision matrix Ω0n\Omega_{0n} with those of the design matrix in (3) because the quantity 𝐗Sj{\bf X}_{S_{j}} corresponds to the design matrix based on the representation. Thus, the bounded eigenvalue assumption (A1) in section 2.5 essentially corresponds to the restricted eigenvalue condition.

2.3 Empirical Sparse Cholesky Prior

In this paper, we suggest the following prior distribution for our model:

aSjdj,SjindN|Sj|(a^Sj,djγ(𝐗SjT𝐗Sj)1),j=2,,p,π(dj)i.i.d.djν0/21,j=1,,p,πj(Sj=Sj)(j1|Sj|)1fnj(|Sj|),j=2,,p,Sj{1,,j1},fnj(|Sj|)c1|Sj|pc2|Sj|I(0|Sj|Rj(j1)),j=2,,p,\displaystyle\begin{split}a_{S_{j}}\mid d_{j},S_{j}\,\,&\overset{ind}{\sim}\,\,N_{|S_{j}|}\left(\widehat{a}_{S_{j}},\,\,\frac{d_{j}}{\gamma}\big{(}{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}\big{)}^{-1}\right),~~j=2,\ldots,p,\\ \pi(d_{j})\,\,&\overset{i.i.d.}{\propto}\,\,d_{j}^{-\nu_{0}/2-1},~~j=1,\ldots,p,\\ \pi_{j}(S_{j}=S_{j}^{\prime})\,\,&\propto\,\,\binom{j-1}{|S_{j}^{\prime}|}^{-1}f_{nj}(|S_{j}^{\prime}|),\,\,j=2,\ldots,p,\,\,S_{j}^{\prime}\subseteq\{1,\ldots,j-1\},\\ f_{nj}(|S_{j}^{\prime}|)\,\,&\propto\,\,c_{1}^{-|S_{j}^{\prime}|}p^{-c_{2}|S_{j}^{\prime}|}I(0\leq|S_{j}^{\prime}|\leq R_{j}\wedge(j-1)),\,\,j=2,\ldots,p,\end{split} (4)

for some positive constants ν0,c1,c2,R2,,Rp\nu_{0},c_{1},c_{2},R_{2},\ldots,R_{p} and γ\gamma, where fnjf_{nj} is a probability mass function on {0,1,,Rj(j1)}\{0,1,\ldots,R_{j}\wedge(j-1)\} and a^Sj=(𝐗SjT𝐗Sj)1𝐗SjTX~j\widehat{a}_{S_{j}}=\big{(}{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}\big{)}^{-1}{\bf X}_{S_{j}}^{T}\tilde{X}_{j}. The proposed prior is empirical in the sense that it depends on the data, so we call the prior (4) the empirical sparse Cholesky (ESC) prior. To obtain the desired asymptotic properties, appropriate conditions for hyperparameters in (4) will be introduced in section 3. Note that the prior for djd_{j} can be generalized to the proper prior IG(ν0/2,ν0)IG(\nu_{0}/2,\nu_{0}^{\prime}) for some constant ν0>0\nu_{0}^{\prime}>0 and the results in section 3 also hold for this prior choice. However, for computational convenience, we describe and prove the main results with the improper prior π(dj)djν0/21\pi(d_{j})\propto d_{j}^{-\nu_{0}/2-1}.

For the conditional prior of aja_{j} given djd_{j}, we first introduce zero components through the prior πj\pi_{j} and impose the Zellner’s g-prior (Zellner, 1986) on the nonzero components, aSja_{S_{j}}. The use of Zellner’s g-prior simplifies the calculation of the marginal posterior for SjS_{j}. Martin, Mess and Walker (2017) suggested a similar prior in the high-dimensional linear regression model. Also note that the ESC prior has a connection to the DAG-Wishart prior (Ben-David et al., 2015; Cao, Khare and Ghosh, 2017). Theorem 7.3 in Ben-David et al. (2015) shows that the DAG-Wishart prior on (An,Dn)(A_{n},D_{n}) given a DAG implies the inverse-gamma distribution on djd_{j} and multivariate normal distribution on the nonzero elements of aja_{j} given djd_{j}, where (aj,dj)(a_{j},d_{j}) are mutually independent for all j=1,,pj=1,\ldots,p. Thus, the ESC prior (4) is quite close to the DAG-Wishart prior when the support of AnA_{n} is given.

Cao, Khare and Ghosh (2017) used the DAG-Wishart prior to recover the sparse DAG and estimate the precision matrix in high-dimensional settings. Thus, their prior on (An,Dn)(A_{n},D_{n}) is quite close to ours, and can be viewed as a set of priors for autoregressive model (3) as discussed in the previous paragraph. For the support of DAGs, they imposed the element-wise sparsity using independent Bernoulli distributions with the hyperparameter qnq_{n}, which has a nice interpretation as the individual edge probability. Based on the hierarchical DAG-Wishart prior, they obtained the strong model selection consistency for the DAG and the posterior convergence rate for the precision matrix with respect to the spectral norm. However, they did not directly adopt the autoregressive model interpretation as in (3), which is different from our approach. By using the ESC prior, we can adopt state-of-the-art techniques on selection consistency for the regression coefficient (Martin, Mess and Walker, 2017) and achieve the strong model selection consistency under much weaker conditions than those in Cao, Khare and Ghosh (2017). Furthermore, compared to the existing literature, we obtain faster posterior convergence rates for precision matrices and Cholesky factors under weaker conditions using the techniques introduced by Lee and Lee (2018, 2017) and Martin, Mess and Walker (2017). Indeed, the posterior convergence rates for Cholesky factors are nearly or exactly optimal depending on the relative size of true sparseness for the entire dimension.

2.4 α\alpha-posterior Distribution

We suggest adopting the fractional likelihood with power α(0,1)\alpha\in(0,1),

Ln(An,Dn)α\displaystyle L_{n}(A_{n},D_{n})^{\alpha} =\displaystyle= i=1n{Np(Xi0,(IpAn)1Dn((IpAn)T)1)}α.\displaystyle\prod_{i=1}^{n}\Big{\{}N_{p}\big{(}X_{i}\mid 0,\,(I_{p}-A_{n})^{-1}D_{n}((I_{p}-A_{n})^{T})^{-1}\big{)}\Big{\}}^{\alpha}. (5)

The use of fractional likelihood has received increased attention in recent years (Martin and Walker, 2014; Syring and Martin, 2016; Miller and Dunson, 2018). In this paper, we use the fractional likelihood mainly because of its appealing theoretical properties under relatively weaker conditions compared to the actual posterior (Bhattacharya, Pati and Yang, 2018). In the proof of the main results of this paper, the use of the fractional likelihood enables us to effectively deal with the ratio of estimated residual variances d^Sj\widehat{d}_{S_{j}} (the proof of Theorem 3.1) and the ratio of likelihood Lnj(aj,dj)L_{nj}(a_{j},d_{j}) (the proof of Lemma 10.2), where d^Sj\widehat{d}_{S_{j}} and Lnj(aj,dj)L_{nj}(a_{j},d_{j}) will be defined later.

Here we give a more detailed justification of using the fractional likelihood. The proposed conditional prior for aSja_{S_{j}} in (4) tracks the data closely because it is centered at the least square estimate. It may cause the unexpected inconsistency (Walker and Hjort, 2001). The fractional likelihood approach can prohibit it by preventing the posterior from following the data too closely. To be more specific, the use of fractional likelihood can be interpreted as an empirical Bayes procedure by considering

Ln(An,Dn)απ(An,Dn)\displaystyle L_{n}(A_{n},D_{n})^{\alpha}\,\pi(A_{n},D_{n}) =\displaystyle= Ln(An,Dn)π(An,Dn)Ln(An,Dn)1α.\displaystyle L_{n}(A_{n},D_{n})\,\frac{\pi(A_{n},D_{n})}{L_{n}(A_{n},D_{n})^{1-\alpha}}.

Hence, the resulting posterior consists of an ordinary likelihood function and a data-dependent prior π(An,Dn)/Ln(An,Dn)1α\pi(A_{n},D_{n})/L_{n}(A_{n},D_{n})^{1-\alpha}. Note that the power α\alpha only appears in the prior. From this point of view, the prior is rescaled by a fractional likelihood which has an effect of penalizing parameter values that track the data too closely, while the penalty effect is controlled by the hyperparameter α\alpha.

The choice of α\alpha can be important from a practitioner’s point of view even though theoretical results in this paper hold for any choice of 0<α<10<\alpha<1. In practice, we suggest using α\alpha close to 1 to mimic the usual likelihood in finite sample scenario if there is no suspect of model failure, i.e. misspecification. As long as one chooses α\alpha sufficiently close to 1, e.g. α=0.999\alpha=0.999 or α=0.9999\alpha=0.9999, our experience confirms that the α\alpha-posterior can be hardly distinguishable from the “usual” posterior even in a finite sample scenario.

Remark 2.1.

Grünwald et al. (2017) suggested using II-log-SafeBayes (or RR-log-SafeBayes) to determine α\alpha, which gives the minimizer α^\hat{\alpha} of the posterior-expected posterior-randomized loss of prediction (or its variant). The induced posterior is robust to model misspecification in some cases (Grünwald et al., 2017).

The prior (4) and fractional likelihood (5) lead to the following joint posterior distribution,

aSjdj,Sj,𝐗nindN|Sj|(a^Sj,dj(α+γ)(𝐗SjT𝐗Sj)1),j=2,,p,djSj,𝐗nindIG(αn+ν02,αn2d^Sj),j=1,,p,πα(Sj𝐗n)πj(Sj)(1+αγ)|Sj|2(d^Sj)αn+ν02,j=2,,p,\displaystyle\begin{split}a_{S_{j}}\mid d_{j},S_{j},{\bf X}_{n}\,\,&\overset{ind}{\sim}\,\,N_{|S_{j}|}\left(\widehat{a}_{S_{j}},~\frac{d_{j}}{(\alpha+\gamma)}\big{(}{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}\big{)}^{-1}\right),\quad j=2,\ldots,p,\\ d_{j}\mid S_{j},{\bf X}_{n}\,\,&\overset{ind}{\sim}\,\,IG\left(\frac{\alpha n+\nu_{0}}{2},\frac{\alpha n}{2}\widehat{d}_{S_{j}}\right),\quad j=1,\ldots,p,\\ \pi_{\alpha}(S_{j}\mid{\bf X}_{n})\,\,&\propto\,\,\pi_{j}(S_{j})\left(1+\frac{\alpha}{\gamma}\right)^{-\frac{|S_{j}|}{2}}(\widehat{d}_{S_{j}})^{-\frac{\alpha n+\nu_{0}}{2}},\quad j=2,\ldots,p,\end{split} (6)

where d^Sj=n1X~jT(InP~Sj)X~j\widehat{d}_{S_{j}}=n^{-1}\tilde{X}_{j}^{T}(I_{n}-\tilde{P}_{S_{j}})\tilde{X}_{j} and P~Sj=𝐗Sj(𝐗SjT𝐗Sj)1𝐗SjT\tilde{P}_{S_{j}}={\bf X}_{S_{j}}({\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}})^{-1}{\bf X}_{S_{j}}^{T}. We refer to the posterior (6) as the α\alpha-posterior and denote it by πα(𝐗n)\pi_{\alpha}(\cdot\mid{\bf X}_{n}) to clarify that we consider the α\alpha-fractional likelihood. Throughout the paper, α(0,1)\alpha\in(0,1) is a fixed constant.

2.5 Parameter Class

For given positive constants 0<α<10<\alpha<1, 0<ϵ0<1/20<\epsilon_{0}<1/2, CbmC_{\rm bm} and a sequence of positive integers s0s_{0}, we introduce conditions (A1)-(A4) for the true precision matrix:

(A1) ϵ0λmin(Ω0n)λmax(Ω0n)ϵ01\epsilon_{0}\leq\lambda_{\min}(\Omega_{0n})\leq\lambda_{\max}(\Omega_{0n})\leq\epsilon_{0}^{-1}.
(A2) max1jpl=1pI(a0,jl0)s0\max_{1\leq j\leq p}\sum_{l=1}^{p}I(a_{0,jl}\neq 0)\leq s_{0}.
(A3)

min(j,l):a0,jl0|a0,jl|2\displaystyle\min_{(j,l):a_{0,jl}\neq 0}|a_{0,jl}|^{2} \displaystyle\geq 16α(1α)ϵ02(12ϵ0)2Cbmlogpn.\displaystyle{\frac{16}{\alpha(1-\alpha)\,\epsilon_{0}^{2}(1-2\epsilon_{0})^{2}}\,\,C_{\rm bm}\frac{\log p}{n}}.

(A4) max1lpj=2pI(a0,jl0)s0\max_{1\leq l\leq p}\sum_{j=2}^{p}I(a_{0,jl}\neq 0)\leq s_{0}.

Condition (A1) ensures that the eigenvalues of Ω0n\Omega_{0n} are bounded by fixed constants, which has been commonly used for the estimation of the high-dimensional precision matrices (Ren et al., 2015; Cai, Liu and Zhou, 2016; Banerjee and Ghosal, 2015) as well as the high-dimensional DAGs (Yu and Bien, 2017; Khare et al., 2016). In this paper, condition (A1) is mainly used to get upper bounds of d0jd_{0j}, d0j1d_{0j}^{-1} and A0n\|A_{0n}\|.

Condition (A2) restricts the number of nonzero elements in each row of A0nA_{0n} to be smaller than s0s_{0}. Note that s0s_{0} may increase to infinity as nn gets larger. In our setting, it is equivalent to say that the cardinality of paj(𝒟0)pa_{j}(\mathcal{D}_{0}) is less than s0s_{0} for any j=2,,pj=2,\ldots,p, where 𝒟0\mathcal{D}_{0} is the DAG corresponding to A0nA_{0n}.

Condition (A3) is the well-known beta-min condition, which determines the lower bound for the nonzero signals. The beta-min condition has been used for the exact support recovery of the high-dimensional linear regression coefficients (Wainwright, 2009a; Bühlmann and van de Geer, 2011; Castillo, Schmidt-Hieber and van der Vaart, 2015; Yang, Wainwright and Jordan, 2016; Martin, Mess and Walker, 2017) as well as the high-dimensional DAGs (Khare et al., 2016; Yu and Bien, 2017; Cao, Khare and Ghosh, 2017).

Condition (A4) restricts the number of nonzero elements in each column of A0nA_{0n} to be smaller than s0s_{0}. In other words, the number of edges directed from any vertex is less than s0s_{0}. This assumption is required to deal with the posterior probability of AnA0n1\|A_{n}-A_{0n}\|_{1}. Note that if we consider only the banded structure for the Cholesky factor as in Yu and Bien (2017), conditions (A2) and (A4) automatically hold for some s0s_{0}.

Now, we define a class of precision matrices

𝒰p=𝒰p(ϵ0,s0,α,Cbm)\displaystyle\mathcal{U}_{p}\,\,=\,\,\mathcal{U}_{p}(\epsilon_{0},s_{0},\alpha,C_{\rm bm}) =\displaystyle= {Ω𝒞p:Ω satisfies (A1)-(A3)}.\displaystyle\bigg{\{}\Omega\in\mathcal{C}_{p}:\,\,\Omega\text{ satisfies \hyperref@@ii[A1]{\rm(A1)}-\hyperref@@ii[A3]{\rm(A3)}}\,\,\bigg{\}}.

In section 3, we show that one can achieve the strong model selection consistency for any Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p}. Furthermore, we derive the posterior convergence rates for A0nA_{0n} and show that these are optimal or nearly optimal for the class 𝒰p\mathcal{U}_{p} (or 𝒰p\mathcal{U}_{p} without condition (A3)).

Remark 2.2.

Cao, Khare and Ghosh (2017) weakened the bounded eigenvalue condition (A1) by replacing a constant ϵ0\epsilon_{0} with a sequence ϵ0,n\epsilon_{0,n}, which can go to zero at certain rate. Our results also still hold under the similar weakened bounded eigenvalue condition with ϵ0,n\epsilon_{0,n}, but it will sacrifice the other conditions. For example, by using a sequence ϵ0,n\epsilon_{0,n} in place of a fixed ϵ0\epsilon_{0} in the proof of Theorem 3.1, one can see that s0logpCnϵ0,n2s_{0}\log p\leq Cn\epsilon_{0,n}^{2} for some C>0C>0 and the beta-min condition (A3) with ϵ0,n\epsilon_{0,n} in place of ϵ0\epsilon_{0} are required.

3 Main Results

We introduce Condition (P) on the hyperparameters in the ESC prior (4), which is necessary for the results in this section. Note that this condition is for the hyperparameters of the prior distribution, which does not affect the true parameter space.

Condition (P) Assume that ν0=o(n)\nu_{0}=o(n), c1=O(1)c_{1}=O(1), c22c_{2}\geq 2 and γ=O(1)\gamma=O(1). For given positive constants 0<α<10<\alpha<1 and 0<ϵ0<1/20<\epsilon_{0}<1/2 used in conditions (A1) and (A3), assume that Rj=n(logp)1{(logn)1c3}R_{j}=\lfloor n\,(\log p)^{-1}\{(\log n)^{-1}\vee c_{3}\}\rfloor for any j=2,,pj=2,\ldots,p and some small constant 0<c3<(ϵ)2ϵ02/{128(1+2ϵ0)2}0<c_{3}<(\epsilon^{\prime})^{2}\epsilon_{0}^{2}/\{128(1+2\epsilon_{0})^{2}\}, where ϵ={(1α)/10}2\epsilon^{\prime}=\{(1-\alpha)/10\}^{2}.

The condition c22c_{2}\geq 2 is similar to the condition κ2\kappa\geq 2 in Yang, Wainwright and Jordan (2016). Note that the constants c1c_{1} and c2c_{2} in the ESC prior control the row-wise sparsity of the Cholesky factor AnA_{n}: large values of them make the posterior prefer small values for |Sj||S_{j}|. Thus, the above condition means that we need certain amount of penalty on |Sj||S_{j}| to achieve desirable asymptotic properties. The condition on RjR_{j} means that RjR_{j} is of order n(logp)1n(\log p)^{-1} and smaller than n(logp)1(ϵ)2ϵ02/{128(1+2ϵ0)2}n(\log p)^{-1}(\epsilon^{\prime})^{2}\epsilon_{0}^{2}/\{128(1+2\epsilon_{0})^{2}\}, so it can be replaced by the condition Rj=n(logp)1(ϵ)2ϵ02/{128(1+2ϵ0)2}R_{j}=\lfloor n\,(\log p)^{-1}(\epsilon^{\prime})^{2}\epsilon_{0}^{2}/\{128(1+2\epsilon_{0})^{2}\}\rfloor. To assure s0Rns_{0}\leq R_{n}, we will assume that s0n(logp)1c3/2s_{0}\leq n(\log p)^{-1}c_{3}/2 later. In general, assuming s0=O(n(logp)1)s_{0}=O(n(\log p)^{-1}) or even s0=o(n(logp)1)s_{0}=o(n(\log p)^{-1}) is essential to prove theoretical properties such as selection consistency and convergence rates. However, it can be unrealistically small for some finite sample size nn. More importantly, the quantity ϵ0\epsilon_{0} is unknown in typical applications, so it is desirable to make the prior work for any choice of ϵ0\epsilon_{0}. Condition (P) argues that there is such a prior. We suggest choosing a small enough c3c_{3} so that RjR_{j} can be regarded as Rj=n(logplogn)1R_{j}=\lfloor n\,(\log p\cdot\log n)^{-1}\rfloor for finite samples.

Remark 3.1.

Yang, Wainwright and Jordan (2016) suggested a prior for the linear model similar to the ESC prior but for the mean vector of the prior π(aSjdj,Sj)\pi(a_{S_{j}}\mid d_{j},S_{j}), they used zero mean vector while we used a^Sj\widehat{a}_{S_{j}}. There are two consequences from the use of the data-dependent mean a^Sj\widehat{a}_{S_{j}}. First, we do not need an upper bound condition for 𝐗S0ja0,S0j2\|{\bf X}_{S_{0j}}a_{0,S_{0j}}\|_{2} or a0,S0j2\|a_{0,S_{0j}}\|_{2}, while Yang, Wainwright and Jordan (2016) assumed 𝐗S0ja0,S0j2gd0jlogp\|{\bf X}_{S_{0j}}a_{0,S_{0j}}\|_{2}\leq g\,d_{0j}\log p, where g=γ1g=\gamma^{-1} in this paper. It is known that this type of condition is required if we use the Zellner’s gg-prior with zero mean (Shang and Clayton, 2011). Second, to prove model selection consistency, Yang, Wainwright and Jordan (2016) assumed g=p2cg=p^{2c} for some c1/2c\geq 1/2 corresponding to γ=p2c\gamma=p^{-2c} in our notation. This is the so-called information paradox of Zellner’s gg-priors (Liang et al., 2008). We do not require this condition and just assume γ=O(1)\gamma=O(1).

3.1 Strong Model Selection Consistency

When the recovery of the DAG is of interest, it is desirable to use a Bayesian procedure that guarantees the strong model selection consistency. We show that the α\alpha-posterior warrants this property under mild conditions. As mentioned earlier, the Gaussian DAG model has an interpretation as a sequence of autoregressive model (3), which enables us to adopt the state-of-the-art techniques for the selection consistency of the regression coefficient in Martin, Mess and Walker (2017).

To use the results in Martin, Mess and Walker (2017), there are two main issues that need to be addressed. The first is the restricted eigenvalue condition for the design matrix. In our setting, the design matrices consist of columns of data matrix 𝐗n{\bf X}_{n}, thus each row follows a multivariate normal distribution. We show that under the bounded eigenvalue condition (A1), the restricted eigenvalue condition for any integer R=o(n)R=o(n) automatically holds on some large set NcN^{c} having 0\mathbb{P}_{0}-probability tending to 1 (Lemma 9.1 in Supplementary Material). A similar result appears in Narisetty and He (2014). The second issue is more challenging than the first. Martin, Mess and Walker (2017) considered only the known (fixed) residual variance case, which corresponds to the known d0jd_{0j} case in our setting. The assumption on the known residual variance results in a relatively straightforward proof for selection consistency. We extended their techniques to the unknown residual variance case by applying (non-central) chi-square concentration inequalities for the estimated residual variances d^Sj\widehat{d}_{S_{j}} for some index set SjS_{j}, which is motivated by Shin, Bhattacharya and Johnson (2015). It reveals that the ratio of the marginal posteriors πα(Sj𝐗n)/πα(S0j𝐗n)\pi_{\alpha}(S_{j}\mid{\bf X}_{n})/\pi_{\alpha}(S_{0j}\mid{\bf X}_{n}) actually behaves like the ratio of the conditional posteriors given d0jd_{0j}, πα(Sjd0j,𝐗n)/πα(S0jd0j,𝐗n)\pi_{\alpha}(S_{j}\mid d_{0j},{\bf X}_{n})/\pi_{\alpha}(S_{0j}\mid d_{0j},{\bf X}_{n}), with 0\mathbb{P}_{0}-probability tending to 1, where S0jS_{0j} is the index set for the nonzero elements in the jjth row of A0nA_{0n}.

We also note here that unlike the Lasso type (or its variants) of results with the random design matrix (Wainwright, 2009b), our theory does not require the irrepresentable condition on the true covariance matrix. For example, Yu and Bien (2017) and Khare et al. (2016) require the irrepresentable condition for the asymptotic properties of estimators in DAG models. See section IV of Wainwright (2009b) for more details on the irrepresentable condition.

Theorem 3.1 (Strong model selection consistency).

For given positive constants 0<α<10<\alpha<1, 0<ϵ0<1/20<\epsilon_{0}<1/2, Cbm>c2+2C_{\rm bm}>c_{2}+2 and an integer s0s_{0}, assume that Ω0n\Omega_{0n} satisfies conditions (A1), (A2) and (A3), i.e. Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p}. Consider model (12) and the ESC prior (4) with Condition (P). If s0logpnc3/2s_{0}\log p\leq n\,c_{3}/2,

supΩ0n𝒰p𝔼0[πα(SAnSA0n𝐗n)]\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}}\mathbb{E}_{0}\Big{[}\pi_{\alpha}(S_{A_{n}}\neq S_{A_{0n}}\mid{\bf X}_{n})\Big{]} =\displaystyle= o(1).\displaystyle o(1).

The assumption s0logp=o(n)s_{0}\log p=o(n) or s0logpcns_{0}\log p\leq cn for some constant c>0c>0 is widely used in the high-dimensional sparse covariance or precision matrix estimation literature. In Theorem 3.1, we assume less restrictive condition s0logpnc3/2s_{0}\log p\leq n\,c_{3}/2, which automatically guarantees s0Rjs_{0}\leq R_{j} for all j=2,,pj=2,\ldots,p. Note that the constant c3c_{3} is defined in Condition (P).

It is worthwhile to compare our result to those of Cao, Khare and Ghosh (2017), Yu and Bien (2017) and Khare et al. (2016). Note that in these works it is also assumed that the ordering of variables is known. Cao, Khare and Ghosh (2017) showed the strong model selection consistency using the hierarchical DAG-Wishart prior. They assumed variants of conditions (A1), (A2) and (A3). First, they relaxed condition (A1) by letting ϵ0,n0\epsilon_{0,n}\to 0 such that (logp/n)1/21/(2+k)=o(ϵ0,n4)(\log p/n)^{1/2-1/(2+k)}=o(\epsilon_{0,n}^{4}) for some k>0k>0, instead of a fixed ϵ0>0\epsilon_{0}>0. Second, they assumed the same condition (A2) but further assumed s02+klogp/n=o(1)s_{0}^{2+k}\sqrt{\log p/n}=o(1) and (logp/n)k/(4k+8)logn=o(1)(\log p/n)^{k/(4k+8)}\log n=o(1) and considered only the DAGs with the total number of edges at most 81s0(n/logp)(1+k)/(2+k)8^{-1}s_{0}(n/\log p)^{(1+k)/(2+k)}, which can be restrictive. Note that, when pnp\geq n, it does not include the banded Cholesky factor having s0s_{0} nonzero elements for each row. Third, they assumed somewhat strong beta-min condition compared with (A3), which requires minj,l:a0,jl0|a0,jl|2Mn2s02ϵ0,n1(logp/n)1/(2+k)\min_{j,l:a_{0,jl}\neq 0}|a_{0,jl}|^{2}\geq M_{n}^{2}s_{0}^{2}\epsilon_{0,n}^{-1}\,(\log p/n)^{1/(2+k)} for some k>0k>0 and some sequence MnM_{n}\to\infty. Thus, their assumptions on the tuple (n,p,s0)(n,p,s_{0}) as well as the parameter class are much more restrictive than ours, except for the bounded eigenvalue condition. Furthermore, the choice of hyperparameter in the hierarchical DAG-Wishart prior depends on the unknown sparsity parameter s0s_{0}, thus it is not adaptive to the unknown parameter. More specifically, the hyperparameter qnq_{n} in the hierarchical DAG-Wishart prior should be set at qn=s0(logp/n)1/(2+k)q_{n}=s_{0}(\log p/n)^{1/(2+k)} for some k>0k>0 to achieve the strong model selection consistency.

Yu and Bien (2017) suggested a penalized maximum likelihood estimation for the Cholesky factor of the precision matrix and proved the exact signed support recovery under the condition ρ2D0nϵ01(12π2s0+32)logp<n\rho^{-2}\|D_{0n}\|\epsilon_{0}^{-1}(12\pi^{2}s_{0}+32)\log p<n. They considered the class of precision matrices satisfying condition (A1) and having a banded structure with the row-specific bandwidths s0j=|S0j|s_{0j}=|S_{0j}| such that a0,jl=0a_{0,jl}=0 for all 1l<js0j1\leq l<j-s_{0j} and 2jp2\leq j\leq p. Thus, by taking s0=maxjs0js_{0}=\max_{j}s_{0j}, their class satisfies conditions (A2) and (A4). They also assumed the beta-min condition, minj,l:a0,jl0|d0j1/2a0,jl|8ρ12D0nlogp/n(4maxjΣ0n,S0j1+5ϵ01)\min_{j,l:a_{0,jl}\neq 0}|d_{0j}^{-1/2}a_{0,jl}|\geq 8\rho^{-1}\sqrt{2\|D_{0n}\|\log p/n}\big{(}4\max_{j}\|\Sigma_{0n,S_{0j}}^{-1}\|_{\infty}+5\epsilon_{0}^{-1}\big{)}. In general, it holds that Σ0n,S0j1=O(s0j1/2)\|\Sigma_{0n,S_{0j}}^{-1}\|_{\infty}=O(s_{0j}^{1/2}) without further assumption, thus the above condition implies that the minimum nonzero |d0j1/2a0,jl||d_{0j}^{-1/2}a_{0,jl}| is bounded below by s0logp/n\sqrt{s_{0}\log p/n} with respect to a constant multiple, thus stronger than condition (A3). Furthermore, they assumed the irrepresentable condition

max2jpmax1ljlS0jc(Σ0n)(l,S0j)(Σ0n,S0j)116(1ρ)π2\max_{2\leq j\leq p}\max_{1\leq l\leq j\atop l\in S_{0j}^{c}}\|(\Sigma_{0n})_{(l,S_{0j})}(\Sigma_{0n,S_{0j}})^{-1}\|_{1}\leq\frac{6(1-\rho)}{\pi^{2}}

for some constant ρ(0,1]\rho\in(0,1]. Therefore, they only considered the banded Cholesky factor and used somewhat strong beta-min condition as well as the irrepresentable condition. However, the comparison with our result (Theorem 3.1) is not straightforward because their exact signed support recovery property is stronger than the selection consistency proved in Theorem 3.1.

Khare et al. (2016) proved the signed support recovery property of the convex sparse Cholesky selection (CSCS) method when the data vectors X1,,XnX_{1},\ldots,X_{n} are random sample from a sub-Gaussian distribution. They assumed condition (A1) as well as the (stronger) variants of conditions (A2) and (A3): they assumed j=2ps0j=o(n/logn)\sum_{j=2}^{p}s_{0j}=o(n/\log n) (which is stronger than s0logpnc3/2s_{0}\log p\leq nc_{3}/2) and minj,l:a0,jl0|a0,jl|2Mns02logn/n\min_{j,l:a_{0,jl}\neq 0}|a_{0,jl}|^{2}\geq M_{n}s_{0}^{2}\log n/n for some MnM_{n}\to\infty. Furthermore, they considered only the moderate high-dimensional setting, i.e. p=O(nc)p=O(n^{c}) for some constant c>0c>0. They also required the irrepresentable condition similar to those in Yu and Bien (2017).

3.2 Posterior Convergence Rates for Cholesky Factors

In this subsection, we derive the posterior convergence rates for the Cholesky factors in two different scenarios depending on the existence of the beta-min condition (A3). At first, under the beta-min condition, we show the posterior convergence rates and minimax lower bounds with respect to the matrix \ell_{\infty} norm and Frobenius norm. The obtained posterior convergence rates are nearly minimax, and become exactly minimax if logp=O(s0)\log p=O(s_{0}) and logj=O(s0j)\log j=O(s_{0j}) for all j=2,,pj=2,\ldots,p. We also derive the posterior convergence rate and minimax lower bound with respect to the matrix \ell_{\infty} norm without the beta-min condition. The obtained posterior convergence rate turns out to be nearly minimax, and it will be exactly minimax if s0pβs_{0}\leq p^{\beta} for some 0<β<10<\beta<1. Note that regardless of the relation between s0s_{0} and pp, at least one of the scenarios achieves the minimax rate. Especially, we attain the minimax rate for both scenarios if Clogps0pβC\log p\leq s_{0}\leq p^{\beta} for some constant C>0C>0. Figure 1 describes the range for s0s_{0} in which the minimax rate can be obtained.

s0s_{0}O(logp)O(\log p)O(pβ)O(p^{\beta})Minimax withoutcondition (A3)Minimax forboth casesMinimax withcondition (A3)
Figure 1: For a given 0<β<10<\beta<1, it describes the range for s0s_{0} in which the minimax rate for the Cholesky factor can be obtained. (A3) means the beta-min condition.

3.2.1 Posterior Convergence Rates for Cholesky Factors under Beta-min Condition

Define A^n=(a^jl)\widehat{A}_{n}=(\widehat{a}_{jl}), where (a^jl)lS0j=a^S0j(\widehat{a}_{jl})_{l\in S_{0j}}=\widehat{a}_{S_{0j}} and (a^jl)lS0jc=0(\widehat{a}_{jl})_{l\in S_{0j}^{c}}=0. Thus, A^n\widehat{A}_{n} is the empirical estimates of A0nA_{0n} with true support SA0nS_{A_{0n}}. To obtain the posterior convergence rate for the Cholesky factor, we use a divide and conquer strategy that is similar to Lee and Lee (2018, 2017). Specifically, we decompose the posterior contraction probability into two parts as follows:

πα(AnA0n2ϵn𝐗n)\displaystyle\pi_{\alpha}\Big{(}\|A_{n}-A_{0n}\|\geq 2\epsilon_{n}^{\prime}\mid{\bf X}_{n}\Big{)} (7)
\displaystyle\leq πα(AnA^nϵn𝐗n)+πα(A^nA0nϵn𝐗n)\displaystyle\pi_{\alpha}\Big{(}\|A_{n}-\widehat{A}_{n}\|\geq\epsilon_{n}^{\prime}\mid{\bf X}_{n}\Big{)}+\pi_{\alpha}\Big{(}\|\widehat{A}_{n}-A_{0n}\|\geq\epsilon_{n}^{\prime}\mid{\bf X}_{n}\Big{)}\quad\,\,

for some positive sequence ϵn\epsilon_{n}^{\prime}. As in subsection 3.1, we concentrate on a large set NcN^{c} allowing us to handle the posterior contraction probability easily. The first part of the right hand side of (7) describes how the posterior distribution concentrates around the empirical estimate A^n\widehat{A}_{n}. We use the selection consistency result in Theorem 3.1, and we focus only on the set SAn=SA0nS_{A_{n}}=S_{A_{0n}}. It enables us to deal with the posterior distribution for AnA_{n} easily, but with a cost of the beta-min condition (A3) which is usually not essential for the convergence rate results. Through the posterior distribution (6) given SAn=SA0nS_{A_{n}}=S_{A_{0n}}, we can obtain the contraction probability for AnA^n\|A_{n}-\widehat{A}_{n}\| using the concentration inequality for the chi-square random variables. By taking expectation to the second part of the right hand side of (7), it gives the contraction probability of A^n\widehat{A}_{n}, 0[A^nA0nϵn]\mathbb{P}_{0}\big{[}\|\widehat{A}_{n}-A_{0n}\|\geq\epsilon_{n}^{\prime}\big{]}.

Theorem 3.2 (Posterior convergence rates for A0nA_{0n} with beta-min condition).

For given positive constants 0<α<10<\alpha<1, 0<ϵ0<1/20<\epsilon_{0}<1/2, Cbm>c2+2C_{\rm bm}>c_{2}+2 and an integer s0s_{0}, assume that Ω0n\Omega_{0n} satisfies conditions (A1), (A2) and (A3), i.e. Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p}. Consider model (12) and the ESC prior (4) with Condition (P). If s0logp=o(n)s_{0}\log p=o(n),

supΩ0n𝒰p𝔼0[πα(AnA0nKchols0(s0+logpn)1/2|𝐗n)]\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}}\mathbb{E}_{0}\Big{[}\pi_{\alpha}\big{(}\|A_{n}-A_{0n}\|_{\infty}\geq K_{\rm chol}\sqrt{s_{0}}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\,\,\big{|}\,\,{\bf X}_{n}\big{)}\Big{]} =\displaystyle= o(1),\displaystyle o(1),
supΩ0n𝒰p𝔼0[πα(AnA0nF2Kcholj=2p(s0j+logj)n|𝐗n)]\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}}\mathbb{E}_{0}\Big{[}\pi_{\alpha}\big{(}\|A_{n}-A_{0n}\|_{F}^{2}\geq K_{\rm chol}\frac{\sum_{j=2}^{p}(s_{0j}+\log j)}{n}\,\,\big{|}\,\,{\bf X}_{n}\big{)}\Big{]} =\displaystyle= o(1)\displaystyle o(1)

for some constant Kchol>0K_{\rm chol}>0.

Khare et al. (2016) obtained the convergence rate j=2ps0jλn\sum_{j=2}^{p}s_{0j}\lambda_{n} for estimating the Cholesky factor under the spectral norm in a moderately high-dimensional setting, i.e. p=O(nc)p=O(n^{c}) for some constant c>0c>0, where λn\lambda_{n} is the tuning parameter in CSCS method. They also assumed condition (A1) as well as the (stronger) variants of conditions (A2) and (A3) as described in section 3.1. Because they assumed j=2ps0jlogp/n\sqrt{\sum_{j=2}^{p}s_{0j}\log p/n} =o(λn)=o(\lambda_{n}), j=2ps0jλn\sum_{j=2}^{p}s_{0j}\lambda_{n} is strictly slower than (j=2ps0j)3/2logp/n(\sum_{j=2}^{p}s_{0j})^{3/2}\sqrt{\log p/n} in term of the rate, which implies that their convergence rate is slower than the posterior convergence rate obtained in this paper.

In fact, it turns out that the posterior convergence rates in Theorem 3.2 are nearly optimal. Theorem 3.3 describes that the rates of the frequentist minimax lower bounds for the class 𝒰p\mathcal{U}_{p}, which are of independent interests. Note that the rates of Theorem 3.2 are exactly optimal if logp=O(s0)\log p=O(s_{0}) and logj=O(s0j)\log j=O(s_{0j}) for all j=2,,pj=2,\ldots,p matching the minimax rates of Theorem 3.3. The key idea for proving the minimax lower bounds is to break down the model (12) into a set of linear regression models.

Theorem 3.3 (Minimax lower bounds for A0nA_{0n} with beta-min condition).

For given positive constants 0<α<10<\alpha<1, ϵ0\epsilon_{0}, CbmC_{\rm bm} and an integer s0s_{0}, assume that Ω0n\Omega_{0n} satisfies conditions (A1), (A2) and (A3), i.e. Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p}. Consider model (12). Then,

infA^nsupΩ0n𝒰p𝔼0A^nA0n\displaystyle\inf_{\widehat{A}_{n}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}}\mathbb{E}_{0}\|\widehat{A}_{n}-A_{0n}\|_{\infty} \displaystyle\geq cs0n,\displaystyle c\cdot\frac{s_{0}}{\sqrt{n}},
infA^nsupΩ0n𝒰p𝔼0A^nA0nF2\displaystyle\inf_{\widehat{A}_{n}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}}\mathbb{E}_{0}\|\widehat{A}_{n}-A_{0n}\|_{F}^{2} \displaystyle\geq cj=2ps0jn\displaystyle c\frac{\sum_{j=2}^{p}s_{0j}}{n}

for some constant c>0c>0, where the infimum is taken over all possible estimators A^n\widehat{A}_{n}.

3.2.2 Posterior Convergence Rates for Cholesky Factors without Beta-min Condition

For a given positive constant ϵ0\epsilon_{0} and a sequence of positive integers s0s_{0}, we define a class of precision matrices,

𝒰p0=𝒰p0(ϵ0,s0)\displaystyle\mathcal{U}_{p}^{0}\,\,=\,\,\mathcal{U}_{p}^{0}(\epsilon_{0},s_{0}) =\displaystyle= {Ω𝒞p:Ω satisfies (A1) and (A2)}.\displaystyle\bigg{\{}\Omega\in\mathcal{C}_{p}:\,\,\Omega\text{ satisfies \hyperref@@ii[A1]{\rm(A1)} and \hyperref@@ii[A2]{\rm(A2)}}\,\,\bigg{\}}.

Note that in the definition of 𝒰p0\mathcal{U}_{p}^{0}, we do not require the beta-min condition. Theorem 3.4 gives the posterior convergence rate for the class 𝒰p0\mathcal{U}_{p}^{0}. For the Theorem 3.4, we use the ESC prior (4) but let djIG(ν0/2,ν0)d_{j}\sim IG(\nu_{0}/2,\nu_{0}^{\prime}) for some constant ν0>0\nu_{0}^{\prime}>0 instead of π(dj)djν0/21\pi(d_{j})\propto d_{j}^{-\nu_{0}/2-1}. We call this the modified ESC (MESC) prior. As mentioned before, Theorems 3.1, 3.2 and 3.6 in section 3 also hold for the MESC prior, but we describe Theorems 3.1, 3.2 and 3.6 with the ESC prior for the computational convenience.

We consider the denominator and numerator of the posterior probability πα(AnA0nϵn)\pi_{\alpha}(\|A_{n}-A_{0n}\|_{\infty}\geq\epsilon_{n}^{\prime}) separately, for some positive sequence ϵn\epsilon_{n}^{\prime}. For any j=2,,pj=2,\ldots,p, let Rnj(aj,dj)=Lnj(aj,dj)/Lnj(a0j,d0j)R_{nj}(a_{j},d_{j})=L_{nj}(a_{j},d_{j})/L_{nj}(a_{0j},d_{0j}) be the likelihood ratio, where

Lnj(aj,dj)=(2πdj)n/2exp{X~jZ~jaj22/(2dj)}.L_{nj}(a_{j},d_{j})=(2\pi d_{j})^{-n/2}\exp\big{\{}-\|\tilde{X}_{j}-\tilde{Z}_{j}a_{j}\|_{2}^{2}/(2d_{j})\big{\}}.

Dealing with the likelihood ratio Rnj(aj,dj)R_{nj}(a_{j},d_{j}) is one of the main tasks for proving Theorem 3.4. Lemma 10.1, Lemma 10.2 and Lemma 10.3 in Supplementary Material describe how we can deal with the likelihood ratio Rnj(aj,dj)R_{nj}(a_{j},d_{j}) in the denominator and numerator.

Theorem 3.4 (Posterior convergence rate for A0nA_{0n} without beta-min condition).

For a given positive constant 0<α<10<\alpha<1, 0<ϵ0<1/20<\epsilon_{0}<1/2 and an integer s0s_{0}, assume that Ω0n\Omega_{0n} satisfies conditions (A1) and (A2), i.e. Ω0n𝒰p0\Omega_{0n}\in\mathcal{U}_{p}^{0}. Consider model (12) with the MESC prior with Condition (P). If s0logp=o(n)s_{0}\log p=o(n) and ν0=O(1)\nu_{0}=O(1), then

supΩ0n𝒰p0𝔼0[πα(AnA0nKchols0(logpn)1/2|𝐗n)]\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{0}}\mathbb{E}_{0}\left[\pi_{\alpha}\big{(}\|A_{n}-A_{0n}\|_{\infty}\geq K_{\rm chol}^{\prime}\,s_{0}\left(\frac{\log p}{n}\right)^{1/2}\,\,\big{|}\,\,{\bf X}_{n}\big{)}\right] =\displaystyle= o(1)\displaystyle o(1)

for some constant Kchol>0K_{\rm chol}^{\prime}>0.

Yu and Bien (2017) obtained the convergence rate maxjΣ0n,S0j1A0ns0logp/n+maxjΣ0n,S0j12s02logp/n\max_{j}\|\Sigma_{0n,S_{0j}}^{-1}\|_{\infty}\cdot\|A_{0n}\|_{\infty}s_{0}\sqrt{\log p/n}+\max_{j}\|\Sigma_{0n,S_{0j}}^{-1}\|_{\infty}^{2}s_{0}^{2}\log p/n for the Cholesky factor with respect to the matrix \ell_{\infty} norm. As stated before, they assumed condition (A1), the banded Cholesky factor structure (which corresponds to conditions (A2) and (A4) in this paper) and the irrepresentable condition. Note that their convergence rate coincides with ours only if A0n\|A_{0n}\|_{\infty} and maxjΣ0n,S0j1\max_{j}\|\Sigma_{0n,S_{0j}}^{-1}\|_{\infty} are bounded and s02logp=O(n)s_{0}^{2}\log p=O(n).

To the best of our knowledge, it is the first result on the posterior convergence rate for the high-dimensional sparse Cholesky factor without the beta-min condition. Interestingly, the obtained posterior convergence rate is the same with the minimax convergence rate for the s0s_{0}-sparse coefficient vector in the regression models when s0pβs_{0}\leq p^{\beta} for some 0<β<10<\beta<1. Note that the condition s0pβs_{0}\leq p^{\beta} is not restrictive in the high-dimensional setting, because this condition is met if npβn\leq p^{\beta}. Theorem 3.5 confirms that the above posterior convergence rate is nearly minimax for any Ω0n𝒰p0\Omega_{0n}\in\mathcal{U}_{p}^{0}. Similar to Theorem 3.3, the key idea for proving Theorem 3.5 is to break down the model into a set of linear regression models.

Theorem 3.5 (Minimax lower bound for A0nA_{0n} without beta-min condition).

For a given constant ϵ0\epsilon_{0} and an integer s0s_{0}, assume that Ω0n\Omega_{0n} satisfies conditions (A1) and (A2), i.e. Ω0n𝒰p0\Omega_{0n}\in\mathcal{U}_{p}^{0}. Consider model (12). Then,

infA^nsupΩ0n𝒰p0𝔼0A^nA0n\displaystyle\inf_{\widehat{A}_{n}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{0}}\mathbb{E}_{0}\|\widehat{A}_{n}-A_{0n}\|_{\infty} \displaystyle\geq cs0(log(p/s0)n)1/2\displaystyle c\cdot s_{0}\left(\frac{\log(p/s_{0})}{n}\right)^{1/2}

for some constant c>0c>0.

Remark 3.2.

If we assume s0pβs_{0}\leq p^{\beta} for some 0<β<10<\beta<1, then log(p/s0)\log(p/s_{0}) has the same rate with logp\log p, and the rate of the mininax lower bound in Theorem 3.5 becomes s0logp/ns_{0}\sqrt{\log p/n}. This assumption is reasonable in the high-dimensional setting.

3.3 Posterior Convergence Rates for Precision Matrices

In this subsection, we derive the posterior convergence rates for the precision matrices with respect to various matrix norms. Define Ω^n=(IpA^n)TD^n1(IpA^n)\widehat{\Omega}_{n}=(I_{p}-\widehat{A}_{n})^{T}\widehat{D}_{n}^{-1}(I_{p}-\widehat{A}_{n}), where A^n\widehat{A}_{n} and D^n=diag(d^S0j)\widehat{D}_{n}=diag(\widehat{d}_{S_{0j}}) are the empirical estimates of A0nA_{0n} and D0nD_{0n} with the true support SA0nS_{A_{0n}}. Similar to the previous subsection, we use the divide and conquer strategy to deal with the posterior probability. For the recovery of Ω0n=(IpA0n)TD0n1(IpA0n)\Omega_{0n}=(I_{p}-A_{0n})^{T}D_{0n}^{-1}(I_{p}-A_{0n}), we further assume condition (A4). For given positive constants ϵ0,Cbm\epsilon_{0},C_{\rm bm} and a sequence of positive integers s0s_{0}, define the parameter class as follows:

𝒰p=𝒰p(ϵ0,s0,α,Cbm)\displaystyle\mathcal{U}_{p}^{*}\,\,=\,\,\mathcal{U}_{p}^{*}(\epsilon_{0},s_{0},\alpha,C_{\rm bm}) =\displaystyle= {Ω𝒞p:Ω satisfies (A1)-(A4)}.\displaystyle\bigg{\{}\Omega\in\mathcal{C}_{p}:\,\,\Omega\text{ satisfies \hyperref@@ii[A1]{\rm(A1)}-\hyperref@@ii[A4]{\rm(A4)}}\,\,\bigg{\}}.

Theorem 3.6 shows the posterior convergence rates for the precision matrix with respect to the spectral norm and matrix \ell_{\infty} norm.

Theorem 3.6 (Posterior convergence rates for Ω0n\Omega_{0n}).

For given positive constants 0<α<10<\alpha<1, 0<ϵ0<1/20<\epsilon_{0}<1/2, Cbm>c2+2C_{\rm bm}>c_{2}+2 and an integer s0s_{0}, assume that Ω0n\Omega_{0n} satisfies conditions (A1)-(A4), i.e. Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p}^{*}. Consider model (12) and the ESC prior (4) with Condition (P) and ν02=O(nlogp)\nu_{0}^{2}=O(n\log p). If s03/2(s0+logp)=o(n)s_{0}^{3/2}(s_{0}+\log p)=o(n), then

supΩ0n𝒰p𝔼0[πα(ΩnΩ0nKconvs03/4(s0+logpn)1/2|𝐗n)]=o(1),\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{*}}\mathbb{E}_{0}\Big{[}\pi_{\alpha}\big{(}\|\Omega_{n}-\Omega_{0n}\|\geq K_{\rm conv}s_{0}^{3/4}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\,\,\big{|}\,\,{\bf X}_{n}\big{)}\Big{]}=o(1),

and, if s0(s0+logp)=o(n)s_{0}(s_{0}+\log p)=o(n), then

supΩ0n𝒰p𝔼0[πα(ΩnΩ0nKconvIpA0ns0(s0+logpn)1/2|𝐗n)]=o(1)\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{*}}\mathbb{E}_{0}\Big{[}\pi_{\alpha}\big{(}\|\Omega_{n}-\Omega_{0n}\|_{\infty}\geq K_{\rm conv}\cdot\|I_{p}-A_{0n}\|_{\infty}s_{0}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\,\,\big{|}\,\,{\bf X}_{n}\big{)}\Big{]}=o(1)

for some constant Kconv>0K_{\rm conv}>0.

It is worthwhile to compare our result to other existing results. Cao, Khare and Ghosh (2017) obtained the posterior convergence rate, s02ϵ0,n2logp/ns_{0}^{2}\,\epsilon_{0,n}^{-2}\sqrt{\log p/n}, for the precision matrix with respect to the spectral norm. As discussed in section 3.1, they assumed variants of conditions (A1), (A2) and (A3). They further assumed the condition (A4). Although they did not state clearly that condition (A4) was used, this condition is necessary to use Lemma 3.1 of Xiang, Khare and Ghosh (2015) in their proof. If we assume the bounded eigenvalue condition (A1), their convergence rate becomes s02logp/ns_{0}^{2}\sqrt{\log p/n}, which is slower than the convergence rate in Theorem 3.6. Note that they assumed s02+klogp/n=o(1)s_{0}^{2+k}\sqrt{\log p/n}=o(1) for some constant k>0k>0, which is stronger than our assumption s03/2(s0+logp)=o(n)s_{0}^{3/2}(s_{0}+\log p)=o(n). Thus, we obtain the faster posterior convergence rates under more general condition on the tuple (n,p,s0)(n,p,s_{0}) and parameter class, except for the bounded eigenvalue condition.

Yu and Bien (2017) considered the parameter class they used to prove the strong model selection consistency, but dropped the beta-min condition. They derived the convergence rate

maxj(Σ0n,S0j)1D0n1/2(IpA0n)s0(logpn)1/2\max_{j}\|(\Sigma_{0n,S_{0j}})^{-1}\|_{\infty}\|D_{0n}^{-1/2}(I_{p}-A_{0n})\|_{\infty}\,s_{0}\,\Big{(}\frac{\log p}{n}\Big{)}^{1/2}

for the precision matrix with respect to the matrix \ell_{\infty} norm. Note that this convergence rate depends on the rate of maxj(Σ0n,S0j)1D0n1/2(IpA0n)\max_{j}\|(\Sigma_{0n,S_{0j}})^{-1}\|_{\infty}\|D_{0n}^{-1/2}(I_{p}-A_{0n})\|_{\infty}. In general, it holds that maxj(Σ0n,S0j)1=O(s0j1/2)\max_{j}\|(\Sigma_{0n,S_{0j}})^{-1}\|_{\infty}=O(s_{0j}^{1/2}). Thus, their convergence rate is slower than the posterior convergence rate in Theorem 3.6, without a further assumption on Σ0n\Sigma_{0n} guaranteeing maxj(Σ0n,S0j)1=O((s0/logp)+1).\max_{j}\|(\Sigma_{0n,S_{0j}})^{-1}\|_{\infty}=O(\sqrt{(s_{0}/\log p)+1}).

4 Numerical Results

The use of the ESC prior not only guarantees optimal or near optimal asymptotic properties but also allows us to conduct the posterior inference easily. In this section, we carry out simulation studies to illustrate the model selection performance of our method. For the comparison, we chose state-of-the-art methods for high-dimensional sparse DAG models and measured the performance of each method. The simulation study confirms that our ESC prior outperforms the other existing methods.

4.1 Metropolis-Hastings Algorithm

Recall that, by (6), the marginal posterior distribution for Sj{1,,j1}S_{j}\subseteq\{1,\ldots,j-1\} can be derived analytically as

πα(Sj𝐗n)\displaystyle\pi_{\alpha}(S_{j}\mid{\bf X}_{n}) \displaystyle\propto πj(Sj)(1+αγ)|Sj|2(d^Sj)αn+ν02\displaystyle\pi_{j}(S_{j})\left(1+\frac{\alpha}{\gamma}\right)^{-\frac{|S_{j}|}{2}}(\widehat{d}_{S_{j}})^{-\frac{\alpha n+\nu_{0}}{2}} (8)

for all j=2,,pj=2,\ldots,p, up to some normalizing constants. Thus, we can run the Rao-Blackwellized Metropolis-Hastings algorithm for each j=2,,pj=2,\ldots,p in parallel. Here, we briefly summarize the algorithm used for the inference, where LL is the number of posterior samples:

  1. Run the following steps for j=2,,pj=2,\ldots,p.

    1. (a)

      Set the initial value Sj(1)S_{j}^{(1)}.

    2. (b)

      For each l=2,,Ll=2,\ldots,L,

      1. i.

        sample Sjnewq(Sj(l1))S_{j}^{new}\sim q(\cdot\mid S_{j}^{(l-1)});

      2. ii.

        compute the acceptance probability

        pacc\displaystyle p_{acc} =\displaystyle= min{1,πα(Sjnew𝐗n)q(Sj(l1)Sjnew)πα(Sj(l1)𝐗n)q(SjnewSj(l1))},\displaystyle\min\left\{1,\,\,\frac{\pi_{\alpha}(S_{j}^{new}\mid{\bf X}_{n})q(S_{j}^{(l-1)}\mid S_{j}^{new})}{\pi_{\alpha}(S_{j}^{(l-1)}\mid{\bf X}_{n})q(S_{j}^{new}\mid S_{j}^{(l-1)})}\right\},

        and set Sj(l)=SjnewS_{j}^{(l)}=S_{j}^{new} with probability paccp_{acc}; otherwise, set Sj(l)=Sj(l1)S_{j}^{(l)}=S_{j}^{(l-1)}.

We chose the kernel q(SS)q(S^{\prime}\mid S) which forms a new set SS^{\prime} by changing a randomly selected nonzero component to 0 with probability 0.50.5 or by changing a randomly selected zero component to 1 with probability 0.50.5.

The marginal posterior for SjS_{j} is controlled by the prior πj(Sj)\pi_{j}(S_{j}), the penalty term (1+α/γ)|Sj|/2(1+\alpha/\gamma)^{-|S_{j}|/2} and the estimated residual variance d^Sj\widehat{d}_{S_{j}}. The data favor to minimize the estimated residual while the prior and penalty term give more weight to the simpler models. The marginal posterior of SjS_{j} will find the model that balances data tracking and model complexity.

To use the Metropolis-Hastings algorithm, we need to choose the tuning parameters. Apart from the impact on theoretical results, the choice of tuning parameters also influences the practical performance. As Martin, Mess and Walker (2017) suggested, we set α=0.999\alpha=0.999 to mimic the Bayesian model with the ordinary likelihood. In the simulation study, as long as 1α1-\alpha is close to zero, the performance was not dependent on the choice of α\alpha. The hyperparameters were chosen as γ=0.1\gamma=0.1, ν0=0\nu_{0}=0, c1=0.0005c_{1}=0.0005 and c2=2c_{2}=2 to satisfy Condition (P).

4.2 Simulation Setting

For the simulation study, we considered the sparse Cholesky settings similar to those used in Khare et al. (2016). We randomly chose 3% or 4% of the lower triangular entries of the Cholesky factor A0nA_{0n} and sampled their values from a uniform distribution on [0.7,0.3][0.3,0.7][-0.7,-0.3]\cup[0.3,0.7]. The remaining entries were set to zero. The entries of the diagonal matrix D0nD_{0n} were sampled from a uniform distribution on [2,5][2,5]. Given the precision matrix Ω0n=(IpA0n)TD0n1(IpA0n)\Omega_{0n}=(I_{p}-A_{0n})^{T}D_{0n}^{-1}(I_{p}-A_{0n}), the data sets were generated from the multivariate normal distribution Np(0,Ω0n1)N_{p}(0,\Omega_{0n}^{-1}) with (n=100,p=300)(n=100,p=300) and (n=200,p=500)(n=200,p=500).

4.3 Other Competing Methods

We compared the model selection performance of our method with those of other existing methods: the empirical Bayes (EB) procedure in Martin, Mess and Walker (2017), which we will denote as EB.M, hierarchical DAG-Wishart (DAG-W) prior (Cao, Khare and Ghosh, 2017) and convex sparse Cholesky selection (CSCS) (Khare et al., 2016).

  1. 1.

    (EB.M) : Because EB.M is originally proposed for the regression coefficient estimation, it can be applied independently to estimate each a0ja_{0j} for j=2,,pj=2,\ldots,p. We set the hyperparameters α\alpha, γ\gamma, c1c_{1} and c2c_{2} to be the same as those in our setting for a fair comparison. Note that Martin, Mess and Walker (2017) used γ=0.001\gamma=0.001, c1=1c_{1}=1 and c2=0.05c_{2}=0.05 in their simulation study, but in our simulations, these choices did not yield better results: they tended to make unacceptably large FDR values. The key difference between our method and EB.M is on how to infer the diagonal matrix DnD_{n}. Martin, Mess and Walker (2017) suggested plugging in the cross-validation based Lasso residual sum of squares estimate (Reid, Tibshirani and Friedman, 2016) of d0jd_{0j}, while we impose a prior on djd_{j} and integrate it out to obtain the marginal posterior for SjS_{j}. Thus, EB.M ignores the uncertainty of djd_{j} and replaces it with a plug-in estimate.

  2. 2.

    (DAG-W) : The hierarchical DAG-Wishart prior (Cao, Khare and Ghosh, 2017) enables one to calculate the marginal posterior for the DAG analytically. Note that, in Cao, Khare and Ghosh (2017), they conducted log-posterior score search algorithm instead of Markov chain Monte Carlo (MCMC) algorithm. Basically, they generated sets of candidate graphs by using frequentist approaches and thresholding the modified Cholesky factor of (n1𝐗nT𝐗n+0.5Ip)1(n^{-1}{\bf X}_{n}^{T}{\bf X}_{n}+0.5I_{p})^{-1}, and the graph which maximizes the log-posterior was chosen as the final estimate. In our simulation study, we implemented the log-posterior score search algorithm as well as Metropolis-Hastings algorithm, using the marginal posterior for the DAG, for a comprehensive comparison. For the implementation, we set the shape parameters at αj(𝒟)=Sj+10\alpha_{j}(\mathcal{D})=S_{j}+10 and the scale matrix at Un=IpU_{n}=I_{p} as they suggested, where 𝒟\mathcal{D} is the DAG corresponding to {Sj}j=2p\{S_{j}\}_{j=2}^{p}. The critical part is the choice of the hyperparameter qnq_{n}, which is the individual edge probability. It was shown that the choice of qn=eηnnq_{n}=e^{-\eta_{n}n} leads to strong model selection consistency, where ηn=s0(logp/n)1/(2+k)\eta_{n}=s_{0}(\log p/n)^{1/(2+k)} for some k>0k>0. Thus, the theoretical choice of qnq_{n} depends on the unknown parameter s0s_{0} and constant k>0k>0. Furthermore, even with s0=1s_{0}=1 and k=0k=0, the resulting qnq_{n} is too small, which does not allow the posterior to explore the model space efficiently. We observed that the choice qn=eηnnq_{n}=e^{-\eta_{n}n} makes the posterior stuck in very small size models and not able to detect the true model. For example, for the setting (n=100,p=300)(n=100,p=300) with the sparsity 3%, the corresponding posterior with qn=eηnnq_{n}=e^{-\eta_{n}n} concluded that the true Cholesky factor is a zero matrix, i.e. it never selected any nonzero variable. Thus, in our simulation study, we conducted the simulation only for two choices, qn=0.01q_{n}=0.01 and qn=0.001q_{n}=0.001, although they might not guarantee the strong model selection consistency. For the log-posterior score search, we chose qn=eηnnq_{n}=e^{-\eta_{n}n} as in Cao, Khare and Ghosh (2017).

  3. 3.

    (CSCS) : We chose the CSCS method (Khare et al., 2016) as a state-of-the-art frequentist competitor. The tuning parameter λn\lambda_{n} in the CSCS method was selected by the Bayesian Information Criterion (BIC)-like measure which is defined in section 2.3 of Khare et al. (2016). We calculated the values of BIC-like measure for λn\lambda_{n} from 0.1 to 5.1 with an increment of 0.1. The value of λn\lambda_{n} minimizing the BIC-like measure was chosen for the estimation.

4.4 Results

We ran the Metropolis-Hastings algorithm for each data set to conduct posterior inferences. Every MCMC chain ran for 24,000 iterations with a burn-in period of 4,000, so we obtained 20,000 posterior samples. We used the models selected by the CSCS method as the initial states for MCMC chains. We constructed the final model by collecting indices with inclusion probabilities, π(ajl0𝐗n)\pi(a_{jl}\neq 0\mid{\bf X}_{n}), exceeding 0.5.

To measure the model selection performance, the number of errors, false discovery rate (FDR), true positive rate (TPR) and inclusion probabilities were reported. We calculated the mean inclusion probability for zero entries in A0nA_{0n} and denote it by p¯0\bar{p}_{0}. Similarly, the mean inclusion probability for nonzero entries in A0nA_{0n} is denoted by p¯1\bar{p}_{1}. More specifically, we calculated

p¯0\displaystyle\bar{p}_{0} =\displaystyle= 1j=2p(j1s0j)j=2plS0jπ(ajl0𝐗n),\displaystyle\frac{1}{\sum_{j=2}^{p}(j-1-s_{0j})}\sum_{j=2}^{p}\sum_{l\notin S_{0j}}\pi(a_{jl}\neq 0\mid{\bf X}_{n}),
p¯1\displaystyle\bar{p}_{1} =\displaystyle= 1j=2ps0jj=2plS0jπ(ajl0𝐗n).\displaystyle\frac{1}{\sum_{j=2}^{p}s_{0j}}\sum_{j=2}^{p}\sum_{l\in S_{0j}}\pi(a_{jl}\neq 0\mid{\bf X}_{n}).
Table 1: ESC, EB.M, DAG-W and CSCS denote our method (empirical sparse Cholesky prior), the empirical Bayes procedure proposed by Martin et al. (2017), the hierarchical Bayesian model using DAG-Wishart prior (Cao et al., 2016) and Convex Sparse Cholesky Selection (Khare et al., 2016), respectively. Sp: sparsity; FDR: false discovery rate; TPR: true positive rate; p¯0\bar{p}_{0}: the mean inclusion probability for zero entries in A0nA_{0n}; p¯1\bar{p}_{1}: the mean inclusion probability for nonzero entries in A0nA_{0n}.
(n,p,Sp)(n,p,\text{Sp}) Method # of errors FDR TPR p¯0\bar{p}_{0} p¯1\bar{p}_{1}
(100, 300, 3%) ESC 264 0.0361 0.8349 0.0071 0.8321
EB.M 419 0.1083 0.7836 0.0041 0.7828
DAG-W(qn=0.01)(q_{n}=0.01) 285 0.0208 0.8052 0.0024 0.8036
DAG-W(qn=0.001)(q_{n}=0.001) 462 0.0122 0.6647 0.0006 0.6688
DAG-W(log-score) 1194 0.0065 0.1130 \cdot \cdot
CSCS 2188 0.6433 0.7799 \cdot \cdot
(100, 300, 4%) ESC 389 0.0494 0.8261 0.0084 0.8194
EB.M 325 0.0347 0.7866 0.0020 0.7815
DAG-W(qn=0.01)(q_{n}=0.01) 422 0.0295 0.7887 0.0032 0.7873
DAG-W(qn=0.001)(q_{n}=0.001) 644 0.0216 0.6555 0.0011 0.6556
DAG-W(log-score) 1619 0.0056 0.0981 \cdot \cdot
CSCS 4025 0.7766 0.8045 \cdot \cdot
(200, 500, 3%) ESC 103 0.0118 0.9842 0.0039 0.9796
EB.M 212 0.0075 0.9506 0.0005 0.9509
DAG-W(qn=0.01)(q_{n}=0.01) 98 0.0049 0.9786 0.0010 0.9773
DAG-W(qn=0.001)(q_{n}=0.001) 182 0.0022 0.9535 0.0002 0.9519
DAG-W(log-score) 4285 0.0000 0.1412 \cdot \cdot
CSCS 10214 0.7397 0.9388 \cdot \cdot
(200, 500, 4%) ESC 153 0.0061 0.9754 0.0043 0.9650
EB.M 281 0.0038 0.9473 0.0005 0.9457
DAG-W(qn=0.01)(q_{n}=0.01) 163 0.0041 0.9713 0.0011 0.9684
DAG-W(qn=0.001)(q_{n}=0.001) 295 0.0017 0.9425 0.0002 0.9416
DAG-W(log-score) 4341 0.0000 0.1301 \cdot \cdot
CSCS 14632 0.7550 0.9285 \cdot \cdot

The simulation results are summarized in Table 1. The ESC prior performs generally better than the other competing methods. The EB.M works reasonably well, but the overall performance is worse than that of ESC prior. The DAG-Wishart prior tends to have low TPR and mean inclusion probability p¯1\bar{p}_{1} when qn=0.001q_{n}=0.001. Note that when qn=0.01q_{n}=0.01, which is chosen to be close to the unknown true sparsity level, the DAG-Wishart prior performs reasonably well, but the ESC prior still works better. However, the true sparsity is in general unknown, so fitting qnq_{n} close to the true sparsity is a challenging task in practice. The log-posterior score search algorithm for DAG-Wishart is computationally efficient even for large pp, but tends to have low FDR as well as TPR in our settings. The CSCS method has high TPR values, but at the same time, it has high FDR values. Thus, from the simulation study, we confirm that our ESC prior not only has nice theoretical properties but also practically outperforms the other existing methods.

5 Acknowledgement

We thank Ryan Martin for helpful discussions about the techniques for proving selection consistency. We would like to thank two referees for their valuable comments which have led to improvements of an earlier version of the paper. We would also like to thank Kshitij Khare and Syed Rahman for sharing with us their code to implement the CSCS method (Khare et al., 2016). Kyoungjae Lee thanks Xuan Cao for sharing her code to implement the log-posterior score search algorithm and helpful discussions about the DAG-Wishart prior.

References

  • Arias-Castro and Lounici (2014) {barticle}[author] \bauthor\bsnmArias-Castro, \bfnmEry\binitsE. and \bauthor\bsnmLounici, \bfnmKarim\binitsK. (\byear2014). \btitleEstimation and variable selection with exponential weights. \bjournalElectron. J. Stat. \bvolume8 \bpages328–354. \endbibitem
  • Banerjee and Ghosal (2014) {barticle}[author] \bauthor\bsnmBanerjee, \bfnmSayantan\binitsS. and \bauthor\bsnmGhosal, \bfnmSubhashis\binitsS. (\byear2014). \btitlePosterior convergence rates for estimating large precision matrices using graphical models. \bjournalElectron. J. Stat. \bvolume8 \bpages2111–2137. \endbibitem
  • Banerjee and Ghosal (2015) {barticle}[author] \bauthor\bsnmBanerjee, \bfnmSayantan\binitsS. and \bauthor\bsnmGhosal, \bfnmSubhashis\binitsS. (\byear2015). \btitleBayesian structure learning in graphical models. \bjournalJ. Multivariate Anal. \bvolume136 \bpages147–162. \endbibitem
  • Ben-David et al. (2015) {barticle}[author] \bauthor\bsnmBen-David, \bfnmEmanuel\binitsE., \bauthor\bsnmLi, \bfnmTianxi\binitsT., \bauthor\bsnmMassam, \bfnmHelene\binitsH. and \bauthor\bsnmRajaratnam, \bfnmBala\binitsB. (\byear2015). \btitleHigh dimensional Bayesian inference for Gaussian directed acyclic graph models. \bjournalarXiv:1109.4371v5. \endbibitem
  • Bhattacharya, Pati and Yang (2018) {barticle}[author] \bauthor\bsnmBhattacharya, \bfnmAnirban\binitsA., \bauthor\bsnmPati, \bfnmDebdeep\binitsD. and \bauthor\bsnmYang, \bfnmYun\binitsY. (\byear2018). \btitleBayesian fractional posteriors. \bjournalAnnals of Statistics, to appear. \endbibitem
  • Bickel and Levina (2008) {barticle}[author] \bauthor\bsnmBickel, \bfnmPeter J\binitsP. J. and \bauthor\bsnmLevina, \bfnmElizaveta\binitsE. (\byear2008). \btitleRegularized estimation of large covariance matrices. \bjournalAnn. Statist. \bvolume36 \bpages199–227. \endbibitem
  • Boucheron, Lugosi and Massart (2013) {bbook}[author] \bauthor\bsnmBoucheron, \bfnmS.\binitsS., \bauthor\bsnmLugosi, \bfnmG.\binitsG. and \bauthor\bsnmMassart, \bfnmP.\binitsP. (\byear2013). \btitleConcentration Inequalities: A Nonasymptotic Theory of Independence. \bpublisherOUP Oxford. \endbibitem
  • Bühlmann and van de Geer (2011) {bbook}[author] \bauthor\bsnmBühlmann, \bfnmP.\binitsP. and \bauthor\bparticlevan de \bsnmGeer, \bfnmS.\binitsS. (\byear2011). \btitleStatistics for High-Dimensional Data: Methods, Theory and Applications. \bseriesSpringer Series in Statistics. \bpublisherSpringer Berlin Heidelberg. \endbibitem
  • Cai, Liu and Zhou (2016) {barticle}[author] \bauthor\bsnmCai, \bfnmT Tony\binitsT. T., \bauthor\bsnmLiu, \bfnmWeidong\binitsW. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2016). \btitleEstimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. \bjournalAnn. Statist. \bvolume44 \bpages455–488. \endbibitem
  • Cai, Ma and Wu (2015) {barticle}[author] \bauthor\bsnmCai, \bfnmT Tony\binitsT. T., \bauthor\bsnmMa, \bfnmZongming\binitsZ. and \bauthor\bsnmWu, \bfnmYihong\binitsY. (\byear2015). \btitleOptimal estimation and rank detection for sparse spiked covariance matrices. \bjournalProbab. Theory Related Fields \bvolume161 \bpages781–815. \endbibitem
  • Cai and Yuan (2012) {barticle}[author] \bauthor\bsnmCai, \bfnmT Tony\binitsT. T. and \bauthor\bsnmYuan, \bfnmMing\binitsM. (\byear2012). \btitleAdaptive covariance matrix estimation through block thresholding. \bjournalAnn. Statist. \bvolume40 \bpages2014–2042. \endbibitem
  • Cai, Zhang and Zhou (2010) {barticle}[author] \bauthor\bsnmCai, \bfnmT Tony\binitsT. T., \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2010). \btitleOptimal rates of convergence for covariance matrix estimation. \bjournalAnn. Statist. \bvolume38 \bpages2118–2144. \endbibitem
  • Cai and Zhou (2012a) {barticle}[author] \bauthor\bsnmCai, \bfnmT Tony\binitsT. T. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2012a). \btitleMinimax estimation of large covariance matrices under 1\ell_{1}-norm. \bjournalStatist. Sinica \bvolume22 \bpages1319–1349. \endbibitem
  • Cai and Zhou (2012b) {barticle}[author] \bauthor\bsnmCai, \bfnmT Tony\binitsT. T. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2012b). \btitleOptimal rates of convergence for sparse covariance matrix estimation. \bjournalAnn. Statist. \bvolume40 \bpages2389–2420. \endbibitem
  • Cao, Khare and Ghosh (2017) {barticle}[author] \bauthor\bsnmCao, \bfnmXuan\binitsX., \bauthor\bsnmKhare, \bfnmKshitij\binitsK. and \bauthor\bsnmGhosh, \bfnmMalay\binitsM. (\byear2017). \btitlePosterior Graph Selection and Estimation Consistency for High-dimensional Bayesian DAG Models. \bjournalThe Annals of Statistics. 111Accepted  . \endbibitem
  • Castillo, Schmidt-Hieber and van der Vaart (2015) {barticle}[author] \bauthor\bsnmCastillo, \bfnmIsmaël\binitsI., \bauthor\bsnmSchmidt-Hieber, \bfnmJohannes\binitsJ. and \bauthor\bparticlevan der \bsnmVaart, \bfnmAad\binitsA. (\byear2015). \btitleBayesian linear regression with sparse priors. \bjournalAnn. Statist. \bvolume43 \bpages1986–2018. \endbibitem
  • Duchi (2016) {bmisc}[author] \bauthor\bsnmDuchi, \bfnmJohn\binitsJ. (\byear2016). \btitleLecture Notes for Statistics 311/Electrical Engineering 377. 222URL: https://stanford.edu/class/stats311/Lectures/full_notes.pdf. Last visited on 2016/02/23  . \endbibitem
  • Eldar and Kutyniok (2012) {bbook}[author] \bauthor\bsnmEldar, \bfnmYonina C\binitsY. C. and \bauthor\bsnmKutyniok, \bfnmGitta\binitsG. (\byear2012). \btitleCompressed sensing: theory and applications. \bpublisherCambridge University Press. \endbibitem
  • Fan, Fan and Lv (2008) {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmFan, \bfnmYingying\binitsY. and \bauthor\bsnmLv, \bfnmJinchi\binitsJ. (\byear2008). \btitleHigh dimensional covariance matrix estimation using a factor model. \bjournalJ. Econometrics \bvolume147 \bpages186–197. \endbibitem
  • Gao and Zhou (2015) {barticle}[author] \bauthor\bsnmGao, \bfnmChao\binitsC. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2015). \btitleRate-optimal posterior contraction for sparse PCA. \bjournalAnn. Statist. \bvolume43 \bpages785–818. \endbibitem
  • Grünwald et al. (2017) {barticle}[author] \bauthor\bsnmGrünwald, \bfnmPeter\binitsP., \bauthor\bparticlevan \bsnmOmmen, \bfnmThijs\binitsT. \betalet al. (\byear2017). \btitleInconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. \bjournalBayesian Analysis \bvolume12 \bpages1069–1103. \endbibitem
  • Hero et al. (2001) {barticle}[author] \bauthor\bsnmHero, \bfnmAlfred O\binitsA. O., \bauthor\bsnmMa, \bfnmBing\binitsB., \bauthor\bsnmMichel, \bfnmOlivier\binitsO. and \bauthor\bsnmGorman, \bfnmJohn\binitsJ. (\byear2001). \btitleAlpha-divergence for classification, indexing and retrieval. \bjournalCommunication and Signal Processing Laboratory, Technical Report CSPL-328, U. Mich. \endbibitem
  • Huang et al. (2006) {barticle}[author] \bauthor\bsnmHuang, \bfnmJianhua Z\binitsJ. Z., \bauthor\bsnmLiu, \bfnmNaiping\binitsN., \bauthor\bsnmPourahmadi, \bfnmMohsen\binitsM. and \bauthor\bsnmLiu, \bfnmLinxu\binitsL. (\byear2006). \btitleCovariance matrix selection and estimation via penalised normal likelihood. \bjournalBiometrika \bvolume93 \bpages85–98. \endbibitem
  • Johnstone and Lu (2009) {barticle}[author] \bauthor\bsnmJohnstone, \bfnmIain M\binitsI. M. and \bauthor\bsnmLu, \bfnmArthur Yu\binitsA. Y. (\byear2009). \btitleOn consistency and sparsity for principal components analysis in high dimensions. \bjournalJ. Amer. Statist. Assoc. \bvolume104 \bpages682–693. \endbibitem
  • Kalisch and Bühlmann (2007) {barticle}[author] \bauthor\bsnmKalisch, \bfnmMarkus\binitsM. and \bauthor\bsnmBühlmann, \bfnmPeter\binitsP. (\byear2007). \btitleEstimating high-dimensional directed acyclic graphs with the PC-algorithm. \bjournalJournal of Machine Learning Research \bvolume8 \bpages613–636. \endbibitem
  • Khare et al. (2016) {barticle}[author] \bauthor\bsnmKhare, \bfnmKshitij\binitsK., \bauthor\bsnmOh, \bfnmSang\binitsS., \bauthor\bsnmRahman, \bfnmSyed\binitsS. and \bauthor\bsnmRajaratnam, \bfnmBala\binitsB. (\byear2016). \btitleA convex framework for high-dimensional sparse Cholesky based covariance estimation. \bjournalarXiv preprint arXiv:1610.02436. \endbibitem
  • Laurent and Massart (2000) {barticle}[author] \bauthor\bsnmLaurent, \bfnmBeatrice\binitsB. and \bauthor\bsnmMassart, \bfnmPascal\binitsP. (\byear2000). \btitleAdaptive estimation of a quadratic functional by model selection. \bjournalAnn. Statist. \bvolume28 \bpages1302–1338. \endbibitem
  • Lee and Lee (2017) {barticle}[author] \bauthor\bsnmLee, \bfnmKyoungjae\binitsK. and \bauthor\bsnmLee, \bfnmJaeyong\binitsJ. (\byear2017). \btitleEstimating Large Precision Matrices via Modified Cholesky Decomposition. \bjournalarXiv:1707.01143. \endbibitem
  • Lee and Lee (2018) {barticle}[author] \bauthor\bsnmLee, \bfnmKyoungjae\binitsK. and \bauthor\bsnmLee, \bfnmJaeyong\binitsJ. (\byear2018). \btitleOptimal Bayesian Minimax Rates for Unconstrained Large Covariance Matrices. \bjournalBayesian Anal. 333Accepted  . \endbibitem
  • Lee, Lee and Lin (2018) {barticle}[author] \bauthor\bsnmLee, \bfnmKyoungjae\binitsK., \bauthor\bsnmLee, \bfnmJaeyong\binitsJ. and \bauthor\bsnmLin, \bfnmLizhen\binitsL. (\byear2018). \btitleSupplementary material for “Minimax Posterior Convergence Rates and Model Selection Consistency in High-dimensional DAG Models based on Sparse Cholesky Factors”. \endbibitem
  • Liang et al. (2008) {barticle}[author] \bauthor\bsnmLiang, \bfnmFeng\binitsF., \bauthor\bsnmPaulo, \bfnmRui\binitsR., \bauthor\bsnmMolina, \bfnmGerman\binitsG., \bauthor\bsnmClyde, \bfnmMerlise A\binitsM. A. and \bauthor\bsnmBerger, \bfnmJim O\binitsJ. O. (\byear2008). \btitleMixtures of g priors for Bayesian variable selection. \bjournalJournal of the American Statistical Association \bvolume103 \bpages410–423. \endbibitem
  • Martin, Mess and Walker (2017) {barticle}[author] \bauthor\bsnmMartin, \bfnmRyan\binitsR., \bauthor\bsnmMess, \bfnmRaymond\binitsR. and \bauthor\bsnmWalker, \bfnmStephen G\binitsS. G. (\byear2017). \btitleEmpirical Bayes posterior concentration in sparse high-dimensional linear models. \bjournalBernoulli \bvolume23 \bpages1822–1847. \endbibitem
  • Martin and Walker (2014) {barticle}[author] \bauthor\bsnmMartin, \bfnmRyan\binitsR. and \bauthor\bsnmWalker, \bfnmStephen G\binitsS. G. (\byear2014). \btitleAsymptotically minimax empirical Bayes estimation of a sparse normal mean vector. \bjournalElectron. J. Stat. \bvolume8 \bpages2188–2206. \endbibitem
  • Miller and Dunson (2018) {barticle}[author] \bauthor\bsnmMiller, \bfnmJeffrey W\binitsJ. W. and \bauthor\bsnmDunson, \bfnmDavid B\binitsD. B. (\byear2018). \btitleRobust Bayesian inference via coarsening. \bjournalJournal of the American Statistical Association \bvolumejust-accepted \bpages1–31. \endbibitem
  • Narisetty and He (2014) {barticle}[author] \bauthor\bsnmNarisetty, \bfnmNaveen Naidu\binitsN. N. and \bauthor\bsnmHe, \bfnmXuming\binitsX. (\byear2014). \btitleBayesian variable selection with shrinking and diffusing priors. \bjournalThe Annals of Statistics \bvolume42 \bpages789–817. \endbibitem
  • Pati et al. (2014) {barticle}[author] \bauthor\bsnmPati, \bfnmDebdeep\binitsD., \bauthor\bsnmBhattacharya, \bfnmAnirban\binitsA., \bauthor\bsnmPillai, \bfnmNatesh S\binitsN. S. and \bauthor\bsnmDunson, \bfnmDavid\binitsD. (\byear2014). \btitlePosterior contraction in sparse Bayesian factor models for massive covariance matrices. \bjournalAnn. Statist. \bvolume42 \bpages1102–1130. \endbibitem
  • Reid, Tibshirani and Friedman (2016) {barticle}[author] \bauthor\bsnmReid, \bfnmStephen\binitsS., \bauthor\bsnmTibshirani, \bfnmRobert\binitsR. and \bauthor\bsnmFriedman, \bfnmJerome\binitsJ. (\byear2016). \btitleA study of error variance estimation in Lasso regression. \bjournalStatist. Sinica \bvolume26 \bpages35–67. \endbibitem
  • Ren et al. (2015) {barticle}[author] \bauthor\bsnmRen, \bfnmZhao\binitsZ., \bauthor\bsnmSun, \bfnmTingni\binitsT., \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. and \bauthor\bsnmZhou, \bfnmHarrison H\binitsH. H. (\byear2015). \btitleAsymptotic normality and optimalities in estimation of large Gaussian graphical models. \bjournalAnn. Statist. \bvolume43 \bpages991–1026. \endbibitem
  • Rothman, Levina and Zhu (2010) {barticle}[author] \bauthor\bsnmRothman, \bfnmAdam J\binitsA. J., \bauthor\bsnmLevina, \bfnmElizaveta\binitsE. and \bauthor\bsnmZhu, \bfnmJi\binitsJ. (\byear2010). \btitleA new approach to Cholesky-based covariance regularization in high dimensions. \bjournalBiometrika \bvolume97 \bpages539–550. \endbibitem
  • Roverato (2000) {barticle}[author] \bauthor\bsnmRoverato, \bfnmAlberto\binitsA. (\byear2000). \btitleCholesky decomposition of a hyper inverse Wishart matrix. \bjournalBiometrika \bvolume87 \bpages99–112. \endbibitem
  • Rütimann and Bühlmann (2009) {barticle}[author] \bauthor\bsnmRütimann, \bfnmPhilipp\binitsP. and \bauthor\bsnmBühlmann, \bfnmPeter\binitsP. (\byear2009). \btitleHigh dimensional sparse covariance estimation via directed acyclic graphs. \bjournalElectronic Journal of Statistics \bvolume3 \bpages1133–1160. \endbibitem
  • Shang and Clayton (2011) {barticle}[author] \bauthor\bsnmShang, \bfnmZuofeng\binitsZ. and \bauthor\bsnmClayton, \bfnmMurray K\binitsM. K. (\byear2011). \btitleConsistency of Bayesian linear model selection with a growing number of parameters. \bjournalJournal of Statistical Planning and Inference \bvolume141 \bpages3463–3474. \endbibitem
  • Shin, Bhattacharya and Johnson (2015) {barticle}[author] \bauthor\bsnmShin, \bfnmMinsuk\binitsM., \bauthor\bsnmBhattacharya, \bfnmAnirban\binitsA. and \bauthor\bsnmJohnson, \bfnmValen E\binitsV. E. (\byear2015). \btitleScalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. \bjournalarXiv:1507.07106. \endbibitem
  • Shojaie and Michailidis (2010) {barticle}[author] \bauthor\bsnmShojaie, \bfnmAli\binitsA. and \bauthor\bsnmMichailidis, \bfnmGeorge\binitsG. (\byear2010). \btitlePenalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. \bjournalBiometrika \bvolume97 \bpages519–538. \endbibitem
  • Syring and Martin (2016) {barticle}[author] \bauthor\bsnmSyring, \bfnmNick\binitsN. and \bauthor\bsnmMartin, \bfnmRyan\binitsR. (\byear2016). \btitleScaling the Gibbs posterior credible regions. \bjournalarXiv preprint arXiv:1509.00922. \endbibitem
  • van de Geer and Bühlmann (2013) {barticle}[author] \bauthor\bparticlevan de \bsnmGeer, \bfnmSara\binitsS. and \bauthor\bsnmBühlmann, \bfnmPeter\binitsP. (\byear2013). \btitle0\ell_{0}-penalized maximum likelihood for sparse directed acyclic graphs. \bjournalAnn. Statist. \bvolume41 \bpages536–567. \endbibitem
  • Wainwright (2009a) {barticle}[author] \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2009a). \btitleInformation-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. \bjournalIEEE Trans. Inform. Theory \bvolume55 \bpages5728–5741. \endbibitem
  • Wainwright (2009b) {barticle}[author] \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2009b). \btitleSharp thresholds for High-Dimensional and noisy sparsity recovery using 1\ell_{1}-Constrained Quadratic Programming (Lasso). \bjournalIEEE Trans. Inform. Theory \bvolume55 \bpages2183–2202. \endbibitem
  • Walker and Hjort (2001) {barticle}[author] \bauthor\bsnmWalker, \bfnmStephen\binitsS. and \bauthor\bsnmHjort, \bfnmNils Lid\binitsN. L. (\byear2001). \btitleOn Bayesian consistency. \bjournalJ. R. Stat. Soc. Ser. B. Stat. Methodol. \bvolume63 \bpages811–821. \endbibitem
  • Xiang, Khare and Ghosh (2015) {barticle}[author] \bauthor\bsnmXiang, \bfnmRuoxuan\binitsR., \bauthor\bsnmKhare, \bfnmKshitij\binitsK. and \bauthor\bsnmGhosh, \bfnmMalay\binitsM. (\byear2015). \btitleHigh dimensional posterior convergence rates for decomposable graphical models. \bjournalElectron. J. Stat. \bvolume9 \bpages2828–2854. \endbibitem
  • Yang, Wainwright and Jordan (2016) {barticle}[author] \bauthor\bsnmYang, \bfnmYun\binitsY., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmJordan, \bfnmMichael I\binitsM. I. (\byear2016). \btitleOn the computational complexity of high-dimensional Bayesian variable selection. \bjournalAnn. Statist. \bvolume44 \bpages2497–2532. \endbibitem
  • Ye and Zhang (2010) {barticle}[author] \bauthor\bsnmYe, \bfnmFei\binitsF. and \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. (\byear2010). \btitleRate Minimaxity of the Lasso and Dantzig Selector for the q\ell_{q} Loss in r\ell_{r} Balls. \bjournalJ. Mach. Learn. Res. \bvolume11 \bpages3519–3540. \endbibitem
  • Yu and Bien (2017) {barticle}[author] \bauthor\bsnmYu, \bfnmGuo\binitsG. and \bauthor\bsnmBien, \bfnmJacob\binitsJ. (\byear2017). \btitleLearning Local Dependence In Ordered Data. \bjournalJ. Mach. Learn. Res. \bvolume18 \bpages1–60. \endbibitem
  • Zellner (1986) {barticle}[author] \bauthor\bsnmZellner, \bfnmArnold\binitsA. (\byear1986). \btitleOn assessing prior distributions and Bayesian regression analysis with g-prior distributions. \bjournalBayesian inference and decision techniques: Essays in Honor of Bruno De Finetti \bvolume6 \bpages233–243. \endbibitem

and and

6 Proof of Theorem 3.1

Proof.

Note that

πα(SAnSA0n𝐗n)\displaystyle\pi_{\alpha}(S_{A_{n}}\neq S_{A_{0n}}\mid{\bf X}_{n}) =\displaystyle= j=2pπα(SjS0j𝐗n)\displaystyle\sum_{j=2}^{p}\pi_{\alpha}(S_{j}\neq S_{0j}\mid{\bf X}_{n}) (9)
=\displaystyle= j=2p{πα(SjS0j𝐗n)+πα(SjS0j𝐗n)}.\displaystyle\sum_{j=2}^{p}\Big{\{}\pi_{\alpha}(S_{j}\supsetneq S_{0j}\mid{\bf X}_{n})+\pi_{\alpha}(S_{j}\nsupseteq S_{0j}\mid{\bf X}_{n})\Big{\}}.

The first term of (9) is of order o(1)o(1) by Lemma 8.1. We only need to consider the second term of (9). Note that

αn+ν02log(d^Sjd^S0j)\displaystyle-\frac{\alpha n+\nu_{0}}{2}\log\left(\frac{\widehat{d}_{S_{j}}}{\widehat{d}_{S_{0j}}}\right) =\displaystyle= αn+ν02log[1d^S0jd^Sjd^S0j]\displaystyle-\frac{\alpha n+\nu_{0}}{2}\log\left[1-\frac{\widehat{d}_{S_{0j}}-\widehat{d}_{S_{j}}}{\widehat{d}_{S_{0j}}}\right]
\displaystyle\leq αn+ν02d^S0jd^Sjd^S0j(1d^S0jd^Sjd^S0j)1\displaystyle\frac{\alpha n+\nu_{0}}{2}\cdot\frac{\widehat{d}_{S_{0j}}-\widehat{d}_{S_{j}}}{\widehat{d}_{S_{0j}}}\left(1-\frac{\widehat{d}_{S_{0j}}-\widehat{d}_{S_{j}}}{\widehat{d}_{S_{0j}}}\right)^{-1}
\displaystyle\equiv α+ν0n2d0jn(d^S0jd^Sj)(1+Q^n)1,\displaystyle\frac{\alpha+\frac{\nu_{0}}{n}}{2d_{0j}}\cdot n\left(\widehat{d}_{S_{0j}}-\widehat{d}_{S_{j}}\right)(1+\widehat{Q}_{n})^{-1},

where

Q^n\displaystyle\widehat{Q}_{n} :=\displaystyle:= (d^S0jd0j1)(d^S0jd^S0jSjd0j)+(d^Sjd^S0jSjd0j).\displaystyle\left(\frac{\widehat{d}_{S_{0j}}}{d_{0j}}-1\right)-\left(\frac{\widehat{d}_{S_{0j}}-\widehat{d}_{S_{0j}\cup S_{j}}}{d_{0j}}\right)+\left(\frac{\widehat{d}_{S_{j}}-\widehat{d}_{S_{0j}\cup S_{j}}}{d_{0j}}\right).

The inequality holds because log(1x)x/(1x)-\log(1-x)\leq x/(1-x) for any x<1x<1. For a given 0<α<10<\alpha<1, we define the events

N1,Sj,α,χ2c\displaystyle N_{1,S_{j},\alpha,\chi^{2}}^{c} :=\displaystyle:= {𝐗n:|d^S0jd0j1|(4ϵn|S0j|n|S0j|n,4ϵn|S0j|n|S0j|n)},\displaystyle\left\{{\bf X}_{n}:\Big{|}\frac{\widehat{d}_{S_{0j}}}{d_{0j}}-1\Big{|}\in\Big{(}-4\sqrt{\epsilon^{\prime}}\frac{n-|S_{0j}|}{n}-\frac{|S_{0j}|}{n},4\sqrt{\epsilon^{\prime}}\frac{n-|S_{0j}|}{n}-\frac{|S_{0j}|}{n}\Big{)}\right\},
N2,Sj,α,χ2c\displaystyle N_{2,S_{j},\alpha,\chi^{2}}^{c} :=\displaystyle:= {𝐗n:0<d^S0jd^S0jSjd0j<4ϵ+|S0jSj||S0j|n},\displaystyle\left\{{\bf X}_{n}:0<\frac{\widehat{d}_{S_{0j}}-\widehat{d}_{S_{0j}\cup S_{j}}}{d_{0j}}<4\epsilon^{\prime}+\frac{|S_{0j}\cup S_{j}|-|S_{0j}|}{n}\right\},
N3,Sj,α,χ2c\displaystyle N_{3,S_{j},\alpha,\chi^{2}}^{c} :=\displaystyle:= {𝐗n:0<d^Sjd^S0jSjd0j<ϵ+λ^nn},\displaystyle\left\{{\bf X}_{n}:0<\frac{\widehat{d}_{S_{j}}-\widehat{d}_{S_{0j}\cup S_{j}}}{d_{0j}}<\epsilon^{\prime}+\frac{\widehat{\lambda}_{n}}{n}\right\},

where ϵ:=((1α)/10)2\epsilon^{\prime}:=((1-\alpha)/10)^{2} and λ^n:=(InP~Sj)Z~ja0j22/d0j\widehat{\lambda}_{n}:=\|(I_{n}-\tilde{P}_{S_{j}})\tilde{Z}_{j}a_{0j}\|_{2}^{2}/d_{0j}. Let NSj,α,χ2c:=k=13Nk,Sj,α,χ2cN_{S_{j},\alpha,\chi^{2}}^{c}:=\cap_{k=1}^{3}N_{k,S_{j},\alpha,\chi^{2}}^{c} and ν1:=(1+α/γ)1/2\nu_{1}:=(1+\alpha/\gamma)^{1/2}, then

j=2p𝔼0[πα(SjS0j𝐗n)]\displaystyle\sum_{j=2}^{p}\,\mathbb{E}_{0}\left[\pi_{\alpha}(S_{j}\nsupseteq S_{0j}\mid{\bf X}_{n})\right]
j=2pSj:SjS0j{0(N1,Sj,α,χ2)+0(N2,Sj,α,χ2)+0(N3,Sj,α,χ2)}\displaystyle\leq\sum_{j=2}^{p}\sum_{S_{j}:S_{j}\nsupseteq S_{0j}}\left\{\mathbb{P}_{0}(N_{1,S_{j},\alpha,\chi^{2}})+\mathbb{P}_{0}(N_{2,S_{j},\alpha,\chi^{2}})+\mathbb{P}_{0}(N_{3,S_{j},\alpha,\chi^{2}})\right\} (10)
+j=2pSj:SjS0j𝔼0[πj(Sj)πj(S0j)ν1|S0j||Sj|exp(α+ν0n2d0j(1+Q^n)n(d^S0jd^Sj))INSj,α,χ2c].\displaystyle+\sum_{j=2}^{p}\sum_{S_{j}:S_{j}\nsupseteq S_{0j}}\mathbb{E}_{0}\left[\frac{\pi_{j}(S_{j})}{\pi_{j}(S_{0j})}\nu_{1}^{|S_{0j}|-|S_{j}|}\exp\left(\frac{\alpha+\frac{\nu_{0}}{n}}{2d_{0j}(1+\widehat{Q}_{n})}\cdot n\left(\widehat{d}_{S_{0j}}-\widehat{d}_{S_{j}}\right)\right)I_{N_{S_{j},\alpha,\chi^{2}}^{c}}\right]. (11)

If we show that (10) and (11) are of order o(1)o(1), the proof is completed. Since (10) is of order o(1)o(1) by Lemma 8.2, we will focus on (11) part. Note that on the event 𝐗nNSj,α,χ2c{\bf X}_{n}\in N_{S_{j},\alpha,\chi^{2}}^{c},

min(Q^n)\displaystyle\min(\widehat{Q}_{n}) :=\displaystyle:= 4ϵ|S0j|n4ϵ|S0jSj||S0j|n\displaystyle-4\sqrt{\epsilon^{\prime}}-\frac{|S_{0j}|}{n}-4\epsilon^{\prime}-\frac{|S_{0j}\cup S_{j}|-|S_{0j}|}{n}
\displaystyle\leq Q^n  5ϵ+λ^nn=:max(Q^n)\displaystyle\widehat{Q}_{n}\,\,\leq\,\,5\sqrt{\epsilon^{\prime}}+\frac{\widehat{\lambda}_{n}}{n}\,\,=:\,\,\max(\widehat{Q}_{n})

for all sufficiently large nn. Also note that, for a fixed SjS0jS_{j}\nsupseteq S_{0j} and given Z~j\tilde{Z}_{j},

n(d^S0jd^Sj)\displaystyle n(\widehat{d}_{S_{0j}}-\widehat{d}_{S_{j}}) =\displaystyle= X~jT(P~SjP~S0j)X~j\displaystyle\tilde{X}_{j}^{T}(\tilde{P}_{S_{j}}-\tilde{P}_{S_{0j}})\tilde{X}_{j}
𝑑\displaystyle\overset{d}{\equiv} (a0jTZ~jT+ϵ~jT)(P~SjP~S0j)(Z~ja0j+ϵ~j)\displaystyle(a_{0j}^{T}\tilde{Z}_{j}^{T}+\tilde{\epsilon}_{j}^{T})(\tilde{P}_{S_{j}}-\tilde{P}_{S_{0j}})(\tilde{Z}_{j}a_{0j}+\tilde{\epsilon}_{j})
=\displaystyle= (InP~Sj)Z~ja0j222ϵ~jT(InP~Sj)Z~ja0j+ϵ~jT(P~SjP~S0j)ϵ~j\displaystyle-\|(I_{n}-\tilde{P}_{S_{j}})\tilde{Z}_{j}a_{0j}\|_{2}^{2}-2\tilde{\epsilon}_{j}^{T}(I_{n}-\tilde{P}_{S_{j}})\tilde{Z}_{j}a_{0j}+\tilde{\epsilon}_{j}^{T}(\tilde{P}_{S_{j}}-\tilde{P}_{S_{0j}})\tilde{\epsilon}_{j}
\displaystyle\leq d0jλ^n2ϵ~jT(InP~Sj)Z~ja0jϵ~jT(P~SjP~SjS0j)ϵ~j\displaystyle-d_{0j}\widehat{\lambda}_{n}-2\tilde{\epsilon}_{j}^{T}(I_{n}-\tilde{P}_{S_{j}})\tilde{Z}_{j}a_{0j}\oplus\tilde{\epsilon}_{j}^{T}(\tilde{P}_{S_{j}}-\tilde{P}_{S_{j}\cap S_{0j}})\tilde{\epsilon}_{j}
=:\displaystyle=: d0jλ^n2Vj,SjWj,Sj,\displaystyle-d_{0j}\widehat{\lambda}_{n}-2V_{j,S_{j}}\oplus W_{j,S_{j}},

where ϵ~jNn(0,d0jIn)\tilde{\epsilon}_{j}\sim N_{n}(0,d_{0j}I_{n}). For a given Z~j\tilde{Z}_{j}, it is easy to show that Vj,Sj/d0jN(0,d0jλ^n)V_{j,S_{j}}/\sqrt{d_{0j}}\sim N(0,d_{0j}\widehat{\lambda}_{n}) and Wj,Sj/d0jχ|Sj||S0jSj|2W_{j,S_{j}}/d_{0j}\sim\chi^{2}_{|S_{j}|-|S_{0j}\cap S_{j}|} under 0\mathbb{P}_{0}. Then,

𝔼0[exp(α+ν0n2d0j(1+Q^n)n(d^S0jd^Sj))INSj,α,χ2c|Z~j]\displaystyle\mathbb{E}_{0}\left[\exp\left(\frac{\alpha+\frac{\nu_{0}}{n}}{2d_{0j}(1+\widehat{Q}_{n})}\cdot n\left(\widehat{d}_{S_{0j}}-\widehat{d}_{S_{j}}\right)\right)I_{N_{S_{j},\alpha,\chi^{2}}^{c}}\,\Big{|}\,\tilde{Z}_{j}\right]
\displaystyle\leq Q{min(Q^n),max(Q^n)}𝔼0[exp((α+ν0n)(d0jλ^n+2Vj,Sj)2d0j(1+Q))exp((α+ν0n)Wj,Sj2d0j(1+Q))|Z~j]\displaystyle\hskip-14.22636pt\sum_{Q\in\{\min(\widehat{Q}_{n}),\max(\widehat{Q}_{n})\}}\hskip-14.22636pt\mathbb{E}_{0}\Bigg{[}\exp\Bigg{(}-\frac{(\alpha+\frac{\nu_{0}}{n})(d_{0j}\widehat{\lambda}_{n}+2V_{j,S_{j}})}{2d_{0j}(1+Q)}\Bigg{)}\exp\Bigg{(}\frac{(\alpha+\frac{\nu_{0}}{n})W_{j,S_{j}}}{2d_{0j}(1+Q)}\Bigg{)}\,\Big{|}\,\tilde{Z}_{j}\Bigg{]}
\displaystyle\leq Q{min(Q^n),max(Q^n)}𝔼0[exp((α+ν0n)(d0jλ^n+2Vj,Sj)2d0j(1+Q))|Z~j](1α+ν0n1+Q)|Sj||S0jSj|2\displaystyle\hskip-14.22636pt\sum_{Q\in\{\min(\widehat{Q}_{n}),\max(\widehat{Q}_{n})\}}\hskip-14.22636pt\mathbb{E}_{0}\Bigg{[}\exp\Bigg{(}-\frac{(\alpha+\frac{\nu_{0}}{n})(d_{0j}\widehat{\lambda}_{n}+2V_{j,S_{j}})}{2d_{0j}(1+Q)}\Bigg{)}\,\Big{|}\,\tilde{Z}_{j}\Bigg{]}\left(1-\frac{\alpha+\frac{\nu_{0}}{n}}{1+Q}\right)^{-\frac{|S_{j}|-|S_{0j}\cap S_{j}|}{2}}

The second inequality follows from the moment generating function of the chi-square distribution because α+ν0/n<1+min(Q^n)\alpha+\nu_{0}/n<1+\min(\widehat{Q}_{n}) for all sufficiently large nn. From the moment generating function of the normal distribution, we have

𝔼0[exp((α+ν0n)(d0jλ^n+2Vj,Sj)2d0j(1+Q))|Z~j]\displaystyle\mathbb{E}_{0}\Bigg{[}\exp\Bigg{(}-\frac{(\alpha+\frac{\nu_{0}}{n})(d_{0j}\widehat{\lambda}_{n}+2V_{j,S_{j}})}{2d_{0j}(1+Q)}\Bigg{)}\,\Big{|}\,\tilde{Z}_{j}\Bigg{]}
=\displaystyle= exp{α+ν0n2(1+Q)(1α+ν0n1+Q)λ^n}\displaystyle\exp\left\{-\frac{\alpha+\frac{\nu_{0}}{n}}{2(1+Q)}\cdot\bigg{(}1-\frac{\alpha+\frac{\nu_{0}}{n}}{1+Q}\bigg{)}\cdot\widehat{\lambda}_{n}\right\}
\displaystyle\leq exp{α+ν0n2(1+max(Q^n))(1α+ν0n1+min(Q^n))λ^n}\displaystyle\exp\left\{-\frac{\alpha+\frac{\nu_{0}}{n}}{2(1+\max(\widehat{Q}_{n}))}\cdot\bigg{(}1-\frac{\alpha+\frac{\nu_{0}}{n}}{1+\min(\widehat{Q}_{n})}\bigg{)}\cdot\widehat{\lambda}_{n}\right\}
\displaystyle\leq exp{α+ν0n2(1+5ϵ+λ^nn)(1α+ν0n14ϵ5ϵ)λ^n},\displaystyle\exp\left\{-\frac{\alpha+\frac{\nu_{0}}{n}}{2(1+5\sqrt{\epsilon^{\prime}}+\frac{\widehat{\lambda}_{n}}{n})}\cdot\bigg{(}1-\frac{\alpha+\frac{\nu_{0}}{n}}{1-4\sqrt{\epsilon^{\prime}}-5\epsilon^{\prime}}\bigg{)}\cdot\widehat{\lambda}_{n}\right\},

where Q=min(Q^n)Q=\min(\widehat{Q}_{n}) or max(Q^n)\max(\widehat{Q}_{n}). Note that

d0jλ^n\displaystyle d_{0j}\widehat{\lambda}_{n} =\displaystyle= (InP~Sj)Z~S0jSjca0,S0jSjc22\displaystyle\|(I_{n}-\tilde{P}_{S_{j}})\tilde{Z}_{S_{0j}\cap S_{j}^{c}}a_{0,S_{0j}\cap S_{j}^{c}}\|_{2}^{2}
\displaystyle\geq λmin(Z~S0jSjTZ~S0jSj)a0,S0jSjc22\displaystyle\lambda_{\min}(\tilde{Z}_{S_{0j}\cup S_{j}}^{T}\tilde{Z}_{S_{0j}\cup S_{j}})\|a_{0,S_{0j}\cap S_{j}^{c}}\|_{2}^{2}
\displaystyle\geq λmin(Z~S0jSjTZ~S0jSj)(|S0j||SjS0j|)minj,l:a0,jl0|a0,jl|2\displaystyle\lambda_{\min}(\tilde{Z}_{S_{0j}\cup S_{j}}^{T}\tilde{Z}_{S_{0j}\cup S_{j}})\cdot(|S_{0j}|-|S_{j}\cap S_{0j}|)\cdot\min_{j,l:a_{0,jl}\neq 0}|a_{0,jl}|^{2}

by Lemma 5 of Arias-Castro and Lounici (2014). Define a set

Nj,Sj\displaystyle N_{j,S_{j}} :=\displaystyle:= {𝐗n:n1λmin(Z~S0jSjTZ~S0jSj)(12ϵ0)2ϵ0}\displaystyle\left\{{\bf X}_{n}:n^{-1}\lambda_{\min}(\tilde{Z}_{S_{0j}\cup S_{j}}^{T}\tilde{Z}_{S_{0j}\cup S_{j}})\leq(1-2\epsilon_{0})^{2}\epsilon_{0}\right\}
{𝐗n:n1λmax(Z~S0jSjTZ~S0jSj)(1+2ϵ0)2ϵ01}.\displaystyle\cap\,\,\left\{{\bf X}_{n}:n^{-1}\lambda_{\max}(\tilde{Z}_{S_{0j}\cup S_{j}}^{T}\tilde{Z}_{S_{0j}\cup S_{j}})\geq(1+2\epsilon_{0})^{2}\epsilon_{0}^{-1}\right\}.

By Corollary 5.35 in Eldar and Kutyniok (2012), we have 0(Nj,Sj)4exp(nϵ02/2)\mathbb{P}_{0}(N_{j,S_{j}})\leq 4\exp(-n\epsilon_{0}^{2}/2) for all sufficiently large nn. Thus, on the event 𝐗nNj,Sjc{\bf X}_{n}\in N_{j,S_{j}}^{c}, we have λ^nd0j1(12ϵ0)2ϵ0(|S0j||SjS0j|)nminj,l:a0,jl0|a0,jl|2\widehat{\lambda}_{n}\geq d_{0j}^{-1}(1-2\epsilon_{0})^{2}\epsilon_{0}\cdot(|S_{0j}|-|S_{j}\cap S_{0j}|)\cdot n\min_{j,l:a_{0,jl}\neq 0}|a_{0,jl}|^{2}, which implies

𝔼0[exp{α+ν0n2(1+5ϵ+λ^nn)(1α+ν0n14ϵ5ϵ)λ^n}]\displaystyle\mathbb{E}_{0}\left[\exp\left\{-\frac{\alpha+\frac{\nu_{0}}{n}}{2(1+5\sqrt{\epsilon^{\prime}}+\frac{\widehat{\lambda}_{n}}{n})}\cdot\bigg{(}1-\frac{\alpha+\frac{\nu_{0}}{n}}{1-4\sqrt{\epsilon^{\prime}}-5\epsilon^{\prime}}\bigg{)}\cdot\widehat{\lambda}_{n}\right\}\right]
\displaystyle\leq 𝔼0[exp{α2(1α14ϵ5ϵ)(1+5ϵλ^n+1n)1}]\displaystyle\mathbb{E}_{0}\left[\exp\left\{-\frac{\alpha}{2}\cdot\Big{(}1-\frac{\alpha}{1-4\sqrt{\epsilon^{\prime}}-5\epsilon^{\prime}}\Big{)}\cdot\Big{(}\frac{1+5\sqrt{\epsilon^{\prime}}}{\widehat{\lambda}_{n}}+\frac{1}{n}\Big{)}^{-1}\right\}\right]
\displaystyle\leq 𝔼0[exp{α2(1α2)(2λ^n+1n)1}]\displaystyle\mathbb{E}_{0}\left[\exp\left\{-\frac{\alpha}{2}\cdot\Big{(}\frac{1-\alpha}{2}\Big{)}\cdot\Big{(}\frac{2}{\widehat{\lambda}_{n}}+\frac{1}{n}\Big{)}^{-1}\right\}\right]
\displaystyle\leq exp{α(1α)4ϵ02(12ϵ0)24(|S0j||SjS0j|)nminj,l:a0,jl0|a0,jl|2}+0(Nj,Sj)\displaystyle\exp\left\{-\frac{\alpha(1-\alpha)}{4}\cdot\frac{\epsilon_{0}^{2}(1-2\epsilon_{0})^{2}}{4}(|S_{0j}|-|S_{j}\cap S_{0j}|)\cdot n\min_{j,l:a_{0,jl}\neq 0}|a_{0,jl}|^{2}\right\}+\mathbb{P}_{0}(N_{j,S_{j}})
\displaystyle\leq exp{(|S0j||SjS0j|)Cbmlogp}+0(Nj,Sj)\displaystyle\exp\Big{\{}-(|S_{0j}|-|S_{j}\cap S_{0j}|)\cdot C_{\rm bm}\log p\Big{\}}+\mathbb{P}_{0}(N_{j,S_{j}})

for all sufficiently large nn. Note that the second inequality holds because ϵ=(1α)/10\sqrt{\epsilon^{\prime}}=(1-\alpha)/10. Thus, (11) is bounded above by

j=2pSj:SjS0jπj(Sj)πj(S0j)ν1s0j|Sj|ν2|Sj||S0jSj|2exp{(s0j|SjS0j|)Cbmlogp}\displaystyle\sum_{j=2}^{p}\sum_{S_{j}:S_{j}\nsupseteq S_{0j}}\frac{\pi_{j}(S_{j})}{\pi_{j}(S_{0j})}\nu_{1}^{s_{0j}-|S_{j}|}\nu_{2}^{|S_{j}|-|S_{0j}\cap S_{j}|}\cdot 2\exp\Big{\{}-(s_{0j}-|S_{j}\cap S_{0j}|)C_{\rm bm}\log p\Big{\}}
+j=2pSj:SjS0jπj(Sj)πj(S0j)ν1s0j|Sj|ν2|Sj||S0jSj|4exp(nϵ022)\displaystyle+\quad\sum_{j=2}^{p}\sum_{S_{j}:S_{j}\nsupseteq S_{0j}}\frac{\pi_{j}(S_{j})}{\pi_{j}(S_{0j})}\nu_{1}^{s_{0j}-|S_{j}|}\nu_{2}^{|S_{j}|-|S_{0j}\cap S_{j}|}\cdot 4\exp\Big{(}-\frac{n\epsilon_{0}^{2}}{2}\Big{)}
j=2ps=0Rjt=0(s0j1)s(s0jt)(js0jst)(js0j)(js)(ν1ν2c1pc2)s0js 6(ν2pCbm)s0jt\displaystyle\leq\sum_{j=2}^{p}\sum_{s=0}^{R_{j}}\sum_{t=0}^{(s_{0j}-1)\wedge s}\binom{s_{0j}}{t}\binom{j-s_{0j}}{s-t}\frac{\binom{j}{s_{0j}}}{\binom{j}{s}}\Big{(}\frac{\nu_{1}}{\nu_{2}}c_{1}p^{c_{2}}\Big{)}^{s_{0j}-s}\,6\,(\nu_{2}p^{-C_{\rm bm}})^{s_{0j}-t}

for all sufficiently large nn and some constant Cbm>0C_{\rm bm}>0, where ν2:=(1(α+ν0/n)/(14ϵ5ϵ))1/2\nu_{2}:=(1-(\alpha+\nu_{0}/n)/(1-4\sqrt{\epsilon^{\prime}}-5\epsilon^{\prime}))^{-1/2}. Note that

(s0jt)(js0jst)(js0j)(js)\displaystyle\frac{\binom{s_{0j}}{t}\binom{j-s_{0j}}{s-t}\binom{j}{s_{0j}}}{\binom{j}{s}} =\displaystyle= (st)(jss0jt)sst×ps0jt=(ps)s0jt×s(s0js),\displaystyle\binom{s}{t}\binom{j-s}{s_{0j}-t}\,\,\leq\,\,s^{s-t}\times p^{s_{0j}-t}\,\,=\,\,(ps)^{s_{0j}-t}\times s^{-(s_{0j}-s)},

so the last term can be decomposed by

j=2ps=0s0j1t=0s(ν1ν2sc1pc2)s0js(ν2pCbm+1s)s0jt\displaystyle\sum_{j=2}^{p}\sum_{s=0}^{s_{0j}-1}\sum_{t=0}^{s}\Big{(}\frac{\nu_{1}}{\nu_{2}s}c_{1}p^{c_{2}}\Big{)}^{s_{0j}-s}(\nu_{2}p^{-C_{\rm bm}+1}s)^{s_{0j}-t} \displaystyle\lesssim j=2ps=0s0j1(ν1c1pCbm+c2+1)s0js\displaystyle\sum_{j=2}^{p}\sum_{s=0}^{s_{0j}-1}\left(\nu_{1}c_{1}p^{-C_{\rm bm}+c_{2}+1}\right)^{s_{0j}-s}
\displaystyle\lesssim ν1c1pCbm+c2+2=o(1), and\displaystyle\nu_{1}c_{1}p^{-C_{\rm bm}+c_{2}+2}\,\,=\,\,o(1),\text{ and}
j=2ps=s0jRjt=0s0j1(ν1ν2sc1pc2)s0js(ν2pCbm+1s)s0jt\displaystyle\sum_{j=2}^{p}\sum_{s=s_{0j}}^{R_{j}}\sum_{t=0}^{s_{0j}-1}\Big{(}\frac{\nu_{1}}{\nu_{2}s}c_{1}p^{c_{2}}\Big{)}^{s_{0j}-s}(\nu_{2}p^{-C_{\rm bm}+1}s)^{s_{0j}-t} \displaystyle\lesssim j=2pν2pCbm+1Rj\displaystyle\sum_{j=2}^{p}\nu_{2}p^{-C_{\rm bm}+1}R_{j}
\displaystyle\leq supjRjν2pCbm+2=o(1),\displaystyle\sup_{j}R_{j}\cdot\nu_{2}p^{-C_{\rm bm}+2}\,\,=\,\,o(1),

provided that Cbm>c2+2C_{\rm bm}>c_{2}+2. Note that s0jRjs_{0j}\leq R_{j} because of Condition (P) and s0logpnc3/2s_{0}\log p\leq n\,c_{3}/2. ∎

6.1 Lemmas for the proof of Theorem 3.1

Lemma 6.1.

Under the conditions in Theorem 3.1, we have

j=2pπα(SjS0j𝐗n)\displaystyle\sum_{j=2}^{p}\pi_{\alpha}(S_{j}\supsetneq S_{0j}\mid{\bf X}_{n}) =\displaystyle= o(1).\displaystyle o(1).
Proof.

For a given SjS0jS_{j}\supsetneq S_{0j}, we have

πα(Sj𝐗n)\displaystyle\pi_{\alpha}(S_{j}\mid{\bf X}_{n}) \displaystyle\leq πα(Sj𝐗n)πα(S0j𝐗n)\displaystyle\frac{\pi_{\alpha}(S_{j}\mid{\bf X}_{n})}{\pi_{\alpha}(S_{0j}\mid{\bf X}_{n})}
=\displaystyle= πj(Sj)πj(S0j)(1+αγ)|Sj||S0j|2(d^Sjd^S0j)αn+ν02.\displaystyle\frac{\pi_{j}(S_{j})}{\pi_{j}(S_{0j})}\left(1+\frac{\alpha}{\gamma}\right)^{-\frac{|S_{j}|-|S_{0j}|}{2}}\cdot\left(\frac{\widehat{d}_{S_{j}}}{\widehat{d}_{S_{0j}}}\right)^{-\frac{\alpha n+\nu_{0}}{2}}.

Note that nd0j1d^Sj=d0j1X~jT(InP~Sj)X~jχn|Sj|2nd_{0j}^{-1}\widehat{d}_{S_{j}}=d_{0j}^{-1}\tilde{X}_{j}^{T}(I_{n}-\tilde{P}_{S_{j}})\tilde{X}_{j}\sim\chi^{2}_{n-|S_{j}|} and nd0j1d^S0j𝑑nd0j1d^Sjχ|Sj||S0j|2nd_{0j}^{-1}\widehat{d}_{S_{0j}}\overset{d}{\equiv}nd_{0j}^{-1}\widehat{d}_{S_{j}}\oplus\chi^{2}_{|S_{j}|-|S_{0j}|} given Z~j=(Z1j,,Znj)T\tilde{Z}_{j}=(Z_{1j},\ldots,Z_{nj})^{T} under 0\mathbb{P}_{0}, which implies d^Sj/d^S0jBeta((n|Sj|)/2,(|Sj||S0j|)/2)\widehat{d}_{S_{j}}/\widehat{d}_{S_{0j}}\sim Beta\left((n-|S_{j}|)/2,(|S_{j}|-|S_{0j}|)/2\right) and

𝔼0(d^Sjd^S0j)αn+ν02\displaystyle\mathbb{E}_{0}\left(\frac{\widehat{d}_{S_{j}}}{\widehat{d}_{S_{0j}}}\right)^{-\frac{\alpha n+\nu_{0}}{2}} =\displaystyle= Γ(n|S0j|2)Γ(n|Sj|2)Γ(n(1α)ν0|Sj|2)Γ(n(1α)ν0|S0j|2)\displaystyle\frac{\Gamma\left(\frac{n-|S_{0j}|}{2}\right)}{\Gamma\left(\frac{n-|S_{j}|}{2}\right)}\cdot\frac{\Gamma\left(\frac{n(1-\alpha)-\nu_{0}-|S_{j}|}{2}\right)}{\Gamma\left(\frac{n(1-\alpha)-\nu_{0}-|S_{0j}|}{2}\right)}
\displaystyle\leq (n|S0j|22)|Sj||S0j|2(2n(1α)ν0|Sj|)|Sj||S0j|2\displaystyle\left(\frac{n-|S_{0j}|-2}{2}\right)^{\frac{|S_{j}|-|S_{0j}|}{2}}\cdot\left(\frac{2}{n(1-\alpha)-\nu_{0}-|S_{j}|}\right)^{\frac{|S_{j}|-|S_{0j}|}{2}}
\displaystyle\leq (2(n2)n(1α))|Sj||S0j|2(21α)|Sj||S0j|2,\displaystyle\left(\frac{2(n-2)}{n(1-\alpha)}\right)^{\frac{|S_{j}|-|S_{0j}|}{2}}\,\,\leq\,\,\left(\frac{2}{1-\alpha}\right)^{\frac{|S_{j}|-|S_{0j}|}{2}},

where the second inequality holds because ν0+|Sj|ν0+Rjn(1α)/2\nu_{0}+|S_{j}|\leq\nu_{0}+R_{j}\leq n(1-\alpha)/2 for all large nn. Let cα,γ=(1+α/γ)1/2(2/(1α))1/2c_{\alpha,\gamma}=(1+\alpha/\gamma)^{-1/2}(2/(1-\alpha))^{1/2} and s0j=|S0j|s_{0j}=|S_{0j}|, then we have

j=2p𝔼0πα(SjS0j𝐗n)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}(S_{j}\supsetneq S_{0j}\mid{\bf X}_{n}) \displaystyle\leq j=2pSj:SjS0jπj(Sj)πj(S0j)cα,γ|Sj||S0j|\displaystyle\sum_{j=2}^{p}\sum_{S_{j}:S_{j}\supsetneq S_{0j}}\frac{\pi_{j}(S_{j})}{\pi_{j}(S_{0j})}c_{\alpha,\gamma}^{|S_{j}|-|S_{0j}|}
\displaystyle\leq j=2ps=s0j+1Rj(ss0j)(cα,γc1pc2)ss0j\displaystyle\sum_{j=2}^{p}\sum_{s=s_{0j}+1}^{R_{j}}\binom{s}{s_{0j}}\left(\frac{c_{\alpha,\gamma}}{c_{1}p^{c_{2}}}\right)^{s-s_{0j}}
\displaystyle\leq j=2ps=s0j+1Rj(cα,γsc1pc2)ss0jj=2pcα,γRjc1pc2.\displaystyle\sum_{j=2}^{p}\sum_{s=s_{0j}+1}^{R_{j}}\left(\frac{c_{\alpha,\gamma}s}{c_{1}p^{c_{2}}}\right)^{s-s_{0j}}\,\,\lesssim\,\,\sum_{j=2}^{p}\,\,\frac{c_{\alpha,\gamma}R_{j}}{c_{1}p^{c_{2}}}.

The last display is of order o(1)o(1) because we assume that c22c_{2}\geq 2. ∎

Lemma 6.2.

Under the conditions in Theorem 3.1, we have

j=2pSj:SjS0j{0(N1,Sj,α,χ2)+0(N2,Sj,α,χ2)+0(N3,Sj,α,χ2)}\displaystyle\sum_{j=2}^{p}\sum_{S_{j}:S_{j}\supsetneq S_{0j}}\Big{\{}\mathbb{P}_{0}(N_{1,S_{j},\alpha,\chi^{2}})+\mathbb{P}_{0}(N_{2,S_{j},\alpha,\chi^{2}})+\mathbb{P}_{0}(N_{3,S_{j},\alpha,\chi^{2}})\Big{\}} =\displaystyle= o(1),\displaystyle o(1),

where N1,Sj,α,χ2,N2,Sj,α,χ2N_{1,S_{j},\alpha,\chi^{2}},N_{2,S_{j},\alpha,\chi^{2}} and N3,Sj,α,χ2N_{3,S_{j},\alpha,\chi^{2}} are defined in the proof of Theorem 3.1.

Proof.

By Lemma 1 in Laurent and Massart (2000), P(χk2k2kx+2x)exp(x)P(\chi_{k}^{2}-k\geq 2\sqrt{kx}+2x)\leq\exp(-x) and P(kχk22kx)exp(x)P(k-\chi_{k}^{2}\geq 2\sqrt{kx})\leq\exp(-x) for all x>0x>0. It is easy to check that

0(N1,Sj,α,χ2)\displaystyle\mathbb{P}_{0}(N_{1,S_{j},\alpha,\chi^{2}}) =\displaystyle= 0(|(ns0j)1χns0j21|4ϵ)\displaystyle\mathbb{P}_{0}\Big{(}|(n-s_{0j})^{-1}\chi_{n-s_{0j}}^{2}-1|\geq 4\sqrt{\epsilon^{\prime}}\Big{)}
\displaystyle\leq 2eϵ(ns0j)  2eϵn2,\displaystyle 2e^{-\epsilon^{\prime}(n-s_{0j})}\,\,\leq\,\,2e^{-\frac{\epsilon^{\prime}n}{2}},
0(N2,Sj,α,χ2)\displaystyle\mathbb{P}_{0}(N_{2,S_{j},\alpha,\chi^{2}}) =\displaystyle= 0((|S0jSj|s0j)1χ|S0jSj|s0j214ϵn|S0jSj|s0j)\displaystyle\mathbb{P}_{0}\Big{(}(|S_{0j}\cup S_{j}|-s_{0j})^{-1}\chi_{|S_{0j}\cup S_{j}|-s_{0j}}^{2}-1\geq\frac{4\epsilon^{\prime}n}{|S_{0j}\cup S_{j}|-s_{0j}}\Big{)}
\displaystyle\leq eϵn\displaystyle e^{-\epsilon^{\prime}n}

for all sufficiently large nn. For the third term 0(N3,Sj,α,χ2)\mathbb{P}_{0}(N_{3,S_{j},\alpha,\chi^{2}}), note that n(d^Sjd^S0jSj)/d0jn(\widehat{d}_{S_{j}}-\widehat{d}_{S_{0j}\cup S_{j}})/d_{0j} follows the noncentral chi-square distribution with |S0jSj||Sj||S_{0j}\cup S_{j}|-|S_{j}| degrees of freedom and the noncentrality parameter λ^n\widehat{\lambda}_{n} under 0\mathbb{P}_{0} given Z~j\tilde{Z}_{j}. Note that on the event 𝐗nNj,Sjc{\bf X}_{n}\in N_{j,S_{j}}^{c} defined in the proof of Theorem 3.1,

λ^n\displaystyle\widehat{\lambda}_{n} =\displaystyle= d0j1(InP~Sj)Z~ja0j22λmax(Z~S0jTZ~S0j)d0j1a0j22\displaystyle d_{0j}^{-1}\|(I_{n}-\tilde{P}_{S_{j}})\tilde{Z}_{j}a_{0j}\|_{2}^{2}\,\,\leq\,\,\lambda_{\max}(\tilde{Z}_{S_{0j}}^{T}\tilde{Z}_{S_{0j}})\cdot d_{0j}^{-1}\|a_{0j}\|_{2}^{2}
\displaystyle\leq ϵ01n(1+2ϵ0)2d0j1/2a0j2\displaystyle\epsilon_{0}^{-1}n(1+2\epsilon_{0})^{2}\cdot\|d_{0j}^{-1/2}a_{0j}\|^{2}
\displaystyle\leq ϵ01n(1+2ϵ0)22{d0j1/2(eja0j)22+d0j1}\displaystyle\epsilon_{0}^{-1}n(1+2\epsilon_{0})^{2}\cdot 2\Big{\{}\|d_{0j}^{-1/2}(e_{j}-a_{0j})\|_{2}^{2}+d_{0j}^{-1}\Big{\}}
\displaystyle\leq ϵ01n(1+2ϵ0)22ϵ01,\displaystyle\epsilon_{0}^{-1}n(1+2\epsilon_{0})^{2}\cdot 2\epsilon_{0}^{-1},

where eje_{j} it the unit vector whose jjth element is 1 and the others are zero. By Lemma 4 in Shin, Bhattacharya and Johnson (2015),

0(N3,Sj,α,χ2)\displaystyle\mathbb{P}_{0}(N_{3,S_{j},\alpha,\chi^{2}}) \displaystyle\leq 𝔼0[C(ϵn2(|S0jSj||Sj|))|S0jSj||Sj|2e|S0jSj||Sj|2ϵn2\displaystyle\mathbb{E}_{0}\Bigg{[}C\Big{(}\frac{\epsilon^{\prime}n}{2(|S_{0j}\cup S_{j}|-|S_{j}|)}\Big{)}^{\frac{|S_{0j}\cup S_{j}|-|S_{j}|}{2}}e^{\frac{|S_{0j}\cup S_{j}|-|S_{j}|}{2}-\frac{\epsilon^{\prime}n}{2}}
+{Cλ^nϵneϵ2n232λ^n1}]\displaystyle\quad+\quad\Big{\{}C\frac{\widehat{\lambda}_{n}}{\epsilon^{\prime}n}e^{-\frac{\epsilon^{\prime 2}n^{2}}{32\widehat{\lambda}_{n}}}\wedge 1\Big{\}}\Bigg{]}
\displaystyle\leq eϵn4+𝔼0[Cλ^nϵneϵ2n232λ^nINj,Sjc]+0(Nj,Sj)\displaystyle e^{-\frac{\epsilon^{\prime}n}{4}}+\mathbb{E}_{0}\left[C\frac{\widehat{\lambda}_{n}}{\epsilon^{\prime}n}e^{-\frac{\epsilon^{\prime 2}n^{2}}{32\widehat{\lambda}_{n}}}I_{N_{j,S_{j}}^{c}}\right]+\mathbb{P}_{0}(N_{j,S_{j}})
\displaystyle\leq eϵn4+eϵ2ϵ0264(1+2ϵ0)2n+4eϵ02n2\displaystyle e^{-\frac{\epsilon^{\prime}n}{4}}+e^{-\frac{\epsilon^{\prime 2}\epsilon_{0}^{2}}{64(1+2\epsilon_{0})^{2}}\,n}+4e^{-\frac{\epsilon_{0}^{2}n}{2}}

for all sufficiently large nn, for some constant C>0C>0. Thus, by Condition (P), it completes the proof. ∎

7 Proofs of Posterior Convergence Rates for Precision Matrices

Recall that we consider the model

X1,,XnΩn\displaystyle X_{1},\ldots,X_{n}\mid\Omega_{n} i.i.d\displaystyle\overset{i.i.d}{\sim} Np(0,Ωn1),\displaystyle N_{p}(0,\Omega_{n}^{-1}), (12)

where Ωn=Σn1\Omega_{n}=\Sigma_{n}^{-1} is a p×pp\times p precision matrix and Xi=(Xi1,,Xip)TpX_{i}=(X_{i1},\ldots,X_{ip})^{T}\in\mathbb{R}^{p} for all i=1,,ni=1,\ldots,n.

We also introduce some notations here which will be used in the proofs in the supplementary material. We define Var^(Xj)=n1X~j22\widehat{{\rm Var}}(X_{j})=n^{-1}\|\tilde{X}_{j}\|_{2}^{2} for j=1,,pj=1,\ldots,p. For a given index set S{1,,p}S\subseteq\{1,\ldots,p\}, we define Var^(ZS)=n1𝐗ST𝐗S\widehat{{\rm Var}}(Z_{S})=n^{-1}{\bf X}_{S}^{T}{\bf X}_{S} and Cov^(ZS,Xj)=n1𝐗STX~j\widehat{{\rm Cov}}(Z_{S},X_{j})=n^{-1}{\bf X}_{S}^{T}\tilde{X}_{j}.

Lemma 7.1.

Let 𝐗n{\bf X}_{n} be the random sample of size nn from Np(0,Σ0n)N_{p}(0,\Sigma_{0n}) with ϵ0λmin(Σ0n)λmax(Σ0n)ϵ01\epsilon_{0}\leq\lambda_{\min}(\Sigma_{0n})\leq\lambda_{\max}(\Sigma_{0n})\leq\epsilon_{0}^{-1} for some constant 0<ϵ0<1/20<\epsilon_{0}<1/2. Define Cmax=(1+2ϵ0)2C_{\max}=(1+2\epsilon_{0})^{2}, Cmin=(12ϵ0)2C_{\min}=(1-2\epsilon_{0})^{2},

N1,R,ϵ0\displaystyle N_{1,R,\epsilon_{0}} =\displaystyle= {𝐗n:n1Ψmax(R)2Cmaxϵ01} and\displaystyle\Big{\{}{\bf X}_{n}:n^{-1}\Psi_{\max}(R)^{2}\geq C_{\max}\epsilon_{0}^{-1}\Big{\}}\quad\text{ and}
N2,R,ϵ0\displaystyle N_{2,R,\epsilon_{0}} =\displaystyle= {𝐗n:n1Ψmin(R)2Cminϵ0},\displaystyle\Big{\{}{\bf X}_{n}:n^{-1}\Psi_{\min}(R)^{2}\leq C_{\min}\epsilon_{0}\Big{\}},

for some positive integer RR. If R=o(n)R=o(n), we have

0(N1,R,ϵ0)\displaystyle\mathbb{P}_{0}(N_{1,R,\epsilon_{0}}) \displaystyle\leq 2exp(n2ϵ02+Rlogp+logR) and\displaystyle 2\exp\left(-\frac{n}{2}\epsilon_{0}^{2}+R\log p+\log R\right)\quad\text{ and}
0(N2,R,ϵ0)\displaystyle\mathbb{P}_{0}(N_{2,R,\epsilon_{0}}) \displaystyle\leq 2exp(n2ϵ02+Rlogp+logR)\displaystyle 2\exp\left(-\frac{n}{2}\epsilon_{0}^{2}+R\log p+\log R\right)

for all sufficiently large nn.

Proof.

We only prove the upper bound for 0(N2,R,ϵ0)\mathbb{P}_{0}(N_{2,R,\epsilon_{0}}), because the upper bound for 0(N1,R,ϵ0)\mathbb{P}_{0}(N_{1,R,\epsilon_{0}}) can be proved easily by the similar arguments. For any given index set S{1,,p}S\subseteq\{1,\ldots,p\} such that 0<|S|R0<|S|\leq R, it is easy to show that n1Σ0n,S1/2𝐗ST𝐗SΣ0n,S1/2W|S|(n,n1I|S|)n^{-1}\Sigma_{0n,S}^{-1/2}{\bf X}_{S}^{T}{\bf X}_{S}\Sigma_{0n,S}^{-1/2}\sim W_{|S|}(n,n^{-1}I_{|S|}) and λmin(Σ0n,S)ϵ0\lambda_{\min}(\Sigma_{0n,S})\geq\epsilon_{0}. Let Cmin=(12ϵ0)2C_{\min}=(1-2\epsilon_{0})^{2}. By Corollary 5.35 in Eldar and Kutyniok (2012) with t=ϵ0nt=\epsilon_{0}\sqrt{n},

0(n1λmin(𝐗ST𝐗S)Cminϵ0)\displaystyle\mathbb{P}_{0}\left(n^{-1}\lambda_{\min}({\bf X}_{S}^{T}{\bf X}_{S})\leq C_{\min}\epsilon_{0}\right) \displaystyle\leq 0(n1ϵ0λmin(Σ0n,S1/2𝐗ST𝐗SΣ0n,S1/2)Cminϵ0)\displaystyle\mathbb{P}_{0}\left(n^{-1}\epsilon_{0}\lambda_{\min}(\Sigma_{0n,S}^{-1/2}{\bf X}_{S}^{T}{\bf X}_{S}\Sigma_{0n,S}^{-1/2})\leq C_{\min}\epsilon_{0}\right)
\displaystyle\leq 2exp(n2ϵ02)\displaystyle 2\exp\left(-\frac{n}{2}\epsilon_{0}^{2}\right)

for all sufficiently large nn because R=o(n)R=o(n). Thus, we have

0(R2,R,ϵ0)\displaystyle\mathbb{P}_{0}\left(R_{2,R,\epsilon_{0}}\right) =\displaystyle= 0(infS:0<|S|Rn1λmin(𝐗ST𝐗S)Cminϵ0)\displaystyle\mathbb{P}_{0}\left(\inf_{S:0<|S|\leq R}n^{-1}\lambda_{\min}({\bf X}_{S}^{T}{\bf X}_{S})\leq C_{\min}\epsilon_{0}\right)
\displaystyle\leq S:0<|S|R0(n1λmin(𝐗ST𝐗S)Cminϵ0)\displaystyle\sum_{S:0<|S|\leq R}\mathbb{P}_{0}\left(n^{-1}\lambda_{\min}({\bf X}_{S}^{T}{\bf X}_{S})\leq C_{\min}\epsilon_{0}\right)
\displaystyle\leq R×pR×2exp(n2ϵ02)\displaystyle R\times p^{R}\times 2\exp\left(-\frac{n}{2}\epsilon_{0}^{2}\right)
=\displaystyle= 2exp(n2ϵ02+Rlogp+logR)\displaystyle 2\exp\left(-\frac{n}{2}\epsilon_{0}^{2}+R\log p+\log R\right)

for all sufficiently large nn. ∎

Lemma 7.2.

Let 𝐗n{\bf X}_{n} be the random sample of size nn from Np(0,Σ0n)N_{p}(0,\Sigma_{0n}) with ϵ0λmin(Σ0n)λmax(Σ0n)ϵ01\epsilon_{0}\leq\lambda_{\min}(\Sigma_{0n})\leq\lambda_{\max}(\Sigma_{0n})\leq\epsilon_{0}^{-1} for some constant 0<ϵ0<1/20<\epsilon_{0}<1/2. For a given constant Kdiff>0K_{\rm diff}>0, define

N1,S0,ϵ0\displaystyle N_{1,S_{0},\epsilon_{0}} =\displaystyle= {𝐗n:max2jpVar^(ZS0j{j})(1+2ϵ0)2ϵ01},\displaystyle\left\{{\bf X}_{n}:\max_{2\leq j\leq p}\|\widehat{{\rm Var}}(Z_{S_{0j}\cup\{j\}})\|\geq(1+2\epsilon_{0})^{2}\epsilon_{0}^{-1}\right\},
N2,S0,ϵ0\displaystyle N_{2,S_{0},\epsilon_{0}} =\displaystyle= {𝐗n:max2jpVar^1(ZS0j{j})(12ϵ0)2ϵ01},\displaystyle\left\{{\bf X}_{n}:\max_{2\leq j\leq p}\|\widehat{{\rm Var}}^{-1}(Z_{S_{0j}\cup\{j\}})\|\geq(1-2\epsilon_{0})^{2}\epsilon_{0}^{-1}\right\},
N3,S0,ϵ0\displaystyle N_{3,S_{0},\epsilon_{0}} =\displaystyle= {𝐗n:max2jpVar^(ZS0j{j})Var(ZS0j{j})Kdiffs0+logpn} and\displaystyle\left\{{\bf X}_{n}:\max_{2\leq j\leq p}\|\widehat{{\rm Var}}(Z_{S_{0j}\cup\{j\}})-{\rm Var}(Z_{S_{0j}\cup\{j\}})\|\geq\sqrt{K_{\rm diff}\cdot\frac{s_{0}+\log p}{n}}\right\}\quad\text{ and}
N4,S0,ϵ0\displaystyle N_{4,S_{0},\epsilon_{0}} =\displaystyle= {𝐗n:max2jpVar^1(ZS0j{j})Var1(ZS0j{j})Cϵ0Kdiffs0+logpn},\displaystyle\left\{{\bf X}_{n}:\max_{2\leq j\leq p}\|\widehat{{\rm Var}}^{-1}(Z_{S_{0j}\cup\{j\}})-{\rm Var}^{-1}(Z_{S_{0j}\cup\{j\}})\|\geq C_{\epsilon_{0}}\sqrt{K_{\rm diff}\cdot\frac{s_{0}+\log p}{n}}\right\},

where Cϵ0=(12ϵ0)2ϵ02C_{\epsilon_{0}}=(1-2\epsilon_{0})^{2}\epsilon_{0}^{-2}. Let NS0,ϵ0=j=14Nj,S0,ϵ0N_{S_{0},\epsilon_{0}}=\cup_{j=1}^{4}N_{j,S_{0},\epsilon_{0}}. If s0+logp=o(n)s_{0}+\log p=o(n), there exists a universal constant C>0C>0 such that

0(NS0,ϵ0)\displaystyle\mathbb{P}_{0}\big{(}N_{S_{0},\epsilon_{0}}\big{)} \displaystyle\leq 6pexp(n2ϵ02)+4p5s0exp(KdiffCϵ02(s0+logp))\displaystyle 6\cdot p\exp\left(-\frac{n}{2}\epsilon_{0}^{2}\right)+4\cdot p5^{s_{0}}\exp\Big{(}-K_{\rm diff}C\epsilon_{0}^{2}(s_{0}+\log p)\Big{)}

for all sufficiently large nn.

Proof.

It is easy to show that

0(N1,S0,ϵ0)+0(N2,S0,ϵ0)\displaystyle\mathbb{P}_{0}\big{(}N_{1,S_{0},\epsilon_{0}}\big{)}+\mathbb{P}_{0}\big{(}N_{2,S_{0},\epsilon_{0}}\big{)} \displaystyle\leq 4pexp(n2ϵ02)\displaystyle 4\cdot p\exp\left(-\frac{n}{2}\epsilon_{0}^{2}\right)

by the similar arguments in the proof of Lemma 7.1. Thus, it suffices to show that

0(N3,S0,ϵ0)\displaystyle\mathbb{P}_{0}\big{(}N_{3,S_{0},\epsilon_{0}}\big{)} 2p5s0exp(KdiffCϵ02(s0+logp)),\displaystyle\leq 2\cdot p5^{s_{0}}\exp\Big{(}-K_{\rm diff}C\epsilon_{0}^{2}(s_{0}+\log p)\Big{)}, (13)
0(N4,S0,ϵ0)\displaystyle\mathbb{P}_{0}\big{(}N_{4,S_{0},\epsilon_{0}}\big{)} 2p{5s0exp(KdiffCϵ02(s0+logp))+exp(n2ϵ02)}\displaystyle\leq 2p\cdot\left\{5^{s_{0}}\exp\Big{(}-K_{\rm diff}C\epsilon_{0}^{2}(s_{0}+\log p)\Big{)}+\exp\left(-\frac{n}{2}\epsilon_{0}^{2}\right)\right\} (14)

for all sufficiently large nn and some constant C>0C>0. Note that for any p×pp\times p symmetric matrix VV, there exist vjpv_{j}\in\mathbb{R}^{p} with vj2=1\|v_{j}\|_{2}=1 for j=1,,5pj=1,\ldots,5^{p} such that

V\displaystyle\|V\| \displaystyle\leq 4sup1j5p|vjTVvj|,\displaystyle 4\cdot\sup_{1\leq j\leq 5^{p}}|v_{j}^{T}Vv_{j}|,

by page 2141 of Cai, Zhang and Zhou (2010). Thus,

0(N3,S0,ϵ0)\displaystyle\mathbb{P}_{0}\big{(}N_{3,S_{0},\epsilon_{0}}\big{)}
=\displaystyle= 0(max2jpVar^(ZS0j{j})Var(ZS0j{j})Kdiffs0+logpn)\displaystyle\mathbb{P}_{0}\left(\max_{2\leq j\leq p}\|\widehat{{\rm Var}}(Z_{S_{0j}\cup\{j\}})-{\rm Var}(Z_{S_{0j}\cup\{j\}})\|\geq\sqrt{K_{\rm diff}\frac{s_{0}+\log p}{n}}\right)
\displaystyle\leq 0(max2jpW^S0j{j}Iϵ0Kdiffs0+logpn)\displaystyle\mathbb{P}_{0}\left(\max_{2\leq j\leq p}\|\widehat{W}_{S_{0j}\cup\{j\}}-I\|\geq\epsilon_{0}\sqrt{K_{\rm diff}\frac{s_{0}+\log p}{n}}\right)
\displaystyle\leq p5s0max2jpsup1j5s0j+10(|vjT(W^S0j{j}I)vj|ϵ04Kdiffs0+logpn),\displaystyle p5^{s_{0}}\max_{2\leq j\leq p}\sup_{1\leq j\leq 5^{s_{0j}+1}}\mathbb{P}_{0}\left(|v_{j}^{T}(\widehat{W}_{S_{0j}\cup\{j\}}-I)v_{j}|\geq\frac{\epsilon_{0}}{4}\sqrt{K_{\rm diff}\frac{s_{0}+\log p}{n}}\right),

where W^S0j{j}:=Var(ZS0j{j})1/2Var^(ZS0j{j})Var(ZS0j{j})1/2\widehat{W}_{S_{0j}\cup\{j\}}:={\rm Var}(Z_{S_{0j}\cup\{j\}})^{-1/2}\widehat{{\rm Var}}(Z_{S_{0j}\cup\{j\}}){\rm Var}(Z_{S_{0j}\cup\{j\}})^{-1/2}, and W^S0j{j}W|S0j|+1(n,n1I)\widehat{W}_{S_{0j}\cup\{j\}}\sim W_{|S_{0j}|+1}(n,n^{-1}I). Note that nvjTW^S0j{j}vjχn2n\,v_{j}^{T}\widehat{W}_{S_{0j}\cup\{j\}}v_{j}\sim\chi_{n}^{2} by the property of Wishart distribution, and P(|χn2n|2nt+2t)exp(t)P(|\chi_{n}^{2}-n|\geq 2\sqrt{nt}+2t)\leq\exp(-t) for all t>0t>0 by Lemma 1 in Laurent and Massart (2000). Thus,

0(N3,S0,ϵ0)\displaystyle\mathbb{P}_{0}\big{(}N_{3,S_{0},\epsilon_{0}}\big{)} \displaystyle\leq exp(ϵ0244Kdiff(s0+logp)).\displaystyle\exp\left(-\frac{\epsilon_{0}^{2}}{4^{4}}K_{\rm diff}(s_{0}+\log p)\right).

Similarly,

0(N4,S0,ϵ0)\displaystyle\mathbb{P}_{0}\big{(}N_{4,S_{0},\epsilon_{0}}\big{)} \displaystyle\leq 0(N4,S0,ϵ0N2,S0,ϵ0c)+0(N2,S0,ϵ0)\displaystyle\mathbb{P}_{0}\big{(}N_{4,S_{0},\epsilon_{0}}\cap N_{2,S_{0},\epsilon_{0}}^{c}\big{)}+\mathbb{P}_{0}\big{(}N_{2,S_{0},\epsilon_{0}}\big{)}
\displaystyle\leq 0(max2jpVar^(ZS0j{j})Var(ZS0j{j})Kdiffs0+logpn)\displaystyle\mathbb{P}_{0}\left(\max_{2\leq j\leq p}\|\widehat{{\rm Var}}(Z_{S_{0j}\cup\{j\}})-{\rm Var}(Z_{S_{0j}\cup\{j\}})\|\geq\sqrt{K_{\rm diff}\frac{s_{0}+\log p}{n}}\right)
+  2pexp(n2ϵ02)\displaystyle+\,\,2\cdot p\exp\left(-\frac{n}{2}\epsilon_{0}^{2}\right)
\displaystyle\leq 2p5s0exp(KdiffCϵ02(s0+logp))+2pexp(n2ϵ02)\displaystyle 2\cdot p5^{s_{0}}\exp\Big{(}-K_{\rm diff}C\epsilon_{0}^{2}(s_{0}+\log p)\Big{)}+2\cdot p\exp\left(-\frac{n}{2}\epsilon_{0}^{2}\right)

for all sufficiently large nn, thus, we have (14). The second inequality follows from

Var^1(ZS0j{j})Var1(ZS0j{j})\displaystyle\|\widehat{{\rm Var}}^{-1}(Z_{S_{0j}\cup\{j\}})-{\rm Var}^{-1}(Z_{S_{0j}\cup\{j\}})\|
\displaystyle\leq Var^1(ZS0j{j})Var1(ZS0j{j})Var^(ZS0j{j})Var(ZS0j{j}).\displaystyle\|\widehat{{\rm Var}}^{-1}(Z_{S_{0j}\cup\{j\}})\|\|{\rm Var}^{-1}(Z_{S_{0j}\cup\{j\}})\|\|\widehat{{\rm Var}}(Z_{S_{0j}\cup\{j\}})-{\rm Var}(Z_{S_{0j}\cup\{j\}})\|.

Lemma 7.3.

Let 𝐗n{\bf X}_{n} be the random sample of size nn from Np(0,Ω0n1)N_{p}(0,\Omega_{0n}^{-1}) with Ω0n\Omega_{0n} satisfying (A1), (A2) and (A4) for some constant 0<ϵ0<1/20<\epsilon_{0}<1/2 and a sequence of positive integers s0s_{0}. Let NS0,ϵ0N_{S_{0},\epsilon_{0}} be the set defined at Lemma 7.2. If s0+logp=o(n)s_{0}+\log p=o(n) and s03/2(s0+logp)=O(n)s_{0}^{3/2}(s_{0}+\log p)=O(n), we have

Ω^nΩ0n\displaystyle\|\widehat{\Omega}_{n}-\Omega_{0n}\| \displaystyle\lesssim s03/4(s0+logpn)1/2\displaystyle s_{0}^{3/4}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}

on 𝐗nNS0,ϵ0c{\bf X}_{n}\in N_{S_{0},\epsilon_{0}}^{c}, for all sufficiently large nn. If we further assume s0(s0+logp)=O(n)s_{0}(s_{0}+\log p)=O(n), then

Ω^nΩ0n\displaystyle\|\widehat{\Omega}_{n}-\Omega_{0n}\|_{\infty} \displaystyle\lesssim IpA0ns0(s0+logpn)1/2\displaystyle\|I_{p}-A_{0n}\|_{\infty}\cdot s_{0}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}

on 𝐗nNS0,ϵ0c{\bf X}_{n}\in N_{S_{0},\epsilon_{0}}^{c}, for all sufficiently large nn.

Proof.

Throughout the proof, we only consider the event 𝐗nNS0,ϵ0c{\bf X}_{n}\in N_{S_{0},\epsilon_{0}}^{c}. Consider the spectral norm case first. By the triangle inequality,

Ω^nΩ0nIpA^n2D^n1D0n1+IpA^nD0n1A^nA0n+IpA0nD0n1A^nA0n.\displaystyle\begin{split}&\quad\,\,\,\|\widehat{\Omega}_{n}-\Omega_{0n}\|\\ \,\,&\leq\,\,\|I_{p}-\widehat{A}_{n}\|^{2}\cdot\|\widehat{D}_{n}^{-1}-D_{0n}^{-1}\|+\|I_{p}-\widehat{A}_{n}\|\cdot\|D_{0n}^{-1}\|\cdot\|\widehat{A}_{n}-A_{0n}\|\\ &+\,\,\|I_{p}-A_{0n}\|\cdot\|D_{0n}^{-1}\|\cdot\|\widehat{A}_{n}-A_{0n}\|.\end{split} (15)

Note that

A^nA0n=maxja^S0ja0,S0j1s0maxja^S0ja0,S0j2s0{maxjVar1(ZS0j)(Cov^(ZS0j,Xj)Cov(ZS0j,Xj))2+maxj(Var^1(ZS0j)Var1(ZS0j))Cov^(ZS0j,Xj)}s0(s0+logpn)1/2\displaystyle\begin{split}\|\widehat{A}_{n}-A_{0n}\|_{\infty}\,\,&=\,\,\max_{j}\|\widehat{a}_{S_{0j}}-a_{0,S_{0j}}\|_{1}\\ \,\,&\leq\,\,\sqrt{s_{0}}\max_{j}\|\widehat{a}_{S_{0j}}-a_{0,S_{0j}}\|_{2}\\ \,\,&\leq\,\,\sqrt{s_{0}}\Big{\{}\max_{j}\|{\rm Var}^{-1}(Z_{S_{0j}})\cdot\big{(}\widehat{{\rm Cov}}(Z_{S_{0j}},X_{j})-{\rm Cov}(Z_{S_{0j}},X_{j})\big{)}\|_{2}\\ \,\,&\quad+\,\,\max_{j}\|\big{(}\widehat{{\rm Var}}^{-1}(Z_{S_{0j}})-{\rm Var}^{-1}(Z_{S_{0j}})\big{)}\cdot\widehat{{\rm Cov}}(Z_{S_{0j}},X_{j})\|\Big{\}}\\ \,\,&\lesssim\,\,\sqrt{s_{0}}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\end{split} (16)

by the definition of NS0,ϵ0cN_{S_{0},\epsilon_{0}}^{c}. Similarly, it is easy to show that

A^nA0n1\displaystyle\|\widehat{A}_{n}-A_{0n}\|_{1} \displaystyle\leq s0maxja^S0ja0,S0jmax\displaystyle s_{0}\max_{j}\|\widehat{a}_{S_{0j}}-a_{0,S_{0j}}\|_{\max}
\displaystyle\leq s0maxja^S0ja0,S0j2\displaystyle s_{0}\max_{j}\|\widehat{a}_{S_{0j}}-a_{0,S_{0j}}\|_{2}
\displaystyle\lesssim s0(s0+logpn)1/2.\displaystyle s_{0}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}.

Thus, we have

A^nA0n\displaystyle\|\widehat{A}_{n}-A_{0n}\| \displaystyle\leq A^nA0n1/2A^nA0n11/2\displaystyle\|\widehat{A}_{n}-A_{0n}\|_{\infty}^{1/2}\cdot\|\widehat{A}_{n}-A_{0n}\|_{1}^{1/2}
\displaystyle\leq s03/4(s0+logpn)1/2.\displaystyle s_{0}^{3/4}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}.

On the other hand, note that

D^n1D0n1\displaystyle\|\widehat{D}_{n}^{-1}-D_{0n}^{-1}\| \displaystyle\leq D^n1D0n1D^nDn\displaystyle\|\widehat{D}_{n}^{-1}\|\cdot\|{D}_{0n}^{-1}\|\cdot\|\widehat{D}_{n}-{D}_{n}\|
\displaystyle\leq (12ϵ0)2ϵ01ϵ01D^nDn\displaystyle(1-2\epsilon_{0})^{-2}\epsilon_{0}^{-1}\cdot\epsilon_{0}^{-1}\cdot\|\widehat{D}_{n}-{D}_{n}\|

and

D^nD0n\displaystyle\|\widehat{D}_{n}-{D}_{0n}\| =\displaystyle= maxj|dS0j^d0j|\displaystyle\max_{j}|\widehat{d_{S_{0j}}}-d_{0j}|
\displaystyle\leq maxj|Var^(Xj)Var(Xj)|\displaystyle\max_{j}\Big{|}\widehat{{\rm Var}}(X_{j})-{\rm Var}(X_{j})\Big{|}
+maxj|Cov^(Xj,ZS0j)a^S0jCov(Xj,ZS0j)a0,S0j|\displaystyle+\quad\max_{j}\Big{|}\widehat{{\rm Cov}}(X_{j},Z_{S_{0j}})\widehat{a}_{S_{0j}}-{\rm Cov}(X_{j},Z_{S_{0j}})a_{0,S_{0j}}\Big{|}
\displaystyle\lesssim (s0+logpn)1/2.\displaystyle\left(\frac{s_{0}+\log p}{n}\right)^{1/2}.

Also note that

IpA^n\displaystyle\|I_{p}-\widehat{A}_{n}\| \displaystyle\leq IpA0n+A^nA0n\displaystyle\|I_{p}-A_{0n}\|+\|\widehat{A}_{n}-A_{0n}\|
\displaystyle\leq ϵ01+s03/4(s0+logpn)1/2,\displaystyle\epsilon_{0}^{-1}+s_{0}^{3/4}\left(\frac{s_{0}+\log p}{n}\right)^{1/2},

where the last display is of order O(1)O(1) provided that s03/2(s0+logp)=O(n)s_{0}^{3/2}(s_{0}+\log p)=O(n). The second inequality follows from ϵ01(IpA0n)TD0n1(IpA0n)λmin(D0n1)(IpA0n)(IpA0n)Tϵ0IpA0n2\epsilon_{0}^{-1}\geq\|(I_{p}-A_{0n})^{T}D_{0n}^{-1}(I_{p}-A_{0n})\|\geq\lambda_{\min}(D_{0n}^{-1})\|(I_{p}-A_{0n})(I_{p}-A_{0n})^{T}\|\geq\epsilon_{0}\|I_{p}-A_{0n}\|^{2}. By (LABEL:whatO_diff_O), we have shown the spectral norm result.

Now, consider the matrix \ell_{\infty} norm case. Similar to (LABEL:whatO_diff_O), by the triangle inequality,

Ω^nΩ0nIpA^n1IpA^nD^n1D0n1+IpA^n1D0n1A^nA0n+IpA0nD0n1A^nA0n1.\displaystyle\begin{split}\|\widehat{\Omega}_{n}-\Omega_{0n}\|_{\infty}\,\,&\leq\,\,\|I_{p}-\widehat{A}_{n}\|_{1}\cdot\|I_{p}-\widehat{A}_{n}\|_{\infty}\cdot\|\widehat{D}_{n}^{-1}-D_{0n}^{-1}\|\\ &+\,\,\|I_{p}-\widehat{A}_{n}\|_{1}\cdot\|D_{0n}^{-1}\|\cdot\|\widehat{A}_{n}-A_{0n}\|_{\infty}\\ &+\,\,\|I_{p}-A_{0n}\|_{\infty}\cdot\|D_{0n}^{-1}\|\cdot\|\widehat{A}_{n}-A_{0n}\|_{1}.\end{split} (17)

From the above arguments and (17), we only need to show that

IpA^n1\displaystyle\|I_{p}-\widehat{A}_{n}\|_{1} \displaystyle\lesssim s0.\displaystyle\sqrt{s_{0}}.

It is easy to show that

IpA^n1\displaystyle\|I_{p}-\widehat{A}_{n}\|_{1} \displaystyle\leq IpA0n1+A^nA0n1\displaystyle\|I_{p}-A_{0n}\|_{1}+\|\widehat{A}_{n}-A_{0n}\|_{1}
\displaystyle\lesssim IpA0n1+s0(s0+logpn)1/2\displaystyle\|I_{p}-A_{0n}\|_{1}+s_{0}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}
\displaystyle\lesssim IpA0n1+s0\displaystyle\|I_{p}-A_{0n}\|_{1}+\sqrt{s_{0}}

because we assume that s0(s0+logp)=O(n)s_{0}(s_{0}+\log p)=O(n). If we show that IpA0n1s0+1IpA0n\|I_{p}-A_{0n}\|_{1}\leq\sqrt{s_{0}+1}\,\|I_{p}-A_{0n}\|, it completes the proof. Let ac,0ja_{c,0j} be the jjth column vector of A0nA_{0n} and ejpe_{j}\in\mathbb{R}^{p} be the unit vector whose jjth element is 1 and the others are 0, then

IpA0n1\displaystyle\|I_{p}-A_{0n}\|_{1} =\displaystyle= maxjejac,0j1\displaystyle\max_{j}\|e_{j}-a_{c,0j}\|_{1}
\displaystyle\leq s0+1maxjejac,0j2,\displaystyle\sqrt{s_{0}+1}\max_{j}\|e_{j}-a_{c,0j}\|_{2},

by the condition (A4). Note that maxjejac,0j2\max_{j}\|e_{j}-a_{c,0j}\|_{2} is the maximum 2\ell_{2} norm of columns of IpA0nI_{p}-A_{0n}, which is smaller than IpA0n\|I_{p}-A_{0n}\|. Since IpA0nϵ01\|I_{p}-A_{0n}\|\leq\epsilon_{0}^{-1}, we have IpA^n1s0\|I_{p}-\widehat{A}_{n}\|_{1}\lesssim\sqrt{s_{0}}. ∎

Recall that we consider the beta-min condition (A3) such that

minj,l:a0,jl0|a0,jl|2\displaystyle\min_{j,l:a_{0,jl}\neq 0}|a_{0,jl}|^{2} \displaystyle\geq 16α(1α)ϵ02(12ϵ0)2Cbmlogpn\displaystyle\frac{16}{\alpha(1-\alpha)\,\epsilon_{0}^{2}(1-2\epsilon_{0})^{2}}\cdot C_{\rm bm}\cdot\frac{\log p}{n}

for some constant Cbm>0C_{\rm bm}>0 and 0<α<10<\alpha<1.

Lemma 7.4.

For given positive constants 0<α<10<\alpha<1, 0<ϵ0<1/20<\epsilon_{0}<1/2, Cbm>c2+2C_{\rm bm}>c_{2}+2 and an integer s0s_{0}, assume model (12) and the ESC prior with Condition (P). If s0logp=o(n)s_{0}\log p=o(n), then

supΩ0n𝒰p𝔼0[πα(AnA^nK1s0(s0+logpn)1/2|𝐗n)]\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{*}}\mathbb{E}_{0}\left[\pi_{\alpha}\left(\|A_{n}-\widehat{A}_{n}\|_{\infty}\geq K_{1}\sqrt{s_{0}}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)\right] =\displaystyle= o(1),\displaystyle o(1),
supΩ0n𝒰p𝔼0[πα(AnA^n1K1s0(s0+logpn)1/2|𝐗n)]\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{*}}\mathbb{E}_{0}\left[\pi_{\alpha}\left(\|A_{n}-\widehat{A}_{n}\|_{1}\geq K_{1}\sqrt{s_{0}}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)\right] =\displaystyle= o(1)\displaystyle o(1)

for some constant K1>0K_{1}>0.

Proof.

We follow closely the line of the proof of Lemma 7.4 in Lee and Lee (2017). Let Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p}^{*} and NS0,ϵ0N_{S_{0},\epsilon_{0}} be the set defined at Lemma 7.2. Note that

𝔼0[πα(AnA^nK1s0(s0+logpn)1/2|𝐗n)]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\left(\|A_{n}-\widehat{A}_{n}\|_{\infty}\geq K_{1}\sqrt{s_{0}}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)\right]
\displaystyle\leq 𝔼0[πα(AnA^nK1s0(s0+logpn)1/2|𝐗n)INS0,ϵ0c]+o(1)\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\left(\|A_{n}-\widehat{A}_{n}\|_{\infty}\geq K_{1}\sqrt{s_{0}}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)I_{N_{S_{0},\epsilon_{0}}^{c}}\right]+o(1)

by Lemma 7.2. Then, on 𝐗nNS0,ϵ0c{\bf X}_{n}\in N_{S_{0},\epsilon_{0}}^{c}, if s0logp=o(n)s_{0}\log p=o(n),

AnA^n\displaystyle\|A_{n}-\widehat{A}_{n}\|_{\infty} \displaystyle\leq maxjs0aS0ja^S0j2\displaystyle\max_{j}\sqrt{s_{0}}\big{\|}a_{S_{0j}}-\widehat{a}_{S_{0j}}\big{\|}_{2}
\displaystyle\lesssim maxjs0djnn(α+γ)djVar^1/2(ZS0j)(aS0ja^S0j)2\displaystyle\max_{j}\sqrt{\frac{s_{0}\,d_{j}}{n}}\cdot\bigg{\|}\sqrt{\frac{n(\alpha+\gamma)}{d_{j}}}\cdot\widehat{{\rm Var}}^{1/2}(Z_{S_{0j}})(a_{S_{0j}}-\widehat{a}_{S_{0j}})\bigg{\|}_{2}
=:\displaystyle=: maxjs0djnstd(aS0j)2.\displaystyle\max_{j}\sqrt{\frac{s_{0}\,d_{j}}{n}}\cdot\|std(a_{S_{0j}})\|_{2}.

Note that the first inequality follows from the strong model selection consistency in Theorem 3.1, so we can always concentrate on the set SAn=SA0nS_{A_{n}}=S_{A_{0n}}. Also note that

𝔼0[πα(maxjdjstd(aS0j)2K1(s0+logp)1/2|𝐗n)INS0,ϵ0c]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\left(\max_{j}\sqrt{d_{j}}\cdot\|std(a_{S_{0j}})\|_{2}\geq K_{1}^{\prime}\left(s_{0}+\log p\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)I_{N_{S_{0},\epsilon_{0}}^{c}}\right]
\displaystyle\leq 𝔼0[πα(maxj(1+2ϵ0)4ϵ01std(aS0j)2K1(s0+logp)1/2|𝐗n)INS0,ϵ0c]+o(1)\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\left(\max_{j}\sqrt{(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1}}\cdot\|std(a_{S_{0j}})\|_{2}\geq K_{1}^{\prime}\left(s_{0}+\log p\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)I_{N_{S_{0},\epsilon_{0}}^{c}}\right]+o(1)

for some constant K1>0K_{1}^{\prime}>0, by the similar arguments used in the proof of Lemma 7.5. We only need to show that

πα(maxj(1+2ϵ0)4ϵ01std(aS0j)2K1(s0+logp)1/2|𝐗n)\displaystyle\pi_{\alpha}\left(\max_{j}\sqrt{(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1}}\cdot\|std(a_{S_{0j}})\|_{2}\geq K_{1}^{\prime}\left(s_{0}+\log p\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right) =\displaystyle= o(1).\displaystyle o(1).

We can check that std(aS0j)22𝐗nindχs0j2\|std(a_{S_{0j}})\|_{2}^{2}\mid{\bf X}_{n}\overset{ind}{\sim}\chi^{2}_{s_{0j}}. By Lemma 1 in Laurent and Massart (2000), we have P(χk23(k+x))exp(x)P\big{(}\chi_{k}^{2}\geq 3(k+x)\big{)}\leq\exp(-x) for all x>0x>0, where χk2\chi_{k}^{2} is the chi-square random variable with degrees of freedom kk. Thus,

πα(maxjstd(aS0j)22K12(1+2ϵ0)4ϵ01(s0+logp)𝐗n)\displaystyle\pi_{\alpha}\left(\max_{j}\|std(a_{S_{0j}})\|_{2}^{2}\geq\frac{K_{1}^{\prime 2}}{(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1}}(s_{0}+\log p)\,\,\mid\,\,{\bf X}_{n}\right)
\displaystyle\leq pexp(K123(1+2ϵ0)4ϵ01(s0+logp)+s0),\displaystyle p\cdot\exp\left(-\frac{K_{1}^{\prime 2}}{3(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1}}(s_{0}+\log p)+s_{0}\right),

where the last display is of order o(1)o(1) by taking K12=6(1+2ϵ0)4K_{1}^{\prime 2}=6(1+2\epsilon_{0})^{4}.

Note that all rows of AnA_{n} are posteriori independent, so each column of AnA_{n} has a multivariate normal posterior distribution with a diagonal covariance matrix. Because of the condition (A4), there are at most s0s_{0} nonzero elements in each column of A0nA_{0n}. Then, by the similar arguments, we have

supΩ0n𝒰p𝔼0[πα(AnA^n1K1s0(s0+logpn)1/2|𝐗n)]\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{*}}\mathbb{E}_{0}\left[\pi_{\alpha}\left(\|A_{n}-\widehat{A}_{n}\|_{1}\geq K_{1}\sqrt{s_{0}}\left(\frac{s_{0}+\log p}{n}\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)\right] =\displaystyle= o(1).\displaystyle o(1).

Lemma 7.5.

Let 𝐗n{\bf X}_{n} be the random sample of size nn from Np(0,Σ0n)N_{p}(0,\Sigma_{0n}) with ϵ0λmin(Σ0n)λmax(Σ0n)ϵ01\epsilon_{0}\leq\lambda_{\min}(\Sigma_{0n})\leq\lambda_{\max}(\Sigma_{0n})\leq\epsilon_{0}^{-1} for some small constant 0<ϵ0<1/20<\epsilon_{0}<1/2. Consider the model (12) and the ESC prior with Condition (P). Let N1,R,ϵ0N_{1,R,\epsilon_{0}} and N2,R,ϵ0N_{2,R,\epsilon_{0}} be the sets defined at Lemma 7.1. Then, for a given constant 0<α<10<\alpha<1 and an integer 2jp2\leq j\leq p, we have

πα(M1djM2𝐗n)\displaystyle\pi_{\alpha}(M_{1}\leq d_{j}\leq M_{2}\mid{\bf X}_{n}) \displaystyle\geq 12enCα,ϵ0on 𝐗nN1,Rj,ϵ0cN2,Rj,ϵ0c,\displaystyle 1-2e^{-nC_{\alpha,\epsilon_{0}}}\quad\text{on \,\,${\bf X}_{n}\in N_{1,R_{j},\epsilon_{0}}^{c}\cap N_{2,R_{j},\epsilon_{0}}^{c}$},

for all sufficiently large nn and some constant Cα,ϵ0>0C_{\alpha,\epsilon_{0}}>0 depending only on α\alpha and ϵ0\epsilon_{0}, where M1(12ϵ0)4ϵ0M_{1}\leq(1-2\epsilon_{0})^{4}\epsilon_{0} and M2(1+2ϵ0)4ϵ01M_{2}\geq(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1}.

Proof.

Let NRj,ϵ0c=N1,Rj,ϵ0cN2,Rj,ϵ0cN_{R_{j},\epsilon_{0}}^{c}=N_{1,R_{j},\epsilon_{0}}^{c}\cap N_{2,R_{j},\epsilon_{0}}^{c}. Throughout the proof, we consider only the event 𝐗nNRj,ϵ0c{\bf X}_{n}\in N_{R_{j},\epsilon_{0}}^{c}. Since

minSj{1,,j1}:0<|Sj|Rjπα(M1djM2Sj,𝐗n)\displaystyle\min_{S_{j}\subseteq\{1,\ldots,j-1\}:\atop 0<|S_{j}|\leq R_{j}}\pi_{\alpha}(M_{1}\leq d_{j}\leq M_{2}\mid S_{j},{\bf X}_{n})
\displaystyle\leq Sj{1,,j1}:0<|Sj|Rjπα(M1djM2Sj,𝐗n)πα(Sj𝐗n)\displaystyle\sum_{S_{j}\subseteq\{1,\ldots,j-1\}:\atop 0<|S_{j}|\leq R_{j}}\pi_{\alpha}(M_{1}\leq d_{j}\leq M_{2}\mid S_{j},{\bf X}_{n})\pi_{\alpha}(S_{j}\mid{\bf X}_{n})
=\displaystyle= πα(M1djM2𝐗n),\displaystyle\pi_{\alpha}(M_{1}\leq d_{j}\leq M_{2}\mid{\bf X}_{n}),

it suffices to prove that

πα(M1djM2Sj,𝐗n)\displaystyle\pi_{\alpha}(M_{1}\leq d_{j}\leq M_{2}\mid S_{j},{\bf X}_{n}) \displaystyle\geq 12enCα,ϵ0\displaystyle 1-2e^{-nC_{\alpha,\epsilon_{0}}} (18)

for any jj and Sj{1,,j1}S_{j}\subseteq\{1,\ldots,j-1\} such that 0<|Sj|Rj0<|S_{j}|\leq R_{j}. Note that

d^Sj\displaystyle\widehat{d}_{S_{j}} \displaystyle\leq Var^(Xj)Cmaxϵ01,\displaystyle\widehat{{\rm Var}}(X_{j})\,\,\leq\,\,C_{\max}\epsilon_{0}^{-1},
d^Sj1\displaystyle\widehat{d}_{S_{j}}^{-1} \displaystyle\leq Var^1/2(ZSj{j})(a^Sj1)22\displaystyle\Big{\|}\widehat{{\rm Var}}^{1/2}(Z_{S_{j}\cup\{j\}})\cdot\binom{-\widehat{a}_{S_{j}}}{1}\Big{\|}_{2}^{-2}
\displaystyle\leq λmin(Var^1/2(ZSj{j}))1Cmin1ϵ01,\displaystyle\lambda_{\min}(\widehat{{\rm Var}}^{1/2}(Z_{S_{j}\cup\{j\}}))^{-1}\,\,\leq\,\,C_{\min}^{-1}\epsilon_{0}^{-1},

where Cmax=(1+2ϵ0)2C_{\max}=(1+2\epsilon_{0})^{2} and Cmin=(12ϵ0)2C_{\min}=(1-2\epsilon_{0})^{2}. It is easy to check that dj1Sj,𝐗nGamma((αn+ν0)/2,αnd^Sj/2)d_{j}^{-1}\mid S_{j},{\bf X}_{n}\sim Gamma((\alpha n+\nu_{0})/2,\alpha n\,\widehat{d}_{S_{j}}/2), where Gamma(a,b)Gamma(a,b) is the gamma distribution with the shape parameter a>0a>0 and rate parameter b>0b>0. Note that

πα(dj<M1Sj,𝐗n)\displaystyle\pi_{\alpha}(d_{j}<M_{1}\mid S_{j},{\bf X}_{n})
=\displaystyle= πα(dj1αn+ν0αnd^Sj1>M11αn+ν0αnd^Sj1Sj,𝐗n)\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{j}}^{-1}>M_{1}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{j}}^{-1}\mid S_{j},{\bf X}_{n}\right)
\displaystyle\leq πα(dj1αn+ν0αnd^Sj1>M11αn+ν0αnCmin1ϵ01Sj,𝐗n)\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{j}}^{-1}>M_{1}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}C_{\min}^{-1}\epsilon_{0}^{-1}\mid S_{j},{\bf X}_{n}\right)

If YY is a sub-gamma distribution with variance factor ν\nu and scale parameter cc,

P(Y>2νt+ct)P(Y<2νtct)\displaystyle P\left(Y>\sqrt{2\nu t}+ct\right)\vee P\left(Y<-\sqrt{2\nu t}-ct\right) \displaystyle\leq et\displaystyle e^{-t} (19)

for all t>0t>0, by the page 29 of Boucheron, Lugosi and Massart (2013), where a centered Gamma(a,b)Gamma(a,b) random variable follows the sub-gamma distribution with ν=a/b2\nu=a/b^{2} and c=1/bc=1/b. Thus, by (19) with t=αn(ϵ0/2)2t=\alpha n(\epsilon_{0}/2)^{2},

eαn(ϵ0/2)2\displaystyle e^{-\alpha n(\epsilon_{0}/2)^{2}} \displaystyle\geq πα(dj1αn+ν0αnd^Sj1>d^Sj1(ϵ0αn+ν0αn+ϵ022)|Sj,𝐗n)\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{j}}^{-1}>\widehat{d}_{S_{j}}^{-1}\Big{(}\epsilon_{0}\sqrt{\frac{\alpha n+\nu_{0}}{\alpha n}}+\frac{\epsilon_{0}^{2}}{2}\Big{)}\,\Big{|}\,S_{j},{\bf X}_{n}\right)
\displaystyle\geq πα(dj1αn+ν0αnd^Sj1>Cmin1(αn+ν0αn+ϵ02)|Sj,𝐗n).\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{j}}^{-1}>C_{\min}^{-1}\Big{(}\sqrt{\frac{\alpha n+\nu_{0}}{\alpha n}}+\frac{\epsilon_{0}}{2}\Big{)}\,\Big{|}\,S_{j},{\bf X}_{n}\right).

Note that for all sufficiently large nn and small ϵ0\epsilon_{0},

αn+ν0αnCmin1ϵ01+Cmin1(αn+ν0αn+ϵ02)\displaystyle\frac{\alpha n+\nu_{0}}{\alpha n}C_{\min}^{-1}\epsilon_{0}^{-1}+C_{\min}^{-1}\Big{(}\sqrt{\frac{\alpha n+\nu_{0}}{\alpha n}}+\frac{\epsilon_{0}}{2}\Big{)} \displaystyle\leq (12ϵ0)4ϵ01,\displaystyle(1-2\epsilon_{0})^{-4}\epsilon_{0}^{-1},

which implies

πα(dj<M1Sj,𝐗n)\displaystyle\pi_{\alpha}(d_{j}<M_{1}\mid S_{j},{\bf X}_{n}) \displaystyle\leq eαn(ϵ0/2)2\displaystyle e^{-\alpha n(\epsilon_{0}/2)^{2}} (20)

provided that M1(12ϵ0)4ϵ0M_{1}\leq(1-2\epsilon_{0})^{4}\epsilon_{0}. On the other hand, note that

πα(dj>M2Sj,𝐗n)\displaystyle\pi_{\alpha}(d_{j}>M_{2}\mid S_{j},{\bf X}_{n}) =\displaystyle= πα(dj1αn+ν0αnd^Sj1<M21αn+ν0αnd^Sj1Sj,𝐗n)\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{j}}^{-1}<M_{2}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{j}}^{-1}\mid S_{j},{\bf X}_{n}\right)

and by (19) with t=αn(ϵ0/2)2t=\alpha n(\epsilon_{0}/2)^{2},

eαn(ϵ0/2)2\displaystyle e^{-\alpha n(\epsilon_{0}/2)^{2}} \displaystyle\geq πα(dj1αn+ν0αnd^Sj1<d^Sj1(ϵ0αn+ν0αn+ϵ022)|Sj,𝐗n).\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{j}}^{-1}<-\widehat{d}_{S_{j}}^{-1}\Big{(}\epsilon_{0}\sqrt{\frac{\alpha n+\nu_{0}}{\alpha n}}+\frac{\epsilon_{0}^{2}}{2}\Big{)}\,\Big{|}\,S_{j},{\bf X}_{n}\right).

Similarly, they imply

πα(dj>M2Sj,𝐗n)\displaystyle\pi_{\alpha}(d_{j}>M_{2}\mid S_{j},{\bf X}_{n}) \displaystyle\leq eαn(ϵ0/2)2\displaystyle e^{-\alpha n(\epsilon_{0}/2)^{2}} (21)

for all sufficiently large nn and small ϵ0\epsilon_{0}, provided that M2(1+2ϵ0)4ϵ01M_{2}\geq(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1}.

Hence, (20) and (21) imply the desired result (18) with Cα,ϵ0=α(ϵ0/2)2C_{\alpha,\epsilon_{0}}=\alpha(\epsilon_{0}/2)^{2}. ∎

Lemma 7.6.

For given positive constants 0<α<10<\alpha<1, 0<ϵ0<1/20<\epsilon_{0}<1/2, Cbm>c2+2C_{\rm bm}>c_{2}+2 and an integer s0s_{0}, assume model (12) and the ESC prior with Condition (P) and ν02=O(nlogp)\nu_{0}^{2}=O(n\log p). If s0logp=o(n)s_{0}\log p=o(n), then

supΩ0n𝒰p𝔼0[πα(Dn1D^n1K2(logpn)1/2|𝐗n)]\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{*}}\mathbb{E}_{0}\left[\pi_{\alpha}\left(\|D_{n}^{-1}-\widehat{D}_{n}^{-1}\|\geq K_{2}\left(\frac{\log p}{n}\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)\right] =\displaystyle= o(1)\displaystyle o(1)

for some constant K2>0K_{2}>0.

Proof.

Let Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p}^{*} and NS0,ϵ0N_{S_{0},\epsilon_{0}} be the set defined at Lemma 7.2. Similar to the proof of Lemma 7.4, if s0logp=o(n)s_{0}\log p=o(n), we can always concentrate on djSj=S0j,𝐗nindIG((αn+ν0)/2,αnd^S0j/2)d_{j}\mid S_{j}=S_{0j},{\bf X}_{n}\overset{ind}{\sim}IG((\alpha n+\nu_{0})/2,\alpha n\widehat{d}_{S_{0j}}/2) for all j=1,,pj=1,\ldots,p. Then, αnd^S0jdj1Sj=S0j,𝐗nindχαn+ν02\alpha n\,\widehat{d}_{S_{0j}}d_{j}^{-1}\mid S_{j}=S_{0j},{\bf X}_{n}\overset{ind}{\sim}\chi^{2}_{\alpha n+\nu_{0}} for all j=1,,pj=1,\ldots,p. By Lemma 1 in Laurent and Massart (2000), P(χk2k2kx+2x)exp(x)P(\chi_{k}^{2}-k\geq 2\sqrt{kx}+2x)\leq\exp(-x) and P(kχk22kx)exp(x)P(k-\chi_{k}^{2}\geq 2\sqrt{kx})\leq\exp(-x) for all x>0x>0. Thus,

exp(x)\displaystyle\exp(-x)
\displaystyle\geq πα(αnd^S0jdj1(αn+ν0)2(αn+ν0)x+2xSj=S0j,𝐗n)\displaystyle\pi_{\alpha}\left(\alpha n\widehat{d}_{S_{0j}}d_{j}^{-1}-(\alpha n+\nu_{0})\geq 2\sqrt{(\alpha n+\nu_{0})x}+2x\mid S_{j}=S_{0j},{\bf X}_{n}\right)
=\displaystyle= πα(dj1αn+ν0αnd^S0j12αnd^S0j1(αn+ν0)x+2xαnd^S0j1Sj=S0j,𝐗n)\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\frac{\alpha n+\nu_{0}}{\alpha n}\widehat{d}_{S_{0j}}^{-1}\geq\frac{2}{\alpha n}\widehat{d}_{S_{0j}}^{-1}\sqrt{(\alpha n+\nu_{0})x}+\frac{2x}{\alpha n}\widehat{d}_{S_{0j}}^{-1}\mid S_{j}=S_{0j},{\bf X}_{n}\right)
=\displaystyle= πα(dj1d^S0j1d^S0j1αn[2(αn+ν0)x+2x+ν0]Sj=S0j,𝐗n)\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\widehat{d}_{S_{0j}}^{-1}\geq\frac{\widehat{d}_{S_{0j}}^{-1}}{\alpha n}\Big{[}2\sqrt{(\alpha n+\nu_{0})x}+2x+\nu_{0}\Big{]}\mid S_{j}=S_{0j},{\bf X}_{n}\right)
\displaystyle\geq πα(dj1d^S0j1(12ϵ0)2ϵ0αn[2(αn+ν0)x+2x+ν0]Sj=S0j,𝐗n)INS0,ϵ0c.\displaystyle\pi_{\alpha}\left(d_{j}^{-1}-\widehat{d}_{S_{0j}}^{-1}\geq\frac{(1-2\epsilon_{0})^{-2}}{\epsilon_{0}\alpha n}\Big{[}2\sqrt{(\alpha n+\nu_{0})x}+2x+\nu_{0}\Big{]}\mid S_{j}=S_{0j},{\bf X}_{n}\right)I_{N_{S_{0},\epsilon_{0}}^{c}}.

Similarly, also note that

exp(x)\displaystyle\exp(-x) \displaystyle\geq πα(αnd^S0jdj1+(αn+ν0)2(αn+ν0)xSj=S0j,𝐗n)\displaystyle\pi_{\alpha}\left(-\alpha n\widehat{d}_{S_{0j}}d_{j}^{-1}+(\alpha n+\nu_{0})\geq 2\sqrt{(\alpha n+\nu_{0})x}\mid S_{j}=S_{0j},{\bf X}_{n}\right)
\displaystyle\geq πα(dj1+d^S0j1d^S0j1αn[2(αn+ν0)xν0]Sj=S0j,𝐗n)\displaystyle\pi_{\alpha}\left(-d_{j}^{-1}+\widehat{d}_{S_{0j}}^{-1}\geq\frac{\widehat{d}_{S_{0j}}^{-1}}{\alpha n}\Big{[}2\sqrt{(\alpha n+\nu_{0})x}-\nu_{0}\Big{]}\mid S_{j}=S_{0j},{\bf X}_{n}\right)
\displaystyle\geq πα(dj1+d^S0j1(12ϵ0)2ϵ0αn2(αn+ν0)xSj=S0j,𝐗n).\displaystyle\pi_{\alpha}\left(-d_{j}^{-1}+\widehat{d}_{S_{0j}}^{-1}\geq\frac{(1-2\epsilon_{0})^{-2}}{\epsilon_{0}\alpha n}\cdot 2\sqrt{(\alpha n+\nu_{0})x}\mid S_{j}=S_{0j},{\bf X}_{n}\right).

Let x=Clogpx=C\log p with some constant C>1C>1, then

(12ϵ0)2ϵ0αn[2(αn+ν0)x+2x+ν0]\displaystyle\frac{(1-2\epsilon_{0})^{-2}}{\epsilon_{0}\alpha n}\Big{[}2\sqrt{(\alpha n+\nu_{0})x}+2x+\nu_{0}\Big{]} \displaystyle\leq K2(logpn)1/2\displaystyle K_{2}\left(\frac{\log p}{n}\right)^{1/2}

for all sufficiently large nn and some large constant K2>0K_{2}>0. Thus, we have

𝔼0[πα(Dn1D^n1K2(logpn)1/2|𝐗n)]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\left(\|D_{n}^{-1}-\widehat{D}_{n}^{-1}\|\geq K_{2}\left(\frac{\log p}{n}\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)\right]
\displaystyle\leq 𝔼0[πα(maxj|dj1d^S0j1|K2(logpn)1/2|𝐗n)INS0,ϵ0c]+0(NS0,ϵ0)\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\left(\max_{j}|d_{j}^{-1}-\widehat{d}_{S_{0j}}^{-1}|\geq K_{2}\left(\frac{\log p}{n}\right)^{1/2}\,\,\Big{|}\,\,{\bf X}_{n}\right)I_{N_{S_{0},\epsilon_{0}}^{c}}\right]+\mathbb{P}_{0}(N_{S_{0},\epsilon_{0}})
\displaystyle\leq 2pexp(Clogp)+o(1)=o(1).\displaystyle 2p\cdot\exp(-C\log p)+o(1)\,\,=\,\,o(1).

Proof of Theorem 3.6.

Let Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p}^{*}, ϵn=s03/4(s0+logp)/n\epsilon_{n}=s_{0}^{3/4}\sqrt{(s_{0}+\log p)/n} and assume s03/2(s0+logp)=o(n)s_{0}^{3/2}(s_{0}+\log p)=o(n). Consider the spectral norm case first. Then,

𝔼0[πα(ΩnΩ0nKconvϵn𝐗n)]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\big{(}\|\Omega_{n}-\Omega_{0n}\|\geq K_{\rm conv}\epsilon_{n}\mid{\bf X}_{n}\big{)}\right]
\displaystyle\leq 𝔼0[πα(ΩnΩ^nKconv2ϵn𝐗n)]+0(Ω^nΩ0nKconv2ϵn)\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\|\Omega_{n}-\widehat{\Omega}_{n}\|\geq\frac{K_{\rm conv}}{2}\epsilon_{n}\mid{\bf X}_{n}\Big{)}\right]+\mathbb{P}_{0}\Big{(}\|\widehat{\Omega}_{n}-\Omega_{0n}\|\geq\frac{K_{\rm conv}}{2}\epsilon_{n}\Big{)}
=\displaystyle= 𝔼0[πα(ΩnΩ^nKconv2ϵn𝐗n)]+o(1)\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\|\Omega_{n}-\widehat{\Omega}_{n}\|\geq\frac{K_{\rm conv}}{2}\epsilon_{n}\mid{\bf X}_{n}\Big{)}\right]+o(1)

for some large constant Kconv>0K_{\rm conv}>0 by Lemma 7.2 and Lemma 7.3, so it suffices to prove

𝔼0[πα(ΩnΩ^nKconv2ϵn𝐗n)]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\|\Omega_{n}-\widehat{\Omega}_{n}\|\geq\frac{K_{\rm conv}}{2}\epsilon_{n}\mid{\bf X}_{n}\Big{)}\right] =\displaystyle= o(1).\displaystyle o(1).

By applying (LABEL:whatO_diff_O), Lemma 7.4 and Lemma 7.6, it is easy to prove the above result for some large constant Kconv>0K_{\rm conv}>0.

Let ϵn=IpA0ns0(s0+logp)/n\epsilon_{n}^{*}=\|I_{p}-A_{0n}\|_{\infty}s_{0}\sqrt{(s_{0}+\log p)/n} and assume s0(s0+logp)=o(n)s_{0}(s_{0}+\log p)=o(n). Note that

𝔼0[πα(ΩnΩ0nKconvϵn𝐗n)]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\big{(}\|\Omega_{n}-\Omega_{0n}\|_{\infty}\geq K_{\rm conv}\epsilon_{n}^{*}\mid{\bf X}_{n}\big{)}\right]
\displaystyle\leq 𝔼0[πα(ΩnΩ^nKconv2ϵn𝐗n)]+0(Ω^nΩ0nKconv2ϵn)\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\|\Omega_{n}-\widehat{\Omega}_{n}\|_{\infty}\geq\frac{K_{\rm conv}}{2}\epsilon_{n}^{*}\mid{\bf X}_{n}\Big{)}\right]+\mathbb{P}_{0}\Big{(}\|\widehat{\Omega}_{n}-\Omega_{0n}\|_{\infty}\geq\frac{K_{\rm conv}}{2}\epsilon_{n}^{*}\Big{)}
=\displaystyle= 𝔼0[πα(ΩnΩ^nKconv2ϵn𝐗n)]+o(1)\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\|\Omega_{n}-\widehat{\Omega}_{n}\|_{\infty}\geq\frac{K_{\rm conv}}{2}\epsilon_{n}^{*}\mid{\bf X}_{n}\Big{)}\right]+o(1)

for some large constant Kconv>0K_{\rm conv}>0 by Lemma 7.2 and Lemma 7.3. Again, the last display is of order o(1)o(1) by (17), Lemma 7.4 and Lemma 7.6. ∎

8 Proofs of Posterior Convergence Rates for Cholesky Factors

Proof of Theorem 3.2.

Consider Ω0n𝒰p\Omega_{0n}\in\mathcal{U}_{p} and assume s0logp=o(n)s_{0}\log p=o(n). Then,

𝔼0[πα(AnA0nKchols0(s0+logpn)1/2𝐗n)]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\|A_{n}-A_{0n}\|_{\infty}\geq K_{\rm chol}\sqrt{s_{0}}\Big{(}\frac{s_{0}+\log p}{n}\Big{)}^{1/2}\mid{\bf X}_{n}\Big{)}\right] (22)
\displaystyle\leq 𝔼0[πα(AnA^nKchol2s0(s0+logpn)1/2𝐗n)]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\|A_{n}-\widehat{A}_{n}\|_{\infty}\geq\frac{K_{\rm chol}}{2}\sqrt{s_{0}}\Big{(}\frac{s_{0}+\log p}{n}\Big{)}^{1/2}\mid{\bf X}_{n}\Big{)}\right]
+\displaystyle+ 0(A^nA0nKchol2s0(s0+logpn)1/2).\displaystyle\mathbb{P}_{0}\Big{(}\|\widehat{A}_{n}-A_{0n}\|_{\infty}\geq\frac{K_{\rm chol}}{2}\sqrt{s_{0}}\Big{(}\frac{s_{0}+\log p}{n}\Big{)}^{1/2}\Big{)}. (23)

Note that (22) is of order o(1)o(1) for some constant Kchol>0K_{\rm chol}>0 by Lemma 7.4. On the other hand, (23) is also of order o(1)o(1) for some constant Kchol>0K_{\rm chol}>0 by Lemma 7.2 and (16). Note that AnA0nF2=j=2paja0j22\|A_{n}-A_{0n}\|_{F}^{2}=\sum_{j=2}^{p}\|a_{j}-a_{0j}\|_{2}^{2} and

𝔼0[πα(j=2paja0j22Kcholj=2p(s0j+logjn)𝐗n)]\displaystyle\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\sum_{j=2}^{p}\|a_{j}-a_{0j}\|_{2}^{2}\geq K_{\rm chol}\sum_{j=2}^{p}\Big{(}\frac{s_{0j}+\log j}{n}\Big{)}\mid{\bf X}_{n}\Big{)}\right]
\displaystyle\leq j=2p𝔼0[πα(aja^j22Kchol2(s0j+logjn)𝐗n)]\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\left[\pi_{\alpha}\Big{(}\|a_{j}-\widehat{a}_{j}\|_{2}^{2}\geq\frac{K_{\rm chol}}{2}\Big{(}\frac{s_{0j}+\log j}{n}\Big{)}\mid{\bf X}_{n}\Big{)}\right]
+\displaystyle+ j=2p0(a^ja0j22Kchol2(s0j+logjn)),\displaystyle\sum_{j=2}^{p}\mathbb{P}_{0}\Big{(}\|\widehat{a}_{j}-a_{0j}\|_{2}^{2}\geq\frac{K_{\rm chol}}{2}\Big{(}\frac{s_{0j}+\log j}{n}\Big{)}\Big{)},

where the last displays are of order o(1)o(1) for some constant Kchol>0K_{\rm chol}>0. It is easy to check from the slight modifications of Lemma 7.2, Lemma 7.4 and the proof of Lemma 7.3, by using s0js_{0j} and logj\log j instead of s0s_{0} and logp\log p. ∎

Lemma 8.1.

For a given constant 0<ϵ0<1/20<\epsilon_{0}<1/2 and an integer s0s_{0}, assume model (12) and the MESC prior with Condition (P) and ν0=O(1)\nu_{0}=O(1). Let Bn={𝐗n:|X~j22𝐗S0ja^S0j22nd0j|[j2logn]1}B_{n}=\Big{\{}{\bf X}_{n}:\big{|}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}-n\,d_{0j}\big{|}\geq[j^{2}\log n]^{-1}\Big{\}}. For given constants 0<α<10<\alpha<1, M1(12ϵ0)4ϵ0M_{1}\leq(1-2\epsilon_{0})^{4}\epsilon_{0}, M2(1+2ϵ0)4ϵ01M_{2}\geq(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1} and 2jp2\leq j\leq p,

Dnj\displaystyle D_{nj} :=\displaystyle:= Sj:0<|Sj|RjM1M2Rnj(aj,dj)απ(aSjdj,Sj)πj(Sj)π(dj)𝑑aSjδ0(daj,Sjc)𝑑dj\displaystyle\sum_{S_{j}:0<|S_{j}|\leq R_{j}}\int_{M_{1}}^{M_{2}}\int R_{nj}(a_{j},d_{j})^{\alpha}\pi(a_{S_{j}}\mid d_{j},S_{j})\pi_{j}(S_{j})\pi(d_{j})da_{S_{j}}\delta_{0}(da_{j,S_{j}^{c}})dd_{j}
\displaystyle\geq πj(S0j)(1+αγ)s0j2Cdennj2logn on the event Bn,\displaystyle\pi_{j}(S_{0j})\cdot\left(1+\frac{\alpha}{\gamma}\right)^{-\frac{s_{0j}}{2}}\frac{C_{\rm den}}{nj^{2}\log n}\quad\quad\text{ on the event }B_{n},

for some constant Cden=Cden(ϵ0,ν0,ν0)>0C_{\rm den}=C_{\rm den}(\epsilon_{0},\nu_{0},\nu_{0}^{\prime})>0 for any Ω0n𝒰p0\Omega_{0n}\in\mathcal{U}_{p}^{0}.

Proof.

Note that

Dnj\displaystyle D_{nj}
\displaystyle\geq πj(S0j)M1M2Rnj(aj,dj)απ(aS0jdj,S0j)π(dj)𝑑aS0jδ0(daj,S0jc)𝑑dj\displaystyle\pi_{j}(S_{0j})\int_{M_{1}}^{M_{2}}\int R_{nj}(a_{j},d_{j})^{\alpha}\pi(a_{S_{0j}}\mid d_{j},S_{0j})\pi(d_{j})da_{S_{0j}}\delta_{0}(da_{j,S_{0j}^{c}})dd_{j}
=\displaystyle= πj(S0j)M1M2(d0jdj)αn2eα2{dj1X~j𝐗S0jaS0j22d0j1X~j𝐗S0ja0,S0j22}\displaystyle\pi_{j}(S_{0j})\int_{M_{1}}^{M_{2}}\int\Big{(}\frac{d_{0j}}{d_{j}}\Big{)}^{\frac{\alpha n}{2}}e^{-\frac{\alpha}{2}\big{\{}d_{j}^{-1}\|\tilde{X}_{j}-{\bf X}_{S_{0j}}a_{S_{0j}}\|_{2}^{2}-d_{0j}^{-1}\|\tilde{X}_{j}-{\bf X}_{S_{0j}}a_{0,S_{0j}}\|_{2}^{2}\big{\}}}
×det[2πdjγ(𝐗S0jT𝐗S0j)1]12e12(aS0ja^S0j)Tγdj𝐗S0jT𝐗S0j(aS0ja^S0j)π(dj)daS0jddj\displaystyle\times\,\,\det\Big{[}2\pi\frac{d_{j}}{\gamma}({\bf X}_{S_{0j}}^{T}{\bf X}_{S_{0j}})^{-1}\Big{]}^{-\frac{1}{2}}e^{-\frac{1}{2}(a_{S_{0j}}-\widehat{a}_{S_{0j}})^{T}\frac{\gamma}{d_{j}}{\bf X}_{S_{0j}}^{T}{\bf X}_{S_{0j}}(a_{S_{0j}}-\widehat{a}_{S_{0j}})}\pi(d_{j})da_{S_{0j}}dd_{j}
=\displaystyle= πj(S0j)(1+αγ)s0j2eα2d0j𝐗S0j(a^S0ja0,S0j)22\displaystyle\pi_{j}(S_{0j})\left(1+\frac{\alpha}{\gamma}\right)^{-\frac{s_{0j}}{2}}e^{\frac{\alpha}{2d_{0j}}\|{\bf X}_{S_{0j}}(\widehat{a}_{S_{0j}}-a_{0,S_{0j}})\|_{2}^{2}}
×M1M2(d0jdj)αn2eα2(dj1d0j1){X~j22𝐗S0ja^S0j22}π(dj)ddj\displaystyle\times\,\,\int_{M_{1}}^{M_{2}}\Big{(}\frac{d_{0j}}{d_{j}}\Big{)}^{\frac{\alpha n}{2}}e^{-\frac{\alpha}{2}(d_{j}^{-1}-d_{0j}^{-1})\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}}\pi(d_{j})dd_{j}
\displaystyle\geq πj(S0j)(1+αγ)s0j2M1M2(d0jdj)αn2eα2(dj1d0j1){X~j22𝐗S0ja^S0j22}π(dj)𝑑dj.\displaystyle\pi_{j}(S_{0j})\left(1+\frac{\alpha}{\gamma}\right)^{-\frac{s_{0j}}{2}}\int_{M_{1}}^{M_{2}}\Big{(}\frac{d_{0j}}{d_{j}}\Big{)}^{\frac{\alpha n}{2}}e^{-\frac{\alpha}{2}(d_{j}^{-1}-d_{0j}^{-1})\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}}\pi(d_{j})dd_{j}.

The second equality follows from the integration of the multivariate normal distribution. Denote IG(xa,b)IG(x\mid a,b) as the density function of IG(a,b)IG(a,b) at xx, then

M1M2(d0jdj)αn2eα2(dj1d0j1){X~j22𝐗S0ja^S0j22}π(dj)𝑑dj\displaystyle\int_{M_{1}}^{M_{2}}\Big{(}\frac{d_{0j}}{d_{j}}\Big{)}^{\frac{\alpha n}{2}}e^{-\frac{\alpha}{2}(d_{j}^{-1}-d_{0j}^{-1})\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}}\pi(d_{j})dd_{j} (24)
=\displaystyle= M1M2IG(djαn/2+1,α/2{X~j22𝐗S0ja^S0j22})IG(d0jαn/2+1,α/2{X~j22𝐗S0ja^S0j22})π(dj)𝑑dj.\displaystyle\int_{M_{1}}^{M_{2}}\frac{IG\big{(}d_{j}\mid\alpha n/2+1,\alpha/2\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}\big{)}}{IG\big{(}d_{0j}\mid\alpha n/2+1,\alpha/2\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}\big{)}}\cdot\pi(d_{j})dd_{j}.

Note that the ratio of the inverse-gamma density functions is larger than 1 if djd_{j} is between d0jd_{0j} and {X~j22𝐗S0ja^S0j22}/n\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}/n. Since we focus on the event {𝐗n:|X~j22𝐗S0ja^S0j22nd0j|[j2logn]1}\Big{\{}{\bf X}_{n}:\big{|}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}-n\,d_{0j}\big{|}\geq[j^{2}\log n]^{-1}\Big{\}}, if d0j{X~j22𝐗S0ja^S0j22}/nd_{0j}\geq\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}/n, (24) is bounded below by

M1{X~j22𝐗S0ja^S0j22}/nd0jπ(dj)𝑑dj\displaystyle\int_{M_{1}\vee\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}/n}^{d_{0j}}\pi(d_{j})dd_{j} \displaystyle\geq Cdennj2logn\displaystyle\frac{C_{\rm den}}{nj^{2}\log n}

for some constant Cden=Cden(ϵ0,ν0,ν0)>0C_{\rm den}=C_{\rm den}(\epsilon_{0},\nu_{0},\nu_{0}^{\prime})>0, because we assume M1(12ϵ0)4ϵ0M_{1}\leq(1-2\epsilon_{0})^{4}\epsilon_{0}. Similarly, if d0j{X~j22𝐗S0ja^S0j22}/nd_{0j}\leq\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}/n, (24) is bounded below by Cden/(nj2logn)C_{\rm den}/(nj^{2}\log n) for some constant Cden>0C_{\rm den}>0. ∎

Lemma 8.2.

For a given constant 0<ϵ0<1/20<\epsilon_{0}<1/2 and an integer s0s_{0}, assume model (12) and the MESC prior with Condition (P) and ν0=O(1)\nu_{0}=O(1). If s0logp=o(n)s_{0}\log p=o(n) and 0<α<10<\alpha<1, then

j=2p𝔼0πα(SjCdims0𝐗n)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\big{(}S_{j}\geq C_{\rm dim}s_{0}\mid{\bf X}_{n}\big{)} =\displaystyle= o(1)\displaystyle o(1)

for some constant Cdim>0C_{\rm dim}>0 and for any Ω0n𝒰p0\Omega_{0n}\in\mathcal{U}_{p}^{0}.

Proof.

Let Bn={𝐗n:|X~j22𝐗S0ja^S0j22nd0j|[j2logn]1}B_{n}=\Big{\{}{\bf X}_{n}:\big{|}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}-n\,d_{0j}\big{|}\geq[j^{2}\log n]^{-1}\Big{\}} as defined at Lemma 8.1 and

Nnj(SjCdims0)\displaystyle N_{nj}(S_{j}\geq C_{\rm dim}s_{0})
=\displaystyle= Sj:|Sj|Cdims0M1M2Rnj(aj,dj)απ(aSjdj,Sj)πj(Sj)π(dj)𝑑aSjδ0(daj,Sjc)𝑑dj.\displaystyle\sum_{S_{j}:|S_{j}|\geq C_{\rm dim}s_{0}}\int_{M_{1}}^{M_{2}}\int R_{nj}(a_{j},d_{j})^{\alpha}\pi(a_{S_{j}}\mid d_{j},S_{j})\pi_{j}(S_{j})\pi(d_{j})da_{S_{j}}\delta_{0}(da_{j,S_{j}^{c}})dd_{j}.

By Lemma 7.5, we have

j=2p𝔼0πα(SjCdims0𝐗n)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\big{(}S_{j}\geq C_{\rm dim}s_{0}\mid{\bf X}_{n}\big{)}
\displaystyle\leq j=2p𝔼0πα(SjCdims0,M1djM2𝐗n)+o(1)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\big{(}S_{j}\geq C_{\rm dim}s_{0},\,\,M_{1}\leq d_{j}\leq M_{2}\mid{\bf X}_{n}\big{)}\,\,+\,\,o(1)

for some constants M1(12ϵ0)4ϵ0M_{1}\leq(1-2\epsilon_{0})^{4}\epsilon_{0}, M2(1+2ϵ0)4ϵ0M_{2}\geq(1+2\epsilon_{0})^{4}\epsilon_{0} and Cα,ϵ0>0C_{\alpha,\epsilon_{0}}>0 because we assume s0logp=o(n)s_{0}\log p=o(n). In fact, Lemma 7.5 assumes the ESC prior, but it is easy to show that it also holds for the MESC prior for some constant ν0>0\nu_{0}^{\prime}>0. For any 2jp2\leq j\leq p, we have

𝔼0πα(SjCdims0,M1djM2𝐗n)\displaystyle\mathbb{E}_{0}\pi_{\alpha}\big{(}S_{j}\geq C_{\rm dim}s_{0},\,\,M_{1}\leq d_{j}\leq M_{2}\mid{\bf X}_{n}\big{)}
\displaystyle\leq 𝔼0[Nnj(SjCdims0)]eC1s0jπj(S0j)C2nj2logn\displaystyle\mathbb{E}_{0}\big{[}N_{nj}(S_{j}\geq C_{\rm dim}s_{0})\big{]}\frac{e^{C_{1}s_{0j}}}{\pi_{j}(S_{0j})}C_{2}nj^{2}\log n
+0(|X~j22𝐗S0ja^S0j22nd0j|1j2logn)\displaystyle+\,\,\mathbb{P}_{0}\Big{(}\big{|}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}-n\,d_{0j}\big{|}\leq\frac{1}{j^{2}\log n}\Big{)}

for some positive constants C1C_{1} and C2C_{2} by Lemma 8.1. Since d0j1{X~j22𝐗S0ja^S0j22}χns0j2d_{0j}^{-1}\big{\{}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}\big{\}}\sim\chi_{n-s_{0j}}^{2} under 0\mathbb{P}_{0} given Z~j\tilde{Z}_{j},

j=2p0(|X~j22𝐗S0ja^S0j22nd0j|1j2logn)\displaystyle\sum_{j=2}^{p}\mathbb{P}_{0}\Big{(}\big{|}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}-n\,d_{0j}\big{|}\leq\frac{1}{j^{2}\log n}\Big{)} \displaystyle\leq j=2p1d0jj2logn=o(1),\displaystyle\sum_{j=2}^{p}\frac{1}{d_{0j}\,j^{2}\log n}\,\,=\,\,o(1),

so we only need to focus on (8). Note that

𝔼0[Nnj(SjCdims0)|Z~j]\displaystyle\mathbb{E}_{0}\big{[}N_{nj}(S_{j}\geq C_{\rm dim}s_{0})\,\big{|}\,\tilde{Z}_{j}\big{]}
=\displaystyle= M1M2Sj:|Sj|Cdims0πj(Sj)𝔼0[Rnj(aj,dj)απ(aSjdj,Sj)|Z~j]π(dj)daSjδ0(daj,Sjc)ddj.\displaystyle\int_{M_{1}}^{M_{2}}\int\sum_{S_{j}:|S_{j}|\geq C_{\rm dim}s_{0}}\pi_{j}(S_{j})\cdot\mathbb{E}_{0}\left[R_{nj}(a_{j},d_{j})^{\alpha}\pi(a_{S_{j}}\mid d_{j},S_{j})\,\big{|}\,\tilde{Z}_{j}\right]\pi(d_{j})da_{S_{j}}\delta_{0}(da_{j,S_{j}^{c}})dd_{j}.

Let aSj+a_{S_{j}+} be a pp-dimensional vector such that (aSj+)Sj=aSj(a_{S_{j}+})_{S_{j}}=a_{S_{j}} and (aSj+)l=0(a_{S_{j}+})_{l}=0 for all lSjl\notin S_{j}. It is easy to see that

𝔼0[Rnj(aSj+,dj)απ(aSjdj,Sj)|Z~j]\displaystyle\mathbb{E}_{0}\left[R_{nj}(a_{S_{j}+},d_{j})^{\alpha}\pi(a_{S_{j}}\mid d_{j},S_{j})\,\big{|}\,\tilde{Z}_{j}\right]
\displaystyle\leq {𝔼0[Rnj(aSj+,dj)h1α|Z~j]}1h1×{𝔼0[π(aSjdj,Sj)h2|Z~j]}1h2\displaystyle\left\{\mathbb{E}_{0}\big{[}R_{nj}(a_{S_{j}+},d_{j})^{h_{1}\alpha}\,\big{|}\,\tilde{Z}_{j}\big{]}\right\}^{\frac{1}{h_{1}}}\times\left\{\mathbb{E}_{0}\big{[}\pi(a_{S_{j}}\mid d_{j},S_{j})^{h_{2}}\,\big{|}\,\tilde{Z}_{j}\big{]}\right\}^{\frac{1}{h_{2}}}
\displaystyle\leq exp(h1α(1h1α)2(h1αd0j+(1h1α)dj)Z~j(aSj+a0)22)\displaystyle\exp\left(-\frac{h_{1}\alpha(1-h_{1}\alpha)}{2(h_{1}\alpha d_{0j}+(1-h_{1}\alpha)d_{j})}\|\tilde{Z}_{j}(a_{S_{j}+}-a_{0})\|_{2}^{2}\right)
×exp(n2h1log[h1αd0j+(1h1α)djd0jh1αdj1h1α])×{𝔼0[π(aSjdj,Sj)h2]}1h2\displaystyle\times\,\,\exp\left(-\frac{n}{2h_{1}}\log\Big{[}\frac{h_{1}\alpha d_{0j}+(1-h_{1}\alpha)d_{j}}{d_{0j}^{h_{1}\alpha}d_{j}^{1-h_{1}\alpha}}\Big{]}\right)\times\left\{\mathbb{E}_{0}\big{[}\pi(a_{S_{j}}\mid d_{j},S_{j})^{h_{2}}\big{]}\right\}^{\frac{1}{h_{2}}}

for any constants h1,h2>1h_{1},h_{2}>1 such that h11+h21=1h_{1}^{-1}+h_{2}^{-1}=1 and h1α<1h_{1}\alpha<1. The first inequality follows from the Hölder’s inequality, and the second inequality follows from the Rényi divergence between two multivariate normal distributions (Hero et al. (2001)). Note that the first term in the last display is bounded above by

exp(h1α(1h1α)2(d0j+M2)Z~j(aSj+a0)22)\displaystyle\exp\left(-\frac{h_{1}\alpha(1-h_{1}\alpha)}{2(d_{0j}+M_{2})}\|\tilde{Z}_{j}(a_{S_{j}+}-a_{0})\|_{2}^{2}\right) \displaystyle\leq 1\displaystyle 1

for any M1djM2M_{1}\leq d_{j}\leq M_{2}. Also note that the second term in the last display is bounded above by 1 because

log(h1αd0j+(1h1α)dj)h1αlogd0j+(1h1α)logdj\displaystyle\log\big{(}h_{1}\alpha d_{0j}+(1-h_{1}\alpha)d_{j}\big{)}\geq h_{1}\alpha\log d_{0j}+(1-h_{1}\alpha)\log d_{j}

by the Jensen’s inequality. For the last term, we have

{𝔼0[π(aSjdj,Sj)h2|Z~j]}1h2\displaystyle\left\{\mathbb{E}_{0}\big{[}\pi(a_{S_{j}}\mid d_{j},S_{j})^{h_{2}}\,\big{|}\,\tilde{Z}_{j}\big{]}\right\}^{\frac{1}{h_{2}}}
=\displaystyle= det[2πdjγ(𝐗SjT𝐗Sj)1]12{𝔼0(exp{γh22dj𝐗Sj(aSja^Sj)22}|Z~j)}1h2\displaystyle\det\Big{[}2\pi\frac{d_{j}}{\gamma}({\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}})^{-1}\Big{]}^{-\frac{1}{2}}\cdot\left\{\mathbb{E}_{0}\Big{(}\exp\big{\{}-\frac{\gamma h_{2}}{2d_{j}}\|{\bf X}_{S_{j}}(a_{S_{j}}-\widehat{a}_{S_{j}})\|_{2}^{2}\big{\}}\,\big{|}\,\tilde{Z}_{j}\Big{)}\right\}^{\frac{1}{h_{2}}}
=\displaystyle= det[2πdjγ(𝐗SjT𝐗Sj)1]12×(1+γdjh2d0j)|Sj|2h2\displaystyle\det\Big{[}2\pi\frac{d_{j}}{\gamma}({\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}})^{-1}\Big{]}^{-\frac{1}{2}}\times\Big{(}1+\frac{\gamma}{d_{j}}h_{2}d_{0j}\Big{)}^{-\frac{|S_{j}|}{2h_{2}}}
×exp(γ2(dj+γh2d0j)𝐗Sj(aSj[𝐗SjT𝐗Sj]1𝐗SjTZ~ja0j)22)\displaystyle\times\,\,\exp\left(-\frac{\gamma}{2(d_{j}+\gamma h_{2}d_{0j})}\|{\bf X}_{S_{j}}(a_{S_{j}}-[{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}]^{-1}{\bf X}_{S_{j}}^{T}\tilde{Z}_{j}a_{0j})\|_{2}^{2}\right)
=\displaystyle= (1+γdjh2d0j)12(1h21)|Sj|N|Sj|(aSj|[𝐗SjT𝐗Sj]1𝐗SjTZ~ja0j,(djγ+h2d0j)[𝐗SjT𝐗Sj]1).\displaystyle\left(1+\frac{\gamma}{d_{j}}h_{2}d_{0j}\right)^{\frac{1}{2}(1-h_{2}^{-1})|S_{j}|}\cdot N_{|S_{j}|}\Big{(}a_{S_{j}}\,\big{|}\,[{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}]^{-1}{\bf X}_{S_{j}}^{T}\tilde{Z}_{j}a_{0j},\,\big{(}\frac{d_{j}}{\gamma}+h_{2}d_{0j}\big{)}[{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}]^{-1}\Big{)}.

The second equality follows from the moment generating function of the noncentral chi-square distribution because d0j1𝐗Sj(a^SjaSj)22d_{0j}^{-1}\|{\bf X}_{S_{j}}(\widehat{a}_{S_{j}}-a_{S_{j}})\|_{2}^{2} is the noncentral chi-square random variable with |Sj||S_{j}| degrees of freedom and the noncentrality parameter 𝐗Sj(aSj[𝐗SjT𝐗Sj]1𝐗SjTZ~ja0j)22\|{\bf X}_{S_{j}}(a_{S_{j}}-[{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}]^{-1}{\bf X}_{S_{j}}^{T}\tilde{Z}_{j}a_{0j})\|_{2}^{2} under 0\mathbb{P}_{0} given Z~j\tilde{Z}_{j}. Thus,

j=2p𝔼0πα(SjCdims0𝐗n)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\big{(}S_{j}\geq C_{\rm dim}s_{0}\mid{\bf X}_{n}\big{)}
\displaystyle\leq j=2p𝔼0[Nnj(SjCdims0)]eC1s0jπj(S0j)C2nj2logn+o(1)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\big{[}N_{nj}(S_{j}\geq C_{\rm dim}s_{0})\big{]}\cdot\frac{e^{C_{1}s_{0j}}}{\pi_{j}(S_{0j})}C_{2}nj^{2}\log n\,\,+\,\,o(1)
\displaystyle\leq j=2pSj:|Sj|Cdims0πj(Sj)(1+γM1h2d0j)12(1h21)|Sj|eC1s0jπj(S0j)C2nj2logn+o(1)\displaystyle\sum_{j=2}^{p}\sum_{S_{j}:|S_{j}|\geq C_{\rm dim}s_{0}}\pi_{j}(S_{j})\cdot\left(1+\frac{\gamma}{M_{1}}h_{2}d_{0j}\right)^{\frac{1}{2}(1-h_{2}^{-1})|S_{j}|}\cdot\frac{e^{C_{1}s_{0j}}}{\pi_{j}(S_{0j})}C_{2}nj^{2}\log n\,\,+\,\,o(1)
\displaystyle\leq j=2peC1s0jπj(S0j)C2nj2lognSj:|Sj|Cdims0eC3|Sj|πj(Sj)+o(1)\displaystyle\sum_{j=2}^{p}\,\frac{e^{C_{1}s_{0j}}}{\pi_{j}(S_{0j})}C_{2}nj^{2}\log n\sum_{S_{j}:|S_{j}|\geq C_{\rm dim}s_{0}}e^{C_{3}|S_{j}|}\pi_{j}(S_{j})\,\,+\,\,o(1)
\displaystyle\leq j=2peC1s0j+C4s0jlogj+4log(nj)(eC3c1pc2)Cdims0+o(1)\displaystyle\sum_{j=2}^{p}\,e^{C_{1}s_{0j}+C_{4}s_{0j}\log j+4\log(n\vee j)}\cdot\left(\frac{e^{C_{3}}}{c_{1}p^{c_{2}}}\right)^{C_{\rm dim}s_{0}}\,\,+\,\,o(1)

for some positive constants C3C_{3} and C4C_{4}. The last term is of order o(1)o(1) for some large constant Cdim>0C_{\rm dim}>0. ∎

Lemma 8.3.

For a given constant 0<ϵ0<1/20<\epsilon_{0}<1/2 and an integer s0s_{0}, assume model (12) and the MESC prior with Condition (P). For given constants 0<α<10<\alpha<1, M1(12ϵ0)4ϵ0M_{1}\leq(1-2\epsilon_{0})^{4}\epsilon_{0}, M2(1+2ϵ0)4ϵ01M_{2}\geq(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1} and 2jp2\leq j\leq p, define δn=s0logp/Ψmin(C2s0)2\delta_{n}^{\prime}=\sqrt{s_{0}\log p/\Psi_{\min}(C_{2}s_{0})^{2}},

Bnj(C1)\displaystyle B_{nj}(C_{1}) =\displaystyle= {aj:aja0j22C1δn2} and\displaystyle\big{\{}a_{j}:\|a_{j}-a_{0j}\|_{2}^{2}\geq C_{1}\delta_{n}^{\prime 2}\,\big{\}}\quad\text{ and}
Nnj\displaystyle N_{nj} =\displaystyle= Sj:0<|Sj|C3s0M1M2Bnj(C1)Rnj(aj,dj)απ(aSjdj,Sj)πj(Sj)π(dj)𝑑aSjδ0(daj,Sjc)𝑑dj,\displaystyle\sum_{S_{j}:0<|S_{j}|\leq C_{3}s_{0}}\int_{M_{1}}^{M_{2}}\int_{B_{nj}(C_{1})}R_{nj}(a_{j},d_{j})^{\alpha}\pi(a_{S_{j}}\mid d_{j},S_{j})\pi_{j}(S_{j})\pi(d_{j})da_{S_{j}}\delta_{0}(da_{j,S_{j}^{c}})dd_{j},

for some positive constants C1C_{1}, C2>1C_{2}>1 and C3=C21>0C_{3}=C_{2}-1>0. Then, we have

𝔼0(Nnj)\displaystyle\mathbb{E}_{0}\big{(}N_{nj}\big{)} \displaystyle\leq eCnum,1C1s0logpSj:0<|Sj|RjCnum,2|Sj|πj(Sj)\displaystyle e^{-C_{\rm num,1}\cdot C_{1}s_{0}\log p}\sum_{S_{j}:0<|S_{j}|\leq R_{j}}C_{\rm num,2}^{|S_{j}|}\pi_{j}(S_{j})

for some positive constants Cnum,1=Cnum,1(M2,ϵ0,α)C_{\rm num,1}=C_{\rm num,1}(M_{2},\epsilon_{0},\alpha) and Cnum,2=Cnum,2(M1,ϵ0,γ)C_{\rm num,2}=C_{\rm num,2}(M_{1},\epsilon_{0},\gamma) for any Ω0n𝒰p0\Omega_{0n}\in\mathcal{U}_{p}^{0}.

Proof.

By the definition of Ψmin\Psi_{\min}, it is easy to see that

Z~j(aja0j)22\displaystyle\|\tilde{Z}_{j}(a_{j}-a_{0j})\|_{2}^{2} \displaystyle\geq Ψmin(|Saja0j|)2aja0j22.\displaystyle\Psi_{\min}(|S_{a_{j}-a_{0j}}|)^{2}\|a_{j}-a_{0j}\|_{2}^{2}.

Note that Ψmin(|Saja0j|)Ψmin(|Saj|+|Sa0j|)Ψmin(C2s0)\Psi_{\min}(|S_{a_{j}-a_{0j}}|)\geq\Psi_{\min}(|S_{a_{j}}|+|S_{a_{0j}}|)\geq\Psi_{\min}(C_{2}s_{0}) for any SjS_{j} such that 0<|Sj|C3s00<|S_{j}|\leq C_{3}s_{0}. Thus, it suffices to prove the result with respect to the set Bnj(C1):={aj:Z~j(aja0j)22C1s0logp}B_{nj}^{\prime}(C_{1}):=\big{\{}a_{j}:\|\tilde{Z}_{j}(a_{j}-a_{0j})\|_{2}^{2}\geq C_{1}s_{0}\log p\big{\}} instead of Bnj(C1)B_{nj}(C_{1}). Let aSj+a_{S_{j}+} be a pp-dimensional vector such that (aSj+)Sj=aSj(a_{S_{j}+})_{S_{j}}=a_{S_{j}} and (aSj+)l=0(a_{S_{j}+})_{l}=0 for all lSjl\notin S_{j}. The rest part of the proof is straightforward from the proof of Lemma 8.2. We have

𝔼0(Nnj)\displaystyle\mathbb{E}_{0}\big{(}N_{nj}\big{)}
\displaystyle\leq 𝔼0[M1M2Bnj(C1)Sj:0<|Sj|Rjexp(h1α(1h1α)2(d0j+M2)Z~j(aSj+a0)22)\displaystyle\mathbb{E}_{0}\Bigg{[}\int_{M_{1}}^{M_{2}}\int_{B_{nj}^{\prime}(C_{1})}\sum_{S_{j}:0<|S_{j}|\leq R_{j}}\exp\left(-\frac{h_{1}\alpha(1-h_{1}\alpha)}{2(d_{0j}+M_{2})}\|\tilde{Z}_{j}(a_{S_{j}+}-a_{0})\|_{2}^{2}\right)
×N|Sj|(aSj[𝐗SjT𝐗Sj]1𝐗SjTZ~ja0j,(djγ+h2d0j)[𝐗SjT𝐗Sj]1)\displaystyle\quad\times\,\,N_{|S_{j}|}\Big{(}a_{S_{j}}\mid[{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}]^{-1}{\bf X}_{S_{j}}^{T}\tilde{Z}_{j}a_{0j},\,\big{(}\frac{d_{j}}{\gamma}+h_{2}d_{0j}\big{)}[{\bf X}_{S_{j}}^{T}{\bf X}_{S_{j}}]^{-1}\Big{)}
×(1+γM1h2d0j)12(1h21)|Sj|πj(Sj)π(dj)daSjddj]\displaystyle\quad\times\,\,\left(1+\frac{\gamma}{M_{1}}h_{2}d_{0j}\right)^{\frac{1}{2}(1-h_{2}^{-1})|S_{j}|}\pi_{j}(S_{j})\pi(d_{j})da_{S_{j}}dd_{j}\Bigg{]}
\displaystyle\leq eCnum,1C1s0logpSj:0<|Sj|RjCnum,2|Sj|πj(Sj),\displaystyle e^{-C_{\rm num,1}\cdot C_{1}s_{0}\log p}\sum_{S_{j}:0<|S_{j}|\leq R_{j}}C_{\rm num,2}^{|S_{j}|}\pi_{j}(S_{j}),

where Cnum,1=h1α(1h1α)/(2[ϵ01+M2])C_{\rm num,1}=h_{1}\alpha(1-h_{1}\alpha)/(2[\epsilon_{0}^{-1}+M_{2}]) and Cnum,2=(1+γh2ϵ01/M1)21(1h21)C_{\rm num,2}=(1+\gamma h_{2}\epsilon_{0}^{-1}/M_{1})^{2^{-1}(1-h_{2}^{-1})} for any constants h1,h2>1h_{1},h_{2}>1 such that h1α<1h_{1}\alpha<1 and h11+h21=1h_{1}^{-1}+h_{2}^{-1}=1. The second inequality holds because for any Ω0n𝒰p0\Omega_{0n}\in\mathcal{U}_{p}^{0}, we have d0jϵ01d_{0j}\leq\epsilon_{0}^{-1}. ∎

Proof of Theorem 3.4.

For some constant Kchol>0K_{\rm chol}^{\prime}>0, let δn=Kchols0logp/n\delta_{n}=K_{\rm chol}^{\prime}\sqrt{s_{0}\log p/n}. Note that

𝔼0πα(AnA0ns0δn|𝐗n)\displaystyle\mathbb{E}_{0}\pi_{\alpha}\left(\|A_{n}-A_{0n}\|_{\infty}\geq\sqrt{s_{0}}\delta_{n}\,\,\big{|}\,\,{\bf X}_{n}\right)
=\displaystyle= 𝔼0πα(max2jpaja0j1s0δn|𝐗n)\displaystyle\mathbb{E}_{0}\pi_{\alpha}\left(\max_{2\leq j\leq p}\|a_{j}-a_{0j}\|_{1}\geq\sqrt{s_{0}}\delta_{n}\,\,\big{|}\,\,{\bf X}_{n}\right)
\displaystyle\leq j=2p𝔼0πα(aja0j1s0δn|𝐗n)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\left(\|a_{j}-a_{0j}\|_{1}\geq\sqrt{s_{0}}\delta_{n}\,\,\big{|}\,\,{\bf X}_{n}\right)
\displaystyle\leq j=2p𝔼0πα(aja0j1s0δn,SjC1s0|𝐗n)+j=2p𝔼0πα(SjC1s0𝐗n)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\left(\|a_{j}-a_{0j}\|_{1}\geq\sqrt{s_{0}}\delta_{n},\,S_{j}\leq C_{1}s_{0}\,\,\big{|}\,\,{\bf X}_{n}\right)+\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\big{(}S_{j}\geq C_{1}s_{0}\mid{\bf X}_{n}\big{)}

for some constant C1>0C_{1}>0. The second term in the last display is of order o(1)o(1) by Lemma 8.2. Also note that

𝔼0πα(aja0j1s0δn,SjC1s0|𝐗n)\displaystyle\mathbb{E}_{0}\pi_{\alpha}\Big{(}\|a_{j}-a_{0j}\|_{1}\geq\sqrt{s_{0}}\delta_{n},\,S_{j}\leq C_{1}s_{0}\,\,\big{|}\,\,{\bf X}_{n}\Big{)}
\displaystyle\leq 𝔼0πα(aja0j2(C1+1)1δn,SjC1s0|𝐗n)\displaystyle\mathbb{E}_{0}\pi_{\alpha}\Big{(}\|a_{j}-a_{0j}\|_{2}\geq(C_{1}+1)^{-1}\delta_{n},\,S_{j}\leq C_{1}s_{0}\,\,\big{|}\,\,{\bf X}_{n}\Big{)}
\displaystyle\leq 𝔼0πα(aja0j2C2Kchol(s0logpΨmin(C2s0)2)1/2,SjC1s0|𝐗n)\displaystyle\mathbb{E}_{0}\pi_{\alpha}\left(\|a_{j}-a_{0j}\|_{2}\geq C_{2}K_{\rm chol}^{\prime}\Big{(}\frac{s_{0}\log p}{\Psi_{\min}(C_{2}s_{0})^{2}}\Big{)}^{1/2},\,S_{j}\leq C_{1}s_{0}\,\,\big{|}\,\,{\bf X}_{n}\right)
+0(n1Ψmin(C2s0)2Cminϵ0),\displaystyle+\,\,\mathbb{P}_{0}\Big{(}n^{-1}\Psi_{\min}(C_{2}s_{0})^{2}\leq C_{\min}\epsilon_{0}\Big{)},

where C2=(C1+1)1Cminϵ0C_{2}=(C_{1}+1)^{-1}\sqrt{C_{\min}\epsilon_{0}} and Cmin=(12ϵ0)4(1ϵ0)C_{\min}=(1-2\epsilon_{0})^{4}(1-\epsilon_{0}). Since we assume s0=o(n)s_{0}=o(n),

j=2p0(n1Ψmin(C2s0)2Cminϵ0)\displaystyle\sum_{j=2}^{p}\mathbb{P}_{0}\Big{(}n^{-1}\Psi_{\min}(C_{2}s_{0})^{2}\leq C_{\min}\epsilon_{0}\Big{)} =\displaystyle= o(1)\displaystyle o(1)

by Lemma 7.1. Let δn=s0logp/Ψmin(C2s0)2\delta_{n}^{\prime}=\sqrt{s_{0}\log p/\Psi_{\min}(C_{2}s_{0})^{2}}, then by Lemma 7.5,

j=2p𝔼0πα(aja0j2C2Kcholδn,SjC1s0|𝐗n)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\left(\|a_{j}-a_{0j}\|_{2}\geq C_{2}K_{\rm chol}^{\prime}\delta_{n}^{\prime},\,S_{j}\leq C_{1}s_{0}\,\,\big{|}\,\,{\bf X}_{n}\right)
\displaystyle\leq j=2p𝔼0πα(aja0j2C2Kcholδn,SjC1s0,M1djM2|𝐗n)+o(1)\displaystyle\sum_{j=2}^{p}\mathbb{E}_{0}\pi_{\alpha}\left(\|a_{j}-a_{0j}\|_{2}\geq C_{2}K_{\rm chol}^{\prime}\delta_{n}^{\prime},\,S_{j}\leq C_{1}s_{0},\,\,M_{1}\leq d_{j}\leq M_{2}\,\,\big{|}\,\,{\bf X}_{n}\right)\,\,+\,\,o(1)

for some constants M1(12ϵ0)4ϵ0M_{1}\leq(1-2\epsilon_{0})^{4}\epsilon_{0} and M2(1+2ϵ0)4ϵ01M_{2}\geq(1+2\epsilon_{0})^{4}\epsilon_{0}^{-1} because we assume that s0logp=o(n)s_{0}\log p=o(n). In fact, Lemma 7.5 assumes the ESC prior, but it is easy to show that it also holds for the MESC prior for some constant ν0>0\nu_{0}^{\prime}>0. Let aSj+a_{S_{j}+} be a pp-dimensional vector such that (aSj+)Sj=aSj(a_{S_{j}+})_{S_{j}}=a_{S_{j}} and (aSj+)l=0(a_{S_{j}+})_{l}=0 for all lSjl\notin S_{j}. By Lemma 8.1, we have

πα(aja0j2C2Kcholδn,M1djM2|𝐗n)\displaystyle\pi_{\alpha}\left(\|a_{j}-a_{0j}\|_{2}\geq C_{2}K_{\rm chol}^{\prime}\delta_{n}^{\prime},\,\,M_{1}\leq d_{j}\leq M_{2}\,\,\big{|}\,\,{\bf X}_{n}\right)
=Sj:0<|Sj|C1s0M1M2aja0j2C2KcholδnRnj(aSj+,dj)απ(aSjdj,Sj)πj(Sj)π(dj)𝑑aSj𝑑djSj:0<|Sj|RjM1M2Rnj(aSj+,dj)απ(aSjdj,Sj)πj(Sj)π(dj)𝑑aSj𝑑dj\displaystyle=\frac{\sum_{S_{j}:0<|S_{j}|\leq C_{1}s_{0}}\int_{M_{1}}^{M_{2}}\int_{\|a_{j}-a_{0j}\|_{2}\geq C_{2}K_{\rm chol}^{\prime}\delta_{n}^{\prime}}R_{nj}(a_{S_{j}+},d_{j})^{\alpha}\pi(a_{S_{j}}\mid d_{j},S_{j})\pi_{j}(S_{j})\pi(d_{j})da_{S_{j}}dd_{j}}{\sum_{S_{j}:0<|S_{j}|\leq R_{j}}\int_{M_{1}}^{M_{2}}\int R_{nj}(a_{S_{j}+},d_{j})^{\alpha}\pi(a_{S_{j}}\mid d_{j},S_{j})\pi_{j}(S_{j})\pi(d_{j})da_{S_{j}}dd_{j}}
=:NnjDnj\displaystyle=:\frac{N_{nj}}{D_{nj}}
NnjeC3s0jπj(S0j)C4nj2logn\displaystyle\leq N_{nj}\cdot\frac{e^{C_{3}s_{0j}}}{\pi_{j}(S_{0j})}\cdot C_{4}nj^{2}\log n (26)
+I(|X~j22𝐗S0ja^S0j22nd0j|[j2logn]1)\displaystyle+I\Big{(}\big{|}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}-n\,d_{0j}\big{|}\leq[j^{2}\log n]^{-1}\Big{)}

for some positive constants C3C_{3} and C4C_{4}. Note that

j=2p0(|X~j22𝐗S0ja^S0j22nd0j|1j2logn)\displaystyle\sum_{j=2}^{p}\mathbb{P}_{0}\Big{(}\big{|}\|\tilde{X}_{j}\|_{2}^{2}-\|{\bf X}_{S_{0j}}\widehat{a}_{S_{0j}}\|_{2}^{2}-n\,d_{0j}\big{|}\leq\frac{1}{j^{2}\log n}\Big{)} \displaystyle\leq j=2p1d0jj2logn=o(1),\displaystyle\sum_{j=2}^{p}\frac{1}{d_{0j}\,j^{2}\log n}\,\,=\,\,o(1),

so we only need to focus on (26). By Lemma 8.3,

eC3s0jπj(S0j)C4nj2logn𝔼0(Nnj)\displaystyle\frac{e^{C_{3}s_{0j}}}{\pi_{j}(S_{0j})}\cdot C_{4}nj^{2}\log n\cdot\mathbb{E}_{0}\big{(}N_{nj}\big{)} \displaystyle\leq eC3s0jπj(S0j)C4nj2logneC5Kchols0logpSj:0<|Sj|RjC6|Sj|πj(Sj)\displaystyle\frac{e^{C_{3}s_{0j}}}{\pi_{j}(S_{0j})}\cdot C_{4}nj^{2}\log n\cdot e^{-C_{5}K_{\rm chol}^{\prime}s_{0}\log p}\sum_{S_{j}:0<|S_{j}|\leq R_{j}}C_{6}^{|S_{j}|}\pi_{j}(S_{j})
\displaystyle\leq eC3s0j+C7s0jlogj+4log(nj)C5Kchols0logp\displaystyle e^{C_{3}s_{0j}+C_{7}s_{0j}\log j+4\log(n\vee j)-C_{5}K_{\rm chol}^{\prime}s_{0}\log p}

for some positive constants C5C_{5}, C6C_{6} and C7C_{7}. The summation of last display with respect to all jj is of order o(1)o(1) for some large KcholK_{\rm chol}^{\prime}. Thus, we have proved

𝔼0πα(AnA0ns0δn|𝐗n)\displaystyle\mathbb{E}_{0}\pi_{\alpha}\left(\|A_{n}-A_{0n}\|_{\infty}\geq\sqrt{s_{0}}\delta_{n}\,\,\big{|}\,\,{\bf X}_{n}\right) =\displaystyle= o(1)\displaystyle o(1)

for some large constant Kchol>0K_{\rm chol}^{\prime}>0. ∎

9 Proofs of Minimax Lower Bounds

Proof of Theorem 3.3.

Note that

A^nA0n\displaystyle\|\widehat{A}_{n}-A_{0n}\|_{\infty} =\displaystyle= maxja^ja0j1\displaystyle\max_{j}\|\widehat{a}_{j}-a_{0j}\|_{1}
\displaystyle\geq a^S0ja0,S0j1\displaystyle\|\widehat{a}_{S_{0j}}-a_{0,S_{0j}}\|_{1}

for any estimator A^n\widehat{A}_{n} and any 2jp2\leq j\leq p. Thus, it suffices to show that

infa^S0jsupΩ0n𝒰p𝔼0a^S0ja0,S0j1\displaystyle\inf_{\widehat{a}_{S_{0j}}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}}\mathbb{E}_{0}\|\widehat{a}_{S_{0j}}-a_{0,S_{0j}}\|_{1} \displaystyle\geq cs0jn\displaystyle c\cdot\frac{s_{0j}}{\sqrt{n}} (27)

for some constant c>0c>0 and any 2jp2\leq j\leq p. Since it is the minimax lower bound for the standard linear regression having s0s_{0}-dimensional coefficient, the inequality (27) holds by a slight modification of Example 13.12 in Duchi (2016). Similarly, it is easy to check

infa^S0jsupΩ0n𝒰p𝔼0A^nA0nF2\displaystyle\inf_{\widehat{a}_{S_{0j}}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}}\mathbb{E}_{0}\|\widehat{A}_{n}-A_{0n}\|_{F}^{2} =\displaystyle= infa^S0jsupΩ0n𝒰pj=2p𝔼0a^S0ja0,S0j22\displaystyle\inf_{\widehat{a}_{S_{0j}}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}}\sum_{j=2}^{p}\mathbb{E}_{0}\|\widehat{a}_{S_{0j}}-a_{0,S_{0j}}\|_{2}^{2}
\displaystyle\geq cj=2ps0jn\displaystyle c\frac{\sum_{j=2}^{p}s_{0j}}{n}

for some constant c>0c>0, by Duchi (2016). ∎

Proof of Theorem 3.5.

Note that

infA^nsupΩ0n𝒰p0𝔼0A^nA0n\displaystyle\inf_{\widehat{A}_{n}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{0}}\mathbb{E}_{0}\|\widehat{A}_{n}-A_{0n}\|_{\infty} =\displaystyle= infA^nsupΩ0n𝒰p0𝔼0[maxja^ja0j1]\displaystyle\inf_{\widehat{A}_{n}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{0}}\mathbb{E}_{0}\left[\max_{j}\|\widehat{a}_{j}-a_{0j}\|_{1}\right]
\displaystyle\geq infA^nsupΩ0n𝒰p0maxj𝔼0a^ja0j1\displaystyle\inf_{\widehat{A}_{n}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{0}}\max_{j}\mathbb{E}_{0}\|\widehat{a}_{j}-a_{0j}\|_{1}

and

infA^nsupΩ0n𝒰p0maxj𝔼0a^ja0j1\displaystyle\inf_{\widehat{A}_{n}}\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{0}}\max_{j}\mathbb{E}_{0}\|\widehat{a}_{j}-a_{0j}\|_{1} \displaystyle\geq supΩ0n𝒰p0maxjcs0j(log(j/s0j)n)1/2\displaystyle\sup_{\Omega_{0n}\in\mathcal{U}_{p}^{0}}\max_{j}c\cdot s_{0j}\left(\frac{\log(j/s_{0j})}{n}\right)^{1/2}
=\displaystyle= cs0(log(p/s0)n)1/2\displaystyle c\cdot s_{0}\left(\frac{\log(p/s_{0})}{n}\right)^{1/2}

for some constant c>0c>0 by Ye and Zhang (2010).

10 Simulation under Misspecified Models

Although it slightly departs from the main topic of the paper, it might be worth comparing results for different choices of α\alpha when the model is misspecified. To investigate misspecified DAG models, we generated the data sets from the pp-dimensional multivariate Laplace distribution with zero mean and covariance matrix Σ0n\Sigma_{0n}. The covariance matrix Σ0n\Sigma_{0n} was generated based on the MCD of Ω0n=Σ0n1\Omega_{0n}=\Sigma_{0n}^{-1} as before, where only 3%\% of entries of the Cholesky factor A0nA_{0n} were drawn from a uniform distribution on [0.7,0.3][0.3,0.7][-0.7,-0.3]\cup[0.3,0.7]. The entries of the diagonal matrix D0nD_{0n} were sampled from a uniform distribution on [2,5][2,5]. We generated the data sets under two settings: (n=100,p=300)(n=100,p=300) and (n=200,p=500)(n=200,p=500).

Table 2: Results for ESC prior with different choices of α\alpha are shawn. Sp: sparsity; FDR: false discovery rate; TPR: true positive rate; p¯0\bar{p}_{0}: the mean inclusion probability for zero entries in A0nA_{0n}; p¯1\bar{p}_{1}: the mean inclusion probability for nonzero entries in A0nA_{0n}.
(n,p,Sp)(n,p,\text{Sp}) α\alpha # of errors FDR TPR p¯0\bar{p}_{0} p¯1\bar{p}_{1}
(100, 300, 3%) 0.999 778 0.3234 0.8074 0.0252 0.8038
0.8 547 0.1843 0.7665 0.0157 0.7620
0.6 568 0.0773 0.6305 0.0100 0.6287
0.4 899 0.0275 0.3413 0.0073 0.3625
0.2 1272 0.0133 0.0550 0.0065 0.0903
(200, 500, 3%) 0.999 1053 0.2020 0.9621 0.0161 0.9552
0.8 591 0.0980 0.9447 0.0105 0.9362
0.6 466 0.0370 0.9105 0.0066 0.8980
0.4 955 0.0146 0.7560 0.0047 0.7430
0.2 2856 0.0100 0.2392 0.0042 0.2559

The simulation results are summarized in Table 2. Based on the results, when the model is misspecified, the choice α1\alpha\approx 1 might not be good because it tends to have high FDR value. Instead, a slightly smaller choice of α\alpha would give reasonable performance. It seems that α=0.8\alpha=0.8 gives reasonable results in terms of the number of errors (and others).

In our settings, as the power α\alpha decreases, one can see that FDR, TPR, p¯0\bar{p}_{0} and p¯1\bar{p}_{1} also decrease. If we take a close look at the posterior samples, selected variables with smaller α\alpha tend to be a subset of those with larger α\alpha, which supports our observations. We are not sure whether this trend is always true or not, but here is a rough intuition: smaller value of α\alpha weakens the effect of (d^Sj)(αn+ν0)/2(\widehat{d}_{S_{j}})^{-(\alpha n+\nu_{0})/2} in πα(Sj𝐗n)\pi_{\alpha}(S_{j}\mid{\bf X}_{n}), which pushes πα(Sj𝐗n)\pi_{\alpha}(S_{j}\mid{\bf X}_{n}) to select larger SjS_{j}, while the main penalty term πj(Sj)\pi_{j}(S_{j}) is not changed, so πα(Sj𝐗n)\pi_{\alpha}(S_{j}\mid{\bf X}_{n}) tends to prefer smaller subset SjS_{j} as α\alpha decreases.

In summary, in our problem, the different choice of 0<α<10<\alpha<1 can improve the variable selection performance compared to α1\alpha\approx 1. The choice α=0.8\alpha=0.8 gave reasonable results in our simulation, but there is no theoretical guideline to choose α\alpha, which might be an interesting topic for the future research.