This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Statistical Foundation of Variational Bayes Neural Networks

Shrijita Bhattacharya bhatta61@msu.edu Tapabrata Maiti maiti@msu.edu Department of Statistics and Probability, Michigan State University
Abstract

Despite the popularism of Bayesian neural networks in recent years, its use is somewhat limited in complex and big data situations due to the computational cost associated with full posterior evaluations. Variational Bayes (VB) provides a useful alternative to circumvent the computational cost and time complexity associated with the generation of samples from the true posterior using Markov Chain Monte Carlo (MCMC) techniques. The efficacy of the VB methods is well established in machine learning literature. However, its potential broader impact is hindered due to a lack of theoretical validity from a statistical perspective. In this paper, we establish the fundamental result of posterior consistency for the mean-field variational posterior (VP) for a feed-forward artificial neural network model. The paper underlines the conditions needed to guarantee that the VP concentrates around Hellinger neighborhoods of the true density function. Additionally, the role of the scale parameter and its influence on the convergence rates has also been discussed. The paper mainly relies on two results (1) the rate at which the true posterior grows (2) the rate at which the KL-distance between the posterior and variational posterior grows. The theory provides a guideline of building prior distributions for Bayesian NN models along with an assessment of accuracy of the corresponding VB implementation.

keywords:
Neural networks, Variational posterior, Mean-field family, Hellinger neighborhood, Kullback-Leibler divergence, Sieve theory, Prior mass, Variational Bayes.
journal: Arxiv

1 Introduction

Bayesian neural networks (BNNs) have been comprehensively studied in the works of Bishop [1997], Neal [1992], Lampinen and Vehtari [2001], etc. More recent developments which establish the efficacy of BNNs can be found in the works of Sun et al. [2017], Mullachery et al. [2018], Hubin et al. [2018], Liang et al. [2018], Javid et al. [2020] and the references therein. The theoretical foundation of BNN by Lee [2000] widens the scope to a broader community. However, with the age of big data applications, the conventional Bayesian approach is computationally inefficient. Thus the alternative computational approaches, such as variational Bayes (VB) become popular among machine learning and applied researchers. Although, there have been many works on the algorithm development for VB in recent years, the theoretical advancement on estimation accuracy is rather limited. This article provides statistical validity of neural networks models with variational inference along with some theory-driven practical guidelines for implementation.

In this article, we mainly focus on feed-forward neural networks with a single hidden layer of inputs and a logistic activation function. Let the number of inputs be denoted by pp and the number of hidden nodes by knk_{n} where the number of nodes is allowed to increase as a function of nn. The true regression function, E(Y|X=𝒙)=f0(𝒙)E(Y|X=\boldsymbol{x})=f_{0}(\boldsymbol{x}) is modeled as a neural network of the form

f(𝒙)=β0+j=1knβjψ(γj0+h=1pγjhxh)f(\boldsymbol{x})=\beta_{0}+\sum_{j=1}^{k_{n}}\beta_{j}\psi(\gamma_{j0}+\sum_{h=1}^{p}\gamma_{jh}x_{h}) (1)

where ψ(u)=1/(1+exp(u))\psi(u)=1/(1+\exp(-u)) is the logistic activation function. With a Gaussian-prior on each of the parameters, Lee [2000] establishes the posterior consistency of neural networks under the simple setup where the scale parameter σ=V(Y|X=𝒙)\sigma=V(Y|X=\boldsymbol{x}) is fixed at 1. The results in Lee [2000] mainly exploit Barron et al. [1999], a fundamental contribution that laid down the framework for posterior consistency in non parametric regression settings. In this paper, we closely mimick the regression model of Lee [2000] by assuming y=f0(𝒙)+ξy=f_{0}(\boldsymbol{x})+\xi where f0(𝒙)f_{0}(\boldsymbol{x}) is the true regression function and ξ\xi follows N(0,σ2)N(0,\sigma^{2}).

The joint posterior distribution of a neural network model is generally evaluated by popular Markov Chain Monte Carlo (MCMC) sampling techniques, like Gibbs sampling, Metropolis Hastings, etc. (see, Neal [1996], Lee [2004], and Ghosh et al. [2004] for more details). Despite the versatility and popularity of MCMC based approach, the Bayesian estimation suffers from computational costs, scalability, time constraints along with other implementation issues such as choice of proposal densities and generating sample paths. Variational Bayes emerged as an important alternative to overcome the drawbacks of the MCMC implementation (see Blei et al. [2017]). Many recent works have discussed the application of variational inference to Bayesian neural networks e.g., Logsdon et al. [2009], Graves [2011], Carbonetto and Stephens [2012], Blundell et al. [2015], Sun et al. [2019]. Although, there is a plethora of literature implementing variational inference for neural networks, the theoretical properties of the variational posterior in BNNs remain relatively unexplored and this limits the use of this powerful computational tool beyond the machine learning community.

Some of the previous works that focused on theoretical properties of variational posterior include the frequentist consistency of variational inference in parametric models in the presence of latent variables (see Wang and Blei [2019]). Optimal risk bounds for mean-field variational Bayes for Gaussian mixture (GM) and Latent Dirichlet allocation (LDA) models have been discussed in Pati et al. [2017]. The work of Yang et al. [2017] propose α\alpha variational inference Bayes risk for GM and LDA models. A more recent work Zhang and Gao [2017] discusses the variational posterior consistency rates in Gaussian sequence models, infinite exponential families and piece-wise constant models. In order to evaluate the validity of a posterior in non-parametric models, one must establish its consistency and rates of contraction. To the best of our knowledge, the problem of posterior consistency, has not been studied in the context of variational Bayes neural network models.

Our contribution: Our theoretical development of posterior consistency, an essential property in nonparametric Bayesian Statistics, provides confidence in using the variational Bayes neural networks model across the disciplines. Our theoretical results help to assess the estimation accuracy for a given training sample and model complexity. Specifically, we establish the conditions needed for the variational posterior consistency of the feedforward neural networks. We establish that a simple Gaussian mean-field approximation is good enough to achieve consistency for the variational posterior. In this direction, we show that ε\varepsilon- Hellinger neighborhood of the true density function receives close to 1 probability under the variational posterior. For the true posterior density ( Lee [2000]), the posterior probability of an ε\varepsilon- Hellinger neighborhood grows at the rate 1eϵnδ1-e^{-\epsilon n^{\delta}}. In contrast, we show for the variational posterior this rate becomes 1ϵ/nδ1-\epsilon/n^{\delta}. The reason for this difference is explained by two folds: (1) first, the KL-distance between the variational posterior and the true posterior does not grow at a rate greater than n1δn^{1-\delta} for some 0δ<10\leq\delta<1, (2) second, the posterior probability of ε\varepsilon- Hellinger neighborhood grows at the rate 1eϵnδ1-e^{-\epsilon n^{\delta}}, thus, the variational posterior probability must grow at the rate 1ϵ/nδ1-\epsilon/n^{\delta}, otherwise the rate of growth of the KL-distance cannot be controlled. We also give the conditions on the approximating neural network and the rate of growth in the number of nodes needed to ensure that the variational posterior achieves consistency. As a last contribtuion, we show that the VB estimator of the regression function converges to the true regression function.

Further, our investigation shows that although the variational posterior(VP) is asymptotically consistent, posterior probability of ε\varepsilon-Hellinger neighborhoods does not converge to 1 as fast as the true posterior. In addition, one requires that the absolute value of the parameters in the approximating neural network function grow at a controlled rate (less than n1δn^{1-\delta} for some 0δ<10\leq\delta<1), a condition not needed in dealing with MCMC based implementation. When the absolute value of the parameters grow as a polynomial function of nn (O(nv),v>1O(n^{v}),v>1), one can choose a flatter prior (a prior whose variance increases with nn) in order to guarantee VP consistency.

VP consistency has been established irrespective of whether σ\sigma is known or unknown and the differences in practice have been discussed. It has been shown that one must guard against using Gaussian distributions as a variational family for σ\sigma. Since the KL-distance between variational posterior and true posterior must be controlled, one must ensure that quantities like E(logX)E(\log X) and E(1/X2)E(1/X^{2}) are defined under the variational distribution of σ\sigma. We thereby discuss two sets of variational family on σ\sigma, (1) an inverse gamma-distribution, (2) a normal distribution on the log-transformed σ\sigma. While the second approach may seem intuitively appealing if one were to use fully Gaussian variational families, it comes with a drawback. Indeed, under the reparametrized σ\sigma, the variational posterior is consistent if the rate of growth in the number of nodes is slower than under the original parameter models. However, a smaller growth in the number of nodes makes it more and more difficult to find an approximating neural network which converges fast enough to the true function.

The outline of the paper is as follows. In Section 2, we present the notation and the terminology of consistency for variational posterior. In Section 3, we present the consistency results when the scale parameter is known. In Section 4, we present the consistency of an unknown scale parameter under two sets of variational families. In Section 5, we show that the Bayes estimates obtained from the variational posterior converge to the true regression function and scale parameter. Finally, Section 5 ends with a discussion and conclusions from our current work.

2 Model and Assumptions

Suppose the true regression model has the form:

yi=f0(𝒙i)+ξiy_{i}=f_{0}(\boldsymbol{x}_{i})+\xi_{i}

where ξ1,,ξn\xi_{1},\cdots,\xi_{n} are i.i.d. N(0,σ02)N(0,\sigma_{0}^{2}) random variables and the feature vector 𝒙1,𝒙n\boldsymbol{x}_{1},\cdots\boldsymbol{x}_{n} with 𝒙ip\boldsymbol{x}_{i}\in\mathbb{R}^{p}. For the purposes of this paper, we assume that the number of covariates pp is fixed.

Thus, the true conditional density of Y|X=𝒙Y|X=\boldsymbol{x} is

l0(y,𝒙)i=1nexp(12σ02(yf0(𝒙))2)l_{0}(y,\boldsymbol{x})\propto\prod_{i=1}^{n}\exp(-\frac{1}{2\sigma_{0}^{2}}(y-f_{0}(\boldsymbol{x}))^{2}) (2)

which implies the true likelihood function is

L0=i=1nl0(yi,𝒙i)L_{0}=\prod_{i=1}^{n}l_{0}(y_{i},\boldsymbol{x}_{i}) (3)

Universal approximation: By Hornik et al. [1989], for every function f0f_{0} such that f02(x)𝑑x<\int f_{0}^{2}(x)dx<\infty, there exists a neural network ff such that ff02<ϵ||f-f_{0}||_{2}<\epsilon. This led to the ubiquitous use of neural networks as a modeling approximation to a wide class of regression functions.

In this paper, we assume that the true regression function f0f_{0} can be approximated by a neural network

f𝜽n(𝒙)=β0+j=1knβjψ(γj𝒙),𝜽n=(βj,γjh)jJ,hH,J={0,,kn},H={0,,p}f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})=\beta_{0}+\sum_{j=1}^{k_{n}}\beta_{j}\psi(\gamma_{j}^{\top}\boldsymbol{x}),\>\>\boldsymbol{\theta}_{n}=(\beta_{j},\gamma_{jh})_{j\in J,h\in H},\>\>J=\{0,\cdots,k_{n}\},\>\>H=\{0,\cdots,p\} (4)

where knk_{n}, the number of nodes increases as a function of nn, while pp, the number of covariates is fixed. Thus, the total number of parameters grow at the same rate as the number of nodes, i.e. K(n)=1+kn(p+1)knK(n)=1+k_{n}(p+1)\sim k_{n}.

Suppose there exists a neural network f𝜽0n(𝒙)=β00+j=1knβj0ψ(γj0𝒙)f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x}) such that

(A1)\displaystyle(A1) f𝜽0nf02=o(nδ)\displaystyle\hskip 28.45274pt||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}=o(n^{-\delta}) (5)

Note that if f0f_{0} is a neural network function itself, then (A1) holds trivially for all 0δ<10\leq\delta<1 irrespective of the choice of knk_{n}. Theorem 2 of Siegel and Xu [2019] showed that with kn=nk_{n}=n, δ\delta can be chosen between 0δ<1/20\leq\delta<1/2. Mimicking the steps of Theorem 2, Siegel and Xu [2019], it can be shown that with kn=na,a>1/2k_{n}=n^{a},a>1/2, δ\delta can be chosen anywhere in the range 0δ<a1/20\leq\delta<a-1/2. For a given choice of knk_{n}, whether (A1) holds or not depends on the entropy of the true function. Assumptions of similar form can also be found in Shen [1997] (see conditions C and CC^{\prime}) and Shen et al. [2019] (see condition C3).

Note that the condition (A1) characterizes the rate at which a neural network function approaches to the true function. The next set of conditions characterize the rate at which the coefficients of the approximating neural network solution grow. Suppose, one of the following two conditions hold:

(A2)\displaystyle(A2) i=1K(n)θi0n2=o(n1δ), 0δ<1\displaystyle\hskip 28.45274pt\sum_{i=1}^{K(n)}\theta_{i0n}^{2}=o(n^{1-\delta}),\;0\leq\delta<1 (6)
(A3)\displaystyle(A3) i=1K(n)θi0n2=O(nv),v1\displaystyle\hskip 28.45274pt\sum_{i=1}^{K(n)}\theta_{i0n}^{2}=O(n^{v}),\;\;v\geq 1 (7)

Note that condition (A2) ensures that sum of squares of the coefficients grow at a rate slower than nn. White [1990] proved consistency properties of feed forward neural networks with i=1K(n)|θi0n|=o(n1/4)\sum_{i=1}^{K(n)}|\theta_{i0n}|=o(n^{1/4}) which implies i=1K(n)|θi0n|2(i=1K(n)|θi0n|)2=o(n1/2)\sum_{i=1}^{K(n)}|\theta_{i0n}|^{2}\leq(\sum_{i=1}^{K(n)}|\theta_{i0n}|)^{2}=o(n^{1/2}), i.e. 0δ<1/20\leq\delta<1/2. Blei et al. [2017] studied the consistency properties for parametric models wherein one requires the assumption logp(θ0)-\log p(\theta_{0}) be bounded (see Relations (44) and (53) in Blei et al. [2017]). With a normal prior of the form p(𝜽n)exp(i=1K(n)θin2)p(\boldsymbol{\theta}_{n})\propto\exp(-\sum_{i=1}^{K(n)}\theta_{in}^{2}), the same condition reduces to i=1K(n)θi0n2\sum_{i=1}^{K(n)}\theta_{i0n}^{2} bounded at a suitable rate. Indeed, condition (A2) guarantees that the rate of growth KL-distance between the true and the variational posterior is well controlled.

Condition (A3) is a relaxed version of (A2), where the sum of squares of the coefficients is allowed to grow at a rate in polynomial in nn. A standard prior independent of nn might fail to guarantee convergence. We thereby assume a flatter prior whose variance increases with nn in order to allow for consistency through variational bayes. Note that if f0f_{0} is a neural network function itself, conditions (A2) and (A3) hold trivially.

Kullback-Leibler divergence: Let PP and QQ be two probability distributions, with density pp and qq respectively, then

dKL(q,p)=𝒳logp(x)q(x)q(x)𝑑xd_{KL}(q,p)=\int_{\mathcal{X}}\log\frac{p(x)}{q(x)}q(x)dx

Hellinger distance: Let PP and QQ be two probability distributions with density pp and qq respectively, then

dH(q,p)=𝒳(q(x)p(x))2𝑑xd_{H}(q,p)=\int_{\mathcal{X}}(\sqrt{q(x)}-\sqrt{p(x)})^{2}dx

Distribution of the feature vector: In order to establish posterior consistency, we assume that the feature vector 𝒙U(0,1)p\boldsymbol{x}\sim U(0,1)^{p}. Although, this is not a requirement for the model, it simplifies steps of the proof since the joint density function of (Y,X) simplifies as

gY,X(y,𝒙)=gY|X(y|𝒙)gX(𝒙)=gY|X(y|𝒙)g_{Y,X}(y,\boldsymbol{x})=g_{Y|X}(y|\boldsymbol{x})g_{X}(\boldsymbol{x})=g_{Y|X}(y|\boldsymbol{x}) (8)

Thus, it suffices to deal with the conditional density of Y|X=𝒙Y|X=\boldsymbol{x}.

3 Consistency of variational posterior with σ\sigma known

In this section, we begin with the simple model where the scale parameter σ0\sigma_{0} is known. For a simple Gaussian mean field family as in (13), we establish that variational posterior is consistent as long as assumptions (A1), (A2) and (A3) hold. We also discuss, how the rates contrast with those in Lee [2000] which established the posterior consistency of the true posterior.

Sieve Theory: Let 𝝎n=𝜽n\boldsymbol{\omega}_{n}=\boldsymbol{\theta}_{n}, then

l𝝎n(y,𝒙)=12πσ02exp(12σ02(yf𝜽n(𝒙))2)l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x})=\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}\exp\Big{(}-\frac{1}{2\sigma_{0}^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)} (9)

where 𝜽n\boldsymbol{\theta}_{n} and f𝜽nf_{\boldsymbol{\theta}_{n}} are defined in (4). The sieve is then defined as:

𝒢n={l𝝎n(y,𝒙),𝝎nn}n={(𝜽n):|θin|Cn}\displaystyle\mathcal{G}_{n}=\Big{\{}l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x}),\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}\Big{\}}\hskip 28.45274pt\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n}):|\theta_{in}|\leq C_{n}\Big{\}} (10)

Likelihood:

L(𝝎n)=i=1nl𝝎n(yi,𝒙i)L(\boldsymbol{\omega}_{n})=\prod_{i=1}^{n}l_{\boldsymbol{\omega}_{n}}(y_{i},\boldsymbol{x}_{i}) (11)

Posterior: Let p(𝝎n)p(\boldsymbol{\omega}_{n}) denote the prior on 𝝎n\boldsymbol{\omega}_{n}. Then, the posterior is given by

π(𝝎n|𝒚n,𝑿n))=L(𝝎n)p(𝝎n)L(𝝎n)p(𝝎n)𝑑𝝎n\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}} (12)

Variational Family: Variational family for 𝝎n\boldsymbol{\omega}_{n} is given by

𝒬n={q:q(𝝎n)=i=1K(n)12πs~in2e(θinm~in)22s~in2}\mathcal{Q}_{n}=\left\{q:q(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\tilde{s}^{2}_{in}}}e^{-\frac{(\theta_{in}-\tilde{m}_{in})^{2}}{2\tilde{s}_{in}^{2}}}\right\} (13)

Let the variational posterior be denoted by

π(𝝎n)=argminq𝒬ndKL(q(.),π(.|𝒚n,𝑿n))\pi^{*}(\boldsymbol{\omega}_{n})=\underset{{q\in\mathcal{Q}_{n}}}{\text{argmin}}d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) (14)

Hellinger neighborhood: Define the neighborhood of the true density l0l_{0} as

𝒱ε={𝝎n:dH(l0,l𝝎n)<ε}\mathcal{V}_{\varepsilon}=\{\boldsymbol{\omega}_{n}:d_{H}(l_{0},l_{\boldsymbol{\omega}_{n}})<\varepsilon\} (15)

where the Hellinger distance dH(l0,l𝝎n)d_{H}(l_{0},l_{\boldsymbol{\omega}_{n}}) given by

dH(l0,l𝝎n)=(l𝝎n(𝒙,y)l0(𝒙,y))2𝑑𝒙𝑑yd_{H}(l_{0},l_{\boldsymbol{\omega}_{n}})=\int\int\left(\sqrt{l_{\boldsymbol{\omega}_{n}}(\boldsymbol{x},y)}-\sqrt{l_{0}(\boldsymbol{x},y)}\right)^{2}d\boldsymbol{x}dy

Note that the above simplified of the Hellinger distance is due to (8).

In the following two theorems for two class of priors, we establish the posterior consistency of π\pi^{*}, i.e. the variational posterior concentrates in ε\varepsilon- small Hellinger neighborhoods of the true density l0l_{0}. Note that, assumptions (A2) and (A3) impose a restriction on the rate of growth of the sum of squares of the coefficients of the approximating neural network solution. With (A2), we show that a standard normal prior on all the parameters works. However, under the more weaker assumption (A3), a normal prior whose variance increases with nn is needed. Additionally, we show that for the variational posterior to achieve consistency, the number of parameters or equivalenty the number of nodes knk_{n} need to grow in a controlled fashion.

Theorem 3.1.

Suppose the number of nodes knk_{n} satisfy

(C1)kn\displaystyle(C1)\hskip 28.45274ptk_{n} na\displaystyle\sim n^{a} (16)

In addition, suppose assumptions (A1) and (A2) hold for some 0δ<1a,0\leq\delta<1-a,\;.

Then, with normal prior for each entry in 𝝎n\boldsymbol{\omega}_{n} as follows

p(𝝎n)=i=1K(n)12πζ2eθin22ζ2p(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}} (17)

we have

π(𝒱εc)=oP0n(nδ)\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=o_{P_{0}^{n}}(n^{-\delta})

Note that conditions (16) and (17) agree with those assumed in Theorem 1 of Lee [2000]. Since π(𝒱εc)=oP0(nδ)\pi^{*}(\mathcal{V}^{c}_{\varepsilon})=o_{P_{0}}(n^{-\delta}), the variational posterior is consistent with δ\delta as small as 0. Indeed δ=0\delta=0 imposes the least restriction on the convergence rate and coefficient growth rate of the true function (see assumptions (A1) and (A2)). As δ\delta grows, restrictions on the approximating neural function increase but that guarantees faster convergence of the variational posterior. Expanding upon the Bayesian posterior consistency established in Lee [2000], one can show that π(𝒱εc|𝒚n,𝑿n)oP0n(enδ)\pi(\mathcal{V}_{\varepsilon}^{c}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})\leq o_{P_{0}^{n}}(e^{-n^{\delta}}) for any 0δ<10\leq\delta<1 (see Relation (88) in Lee [2000]). Thus, probability of ε\varepsilon- Hellinger neighborhood grows at the rate 1ϵ(1/n)δ1-\epsilon(1/n)^{\delta} for variational posterior in contrast to that of 1ϵ(en)δ1-\epsilon(e^{-n})^{\delta} for true posterior. For parametric models, the rate of growth of the variational posterior was found to be 1ϵ(1/n)1-\epsilon(1/n) (see second equation 2 on page 38 of Blei et al. [2017]). Note that the consistency of true posterior requires no assumptions on the approximating neural network function whereas for the variational posterior, both assumptions (A1) and (A2) must be satisfied to guarantee convergence.

Theorem 3.2.

Suppose the number of nodes knk_{n} satisfy condition (C1). In addition, suppose assumptions (A1) and (A3) hold for some 0δ<1a0\leq\delta<1-a and v>1v>1.

Then, with normal prior for each entry in 𝝎n\boldsymbol{\omega}_{n} as follows

p(𝝎n)=i=1K(n)12πζ2nueθin22ζ2nu,u>vp(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}n^{u}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}n^{u}}},u>v (18)

we have

π(𝒱εc)=oP0n(nδ)\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=o_{P_{0}^{n}}(n^{-\delta})

Observe that the consistency rate in Theorem 3.2 agrees to the one in Theorem 3.1. In order to prove both theorems 3.1 and 3.2, a crucial step is to show that dKL(π(.),π(.|𝒚n,𝑿n))=oP0n(n1δ)d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta}). In order to show this, we show that dKL(q(.),π(.|𝒚n,𝑿n))=oP0(n1δ)d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}}(n^{1-\delta}) for some q𝒬nq\in\mathcal{Q}_{n}. Indeed this choice of qq varies in order to adjust for changing nature of the prior from (17) to (18) (see statements (1) and (2) in Lemma 7.9).

We next present the proof of Theorems 3.1 and 3.2. The first crucial step of the proof is to establish that the dKL(π(.),π(.|𝒚n,𝑿n))d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) is bounded below by a quantity which is determined by the rate of consistency of the true posterior (see the quantities AnA_{n} and BnB_{n} in the proof below). The second crucial step towards the proof is to show dKL(π(.),π(.|𝒚n,𝑿n))d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) is bounded above at a rate which can be greater than the rate of its lower bound iff the variation posterior is consistent.

Proof of Theorems 3.1 and 3.2.

With 𝒱ε\mathcal{V}_{\varepsilon} as in (15), we have

dKL(π(.),π(.|𝒚n,𝑿n))=𝒱επ(𝝎n)logπ(𝝎n)π(𝝎n|𝒚n,𝑿n)d𝝎n+𝒱εcπ(𝝎n)logπ(𝝎n)π(𝝎n|𝒚n,𝑿n)d𝝎n\displaystyle d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=\underbrace{\int_{\mathcal{V}_{\varepsilon}}\pi^{*}(\boldsymbol{\omega}_{n})\log\frac{\pi^{*}(\boldsymbol{\omega}_{n})}{\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}d\boldsymbol{\omega}_{n}}_{③}+\underbrace{\int_{\mathcal{V}_{\varepsilon}^{c}}\pi^{*}(\boldsymbol{\omega}_{n})\log\frac{\pi^{*}(\boldsymbol{\omega}_{n})}{\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}d\boldsymbol{\omega}_{n}}_{④} (19)

Without loss of generality, π(𝒱ε)>0\pi^{*}(\mathcal{V}_{\varepsilon})>0, π(𝒱εc)>0\pi^{*}(\mathcal{V}_{\varepsilon}^{c})>0.

\displaystyle③ =π(𝒱ε)𝒱επ(𝝎n)π(𝒱ε)logπ(𝝎n|𝒚n,𝑿n)π(𝝎n)d𝝎n\displaystyle=-\pi^{*}(\mathcal{V}_{\varepsilon})\int_{\mathcal{V}_{\varepsilon}}\frac{\pi^{*}(\boldsymbol{\omega}_{n})}{\pi^{*}(\mathcal{V}_{\varepsilon})}\log\frac{\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}{\pi^{*}(\boldsymbol{\omega}_{n})}d\boldsymbol{\omega}_{n}
π(𝒱ε)log𝒱επ(𝝎n)π(𝒱ε)π(𝝎n|𝒚n,𝑿n)π(𝝎n)𝑑𝝎n Jensen’s inequality\displaystyle\geq-\pi^{*}(\mathcal{V}_{\varepsilon})\log\int_{\mathcal{V}_{\varepsilon}}\frac{\pi^{*}(\boldsymbol{\omega}_{n})}{\pi^{*}(\mathcal{V}_{\varepsilon})}\frac{\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}{\pi^{*}(\boldsymbol{\omega}_{n})}d\boldsymbol{\omega}_{n}\hskip 14.22636pt\text{ Jensen's inequality}
π(𝒱ε)logπ(𝒱ε)π(𝒱ε|𝒚n,𝑿n)π(𝒱ε)logπ(𝒱ε) since logπ(𝒱ε|𝒚n,𝑿n)0\displaystyle\geq\pi^{*}(\mathcal{V}_{\varepsilon})\log\frac{\pi^{*}(\mathcal{V}_{\varepsilon})}{\pi(\mathcal{V}_{\varepsilon}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}\geq\pi^{*}(\mathcal{V}_{\varepsilon})\log\pi^{*}(\mathcal{V}_{\varepsilon})\hskip 28.45274pt\text{ since }\log\pi(\mathcal{V}_{\varepsilon}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})\leq 0

Similarly,

\displaystyle④ π(𝒱εc)logπ(𝒱εc)π(𝒱εc|𝒚n,𝑿n)\displaystyle\geq\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\log\frac{\pi^{*}(\mathcal{V}_{\varepsilon}^{c})}{\pi(\mathcal{V}_{\varepsilon}^{c}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}
π(𝒱εc)logπ(𝒱εc)π(𝒱εc)logπ(𝒱εc|𝒚n,𝑿n)\displaystyle\geq\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\log\pi^{*}(\mathcal{V}_{\varepsilon}^{c})-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\log\pi(\mathcal{V}_{\varepsilon}^{c}|\boldsymbol{y}_{n},\boldsymbol{X}_{n}) (20)

Now let us consider

logπ(𝒱εc|𝒚n,𝑿n)\displaystyle\log\pi(\mathcal{V}_{\varepsilon}^{c}|\boldsymbol{y}_{n},\boldsymbol{X}_{n}) =log𝒱εcL(𝝎n)p(𝝎n)d𝝎nL(𝝎n)p(𝝎n)𝑑𝝎n\displaystyle=\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}
=log𝒱εc(L(𝝎n)/L0)p(𝝎n)𝑑𝝎nAnlog(L(𝝎n)/L0)p(𝝎n)𝑑𝝎nBn\displaystyle=\underbrace{\log\int_{\mathcal{V}_{\varepsilon}^{c}}(L(\boldsymbol{\omega}_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}_{A_{n}}\underbrace{-\log\int(L(\boldsymbol{\omega}_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}_{B_{n}} (21)

Using (3) in (3), we get

π(𝒱εc)logπ(𝒱εc)π(𝒱εc)Anπ(𝒱εc)Bn\displaystyle④\geq\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\log\pi^{*}(\mathcal{V}_{\varepsilon}^{c})-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})A_{n}-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n} (22)

Combining (19) and (22), we get

dKL(π(.),π(.|𝒚n,𝑿n))\displaystyle d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) π(𝒱ε)logπ(𝒱ε)+π(𝒱εc)logπ(𝒱εc)π(𝒱εc)Anπ(𝒱εc)Bn\displaystyle\geq\pi^{*}(\mathcal{V}_{\varepsilon})\log\pi^{*}(\mathcal{V}_{\varepsilon})+\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\log\pi^{*}(\mathcal{V}_{\varepsilon}^{c})-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})A_{n}-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n} (23)
log2π(𝒱εc)Anπ(𝒱εc)Bn\displaystyle\geq-\log 2-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})A_{n}-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n} (24)

where the last inequality follows since xlogx+(1x)log(1x)log2x\log x+(1-x)\log(1-x)\geq-\log 2 for 0<x<10<x<1.

Therefore,

dKL(π(.),π(.|𝒚n,𝑿n))+log2+π(𝒱εc)Bnπ(𝒱εc)An\boxed{d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))+\log 2+\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}\geq-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})A_{n}} (25)

By Proposition 7.17,

Anlog2+nε2+oP0n(1)\displaystyle-A_{n}\geq-\log 2+n\varepsilon^{2}+o_{P_{0}^{n}}(1)
Anπ(𝒱ε)log2+nε2π(𝒱ε)+oP0n(1)\displaystyle\implies-A_{n}\pi^{*}(\mathcal{V}_{\varepsilon})\geq-\log 2+n\varepsilon^{2}\pi^{*}(\mathcal{V}_{\varepsilon})+o_{P_{0}^{n}}(1)
π(𝒱εc)nε2dKL(π(.),π(.|𝒚n,𝑿n))+2log2+π(𝒱εc)Bn+oP0n(1)\displaystyle\implies\pi^{*}(\mathcal{V}_{\varepsilon}^{c})n\varepsilon^{2}\leq d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))+2\log 2+\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}+o_{P_{0}^{n}}(1)

By Proposition 7.18,

π(𝒱εc)Bn=oP0n(n1δ)\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}=o_{P_{0}^{n}}(n^{1-\delta})

By Proposition 7.19,

dKL(π(.),π(.|𝒚n,𝑿n))=oP0n(n1δ)\displaystyle d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})

Therefore,

π(𝒱εc)oP0n(nδ)+oP0n(n1)=oP0n(nδ)\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\leq o_{P_{0}^{n}}(n^{-\delta})+o_{P_{0}^{n}}(n^{-1})=o_{P_{0}^{n}}(n^{-\delta})

In the above proof we have assumed π(𝒱ε)>0\pi^{*}(\mathcal{V}_{\varepsilon})>0, π(𝒱εc)>0\pi^{*}(\mathcal{V}_{\varepsilon}^{c})>0. If π(𝒱εc)=0\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=0, there is nothing to prove. If π(𝒱ε)=0\pi^{*}(\mathcal{V}_{\varepsilon})=0, then following the steps of the proof, we will get ε2=oP0(nδ)\varepsilon^{2}=o_{P_{0}}(n^{-\delta}) which is a contradiction.

The main step in the above proof is (25) which we discuss next. The quantity eAne^{A_{n}} is indeed decomposed into two parts

eAn=𝒱εcn(L(𝝎)n)/L0)p(𝝎n)d𝝎n+𝒱εcnc(L(𝝎)n)/L0)p(𝝎n)d𝝎ne^{A_{n}}=\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}(L(\boldsymbol{\omega})_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}+\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}^{c}}(L(\boldsymbol{\omega})_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

Whereas the first term is controlled using the Hellinger bracketing entropy of n\mathcal{F}_{n}, the second term is controlled by the fact that the prior gives negligible probability outside n\mathcal{F}_{n}. Thus, the main factor influencing eAne^{A_{n}} is a suitable choice of the sequence of spaces n\mathcal{F}_{n}. Indeed our choice of n\mathcal{F}_{n} is same as that in Lee [2000] with knnak_{n}\sim n^{a} and Cn=enbaC_{n}=e^{n^{b-a}}. Such a choice allows one to control the Hellinger bracketing entropy of n\mathcal{F}_{n} while controlling the prior mass for nc\mathcal{F}_{n}^{c} also at the same time.

The second quantity BnB_{n} is controlled by the rate at which the prior gives mass to shrinking KL neighborhoods of the true density l0l_{0}. Indeed, the quantity BnB_{n} appears again when computing bounds on dKL(q(.),π(.|𝒚n,𝑿n)d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}) for some q𝒬nq\in\mathcal{Q}_{n} (see in Proposition 7.19). If δ=0\delta=0, BnB_{n} can be controlled even without assumptions (A1) and (A2). However, if δ>0\delta>0, assumptions (A1) and (A2) are needed in order to guarantee that the BnB_{n} grows at a rate less than n1δn^{1-\delta}.

The last quantity, dKL(π(.),π(.|𝒚n,𝑿n))d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) is controlled at a rate less than n1δn^{1-\delta} by showing that there exists a q𝒬nq\in\mathcal{Q}_{n} (see (62) and (65)) such that dKL(π(.),π(.|𝒚n,𝑿n))=oP0n(n1δ)d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta}). Both assumptions (A1) and (A2) play an important role in guaranteeing that such a qq does exist.

4 Consistency of variational posterior with σ\sigma unknown

In this section, we now assume that the scale parameter σ\sigma is unknown. In this case, our approximating variational family is slightly different from (14). Whereas, we still assume a mean field Gaussian family on 𝜽n\boldsymbol{\theta}_{n}, our approximating family for σ\sigma cannot be Gaussian. An important criterion to guarantee the consistency of variational posterior is to ensure dKL(l0,l𝝎n)q(𝝎n)𝑑𝝎n\int d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} is well bounded (see Lemma 7.11). When σ\sigma is unknown, dKL(l0,l𝝎n)d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}}) involves terms like logσ\log\sigma and 1/σ21/\sigma^{2} both of whose integrals are undefined under a normally distributed qq. We thereby adopt two versions of qq for σ\sigma, firstly an inverse gamma distribution of σ\sigma and secondly a normal distribution on the log transformed σ\sigma (see Sections 4.1 and 4.2 respectively). Both the transforms have their respective advantage in terms of determining the rate of consistency of the variational posterior. In this section, we work only with assumption (A2). We can handle (A3) in a way exactly similar to Section 3.

4.1 Inverse-gamma prior on σ\sigma

Sieve Theory: Let 𝝎n=(𝜽n,σ2)\boldsymbol{\omega}_{n}=(\boldsymbol{\theta}_{n},\sigma^{2}) where 𝜽n\boldsymbol{\theta}_{n} and f𝜽nf_{\boldsymbol{\theta}_{n}} are defined in (4), then

l𝝎n(y,𝒙)=12πσ2exp(12σ2(yf𝜽n(𝒙))2)l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\Big{(}-\frac{1}{2\sigma^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)} (26)

The sieve is defined as follows.

𝒢n={l𝝎n(y,𝒙),𝝎nn}n={(𝜽n,σ2):|θin|Cn,1/Cn2σ2Dn}\displaystyle\mathcal{G}_{n}=\Big{\{}l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x}),\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}\Big{\}}\hskip 14.22636pt\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n},\sigma^{2}):|\theta_{in}|\leq C_{n},1/C_{n}^{2}\leq\sigma^{2}\leq D_{n}\Big{\}} (27)

The definitions for likelihood, posterior and Hellinger neighborhood agree with those given in (11), (12) and (15) as in Section 3.

Prior distribution: We propose a normal prior on each θin\theta_{in} and an inverse gamma prior of σ2\sigma^{2}.

p(𝝎n)=λαΓ(α)(1σ2)α+1eλσ2i=1K(n)12πζ2eθin22ζ2p(\boldsymbol{\omega}_{n})=\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}} (28)

Variational Family: Variational family for 𝝎n\boldsymbol{\omega}_{n} is given by

𝒬n={q:q(𝝎n)=b~na~nΓ(a~n)(1σ2)a~n+1eb~nσ2i=1K(n)12πs~in2e(θinm~in)22s~in2}\mathcal{Q}_{n}=\left\{q:q(\boldsymbol{\omega}_{n})=\frac{\tilde{b}_{n}^{\tilde{a}_{n}}}{\Gamma(\tilde{a}_{n})}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\tilde{a}_{n}+1}e^{-\frac{\tilde{b}_{n}}{\sigma^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\tilde{s}^{2}_{in}}}e^{-\frac{(\theta_{in}-\tilde{m}_{in})^{2}}{2\tilde{s}_{in}^{2}}}\right\} (29)

The variational posterior has the same definition as in (14).

The following theorem shows that when the σ\sigma parameter is unknown, the variational posterior is still consistent, however the rate decreases by an amount of nϵn^{\epsilon}.

Theorem 4.1.

Suppose the number of nodes satisfy condition (C1). In addition, suppose assumptions (A1) and (A2) hold for some 0<δ<1a0<\delta<1-a. Then for any ϵ>0\epsilon>0.

π(𝒱εc)=oP0n(nϵδ)\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=o_{P_{0}^{n}}(n^{\epsilon-\delta})

Note that by Theorem 3.1, the posterior is consistent iff ϵδ<0\epsilon-\delta<0 which is indeed the case as long as δ>0\delta>0. Whether such a δ\delta exists or not depends on the entropy of the function f0f_{0} (see the discussion section in Shen et al. [2019]). Mimicking the steps of Theorem 2, Siegel and Xu [2019] it can be shown that with kn=nak_{n}=n^{a}, a>1/2a>1/2, δ\delta can be chosen anywhere in the range 0δ<1/20\leq\delta<1/2.

Proof.

The proof mimics the steps in the proof of Theorems 3.1 and 3.2 till equation (25).

By Proposition 7.22 for any 0<r<10<r<1,

Anlog2+nrε2+oP0n(1)\displaystyle-A_{n}\geq-\log 2+n^{r}\varepsilon^{2}+o_{P_{0}^{n}}(1)
Anπ(𝒱ε)log2+nrε2π(𝒱ε)+oP0n(1)\displaystyle-A_{n}\pi^{*}(\mathcal{V}_{\varepsilon})\geq-\log 2+n^{r}\varepsilon^{2}\pi^{*}(\mathcal{V}_{\varepsilon})+o_{P_{0}^{n}}(1)
π(𝒱εc)nrε2dKL(π(𝝎n),π(𝝎n|𝒚n,𝑿n))+2log2+π(𝒱εc)Bn+oP0n(1)\displaystyle\implies\pi^{*}(\mathcal{V}_{\varepsilon}^{c})n^{r}\varepsilon^{2}\leq d_{KL}(\pi^{*}(\boldsymbol{\omega}_{n}),\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))+2\log 2+\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}+o_{P_{0}^{n}}(1)

By Proposition 7.23,

π(𝒱εc)Bn=oP0n(n1δ)\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}=o_{P_{0}^{n}}(n^{1-\delta})

By Proposition 7.24,

dKL(π(𝝎n),π(𝝎n|𝒚n,𝑿n))=oP0n(n1δ)\displaystyle d_{KL}(\pi^{*}(\boldsymbol{\omega}_{n}),\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})

Therefore, with r=1ϵr=1-\epsilon, we have

π(𝒱εc)oP0n(n1δr)+oP0n(nr)=oP0n(nϵδ)+oP0n(nϵ1)=oP0n(nϵδ)\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\leq o_{P_{0}^{n}}(n^{1-\delta-r})+o_{P_{0}^{n}}(n^{-r})=o_{P_{0}^{n}}(n^{\epsilon-\delta})+o_{P_{0}^{n}}(n^{\epsilon-1})=o_{P_{0}^{n}}(n^{\epsilon-\delta})

Similar to the proof of Theorem 3.1, the quantity eAne^{A_{n}} is indeed decomposed into two parts

eAn=𝒱εcn(L(𝝎)n)/L0)p(𝝎n)d𝝎n+𝒱εcnc(L(𝝎)n)/L0)p(𝝎n)d𝝎ne^{A_{n}}=\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}(L(\boldsymbol{\omega})_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}+\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}^{c}}(L(\boldsymbol{\omega})_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

Whereas the first term is controlled using the Hellinger bracketing entropy of n\mathcal{F}_{n} at the rate enε2e^{-n\varepsilon^{2}}, the second term is controlled by the prior probability of nc\mathcal{F}_{n}^{c} at enre^{-n^{r}}, 0<r<10<r<1. Since the prior probability of nc\mathcal{F}_{n}^{c} is now controlled at a comparatively slightly smaller rate than that of Theorem 3.1, hence the additional ϵ\epsilon term in the overall consistency rate of variational posterior.

Remark 4.2.

With knnak_{n}\sim n^{a} and n\mathcal{F}_{n} as in (27), we choose Cn=enbaC_{n}=e^{n^{b-a}} and Dn=enbD_{n}=e^{n^{b}}, 0<a<b<10<a<b<1 to prove the posterior consistency statement of Theorem 4.1. Suitably choosing n\mathcal{F}_{n} as a function of ε\varepsilon one may be able to refine the proof to obtain a rate of oP0n(nδ)o_{P_{0}^{n}}(n^{-\delta}) instead of oP0n(nϵδ)o_{P_{0}^{n}}(n^{\epsilon-\delta}). However the proof becomes more involved and such a ε\varepsilon- dependent choice of n\mathcal{F}_{n} has been avoided for the purposes of this paper.

Remark 4.3.

When σ\sigma is unknown, in order to control dKL(π(.),π(.|𝒚n,𝑿n))d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) at a rate less than n1δn^{1-\delta}, q(𝜽n)q(\boldsymbol{\theta}_{n}) has the same form as in the proof of Theorem 3.1. However, we cannot choose a normally distributed qq for σ2\sigma^{2}. The convergence of dKL(π(.),π(.|𝒚n,𝑿n))d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) is determined by the term dKL(l0,l𝝎n)q(𝝎n)𝑑𝝎n\int d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} which involves terms like 12σ2\frac{1}{2\sigma^{2}} and logσ2\log\sigma^{2} (see (7.3)). The expectation of these terms is not defined under a normal qq but well defined under an inverse gamma distribution, hence an inverse-gamma variational family of q(σ2)q(\sigma^{2}).

4.2 Normal prior on log transformed σ\sigma

Given, the wide popularity of Gaussian mean field approximation, we next use a normal variational distribution on the log-transformed σ\sigma and compare and contrast it to the case where an inverse-gamma variational distribution on the scale parameter. In Section 3.3 of Blei et al. [2017], it has been posited that a Gaussian VB posterior can be used to approximate a wide class of posteriors. However, as mentioned in Section 4.1, a normal qq would cause EQdKL(l0,l𝝎n)E_{Q}d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}}) to be undefined. One way out of this impasse reparametrizing σ\sigma as σρ=log(1+exp(ρ))\sigma_{\rho}=\log(1+\exp(\rho)) with a normal prior is used for ρ\rho. In the following section, we show that this approach may work but comes with the disadvantage where the number of nodes, knk_{n} needs to grow at a rate smaller than n1/2n^{1/2}. The main disadvantage with this approach is if the number of nodes do not grow sufficiently, it may be difficult to find a neural network which well approximates the true function.

Sieve Theory: Let 𝝎n=(𝜽n,ρ)\boldsymbol{\omega}_{n}=(\boldsymbol{\theta}_{n},\rho) where 𝜽n\boldsymbol{\theta}_{n} and f𝜽nf_{\boldsymbol{\theta}_{n}} are same as defined in (4). With σρ=log(1+eρ)\sigma_{\rho}=\log(1+e^{\rho}), we have

l𝝎n(y,𝒙)=12πσρ2exp(12σρ2(yf𝜽n(𝒙))2)l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x})=\frac{1}{\sqrt{2\pi\sigma_{\rho}^{2}}}\exp\Big{(}-\frac{1}{2\sigma_{\rho}^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)} (30)

The sieve is defined as follows.

𝒢n\displaystyle\mathcal{G}_{n} ={l𝝎n(y,𝒙),𝝎nn}n={(𝜽n,σ2):|θin|Cn,|ρ|<logCn}\displaystyle=\Big{\{}l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x}),\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}\Big{\}}\hskip 28.45274pt\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n},\sigma^{2}):|\theta_{in}|\leq C_{n},|\rho|<\log C_{n}\Big{\}} (31)

The definitions for likelihood, posterior and Hellinger neighborhood agree with those given in (11), (12) and (15) as in Section 3.

Prior distribution: We propose a normal prior on each θin\theta_{in} and ρ\rho as follows

p(𝝎n)=12πη2eρ22η2i=1K(n)12πζ2eθin22ζ2p(\boldsymbol{\omega}_{n})=\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}} (32)

Variational Family: Variational family for 𝝎n\boldsymbol{\omega}_{n} is given by

𝒬n={q:q(𝝎n)=12πs~0n2e(ρm~0n)22s~0n2i=1K(n)12πs~in2e(θinm~in)22s~in2}\mathcal{Q}_{n}=\left\{q:q(\boldsymbol{\omega}_{n})=\frac{1}{\sqrt{2\pi\tilde{s}^{2}_{0n}}}e^{-\frac{(\rho-\tilde{m}_{0n})^{2}}{2\tilde{s}_{0n}^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\tilde{s}^{2}_{in}}}e^{-\frac{(\theta_{in}-\tilde{m}_{in})^{2}}{2\tilde{s}_{in}^{2}}}\right\} (33)

The variational posterior has the same definition as in (14).

In the following theorem we show that even with σ\sigma reparametrized as log(1+eρ)\log(1+e^{\rho}) the variational posterior is consistent.

Theorem 4.4.

Suppose the number of nodes satisfy condition (C1) with a<1/2a<1/2. In addition, suppose assumptions (A1) and (A2) hold for 0δ<1a0\leq\delta<1-a. Then,

π(𝒱εc)=oP0n(nδ)\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=o_{P_{0}^{n}}(n^{-\delta})
Proof.

The proof mimics the steps in the proof of 3.1 and 3.2 with Propositions 7.17, 7.18 and 7.19 replaced by 7.27, 7.28 and 7.29 respectively. ∎

Remark 4.5.

With knnak_{n}\sim n^{a} and n\mathcal{F}_{n} as in (31), we choose Cn=enbaC_{n}=e^{n^{b-a}} where 0<a<b<10<a<b<1. In order to ensure that prior gives smaller mass outside n\mathcal{F}_{n}, one requires πn(nc)<ens\pi_{n}(\mathcal{F}_{n}^{c})<e^{-ns} for some s>0s>0. With a normal prior of ρ\rho and P(|ρ|>logCn)1logCne(logCn)2P(|\rho|>\log C_{n})\sim\frac{1}{\log C_{n}}e^{-(\log C_{n})^{2}} which is less than ene^{-n} provided 2(ba)>12(b-a)>1 or a<1/2a<1/2. Hence, the requirement of a slow growth in the number of nodes.

5 Consistency of variational bayes

In this section, we show that if the variational posterior is consistent, the variational Bayes estimator of σ\sigma and f𝜽nf_{\boldsymbol{\theta}_{n}} converges to the true σ0\sigma_{0} and f0f_{0}. The proof uses ideas from Barron et al. [1999] and Corollary 1 in Lee [2000]. Let

f^n(𝒙)\displaystyle\hat{f}_{n}(\boldsymbol{x}) =f𝜽n(𝒙)π(𝜽n)𝑑𝜽n\displaystyle=\int f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})\pi^{*}(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}
σ^n2\displaystyle\hat{\sigma}^{2}_{n} =σ2π(σ2)𝑑σ2\displaystyle=\int\sigma^{2}\pi^{*}(\sigma^{2})d\sigma^{2} (34)
Corollary 5.1 (Variational bayes consistency.).

Suppose f^n\hat{f}_{n} and σ^n2\hat{\sigma}_{n}^{2} are defined as in (5), then

(f^n(𝒙)f0(𝒙))2𝑑𝒙\displaystyle\int(\hat{f}_{n}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x} =oP0n(1)\displaystyle=o_{P_{0}^{n}}(1)
σ^nσ0\displaystyle\frac{\hat{\sigma}_{n}}{\sigma_{0}} =1+oP0n(1)\displaystyle=1+o_{P_{0}^{n}}(1) (35)
Proof.

Let

l^n(y,𝒙)=l𝝎n(y,𝒙)π(𝝎n)𝑑𝝎n\hat{l}_{n}(y,\boldsymbol{x})=\int l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x})\pi^{*}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}
dH(l^n(y,𝒙)),l0(y,𝒙))\displaystyle d_{H}(\hat{l}_{n}(y,\boldsymbol{x})),l_{0}(y,\boldsymbol{x})) =dH(l(𝝎n)π(𝝎n)𝑑𝝎n,l0)\displaystyle=d_{H}\left(\int l({\boldsymbol{\omega}_{n}})\pi^{*}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n},l_{0}\right)
dH(l(𝝎n),l0)π(𝝎n)𝑑𝝎nJensen’s inequality\displaystyle\leq\int d_{H}(l(\boldsymbol{\omega}_{n}),l_{0})\pi^{*}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\hskip 14.22636pt\text{Jensen's inequality}
=𝒱εdH(l(𝝎n),l0)π(𝝎n)𝑑𝝎n+𝒱εcdH(l(𝝎n),l0)π(𝝎n)𝑑𝝎n\displaystyle=\int_{\mathcal{V}_{\varepsilon}}d_{H}(l(\boldsymbol{\omega}_{n}),l_{0})\pi^{*}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}+\int_{\mathcal{V}_{\varepsilon}^{c}}d_{H}(l(\boldsymbol{\omega}_{n}),l_{0})\pi^{*}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}
ε+oP0n(1)\displaystyle\leq\varepsilon+o_{P_{0}^{n}}(1)

Taking ε0\varepsilon\to 0, we get dH(l^n(y,𝒙)),l0(y,𝒙))=oP0n(1)d_{H}(\hat{l}_{n}(y,\boldsymbol{x})),l_{0}(y,\boldsymbol{x}))=o_{P_{0}^{n}}(1). Now,

l^n(y,𝒙)=12πσ^n2e12σ^n2(yf^n(𝒙))2\hat{l}_{n}(y,\boldsymbol{x})=\frac{1}{\sqrt{2\pi\hat{\sigma}^{2}_{n}}}e^{-\frac{1}{2\hat{\sigma}_{n}^{2}}(y-\hat{f}_{n}(\boldsymbol{x}))^{2}}

Now, let us consider the form of

dH(l^n,l0)\displaystyle d_{H}(\hat{l}_{n},l_{0}) =(l^n(y,𝒙)l0(y,𝒙))2𝑑y𝑑𝒙\displaystyle=\int\int\left(\sqrt{\hat{l}_{n}(y,\boldsymbol{x})}-\sqrt{l_{0}(y,\boldsymbol{x})}\right)^{2}dyd\boldsymbol{x}
=22l^n(y,𝒙)l0(y,𝒙)𝑑y𝑑𝒙\displaystyle=2-2\int\int\sqrt{\hat{l}_{n}(y,\boldsymbol{x})l_{0}(y,\boldsymbol{x})}dyd\boldsymbol{x}
=2212πσ^nσ0exp{14((yf^n(𝒙))2σ^n2+(yf0(𝒙))2σ02)}𝑑y𝑑𝒙\displaystyle=2-2\int\int\frac{1}{\sqrt{2\pi\hat{\sigma}_{n}\sigma_{0}}}\exp\left\{-\frac{1}{4}\left(\frac{(y-\hat{f}_{n}(\boldsymbol{x}))^{2}}{\hat{\sigma}_{n}^{2}}+\frac{(y-f_{0}(\boldsymbol{x}))^{2}}{\sigma_{0}^{2}}\right)\right\}dyd\boldsymbol{x}
=222σ^n/σ0+σ0/σ^ne{14(σ^n2+σ02)(f^n(𝒙)f0(𝒙))2}𝑑𝒙\displaystyle=2-2\underbrace{\sqrt{\frac{2}{\hat{\sigma}_{n}/\sigma_{0}+\sigma_{0}/\hat{\sigma}_{n}}}}_{①}\underbrace{\int e^{\left\{-\frac{1}{4(\hat{\sigma}^{2}_{n}+\sigma_{0}^{2})}(\hat{f}_{n}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}\right\}}d\boldsymbol{x}}_{②}

Since dH(l^n,l0)=oP0n(1)d_{H}(\hat{l}_{n},l_{0})=o_{P_{0}^{n}}(1), ×P0n1①\times②\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}1.

Note that 1①\leq 1 and 1②\leq 1, thus ,P0n1①,②\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}1.

Since x+1/x2x+1/x\geq 2, thus

P0n1σ^nP0nσ0①\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}1\implies\hat{\sigma}_{n}\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}\sigma_{0}

We shall next show

P0n1(f^n(x)f0(x))2𝑑xP0n0②\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}1\implies\int(\hat{f}_{n}(x)-f_{0}(x))^{2}dx\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}0

We shall instead show that for any sequence {n}\{n\}, there exists a further subsequence {nk}\{n_{k}\} such that (f^nkf0(x))2𝑑𝒙a.s.0\int(\hat{f}_{n_{k}}-f_{0}(x))^{2}d\boldsymbol{x}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0

Since P0n1②\stackrel{{\scriptstyle P_{0}^{n}}}{{\to}}1, there exists a sub-sequence {nk}\{n_{k}\} s.t.

e{14(σ^nk2+σ02)(f^nk(𝒙)f0(𝒙))2}𝑑𝒙a.s.1\int e^{\left\{-\frac{1}{4(\hat{\sigma}^{2}_{n_{k}}+\sigma_{0}^{2})}(\hat{f}_{n_{k}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}\right\}}d\boldsymbol{x}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}1

This implies

14(σ^nk2+σ02)(f^nk(𝒙)f0(𝒙))2a.s.0a.e.𝒙\frac{1}{4(\hat{\sigma}^{2}_{n_{k}}+\sigma_{0}^{2})}(\hat{f}_{n_{k}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}\stackrel{{\scriptstyle a.s.}}{{\to}}0\>\>a.e.\>\>\boldsymbol{x}

(for details see proof of Corollary 1 in Lee [2000]).

Thus, using Scheffe’s theorem in Scheffe [1947], we have

14(σ^nk2+σ02)(f^nk(𝒙)f0(𝒙))2𝑑𝒙a.s.0\int\frac{1}{4(\hat{\sigma}^{2}_{n_{k}}+\sigma_{0}^{2})}(\hat{f}_{n_{k}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}\stackrel{{\scriptstyle a.s.}}{{\to}}0

which implies

14(σ^n2+σ02)(f^n(𝒙)f0(𝒙))2𝑑𝒙=oP0n(1)\int\frac{1}{4(\hat{\sigma}^{2}_{n}+\sigma_{0}^{2})}(\hat{f}_{n}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}=o_{P_{0}^{n}}(1)

Since σ^noP0nσ02\hat{\sigma}_{n}\stackrel{{\scriptstyle o_{P_{0}^{n}}}}{{\to}}\sigma_{0}^{2}, applying Slutsky, we get

(f^n(𝒙)f0(𝒙))2𝑑𝒙=oP0n(1)\int(\hat{f}_{n}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}=o_{P_{0}^{n}}(1)

6 Discussion

In this paper, we have highlighted the conditions which guarantee that the variational posterior of feed-forward neural networks is consistent. A variational family, as simple as a Gaussian mean-field, is good enough to ensure that the variational posterior is consistent provided the entropy of the true function f0f_{0} is well behaved. In other words, f0f_{0} has an approximating neural network solution which approximates f0f_{0} at a fast enough rate while ensuring that the number of nodes and the L2L_{2} norm of the NN parameters grow in a controlled manner. Conditions of this form are often needed when one tries to establish the consistency of neural networks in a frequentist set up (see condition C3 in Shen et al. [2019]). Whereas variational posterior presents a scalable alternative to MCMC, unlike MCMC its consistency cannot be guaranteed without certain restriction on the entropy of the true function. Two other main contributions of the paper include that (1) Gaussian family may not always work as the best choice for a variational family (see Section 4) and (2) One may need a prior with variance growing in nn when the rate of growth in the L2L_{2} norm of the approximating NN is high (see Theorem 3.1).

Although, we have quantified consistency of the variational posterior, the rate of contraction of the variational posterior still needs to be explored. We suspect that this rate would be closely related to the rate of contraction of the true posterior with mild assumptions on the entropy of the function f0f_{0}. By following ideas of the proofs in this paper, one may be able to quantify conditions on the entropy of f0f_{0} when one uses a deep neural network instead of one layer neural network in order to guarantee the consistency of variational posterior. Similarly, the effect of hierarchical priors and hyperparameters on the rate of convergence of the variational posterior need to be explored.

7 Appendix

7.1 General Lemmas

Lemma 7.1.

Let pp and qq be any two density functions. Then

Ep(|logpq|)dKL(p,q)+2eE_{p}\left(\left|\log\frac{p}{q}\right|\right)\leq d_{KL}(p,q)+\frac{2}{e}
Proof.

Proof is same as proof of Lemma 4 in Lee [2000]. ∎

Lemma 7.2.

Let f𝛉0n(𝐱)=β00+j=1knβj0ψ(γj0𝐱)f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x}) be a fixed neural network satisfying

|𝜽in𝜽i0n|ϵ,i=1,,K(n).|\boldsymbol{\theta}_{in}-\boldsymbol{\theta}_{i0n}|\leq\epsilon,\>\>i=1,\cdots,K(n).

Then,

(f𝜽n(𝒙)f𝜽0n(𝒙))2𝑑x8(kn2+(p+1)2(j=1kn|θi0n|)2)ϵ2\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x}))^{2}dx\leq 8\left(k_{n}^{2}+(p+1)^{2}(\sum_{j=1}^{k_{n}}|\theta_{i0n}|)^{2}\right)\epsilon^{2}
Proof.

This proof uses some ideas from Lemma 6 in Lee [2000]. Note that

f𝜽n(𝒙)=β0+j=1knβjψ(γjx)f𝜽0n(𝒙)=β00+j=1knβj0ψ(γj0x)f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})=\beta_{0}+\sum_{j=1}^{k_{n}}\beta_{j}\psi(\gamma_{j}^{\top}x)\hskip 14.22636ptf_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}x)

Therefore,

|f𝜽n(𝒙)f𝜽0n(𝒙)|\displaystyle|f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})| |β0β00|+j=1kn|βjψ(γj𝒙)βj0ψ(γj0𝒙)|\displaystyle\leq|\beta_{0}-\beta_{00}|+\sum_{j=1}^{k_{n}}|\beta_{j}\psi(\gamma_{j}^{\top}\boldsymbol{x})-\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x})|

Let uj=γj0𝒙u_{j}=-\gamma_{j0}^{\top}\boldsymbol{x}, rj=(γj0γj)𝒙r_{j}=(\gamma_{j0}-\gamma_{j})^{\top}\boldsymbol{x}, then

=|β0β00|+j=1kn|βj1+euj+rjβj01+euj|\displaystyle=|\beta_{0}-\beta_{00}|+\sum_{j=1}^{k_{n}}\Big{|}\frac{\beta_{j}}{1+e^{u_{j}+r_{j}}}-\frac{\beta_{j0}}{1+e^{u_{j}}}\Big{|}
=|β0β00|+j=1kn|βj(1+euj)βj0(1+euj+rj)(1+euj+rj)(1+euj)|\displaystyle=|\beta_{0}-\beta_{00}|+\sum_{j=1}^{k_{n}}\Big{|}\frac{\beta_{j}(1+e^{u_{j}})-\beta_{j0}(1+e^{u_{j}+r_{j}})}{(1+e^{u_{j}+r_{j}})(1+e^{u_{j}})}\Big{|}
=|β0β00|+j=1kn|βjβj0|+|βjeujβj0euj+rj|(1+euj+rj)(1+euj)\displaystyle=|\beta_{0}-\beta_{00}|+\sum_{j=1}^{k_{n}}\frac{|\beta_{j}-\beta_{j0}|+|\beta_{j}e^{u_{j}}-\beta_{j0}e^{u_{j}+r_{j}}|}{(1+e^{u_{j}+r_{j}})(1+e^{u_{j}})}
=|β0β00|+2j=1kn|βjβj0|+j=1kn|βj0||1erj|\displaystyle=|\beta_{0}-\beta_{00}|+2\sum_{j=1}^{k_{n}}|\beta_{j}-\beta_{j0}|+\sum_{j=1}^{k_{n}}|\beta_{j0}||1-e^{r_{j}}|

Since, for δ\delta small |rj|<(p+1)δ<1|r_{j}|<(p+1)\delta<1, thus |1erj|<2|rj||1-e^{r_{j}}|<2|r_{j}|.

|f𝜽n(𝒙)f𝜽0n(𝒙)|\displaystyle|f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})| 2knϵ+2ϵ(p+1)j=1kn|βj0|2knϵ+2ϵ(p+1)j=1kn|θi0n|\displaystyle\leq 2k_{n}\epsilon+2\epsilon(p+1)\sum_{j=1}^{k_{n}}|\beta_{j0}|\leq 2k_{n}\epsilon+2\epsilon(p+1)\sum_{j=1}^{k_{n}}|\theta_{i0n}|

Using

(a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) (36)

the proof follows. ∎

Lemma 7.3.

With |σ/σ01|<δ|\sigma/\sigma_{0}-1|<\delta

  1. 1.
    h1(σ)=12logσ2σ0212(1σ02σ2)δ2h_{1}(\sigma)=\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma^{2}}\right)\leq\delta^{2}
  2. 2.
    h2(σ)=12σ212σ02(1δ)2h_{2}(\sigma)=\frac{1}{2\sigma^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\delta)^{2}}
Proof.

Let x=σ/σ0x=\sigma/\sigma_{0}, then

  1. 1.
    h1(x)=12logx212(11x2)h_{1}(x)=\frac{1}{2}\log x^{2}-\frac{1}{2}\left(1-\frac{1}{x^{2}}\right)

    where |x1|<δ|x-1|<\delta. The function h1(x)h_{1}(x) satisfies

    h1(x)(x1)h1(1)+(x1)22h1′′(1)δh1(1)+δ22h1′′(1)=δ2h_{1}(x)\leq(x-1)h_{1}^{\prime}(1)+\frac{(x-1)^{2}}{2}h_{1}^{\prime\prime}(1)\leq\delta h_{1}^{\prime}(1)+\frac{\delta^{2}}{2}h_{1}^{\prime\prime}(1)=\delta^{2}

    since h1′′′(y)0h_{1}^{\prime\prime\prime}(y)\leq 0 for every y(1δ,1+δ)y\in(1-\delta,1+\delta).

  2. 2.
    h2(x)=12σ02x212σ02(1δ)2h_{2}(x)=\frac{1}{2\sigma_{0}^{2}x^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\delta)^{2}}

Lemma 7.4.

With σρ=log(1+eρ)\sigma_{\rho}=\log(1+e^{\rho}) and |ρρ0|<δσ0|\rho-\rho_{0}|<\delta\sigma_{0}, σ0=log(1+eρ0)\sigma_{0}=\log(1+e^{\rho_{0}}).

  1. 1.
    h1(ρ)=12logσρ2σ0212(1σ02σρ2)δ2h_{1}(\rho)=\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\right)\leq\delta^{2}
  2. 2.
    h2(ρ)=12σρ212σ02(1δ)2h_{2}(\rho)=\frac{1}{2\sigma_{\rho}^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\delta)^{2}}
Proof.

|ρρ0|<δlog(1+eρ0)|\rho-\rho_{0}|<\delta\log(1+e^{\rho_{0}}) implies

log(1+eρ)log(1+eρ0)δlog(1+eρ0)\log(1+e^{\rho})-\log(1+e^{\rho_{0}})\leq\delta\log(1+e^{\rho_{0}})

Similarly,

log(1+eρ)log(1+eρ0)δlog(1+eρ0)\log(1+e^{\rho})-\log(1+e^{\rho_{0}})\geq-\delta\log(1+e^{\rho_{0}})

Thus, |σρ/σ01|<δ|\sigma_{\rho}/\sigma_{0}-1|<\delta. The remaining part of the proof follows on the same lines as Lemma 7.3. ∎

Lemma 7.5.

With q(σ2)=((nσ02)n/Γ(n))(1/σ2)(n+1)enσ02/σ2q(\sigma^{2})=((n\sigma_{0}^{2})^{n}/\Gamma(n))(1/\sigma^{2})^{(n+1)}e^{-n\sigma_{0}^{2}/\sigma^{2}} and h(σ2)=(1/2)(log(σ2/σ02)(1σ02/σ2))h(\sigma^{2})=(1/2)(\log(\sigma^{2}/\sigma_{0}^{2})-(1-\sigma_{0}^{2}/\sigma^{2})), for every 0δ<10\leq\delta<1, we have

h(σ2)q(σ2)𝑑σ2=o(nδ)\displaystyle\int h(\sigma^{2})q(\sigma^{2})d\sigma^{2}=o(n^{-\delta})
Proof.
h(σ2)q(σ2)𝑑σ2\displaystyle\int h(\sigma^{2})q(\sigma^{2})d\sigma^{2} =12(logσ2σ02(1σ02σ2))(nσ02)nΓ(n)(1σ2)n+1enσ02σ2𝑑σ2\displaystyle=\int\frac{1}{2}\left(\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\left(1-\frac{\sigma_{0}^{2}}{\sigma^{2}}\right)\right)\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\left(\frac{1}{\sigma^{2}}\right)^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}d\sigma^{2}
=12(logσσ02(1σ02σ))(nσ02)nΓ(n)(1σ)n+1enσ02σ𝑑σ\displaystyle=\int\frac{1}{2}\left(\log\frac{\sigma}{\sigma_{0}^{2}}-\left(1-\frac{\sigma_{0}^{2}}{\sigma}\right)\right)\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\left(\frac{1}{\sigma}\right)^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma}}d\sigma
=12(lognσ02logψ(n)logσ02)12(1σ02σ02)\displaystyle=\frac{1}{2}\left(\log n\sigma_{0}^{2}-\log\psi(n)-\log\sigma_{0}^{2}\right)-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma_{0}^{2}}\right)
=12(lognlogn+O(n1))=o(nδ)\displaystyle=\frac{1}{2}\left(\log n-\log n+O(n^{-1})\right)=o(n^{-\delta})

where the last step holds because ψ(n)=logn+O(n1)\psi(n)=\log n+O(n^{-1}) (see Lemma 4 in Elezovic and Giordano [2000]). ∎

Lemma 7.6.

With q(σ2)=((nσ02)n/Γ(n))(1/σ2)(n+1)enσ02/σ2q(\sigma^{2})=((n\sigma_{0}^{2})^{n}/\Gamma(n))(1/\sigma^{2})^{(n+1)}e^{-n\sigma_{0}^{2}/\sigma^{2}} and h(σ2)=1/(2σ2)h(\sigma^{2})=1/(2\sigma^{2}), for every 0δ<10\leq\delta<1,

h(σ2)q(σ2)𝑑σ2=12σ02\displaystyle\int h(\sigma^{2})q(\sigma^{2})d\sigma^{2}=\frac{1}{2\sigma_{0}^{2}}
Proof.
h(σ2)q(σ2)𝑑σ2\displaystyle\int h(\sigma^{2})q(\sigma^{2})d\sigma^{2} =12σ2(nσ02)nΓ(n)(1σ2)n+1enσ02σ2𝑑σ2\displaystyle=\int\frac{1}{2\sigma^{2}}\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\left(\frac{1}{\sigma^{2}}\right)^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}d\sigma^{2}
=12σ(nσ02)nΓ(n)(1σ)n+1enσ02σ𝑑σ\displaystyle=\int\frac{1}{2\sigma}\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\left(\frac{1}{\sigma}\right)^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma}}d\sigma
=n2nσ02=1σ02\displaystyle=\frac{n}{2n\sigma_{0}^{2}}=\frac{1}{\sigma_{0}^{2}}

Lemma 7.7.

With σρ=log(1+eρ)\sigma_{\rho}=\log(1+e^{\rho}) and σ0=log(1+eρ0)\sigma_{0}=\log(1+e^{\rho_{0}}), let h(ρ)=(1/2)log(σρ2/σ02)(1/2)(1σ02/σρ2)h(\rho)=(1/2)\log(\sigma_{\rho}^{2}/\sigma_{0}^{2})-(1/2)(1-\sigma_{0}^{2}/\sigma_{\rho}^{2}) and q(ρ)=n/(2πν2)en(ρρ0)2/2ν2q(\rho)=\sqrt{n/(2\pi\nu^{2})}e^{-n(\rho-\rho_{0})^{2}/2\nu^{2}}. Then, for every 0δ<10\leq\delta<1, we have

h(ρ)q(ρ)𝑑ρ=o(nδ)\displaystyle\int h(\rho)q(\rho)d\rho=o(n^{-\delta})
Proof.

First note that h(ρ)0h(\rho)\geq 0, thus it suffices to show h(ρ)q(ρ)𝑑ρo(nδ)\int h(\rho)q(\rho)d\rho\leq o(n^{-\delta}). In this direction,

h(ρ)q(ρ)𝑑ρ=|ρρ0|<1/nδ/2h(ρ)q(ρ)𝑑ρ+|ρρ0|>1/nδ/2h(ρ)q(ρ)𝑑ρ\int h(\rho)q(\rho)d\rho=\underbrace{\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}h(\rho)q(\rho)d\rho}_{①}+\underbrace{\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}h(\rho)q(\rho)d\rho}_{②}

We can apply Taylor expansion to as

=|ρρ0|<1/nδ/2\displaystyle①=\int_{|\rho-\rho_{0}|<1/n^{\delta/2}} (h(ρ0)+(ρρ0)h(ρ0)+(ρρ0)22h′′(ρ0)+o((ρρ0)2))q(ρ)dρ\displaystyle\left(h(\rho_{0})+(\rho-\rho_{0})h^{\prime}(\rho_{0})+\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})+o((\rho-\rho_{0})^{2})\right)q(\rho)d\rho
=|ρρ0|<1/nδ/2\displaystyle=\int_{|\rho-\rho_{0}|<1/n^{\delta/2}} (ρρ0)22h′′(ρ0)q(ρ)dρ+o(nδ)\displaystyle\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho+o(n^{-\delta})

where the equality follows since h(ρ0)=0h(\rho_{0})=0 and q(ρ)q(\rho) is symmetric around ρ=ρ0\rho=\rho_{0}.

It is easy to check h′′(ρ0)>0h^{\prime\prime}(\rho_{0})>0, which implies

|ρρ0|<1/nδ/2(ρρ0)22h′′(ρ0)q(ρ)𝑑ρ(ρρ0)22h′′(ρ0)q(ρ)𝑑ρ=h′′(ρ0)ν22n=O(n1)\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho\leq\int\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho=\frac{h^{\prime\prime}(\rho_{0})\nu^{2}}{2n}=O(n^{-1})

Thus, for every 0δ<10\leq\delta<1, O(n1)+o(nδ)=o(nδ)①\leq O(n^{-1})+o(n^{-\delta})=o(n^{-\delta}).

For the remaining part of the proof, we shall make use of the Mill’s ratio approximation as follows.

1Φ(an)ϕ(an)an1-\Phi(a_{n})\sim\frac{\phi(a_{n})}{a_{n}} (37)

where Φ\Phi and ϕ\phi are the cdf and pdf of standard normal distribution respectively.

For ,

\displaystyle② =|ρρ0|>1/nδ/2(12logσρ2σ0212(1σ02σρ2))n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle=\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}\left(\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\right)\right)\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
12logσ02|ρρ0|>1/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ+12|ρρ0|>1/nδ/2logσρ2n2πν2en2ν2(ρρ0)2dρ\displaystyle\leq-\frac{1}{2}\log\sigma_{0}^{2}\underbrace{\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{③}+\frac{1}{2}\underbrace{\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{④}
+σ02|ρρ0|>1/nδ/21σρ2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle+\sigma_{0}^{2}\underbrace{\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{⑤}

Let c=log(e1)c=\log(e-1), then c>0c>0.

If ρ0c\rho_{0}\geq c, then logσ020-\log\sigma_{0}^{2}\leq 0 and can be dropped. If ρ0<clogσ02>0\rho_{0}<c\implies-\log\sigma_{0}^{2}>0, then

=2(1Φ(nνnδ/2))O(1n12δ2en1δ)=o(nδ)③=2\left(1-\Phi\left(\frac{\sqrt{n}}{\nu n^{\delta/2}}\right)\right)\sim O\left(\frac{1}{n^{\frac{1}{2}-\frac{\delta}{2}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta}) (38)

For ④, we make use of the following result

If ρ<c\rho<c, logσρ<0\log\sigma_{\rho}<0. For ρ>c\rho>c, logσρlog(2eρ)\log\sigma_{\rho}\leq\log(2e^{\rho}). (39)

If ρ0<c\rho_{0}<c, then ρ01/nδ/2,ρ0+1/nδ/2<c\rho_{0}-1/n^{\delta/2},\rho_{0}+1/n^{\delta/2}<c for nn sufficiently large.

Using (39) and getting rid of negative terms, we get

\displaystyle④ clogσρ2n2πν2en2ν2(ρρ0)2dρc2(log2+ρ)n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq\int_{c}^{\infty}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho\leq\int_{c}^{\infty}2(\log 2+\rho)\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
=2log2cn2πν2en2ν2(ρρ0)2𝑑ρ+2cρn2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle=2\log 2\int_{c}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+2\int_{c}^{\infty}\rho\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
=2log2n(cρ0)/ν12πe12u2𝑑ρ+2n(cρ0)/ν(uνn+ρ0)12πe12u2𝑑ρ\displaystyle=2\log 2\int_{\sqrt{n}(c-\rho_{0})/\nu}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}d\rho+2\int_{\sqrt{n}(c-\rho_{0})/\nu}^{\infty}\left(\frac{u\nu}{\sqrt{n}}+\rho_{0}\right)\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}d\rho
=(2log2+2ρ0)(1Φ(n(cρ0)ν))+2νnn(cρ0)/νu2πe12u2𝑑ρ\displaystyle=(2\log 2+2\rho_{0})\left(1-\Phi\left(\frac{\sqrt{n}(c-\rho_{0})}{\nu}\right)\right)+\frac{2\nu}{\sqrt{n}}\int_{\sqrt{n}(c-\rho_{0})/\nu}^{\infty}\frac{u}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}d\rho
=(2log2+4ρ0)Φ(n(cρ0)ν)+4ν2πnen(cρ0)22ν2\displaystyle=(2\log 2+4\rho_{0})\Phi\left(-\frac{\sqrt{n}(c-\rho_{0})}{\nu}\right)+\frac{4\nu}{\sqrt{2\pi n}}e^{-\frac{n(c-\rho_{0})^{2}}{2\nu^{2}}}
=O(1nen)+O(1nen)=o(nδ) follows from (37)\displaystyle=O\left(\frac{1}{\sqrt{n}}e^{-n}\right)+O\left(\frac{1}{\sqrt{n}}e^{-n}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}

If ρ0>c\rho_{0}>c, then ρ01/nδ/2,ρ0+1/nδ/2>c\rho_{0}-1/n^{\delta/2},\rho_{0}+1/n^{\delta/2}>c for nn sufficiently large.

Using (39) and getting rid of negative terms, we get

\displaystyle④ cρ01/nδ/2logσρ2n2πν2en2ν2(ρρ0)2dρ+ρ0+1/nδ/2logσρ2n2πν2en2ν2(ρρ0)2dρ\displaystyle\leq\int_{c}^{\rho_{0}-1/n^{\delta/2}}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
=2(log2+ρ)(cρ01/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ+ρ0+1/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ)\displaystyle=2(\log 2+\rho)\left(\int_{c}^{\rho_{0}-1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho\right)
=(2log2+2ρ0){Φ(nnδ/2ν)Φ(n(cρ0)ν)+1Φ(nnδ/2ν)}\displaystyle=(2\log 2+2\rho_{0})\left\{\Phi\left(\frac{-\sqrt{n}}{n^{\delta/2}\nu}\right)-\Phi\left(\frac{\sqrt{n}(c-\rho_{0})}{\nu}\right)+1-\Phi\left(\frac{\sqrt{n}}{n^{\delta/2}\nu}\right)\right\}
+2ν2πn(en(cρ0)22ν2en1δ2ν2)+2ν2πn(en1δ2ν2)\displaystyle+\frac{2\nu}{\sqrt{2\pi n}}\left(e^{-\frac{n(c-\rho_{0})^{2}}{2\nu^{2}}}-e^{-\frac{n^{1-\delta}}{2\nu^{2}}}\right)+\frac{2\nu}{\sqrt{2\pi n}}\left(e^{-\frac{n^{1-\delta}}{2\nu^{2}}}\right)
\displaystyle④ (2log2+2ρ0)Φ(nnδ/2ν)+2ν2πn(en(cρ0)2ν2)\displaystyle\leq(2\log 2+2\rho_{0})\Phi\left(-\frac{\sqrt{n}}{n^{\delta/2}\nu}\right)+\frac{2\nu}{\sqrt{2\pi n}}\left(e^{-\frac{n(c-\rho_{0})}{2\nu^{2}}}\right)
=O(1n1δen1δ)+O(1nen)=o(nδ) follows from (37)\displaystyle=O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)+O\left(\frac{1}{\sqrt{n}}e^{-n}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}

If ρ0=c\rho_{0}=c, then ρ01/nδ/2<c\rho_{0}-1/n^{\delta/2}<c and ρ0+1/nδ/2>c\rho_{0}+1/n^{\delta/2}>c for nn sufficiently large, thus

\displaystyle④ ρ0+1/nδ/2logσρ2n2πν2en2ν2(ρρ0)2dρ=(2log2+2ρ0){1Φ(nnδ/2ν)}+2ν2πn(en1δ2ν2)\displaystyle\leq\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho=(2\log 2+2\rho_{0})\left\{1-\Phi\left(\frac{\sqrt{n}}{n^{\delta/2}\nu}\right)\right\}+\frac{2\nu}{\sqrt{2\pi n}}\left(e^{-\frac{n^{1-\delta}}{2\nu^{2}}}\right)
=O(1nen1δ)+O(1n1δen1δ)=o(nδ) follows from (37)\displaystyle=O\left(\frac{1}{\sqrt{n}}e^{-n^{1-\delta}}\right)+O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}

For , we shall make use of the following result:

e2ρn2πν2en2ν2(ρρ0)2\displaystyle e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}} =e(ρ0ν2n)n2πν2en2ν2(ρ(ρ0ν2n))2\displaystyle=e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}
1σρ23e2ρ,ρ<0\displaystyle\frac{1}{\sigma_{\rho}^{2}}\leq 3e^{-2\rho},\rho<0 1σρ21(log2)2,ρ>0\displaystyle\hskip 28.45274pt\frac{1}{\sigma_{\rho}^{2}}\leq\frac{1}{(\log 2)^{2}},\rho>0 (40)

If ρ<0\rho<0, then ρ01/nδ/2\rho_{0}-1/n^{\delta/2}, ρ0+1/nδ/2\rho_{0}+1/n^{\delta/2} <0<0 for nn sufficiently large. Thus, using (7.1), we get

\displaystyle⑤ =ρ01/nδ/21σρ2n2πν2en2ν2(ρρ0)2𝑑ρ+ρ0+1/nδ/201σρ2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle=\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{\rho_{0}+1/n^{\delta/2}}^{0}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
+01σρ2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle+\int_{0}^{\infty}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
3ρ01/nδ/2e2ρn2πν2en2ν2(ρρ0)2𝑑ρ+3ρ0+1/nδ/20e2ρn2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq 3\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+3\int_{\rho_{0}+1/n^{\delta/2}}^{0}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
+1(log2)20n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle+\frac{1}{(\log 2)^{2}}\int_{0}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
3|ρρ0|>1/nδ/2e2ρn2πν2en2ν2(ρρ0)2𝑑ρ+1(log2)20n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq 3\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\frac{1}{(\log 2)^{2}}\int_{0}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
3e(ρ0ν2n)|ρρ0|>1/nδ/2n2πν2en2ν2(ρ(ρ0ν2n))2+1(log2)20n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq 3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}+\frac{1}{(\log 2)^{2}}\int_{0}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
=6e(ρ0ν2n)Φ(nν(1nδ/2ν2n))+1(log2)2Φ(n(ρ0))\displaystyle=6e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\Phi\left(-\frac{\sqrt{n}}{\nu}\left(\frac{1}{n^{\delta/2}}-\frac{\nu^{2}}{n}\right)\right)+\frac{1}{(\log 2)^{2}}\Phi(-\sqrt{n}(-\rho_{0}))
=O(1n1δen1δ)+O(1nen)=o(nδ) follows from (37)\displaystyle=O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)+O\left(\frac{1}{\sqrt{n}}e^{-n}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}

If ρ>0\rho>0, then ρ01/nδ/2\rho_{0}-1/n^{\delta/2}, ρ0+1/nδ/2\rho_{0}+1/n^{\delta/2} >0>0 for nn sufficiently large. Thus, using (7.1), we get

\displaystyle⑤ =01σρ2n2πν2en2ν2(ρρ0)2𝑑ρ+0ρ01/nδ/21σρ2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle=\int_{-\infty}^{0}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{0}^{\rho_{0}-1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
+ρ0+1/nδ/21σρ2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle+\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
03e2ρn2πν2en2ν2(ρρ0)2𝑑ρ+1(log2)20ρ01/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq\int_{-\infty}^{0}3e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\frac{1}{(\log 2)^{2}}\int_{0}^{\rho_{0}-1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
+1(log2)2ρ0+1/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle+\frac{1}{(\log 2)^{2}}\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
30e2ρn2πν2en2ν2(ρρ0)2𝑑ρ+1(log2)2|ρρ0|>1/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq 3\int_{-\infty}^{0}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\frac{1}{(\log 2)^{2}}\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
\displaystyle⑤ 3e(ρ0ν2n)0n2πν2en2ν2(ρ(ρ0ν2n))2+1(log2)2|ρρ0|>1/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq 3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\int_{-\infty}^{0}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}+\frac{1}{(\log 2)^{2}}\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
3e(ρ0ν2n)0n2πν2en2ν2(ρ(ρ0ν2n))2+1(log2)2|ρρ0|>1/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq 3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\int_{-\infty}^{0}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}+\frac{1}{(\log 2)^{2}}\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
=3e(ρ0ν2n)Φ(nν(ρ0ν2n))+2(log2)2Φ(nρ0νnδ/2)\displaystyle=3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\Phi\left(\frac{-\sqrt{n}}{\nu}\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)+\frac{2}{(\log 2)^{2}}\Phi\left(-\frac{\sqrt{n}\rho_{0}}{\nu n^{\delta/2}}\right)
=O(1nen)+O(1n1δen1δ)=o(nδ) follows from (37)\displaystyle=O\left(\frac{1}{\sqrt{n}}e^{-n}\right)+O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}

If ρ0=0\rho_{0}=0, then ρ01/nδ/2<0\rho_{0}-1/n^{\delta/2}<0, ρ0+1/nδ/2>0\rho_{0}+1/n^{\delta/2}>0 for nn sufficiently large. Thus, using (7.1), we get

\displaystyle⑤ =ρ01/nδ/21σρ2n2πν2en2ν2(ρρ0)2𝑑ρ+ρ0+1/nδ/21σρ2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle=\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
3ρ01/nδ/2e2ρn2πν2en2ν2(ρρ0)2𝑑ρ+1(log2)2ρ0+1/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq 3\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\frac{1}{(\log 2)^{2}}\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
3e(ρ0ν2n)ρ01/nδ/2n2πν2en2ν2(ρ(ρ0ν2n))2+1(log2)2ρ0+1/nδ/2n2πν2en2ν2(ρρ0)2𝑑ρ\displaystyle\leq 3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}+\frac{1}{(\log 2)^{2}}\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho
=3e(ρ0ν2n)Φ(nν(1nδ/2ν2n))+1(log2)2Φ(nρ0νnδ/2)\displaystyle=3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\Phi\left(\frac{-\sqrt{n}}{\nu}\left(\frac{1}{n^{\delta/2}}-\frac{\nu^{2}}{n}\right)\right)+\frac{1}{(\log 2)^{2}}\Phi\left(-\frac{\sqrt{n}\rho_{0}}{\nu n^{\delta/2}}\right)
=O(1n1δen1δ)+O(1nen)=o(nδ) follows from (37)\displaystyle=O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)+O\left(\frac{1}{\sqrt{n}}e^{-n}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}

Lemma 7.8.

With σρ=log(1+eρ)\sigma_{\rho}=\log(1+e^{\rho}) and σ0=log(1+eρ0)\sigma_{0}=\log(1+e^{\rho_{0}}), let h(ρ)=1/(2σρ2)h(\rho)=1/(2\sigma_{\rho}^{2}) and q(ρ)=n/(2πν2)en(ρρ0)2/2ν2q(\rho)=\sqrt{n/(2\pi\nu^{2})}e^{-n(\rho-\rho_{0})^{2}/2\nu^{2}}. Then, for every 0δ<10\leq\delta<1, we have

h(ρ)q(ρ)𝑑ρ=12σ02+o(nδ)\displaystyle\int h(\rho)q(\rho)d\rho=\frac{1}{2\sigma_{0}^{2}}+o(n^{-\delta})
Proof.
h(ρ)q(ρ)𝑑ρ=|ρρ0|<1/nδ/2h(ρ)q(ρ)𝑑ρ+|ρρ0|>1/nδ/2h(ρ)q(ρ)𝑑ρ\int h(\rho)q(\rho)d\rho=\underbrace{\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}h(\rho)q(\rho)d\rho}_{①}+\underbrace{\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}h(\rho)q(\rho)d\rho}_{②}

We can apply Taylor expansion to ,

\displaystyle① =|ρρ0|<1/nδ/2(h(ρ0)+(ρρ0)h(ρ0)+(ρρ0)22h′′(ρ0)+o((ρρ0)2))q(ρ)𝑑ρ\displaystyle=\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}\left(h(\rho_{0})+(\rho-\rho_{0})h^{\prime}(\rho_{0})+\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})+o((\rho-\rho_{0})^{2})\right)q(\rho)d\rho
=12σ02+|ρρ0|<1/nδ/2(ρρ0)22h′′(ρ0)q(ρ)𝑑ρ+o(nδ)\displaystyle=\frac{1}{2\sigma_{0}^{2}}+\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho+o(n^{-\delta})

where the equality follows since h(ρ0)=1/(2σ02)h(\rho_{0})=1/(2\sigma_{0}^{2}) and q(ρ)q(\rho) is symmetric around ρ0\rho_{0}.

Since (ρρ0)2(\rho-\rho_{0})^{2} and h′′(ρ0)>0h^{\prime\prime}(\rho_{0})>0, it suffices to show |ρρ0|<1/nδ/2(ρρ0)22h′′(ρ0)q(ρ)𝑑ρo(nδ)\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho\leq o(n^{-\delta}).

In this direction,

|ρρ0|<1/nδ/2(ρρ0)22h′′(ρ0)q(ρ)𝑑ρ(ρρ0)22h′′(ρ0)q(ρ)𝑑ρ=h′′(ρ0)ν22n=O(n1)=o(nδ)\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho\leq\int\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho=\frac{h^{\prime\prime}(\rho_{0})\nu^{2}}{2n}=O(n^{-1})=o(n^{-\delta})

Since h(ρ)>0h(\rho)>0, to prove =o(nδ)②=o(n^{-\delta}) it suffices to show o(nδ)②\leq o(n^{-\delta}).

Note, is same as of Lemma 7.7, except for a constant. Thus, o(nδ)②\leq o(n^{-\delta}) which completes the proof.

Lemma 7.9.

Suppose condition (C1) and assumption (A1) holds for some 0<a<10<a<1 and 0δ<1a0\leq\delta<1-a. Let

h(𝜽n)=(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙h(\boldsymbol{\theta}_{n})=\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}

we have

h(𝜽n)q(𝜽n)𝑑𝜽n=o(nδ)\int h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}=o(n^{-\delta}) (41)

provided

  1. 1.

    Assumption (A2) holds with same δ\delta as (A1) and

    q(𝜽n)=i=1K(n)n2πτ2en2τ2(θinθ0in)2q(\boldsymbol{\theta}_{n})=\prod_{i=1}^{K(n)}\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{n}{2\tau^{2}}(\theta_{in}-\theta_{0in})^{2}}
  2. 2.

    Assumption (A3) holds and

    q(𝜽n)=i=1K(n)nv+12πτ2env+12τ2(θinθ0in)2q(\boldsymbol{\theta}_{n})=\prod_{i=1}^{K(n)}\sqrt{\frac{n^{v+1}}{2\pi\tau^{2}}}e^{-\frac{n^{v+1}}{2\tau^{2}}(\theta_{in}-\theta_{0in})^{2}}
Proof.

Note that since h(𝜽n)>0h(\boldsymbol{\theta}_{n})>0, to prove (41), it suffices to show h(𝜽n)q(𝜽n)𝑑𝜽n=o(nδ)\int h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}=o(n^{-\delta}).

We begin by proving statement 1. of the lemma. Let A={𝜽n:i=1K(n)|θinθ0in|1/nδ/2}A=\{\boldsymbol{\theta}_{n}:\cap_{i=1}^{K(n)}|\theta_{in}-\theta_{0in}|\leq 1/n^{\delta/2}\}, then

h(𝜽n)q(𝜽n)𝑑𝜽n=Ah(𝜽n)q(𝜽n)𝑑𝜽n+Ach(𝜽n)q(𝜽n)𝑑𝜽n\int h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}=\underbrace{\int_{A}h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{①}+\underbrace{\int_{A^{c}}h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{②}

For , we do a Taylor expansion of h(𝜽n)h(\boldsymbol{\theta}_{n}) around 𝜽0n\boldsymbol{\theta}_{0n} as

\displaystyle① =A(h(𝜽0n)+(𝜽n𝜽0n)h(𝜽0n)+12(𝜽n𝜽0n)2h(𝜽0n)(𝜽n𝜽0n))q(𝜽n)𝑑𝜽n\displaystyle=\int_{A}\left(h(\boldsymbol{\theta}_{0n})+(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})^{\top}\nabla h(\boldsymbol{\theta}_{0n})+\frac{1}{2}(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})^{\top}\nabla^{2}h(\boldsymbol{\theta}_{0n})(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})\right)q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}
+Ao(||𝜽n𝜽0n)||2)q(𝜽n)d𝜽n\displaystyle+\int_{A}o(||\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})||^{2})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}
=A(𝜽n𝜽0n)h(𝜽0n)q(𝜽n)𝑑𝜽n+12A(𝜽n𝜽0n)2h(𝜽0n)(𝜽n𝜽0n)q(𝜽n)𝑑𝜽n+o(nδ)\displaystyle=\underbrace{\int_{A}(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})^{\top}\nabla h(\boldsymbol{\theta}_{0n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{③}+\frac{1}{2}\underbrace{\int_{A}(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})^{\top}\nabla^{2}h(\boldsymbol{\theta}_{0n})(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{④}+o(n^{-\delta})

where the last equality follows since h(𝜽0n)=o(nδ)h(\boldsymbol{\theta}_{0n})=o(n^{-\delta}) by assumption (A1).

With I={1,,K(n)}I=\{1,\cdots,K(n)\}, let h(θ0n)=(ai)iI\nabla h(\theta_{0n})=(a_{i})_{i\in I} and 2h(θ0n)=((bij))iI,jI\nabla^{2}h(\theta_{0n})=((b_{ij}))_{i\in I,j\in I}

\displaystyle③ =i=1K(n)ai|θinθi0n|<1/nδ/2(θinθi0n)q(θin)𝑑θin\displaystyle=\sum_{i=1}^{K(n)}a_{i}\int_{|\theta_{in}-\theta_{i0n}|<1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})q(\theta_{in})d\theta_{in}
=i=1K(n)aiθi0n1/nδ/2θi0n+1/nδ/2(θinθi0n)n2πτ2en2τ2(θinθi0n)2=i=1K(n)ain1δ/τn1δ/τu2πe12u2𝑑u=0\displaystyle=\sum_{i=1}^{K(n)}a_{i}\int_{\theta_{i0n}-1/n^{\delta/2}}^{\theta_{i0n}+1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{n}{2\tau^{2}}(\theta_{in}-\theta_{i0n})^{2}}=\sum_{i=1}^{K(n)}a_{i}\int_{-\sqrt{n^{1-\delta}}/\tau}^{\sqrt{n^{1-\delta}}/\tau}\frac{u}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}du=0\vspace{-5mm} (42)

since ue1/2u2ue^{-1/2u^{2}} is an odd function. Also,

\displaystyle④ =i=1K(n)bii|θinθi0n|1/nδ/2(θinθi0n)2q(θin)𝑑θin\displaystyle=\sum_{i=1}^{K(n)}b_{ii}\int_{|{\theta}_{in}-{\theta}_{i0n}|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})^{2}q(\theta_{in})d\theta_{in}
+i=1K(n)j=1,ijK(n)|θinθi0n|1/nδ/2(θinθi0n)q(θin)𝑑θin|θjnθj0n|1/nδ/2(θjnθj0n)q(θjn)𝑑θjn\displaystyle+\sum_{i=1}^{K(n)}\sum_{j=1,i\neq j}^{K(n)}\int_{|{\theta}_{in}-{\theta}_{i0n}|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})q(\theta_{in})d\theta_{in}\int_{|{\theta}_{jn}-{\theta}_{j0n}|\leq 1/n^{\delta/2}}(\theta_{jn}-\theta_{j0n})q(\theta_{jn})d\theta_{jn}
=i=1K(n)bii|θinθi0n|1/nδ/2(θinθi0n)2q(θin)𝑑θin\displaystyle=\sum_{i=1}^{K(n)}b_{ii}\int_{|{\theta}_{in}-{\theta}_{i0n}|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})^{2}q(\theta_{in})d\theta_{in}

where second equality to third equality is a consequence of (7.1). Thus,

\displaystyle④ i=1K(n)|bii|(θinθi0n)2q(θin)𝑑θin=τ2ni=1K(n)|bii|\displaystyle\leq\sum_{i=1}^{K(n)}|b_{ii}|\int(\theta_{in}-\theta_{i0n})^{2}q(\theta_{in})d\theta_{in}=\frac{\tau^{2}}{n}\sum_{i=1}^{K(n)}|b_{ii}|

We next try to bound the quantities |bii||b_{ii}|. First note that

2h(𝜽n)=2f𝜽𝟎𝒏(𝒙)f𝜽𝟎𝒏(𝒙)𝑑𝒙+2(f𝜽0n(𝒙)f0(𝒙))2f𝜽𝟎𝒏(𝒙)𝑑𝒙\nabla^{2}h(\boldsymbol{\theta}_{n})=2\int\nabla f_{\boldsymbol{\theta_{0n}}}(\boldsymbol{x})\nabla{f_{\boldsymbol{\theta_{0n}}}(\boldsymbol{x})}^{\top}d\boldsymbol{x}+2\int(f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))\nabla^{2}{f_{\boldsymbol{\theta_{0n}}}(\boldsymbol{x})}d\boldsymbol{x}

Let θ0n=[β0,β1,,βkn,γ11,,γ1p,γ21,,γ2p,,γK(n)1,,γK(n)p]\theta_{0n}=[\beta_{0},\beta_{1},\cdots,\beta_{k_{n}},\gamma_{11},\cdots,\gamma_{1p},\gamma_{21},\cdots,\gamma_{2p},\cdots,\gamma_{K(n)1},\cdots,\gamma_{K(n)p}]^{\top}. Then,

𝒃=[2,𝒄0,𝒄1,,𝒄K(n)]\boldsymbol{b}=[2,\boldsymbol{c}_{0},\boldsymbol{c}_{1},\cdots,\boldsymbol{c}_{K(n)}]^{\top}

where for i=1,,kni=1,\cdots,k_{n}, j=1,,pj=1,\cdots,p, we have

𝒄0i\displaystyle\boldsymbol{c}_{0i} =2(ψ(𝜸i0𝒙))2𝑑𝒙\displaystyle=2\int(\psi(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}d\boldsymbol{x}
𝒄ij\displaystyle\boldsymbol{c}_{ij} =2βi02(ψ(𝜸i0𝒙))2𝑑𝒙+2βi02(f𝜽0n(𝒙)f0(𝒙))(ψ′′(𝜸i0𝒙))2𝑑𝒙,j=0\displaystyle=2\beta_{i0}^{2}\int(\psi^{\prime}(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}d\boldsymbol{x}+2\beta_{i0}^{2}\int(f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))(\psi^{\prime\prime}(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}d\boldsymbol{x},\>\>j=0
=2βi02(ψ(𝜸i0𝒙))2xij2𝑑𝒙+2βi02(f𝜽0n(𝒙)f0(𝒙))(ψ′′(𝜸i0𝒙))2xij2𝑑𝒙,j>0\displaystyle=2\beta_{i0}^{2}\int(\psi^{\prime}(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}x_{ij}^{2}d\boldsymbol{x}+2\beta_{i0}^{2}\int(f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))(\psi^{\prime\prime}(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}x_{ij}^{2}d\boldsymbol{x},j>0

Using the fact that |ψ(u)|,|ψ(u)|,|ψ′′(u)|1|\psi(u)|,|\psi^{\prime}(u)|,|\psi^{\prime\prime}(u)|\leq 1 and |xij|1|x_{ij}|\leq 1 we get

\displaystyle④ τ2n(2(kn+1)+2(p+1)i=1knβj02+(p+1)|fθ0nf0(𝒙)|𝑑𝒙i=1knβj02)\displaystyle\leq\frac{\tau^{2}}{n}\left(2(k_{n}+1)+2(p+1)\sum_{i=1}^{k_{n}}\beta^{2}_{j0}+(p+1)\int|f_{\theta_{0n}}-f_{0}(\boldsymbol{x})|d\boldsymbol{x}\sum_{i=1}^{k_{n}}\beta^{2}_{j0}\right)
τ2n(2(K(n)+1)+2(p+1)i=1K(n)θi0n2+(p+1)i=1K(n)θi0n2||f𝜽0nf0||22=o(nδ)\displaystyle\leq\frac{\tau^{2}}{n}(2(K(n)+1)+2(p+1)\sum_{i=1}^{K(n)}\theta_{i0n}^{2}+(p+1)\sum_{i=1}^{K(n)}\theta_{i0n}^{2}||f_{\boldsymbol{\theta}_{0n}}-f_{0}||^{2}_{2}=o(n^{-\delta})

where the last equality is a consequence of assumptions (A1), (A2) and condition (C1).

For , note that

Ach(𝜽n)𝑑𝜽n\displaystyle\int_{A^{c}}h(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n} =2Acf𝜽n2(𝒙)𝑑𝒙q(𝜽n)𝑑𝜽n+2Acf02(𝒙)𝑑𝒙q(𝜽n)𝑑𝜽n\displaystyle=2\int_{A^{c}}\int f^{2}_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})d\boldsymbol{x}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}+2\int_{A^{c}}\int f_{0}^{2}(\boldsymbol{x})d\boldsymbol{x}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}

First, note that |f𝜽n(𝒙)|j=0kn|βj|j=0kn|βj0|+j=0kn|βjβj0|2|f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})|\leq\sum_{j=0}^{k_{n}}|\beta_{j}|\leq\sum_{j=0}^{k_{n}}|\beta_{j0}|+\sum_{j=0}^{k_{n}}|\beta_{j}-\beta_{j0}|^{2} since |ψ(u)|1|\psi(u)|\leq 1. Thus,

Ach(𝜽n)𝑑𝜽n\displaystyle\int_{A^{c}}h(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n} 4Ac(j=1kn|βj0|)2q(𝜽n)𝑑𝜽n+4Ac(j=1kn|βjβj0|)2q(𝜽n)𝑑𝜽n+2f02(𝒙)𝑑𝒙Acq(𝜽n)𝑑𝜽n\displaystyle\leq 4\underbrace{\int_{A^{c}}(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{⑤}+4\underbrace{\int_{A^{c}}(\sum_{j=1}^{k_{n}}|\beta_{j}-\beta_{j0}|)^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{⑥}+2\underbrace{\int f_{0}^{2}(\boldsymbol{x})d\boldsymbol{x}\int_{A^{c}}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{⑦}

where the last equality follows since using (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}).

First note that Ac=i=1K(n)AicA^{c}=\cup_{i=1}^{K(n)}A_{i}^{c} where Ai={|θinθi0n|1/nδ/2}A_{i}=\{|\theta_{in}-\theta_{i0n}|\leq 1/n^{\delta/2}\}. Therefore,

Q(Ac)\displaystyle Q(A^{c}) =Q(i=1K(n)Aic)i=1K(n)Q(Aic)\displaystyle=Q(\cup_{i=1}^{K(n)}A_{i}^{c})\leq\sum_{i=1}^{K(n)}Q(A_{i}^{c})
=i=1K(n)|𝜽in𝜽i0n|>1/nδ/2q(θin)𝑑θin=2K(n)(1Φ(nτnδ/2))=O(naen1δn1δ)\displaystyle=\sum_{i=1}^{K(n)}\int_{|\boldsymbol{\theta}_{in}-\boldsymbol{\theta}_{i0n}|>1/n^{\delta/2}}q(\theta_{in})d\theta_{in}=2K(n)\left(1-\Phi\left(\frac{\sqrt{n}}{\tau n^{\delta/2}}\right)\right)=O\left(\frac{n^{a}e^{-n^{1-\delta}}}{\sqrt{n^{1-\delta}}}\right) (43)

where the last asymptotic equality is a consequence of (37) and condition (C1).

For , note that f02(𝒙)𝑑𝒙M\int f_{0}^{2}(\boldsymbol{x})d\boldsymbol{x}\leq M for some M>0M>0. Therefore,

=O(nan1δen1δ)=o(nδ)⑦=O\left(\frac{n^{a}}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})

for any 0δ<10\leq\delta<1.

For , note that j=1knθi0n2=o(n1δ)\sum_{j=1}^{k_{n}}\theta_{i0n}^{2}=o(n^{1-\delta}) by assumption (A2). Using this together with (7.1), we get

=(j=1kn|βj0|)2Q(Ac)kni=1knβj02Q(Ac)K(n)j=1K(n)θi0n2Q(Ac)o(n1δ)O(n2an1δen1δ)=o(nδ)⑤=(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}Q(A^{c})\leq k_{n}\sum_{i=1}^{k_{n}}\beta_{j0}^{2}Q(A^{c})\leq K(n)\sum_{j=1}^{K(n)}\theta_{i0n}^{2}Q(A^{c})\leq o(n^{1-\delta})O\left(\frac{n^{2a}}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})

For , using Cauchy Schwartz, we get

Ac(j=1kn|βjβj0|)2q(𝜽n)𝑑𝜽n\displaystyle\int_{A^{c}}(\sum_{j=1}^{k_{n}}|\beta_{j}-\beta_{j0}|)^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n} knj=1knAc(βjβj0)2q(𝜽n)d𝜽n=O(kn2en1δ)=O(n2aen1δ=o(nδ)\displaystyle\leq k_{n}\sum_{j=1}^{k_{n}}\int_{A^{c}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}=O(k_{n}^{2}e^{-n^{1-\delta}})=O(n^{2a}e^{-n^{1-\delta}}=o(n^{-\delta})

where the fact Ac(βjβj0)2q(𝜽n)𝑑𝜽nen1δ\int_{A^{c}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}\sim e^{-n^{1-\delta}} is shown below. Now, let Aβj={|βjβj0|>1/nδ/2}A_{\beta_{j}}=\{|\beta_{j}-\beta_{j0}|>1/n^{\delta/2}\}

Ac(βjβj0)2q(𝜽n)𝑑𝜽n\displaystyle\int_{A^{c}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n} =AcAβj(βjβj0)2q(𝜽n)𝑑𝜽n+AcAβjc(βjβj0)2q(𝜽n)𝑑𝜽n\displaystyle=\int_{A^{c}\cap A_{\beta_{j}}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}+\int_{A^{c}\cap A_{\beta_{j}}^{c}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}
Aβj(βjβj0)2q(βj)𝑑βj+τ2nA~cq(𝜽~n)𝑑𝜽~n\displaystyle\leq\int_{A_{\beta_{j}}}(\beta_{j}-\beta_{j0})^{2}q(\beta_{j})d\beta_{j}+\frac{\tau^{2}}{n}\int_{\tilde{A}^{c}}q(\tilde{\boldsymbol{\theta}}_{n})d\tilde{\boldsymbol{\theta}}_{n} (44)

where 𝜽~n\tilde{\boldsymbol{\theta}}_{n} includes all coordinates of 𝜽n\boldsymbol{\theta}_{n} except βj\beta_{j} and A~c\tilde{A}^{c} is the union of all AicA_{i}^{c} except AβjcA_{\beta_{j}}^{c}.

Aβj(βjβj0)2q(βj)𝑑βj\displaystyle\int_{A_{\beta_{j}}}(\beta_{j}-\beta_{j0})^{2}q(\beta_{j})d\beta_{j} =|βjβj0|>1/nδ/2n2πτ2(βjβj0)2en2τ2(βjβj02)\displaystyle=\int_{|\beta_{j}-\beta_{j0}|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\tau^{2}}}(\beta_{j}-\beta_{j0})^{2}e^{-\frac{n}{2\tau^{2}}(\beta_{j}-\beta_{j0}^{2})}
=2n1δτu22πe12u22πn1δτeu𝑑ux2ex2/2ex,x\displaystyle=2\int_{\sqrt{n^{1-\delta}}\tau}^{\infty}\frac{u^{2}}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}\lesssim\sqrt{\frac{2}{\pi}}\int_{\sqrt{n^{1-\delta}}\tau}^{\infty}e^{-u}du\hskip 14.22636ptx^{2}e^{-x^{2}/2}\leq e^{-x},x\to\infty
=O(en1δ)\displaystyle=O(e^{-n^{1-\delta}}) (45)

Using (7.1), we get

A~cq(𝜽~n)𝑑𝜽~n=O(en1δn1δ)\int_{\tilde{A}^{c}}q(\tilde{\boldsymbol{\theta}}_{n})d\tilde{\boldsymbol{\theta}}_{n}=O\left(\frac{e^{-n^{1-\delta}}}{\sqrt{n^{1-\delta}}}\right)

Using (7.1) and (7.1) in (7.1), we get

Ac(βjβj0)2q(βj)𝑑βj=O(en1δ)+O(nanen1δn1δ)=O(en1δ)\displaystyle\int_{A^{c}}(\beta_{j}-\beta_{j0})^{2}q(\beta_{j})d\beta_{j}=O(e^{-n^{1-\delta}})+O\left(\frac{n^{a}}{n}\frac{e^{-n^{1-\delta}}}{\sqrt{n^{1-\delta}}}\right)=O(e^{-n^{1-\delta}}) (46)

The only difference with statement 2. is that i=1K(n)θi0n2=O(nv)\sum_{i=1}^{K(n)}\theta_{i0n}^{2}=O(n^{v}) and τ2=τ2/nv+1\tau^{2}=\tau^{2}/n^{v+1}.

The proof is similar and details have been omitted.

Lemma 7.10.

Suppose Nε={𝛚n:dKL(l0,l𝛚n)<ε}N_{\varepsilon}=\{\boldsymbol{\omega}_{n}:d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})<\varepsilon\} and p(𝛚n)p(\boldsymbol{\omega}_{n}) satisfies

Nκ/nδp(𝝎n)𝑑𝝎neκ~n1δ,n\int_{N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},n\to\infty (47)

for every κ\kappa and κ~\tilde{\kappa} for some 0δ<10\leq\delta<1. Then,

logL(𝝎n)L0p(𝝎n)𝑑𝝎n=oP0n(n1δ)\log\int\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}=o_{P_{0}^{n}}(n^{1-\delta}) (48)

provided 0δ<10\leq\delta<1.

Proof.

This proof uses ideas from the proof of Lemma 5 in Lee [2000]. By Markov’s inequality,

P0n(|logL(𝝎n)L0p(𝝎n)𝑑𝝎n|ϵn1δ)\displaystyle P_{0}^{n}\left(\left|\int\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right|\geq\epsilon n^{1-\delta}\right) 1ϵn1δE0n(|logL(𝝎n)L0p(𝝎n)𝑑𝝎n|)\displaystyle\leq\frac{1}{\epsilon n^{1-\delta}}E_{0}^{n}\left(\left|\log\int\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right|\right)
=1ϵn1δ|logL(𝝎n)L0p(𝝎n)𝑑𝝎n|L0𝑑μ\displaystyle=\frac{1}{\epsilon n^{1-\delta}}\int\left|\log\int\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right|L_{0}d\mu
1ϵn1δ(dKL(L0,L)+2e)\displaystyle\leq\frac{1}{\epsilon n^{1-\delta}}\left(d_{KL}(L_{0},L^{*})+\frac{2}{e}\right) (49)

where L=L(𝝎n)p(𝝎n)𝑑𝝎nL^{*}=\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} and the last equality follows from Lemma 7.1. Further,

dKL(L0,L)\displaystyle d_{KL}(L_{0},L^{*}) =E0n(logL0L)=E0n(logL0L(𝝎n)p(𝝎n)𝑑𝝎n)\displaystyle=E_{0}^{n}\left(\log\frac{L_{0}}{L^{*}}\right)=E_{0}^{n}\left(\log\frac{L_{0}}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}\right)
E0n(logL0Nκ/nδL(𝝎n)p(𝝎n)𝑑𝝎n)\displaystyle\leq E_{0}^{n}\left(\log\frac{L_{0}}{\int_{N_{\kappa/n^{\delta}}}L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}\right)
Nκ/nδp(𝝎n)𝑑𝝎n+Nκ/nδdKL(L0,L(𝝎n))p(𝝎n)𝑑𝝎nJensen’s inequality\displaystyle\leq\int_{N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}+\int_{N_{\kappa/n^{\delta}}}d_{KL}(L_{0},L(\boldsymbol{\omega}_{n}))p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\hskip 14.22636pt\text{Jensen's inequality}
logeκ~n1δ+κn1δ=n1δ(κ+κ~)\displaystyle\leq-\log e^{-\tilde{\kappa}n^{1-\delta}}+\kappa n^{1-\delta}=n^{1-\delta}(\kappa+\tilde{\kappa}) (50)

where the last equality follow from (47).

Using (7.1) in (7.1), the result follows and taking κ~0\tilde{\kappa}\to 0 and κ0\kappa\to 0. ∎

Lemma 7.11.

Suppose qq satisfies

dKL(l0,l(𝝎n))q(𝝎n)𝑑𝝎n=o(nδ),\int d_{KL}(l_{0},l(\boldsymbol{\omega}_{n}))q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}=o(n^{-\delta}),

then

q(𝝎n)logL(𝝎n)L0d𝝎n=oP0n(n1δ)\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}=o_{P_{0}^{n}}(n^{1-\delta})

In this direction, note that

P0n(|q(𝝎n)logL(𝝎n)L0d𝝎n|>n1δϵ)\displaystyle P_{0}^{n}\left(\left|\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}\right|>n^{1-\delta}\epsilon\right) P0n(|q(𝝎n)logL(𝝎n)L0d𝝎n|n1δϵ)\displaystyle\leq P_{0}^{n}\left(\left|\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}\right|\geq n^{1-\delta}\epsilon\right)
1n1δϵE0n(|q(𝝎n)logL(𝝎n)L0d𝝎n|)\displaystyle\leq\frac{1}{n^{1-\delta}\epsilon}E_{0}^{n}\left(\left|\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}\right|\right)

where the last result follows from Markov’s Inequality

1n1δϵE0n(q(𝝎n)|logL(𝝎n)L0|𝑑𝝎n)\displaystyle\leq\frac{1}{n^{1-\delta}\epsilon}E_{0}^{n}\left(\int q(\boldsymbol{\omega}_{n})\left|\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}\right|d\boldsymbol{\omega}_{n}\right)
=1n1δϵq(𝝎n)|logL0L(𝝎n)|L0𝑑μ𝑑𝝎n\displaystyle=\frac{1}{n^{1-\delta}\epsilon}\int q(\boldsymbol{\omega}_{n})\int\left|\log\frac{L_{0}}{L(\boldsymbol{\omega}_{n})}\right|L_{0}d\mu d\boldsymbol{\omega}_{n}

Using Lemma 7.1, we get

1n1δϵq(𝝎n)(dKL(L0,L(𝝎n))+2e)𝑑𝝎n0\displaystyle\leq\frac{1}{n^{1-\delta}\epsilon}\int q(\boldsymbol{\omega}_{n})\left(d_{KL}(L_{0},L(\boldsymbol{\omega}_{n}))+\frac{2}{e}\right)d\boldsymbol{\omega}_{n}\to 0

since q(𝝎n)dKL(L0,L(𝝎n))𝑑𝝎n=nq(𝝎n)dKL(l0,l(𝝎n))𝑑𝝎n=o(n1δ)\int q(\boldsymbol{\omega}_{n})d_{KL}(L_{0},L(\boldsymbol{\omega}_{n}))d\boldsymbol{\omega}_{n}=n\int q(\boldsymbol{\omega}_{n})d_{KL}(l_{0},l(\boldsymbol{\omega}_{n}))d\boldsymbol{\omega}_{n}=o(n^{1-\delta}) .

Lemma 7.12.

Let H[](u,𝒢~n,||.||2)K(n)log(Mnu)H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq K(n)\log\left(\frac{M_{n}}{u}\right) then

0εH[](u,𝒢~n,||.||2)duεO(K(n)logMn)\int_{0}^{\varepsilon}H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})du\leq\varepsilon O(\sqrt{K(n)\log M_{n}})
Proof.

This proof uses some ideas from the proof of Lemma 1 in Lee [2000]

0εH(u,𝒢~n,||.||2)\displaystyle\int_{0}^{\varepsilon}\sqrt{H(u,\widetilde{\mathcal{G}}_{n},||.||_{2})} K(n)0εlog(Mnu)𝑑u=K(n)1/2Mn2logMnεν2eν2/2𝑑ν\displaystyle\leq\sqrt{K(n)}\int_{0}^{\varepsilon}\sqrt{\log\left(\frac{M_{n}}{u}\right)}du=\frac{K(n)^{1/2}M_{n}}{2}\int_{\sqrt{\log\frac{M_{n}}{\varepsilon}}}^{\infty}\nu^{2}e^{-\nu^{2}/2}d\nu
=K(n)1/2Mn2(εMnlogMnε+2πlogMnε12πeν2/2𝑑ν)\displaystyle=\frac{K(n)^{1/2}M_{n}}{2}\left(\frac{\varepsilon}{M_{n}}\sqrt{\log\frac{M_{n}}{\varepsilon}}+\sqrt{2\pi}\int_{\sqrt{\log\frac{M_{n}}{\varepsilon}}}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\nu^{2}/2}d\nu\right)
K(n)1/2Mn2(εMnlogMnε+2πϕ(logMnε)logMnε) by (37)\displaystyle\sim\frac{K(n)^{1/2}M_{n}}{2}\left(\frac{\varepsilon}{M_{n}}\sqrt{\log\frac{M_{n}}{\varepsilon}}+\sqrt{2\pi}\frac{\phi\left(\sqrt{\log\frac{M_{n}}{\varepsilon}}\right)}{\sqrt{\log\frac{M_{n}}{\varepsilon}}}\right)\>\>\text{ by }\eqref{e:mills}
ε2K(n)logMnlogε(1+1MnlogMnε)=εO(K(n)logMn)\displaystyle\leq\frac{\varepsilon}{2}\sqrt{K(n)}\sqrt{\log M_{n}-\log\varepsilon}\left(1+\frac{1}{M_{n}\frac{\log M_{n}}{\varepsilon}}\right)=\varepsilon O(\sqrt{K(n)\log M_{n}})

Lemma 7.13.

For any ε>0\varepsilon>0, suppose

1n0εH(u,𝒢~n,||.||2)ε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}H(u,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq\varepsilon^{2}

Then,

P0n(sup𝝎n𝒱εcnL(𝝎n)L0enε2)0,nP_{0}^{n}\left(\sup_{\boldsymbol{\omega}_{n}\in\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}\geq e^{-n\varepsilon^{2}}\right)\to 0,\>\>n\to\infty (51)
Proof.

Note that,

ε2/82εH(u,𝒢~n,||.||2)du\displaystyle\int_{\varepsilon^{2}/8}^{\sqrt{2}\varepsilon}H(u,\widetilde{\mathcal{G}}_{n},||.||_{2})du 02εH(u,𝒢~n,||.||2)du2ε2n\displaystyle\leq\int_{0}^{\sqrt{2}\varepsilon}H(u,\widetilde{\mathcal{G}}_{n},||.||_{2})du\leq 2\varepsilon^{2}\sqrt{n}

Therefore by Theorem 1 in Wong and Shen [1995], for some constant C>0C>0, we have

P0n(sup𝝎n𝒱εcnL(𝝎n)L0enε2)4exp(nCε2)P_{0}^{n}\left(\sup_{\boldsymbol{\omega}_{n}\in\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}\geq e^{-n\varepsilon^{2}}\right)\leq 4\exp(-nC\varepsilon^{2})

Lemma 7.14.

Suppose, for some r>0r>0, p(𝛚n)p(\boldsymbol{\omega}_{n}) satisfies

ncp(𝝎n)𝑑𝝎neκnr,n\int_{\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-\kappa n^{r}},n\to\infty

for any κ>0\kappa>0. Then, for every κ~<κ\tilde{\kappa}<\kappa.

P0n(𝝎nncL(𝝎n)L0p(𝝎n)𝑑𝝎neκ~nr)0P_{0}^{n}\left(\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{r}}\right)\to 0
Proof.

This proof uses ideas from proof of Lemma 3 in Lee [2000].

P0n(𝝎nncL(𝝎n)p(𝝎n)L0𝑑𝝎n>eκ~nr)\displaystyle P_{0}^{n}\left(\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}>e^{-\tilde{\kappa}n^{r}}\right) =eκ~nrE0n(𝝎nncL(𝝎n)L0p(𝝎n)𝑑𝝎n)\displaystyle=e^{\tilde{\kappa}n^{r}}E_{0}^{n}\left(\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right)
=eκ~nr𝝎nncL(𝝎n)L0p(𝝎n)𝑑𝝎nL0𝑑μ\displaystyle=e^{\tilde{\kappa}n^{r}}\int\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}L_{0}d\mu
=eκ~nr𝝎nncp(𝝎n)𝑑𝝎n\displaystyle=e^{\tilde{\kappa}n^{r}}\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}
eκ~nreκnr=e(κκ~)nr0,n\displaystyle\leq e^{\tilde{\kappa}n^{r}}e^{-\kappa n^{r}}=e^{-(\kappa-\tilde{\kappa})n^{r}}\to 0,\>\>n\to\infty

7.2 Lemmas and Propositions for Theorem 3.1 and 3.2

Lemma 7.15.

Let, 𝒢~n={g:g𝒢n}\widetilde{\mathcal{G}}_{n}=\{\sqrt{g}:g\in\mathcal{G}_{n}\} where 𝒢n\mathcal{G}_{n} is given by (10) with K(n)naK(n)\sim n^{a} and Cn=enbaC_{n}=e^{n^{b-a}}. Then,

1n0εH[](u,𝒢~n,||.||2)𝑑uε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}
Proof.

This proof uses some ideas from the proof of Lemma 2 in Lee [2000].

First, note that, by Lemma 4.1 in Pollard [1990],

N(ε,n,||.||)(3Cnε)K(n).N(\varepsilon,\mathcal{F}_{n},||.||_{\infty})\leq\left(\frac{3C_{n}}{\varepsilon}\right)^{K(n)}.

For 𝝎1,𝝎2n\boldsymbol{\omega}_{1},\boldsymbol{\omega}_{2}\in\mathcal{F}_{n}, let L~(u)=Lu𝝎1+(1u)𝝎2(𝒙,y)\widetilde{L}(u)=\sqrt{L_{u\boldsymbol{\omega}_{1}+(1-u)\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}. Then,

L𝝎1(𝒙,y)L𝝎2(𝒙,y)\displaystyle\sqrt{L_{\boldsymbol{\omega}_{1}}(\boldsymbol{x},y)}-\sqrt{L_{\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)} =01dL~du𝑑u=01i=1K(n)L~ωiωiudu=i=1K(n)(ω1iω2i)01L~ωi𝑑u\displaystyle=\int_{0}^{1}\frac{d\widetilde{L}}{du}du=\int_{0}^{1}\sum_{i=1}^{K(n)}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}\frac{\partial{\omega_{i}}}{\partial{u}}du=\sum_{i=1}^{K(n)}(\omega_{1i}-\omega_{2i})\int_{0}^{1}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}du
supi|ω1iω2i|01i=1K(n)supi|L~ωi|du=K(n)supi|L~ωi|ω1ω2\displaystyle\leq\sup_{i}|\omega_{1i}-\omega_{2i}|\int_{0}^{1}\sum_{i=1}^{K(n)}\sup_{i}\Big{|}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}\Big{|}du=K(n)\sup_{i}\Big{|}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}\Big{|}||\omega_{1}-\omega_{2}||_{\infty}
F(𝒙,y)ω1ω2\displaystyle\leq F(\boldsymbol{x},y)||\omega_{1}-\omega_{2}||_{\infty} (52)

where the upper bound F(𝒙,y)=MK(n)Cnσ03/2F(\boldsymbol{x},y)=MK(n)C_{n}\sigma_{0}^{3/2} for a constant MM. This is because

|L~βj|\displaystyle|\frac{\partial{\widetilde{L}}}{\partial{\beta_{j}}}| (8πe2)1/4σ03/2,j=0,,kn\displaystyle\leq(8\pi e^{2})^{-1/4}\sigma_{0}^{3/2},j=0,\cdots,k_{n}
|L~γjh|\displaystyle|\frac{\partial{\widetilde{L}}}{\partial{\gamma_{jh}}}| (8πe2)1/4Cnσ03/2,j=0,,kn,h=0,,p\displaystyle\leq(8\pi e^{2})^{-1/4}C_{n}\sigma_{0}^{3/2},j=0,\cdots,k_{n},h=0,\cdots,p

In view of (7.2) and Theorem 2.7.11 in van der Vaart et al. [1996], we have

N[](ε,𝒢~n,||.||2)(MK(n)Cn2ε)K(n)N_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq\left(\frac{MK(n)C_{n}^{2}}{\varepsilon}\right)^{K(n)}

for some constant M>0M>0. Therefore,

H[](ε,𝒢~n,||.||2)K(n)logK(n)Cn2uH_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\lesssim K(n)\log\frac{K(n)C_{n}^{2}}{u}

Using, Lemma 7.12 with Mn=K(n)Cn2M_{n}=K(n)C_{n}^{2}, we get

0εH[](u,𝒢~n,||.||2)𝑑uεO(K(n)logK(n)Cn2)=εO(nb)\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon O(\sqrt{K(n)\log K(n)C_{n}^{2}})=\varepsilon O(\sqrt{n^{b}})

where the last equality holds since K(n)naK(n)\sim n^{a} and Cn=enbaC_{n}=e^{n^{b-a}}. Therefore,

1n0εH[](u,𝒢~n,||.||2)duε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Lemma 7.16.

Let

n={𝜽n:|θin|Cn,i=1,,K(n)}K(n)na,Cn=enba\mathcal{F}_{n}=\Big{\{}\boldsymbol{\theta}_{n}:|\theta_{in}|\leq C_{n},i=1,\cdots,K(n)\Big{\}}\>\>\>K(n)\sim n^{a},C_{n}=e^{n^{b-a}}
  1. 1.

    Suppose p(𝝎n)p(\boldsymbol{\omega}_{n}) satisfies (17).

  2. 2.

    Suppose p(𝝎n)p(\boldsymbol{\omega}_{n}) satisfies (18).

Then for every κ>0\kappa>0,

𝝎nncp(𝝎n)𝑑𝝎nenκ,n.\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-n\kappa},\>\>n\to\infty.
Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000]. Let in={θin:|θin|Cn}\mathcal{F}_{in}=\{\theta_{in}:|\theta_{in}|\leq C_{n}\},

n=i=1K(n)innc=i=1K(n)inc\mathcal{F}_{n}=\cap_{i=1}^{K(n)}\mathcal{F}_{in}\implies\mathcal{F}_{n}^{c}=\cap_{i=1}^{K(n)}\mathcal{F}_{in}^{c}

We first prove the Lemma for prior in 1.

𝝎nncp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} i=1K(n)inc12πζ2eθin22ζ2𝑑θin=2i=1K(n)Cn12πζ2eθin22ζ2𝑑θin\displaystyle\leq\sum_{i=1}^{K(n)}\int_{\mathcal{F}_{in}^{c}}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}=2\sum_{i=1}^{K(n)}\int_{C_{n}}^{\infty}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}
=2K(n)(1Φ(Cnζ))K(n)CnζeCn22ζ2 by (37)\displaystyle=2K(n)\left(1-\Phi\left(\frac{C_{n}}{\zeta}\right)\right)\sim\frac{K(n)}{C_{n}\zeta}e^{-\frac{C_{n}^{2}}{2\zeta^{2}}}\hskip 28.45274pt\text{ by }\eqref{e:mills}
naζ1enbae(e2nba)/ζ2enκ,n\displaystyle\sim n^{a}\zeta^{-1}e^{n^{b-a}}e^{-(e^{2{n^{b-a}}})/\zeta^{2}}\leq e^{-n\kappa},n\to\infty

We next prove the Lemma for prior in 2. Analogous to the proof for prior in 1. we get,

𝝎nncp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} 2K(n)(1Φ(Cnζnu/2))K(n)Cnζnu/2eCn22ζ2nu by (37)\displaystyle\leq 2K(n)\left(1-\Phi\left(\frac{C_{n}}{\zeta n^{u/2}}\right)\right)\sim\frac{K(n)}{C_{n}\zeta n^{u/2}}e^{-\frac{C_{n}^{2}}{2\zeta^{2}n^{u}}}\hskip 28.45274pt\text{ by }\eqref{e:mills}
naζ1nu/2enbae(e2nba/ζ2nu)enκ,n\displaystyle\sim n^{a}\zeta^{-1}n^{-u/2}e^{n^{b-a}}e^{-(e^{2{n^{b-a}}}/\zeta^{2}n^{u}})\leq e^{-n\kappa},n\to\infty

Proposition 7.17.

Suppose condition (C1) holds for some 0<a<10<a<1 and one of the following two hold.

  1. 1.

    Suppose p(𝝎n)p(\boldsymbol{\omega}_{n}) satisfies (17).

  2. 2.

    Suppose p(𝝎n)p(\boldsymbol{\omega}_{n}) satisfies (18).

Then,

log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎nlog2nε2+oP0n(1)\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq\log 2-n\varepsilon^{2}+o_{P_{0}^{n}}(1)
Proof.

This proof uses some ideas from the proof of Lemma 3 in Lee [2000]. We shall first show

P0n(log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎nlog2nε2)0,nP_{0}^{n}\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\log 2-n\varepsilon^{2}\right)\to 0,\>\>n\to\infty
P0n(log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎nlog2nε2)=P0n(𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎n2enε2)\displaystyle P_{0}^{n}\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\log 2-n\varepsilon^{2}\right)=P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq 2e^{-n\varepsilon^{2}}\right)
P0n(𝒱εcnL(𝝎n)L0p(𝝎n)𝑑𝝎nenε2)+P0n(ncL(𝝎n)L0p(𝝎n)𝑑𝝎nenε2)\displaystyle\leq P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)+P_{0}^{n}\left(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)

Let n={𝜽n:|θin|Cn=enba,0<a<b<1}\mathcal{F}_{n}=\{\boldsymbol{\theta}_{n}:|\theta_{in}|\leq C_{n}=e^{n^{b-a}},0<a<b<1\}.

By Lemma 7.15,

1n0εH[](u,𝒢~n,||.||2)duε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Therefore, by Lemma 7.13, we have

P0n(𝒱εcnL(𝝎n)L0p(𝝎n)𝑑𝝎nenε2)0P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)\to 0

In view of Lemma 7.16, for p(𝝎n)p(\boldsymbol{\omega}_{n}) as in (17) and (18),

𝝎nncp(𝝎n)𝑑𝝎ne2nε2\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-2n\varepsilon^{2}}

Therefore, using Lemma 7.14 with r=1r=1, κ=2ε2\kappa=2\varepsilon^{2} and κ~=ε2\tilde{\kappa}=\varepsilon^{2}, we have

P0n(ncL(𝝎n)L0p(𝝎n)𝑑𝝎nenε2)0P_{0}^{n}\left(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)\to 0

Finally to complete the proof, let

An={log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎nlog2nε2}A_{n}=\left\{\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq\log 2-n\varepsilon^{2}\right\}

then,

log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎n\displaystyle\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} =(log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎n)1An+(log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎n)1Anc\displaystyle=\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right)1_{A_{n}}+\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right)1_{A_{n}^{c}}
(log2nε2)+(nε2log2+log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎n)1AncA~n\displaystyle\leq(\log 2-n\varepsilon^{2})+\underbrace{\left(n\varepsilon^{2}-\log 2+\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right)1_{A_{n}^{c}}}_{\tilde{A}_{n}}
P0n(|A~n|>ϵ)P0n(1Anc=1)0P_{0}^{n}(|\tilde{A}_{n}|>\epsilon)\leq P_{0}^{n}(1_{A_{n}^{c}}=1)\to 0

as shown before. Thus, A~n=oP0n(1)\tilde{A}_{n}=o_{P_{0}^{n}}(1).

Proposition 7.18.

Suppose condition (C1) holds with some 0<a<10<a<1. Let f𝛉nf_{\boldsymbol{\theta}_{n}} be a neural network satisfying assumption (A1) for some 0δ<1a0\leq\delta<1-a. With 𝛚n=𝛉n\boldsymbol{\omega}_{n}=\boldsymbol{\theta}_{n}, define,

Nκ/nδ={𝝎n:(1/σ02)(f𝜽n(𝒙)f0(𝒙))2<κ/nδ}N_{\kappa/n^{\delta}}=\{\boldsymbol{\omega}_{n}:(1/\sigma_{0}^{2})\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}<\kappa/n^{\delta}\} (53)

For every κ~>0\tilde{\kappa}>0,

  1. 1.

    Suppose (A2) holds with same δ\delta as (A1). With p(𝝎n)p(\boldsymbol{\omega}_{n}) as in (17)

    𝝎nNκ/nδp(𝝎n)𝑑𝝎neκ~n1δ,n.\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},\>\>n\to\infty.
  2. 2.

    Suppose (A3) holds with some v>1v>1. With p(𝝎n)p(\boldsymbol{\omega}_{n}) as in (18)

    𝝎nNκ/nδp(𝝎n)𝑑𝝎neκ~n1δ,n\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},\>\>n\to\infty
Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000].

By assumption (A1), let f𝜽0n(𝒙)=β00+j=1knβj0ψ(γj0𝒙)f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x}) be a neural network such that

f𝜽0nf02κ4nδ||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa}{4n^{\delta}} (54)

Define neighborhood MκM_{\kappa} as follows

Mκ={\displaystyle M_{\kappa}=\{ 𝝎n:|θinθi0n|<κ/(4nδmn)σ0,i=1,,K(n)}\displaystyle\boldsymbol{\omega}_{n}:|{\theta}_{in}-{\theta}_{i0n}|<\sqrt{\kappa/(4n^{\delta}m_{n})}\sigma_{0},i=1,\cdots,K(n)\}

where mn=8K(n)2+8(p+1)2(j=1K(n)|θi0n|)2m_{n}=8K(n)^{2}+8(p+1)^{2}(\sum_{j=1}^{K(n)}|\theta_{i0n}|)^{2}.

Note that mn8kn+8(p+1)2(j=1kn|βj0|)2m_{n}\geq 8k_{n}+8(p+1)^{2}(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}, thereby using Lemma 7.2 with ϵ=κ/(4nδmn)σ0\epsilon=\sqrt{\kappa/(4n^{\delta}m_{n})}\sigma_{0}, we get,

(f𝜽n(𝒙)f𝜽0n(𝒙))2𝑑𝒙κ4nδσ02\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq\frac{\kappa}{4n^{\delta}}\sigma_{0}^{2} (55)

for every 𝝎nMk\boldsymbol{\omega}_{n}\in M_{k}. In view of (54) and (55), we have

(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙2f𝜽nf𝜽0n2+2f𝜽0nf02κσ02nδ by (36)\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq 2||f_{\boldsymbol{\theta}_{n}}-f_{\boldsymbol{\theta}_{0n}}||_{2}+2||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa\sigma_{0}^{2}}{n^{\delta}}\hskip 28.45274pt\text{ by \eqref{e:aplusb}} (56)

Using (56) in (53) we get 𝝎nNκ/nδ\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}} for every 𝝎nMκ\boldsymbol{\omega}_{n}\in M_{\kappa}. Therefore,

𝝎nNκ/nδp(𝝎n)𝑑𝝎n𝝎nMκp(𝝎n)𝑑𝝎n\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

We next show that,

𝝎nMκp(𝝎n)𝑑𝝎n>eκ~n1δ\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}>e^{-\tilde{\kappa}n^{1-\delta}}

For notation simplicity, let δn=κ/(4nδmn)σ0\delta_{n}=\sqrt{\kappa/(4n^{\delta}m_{n})}\sigma_{0}

We first prove statement 1. of Proposition 7.18.

𝝎nMκp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} =i=1K(n)θi0nδnθi0n+δn12πζ2eθin22ζ2𝑑θin\displaystyle=\prod_{i=1}^{K(n)}\int_{\theta_{i0n}-\delta_{n}}^{\theta_{i0n}+\delta_{n}}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}
=i=1K(n)2δnζ2πeti22ζ2,ti[θi0nδn,θi0n+δn]by mean value theorem\displaystyle=\prod_{i=1}^{K(n)}\frac{2\delta_{n}}{\zeta\sqrt{2\pi}}e^{-\frac{t_{i}^{2}}{2\zeta^{2}}},\>\>t_{i}\in[\theta_{i0n}-\delta_{n},\theta_{i0n}+\delta_{n}]\hskip 14.22636pt\text{by mean value theorem}
=exp(K(n)(12logπζ22logδn)i=1K(n)ti22ζ2)\displaystyle=\exp\left(-K(n)\left(\frac{1}{2}\log\frac{\pi\zeta^{2}}{2}-\log\delta_{n}\right)-\sum_{i=1}^{K(n)}\frac{t_{i}^{2}}{2\zeta^{2}}\right)
exp(K(n)(12logπζ22logδn)i=1K(n)max((θi0nϵ)2,(θi0n+ϵ)2)2ζ2)\displaystyle\geq\exp\left(-K(n)\left(\frac{1}{2}\log\frac{\pi\zeta^{2}}{2}-\log\delta_{n}\right)-\sum_{i=1}^{K(n)}\frac{\max((\theta_{i0n}-\epsilon)^{2},(\theta_{i0n}+\epsilon)^{2})}{2\zeta^{2}}\right) (57)

for any ϵ>0\epsilon>0 since ti[θi0nϵ,θi0n+ϵ]t_{i}\in[\theta_{i0n}-\epsilon,\theta_{i0n}+\epsilon] when δn0\delta_{n}\to 0.

Using assumption (A2) and condition (C1) together with (36), we get

i=1K(n)max((θi0nϵ)2,(θi0n+ϵ)2)\displaystyle\sum_{i=1}^{K(n)}\max((\theta_{i0n}-\epsilon)^{2},(\theta_{i0n}+\epsilon)^{2}) 2i=1K(n)θi0n2+2ϵK(n)κ~n1δ\displaystyle\leq 2\sum_{i=1}^{K(n)}{\theta}^{2}_{i0n}+2\epsilon K(n)\leq\tilde{\kappa}n^{1-\delta}\hskip 14.22636pt
K(n)(12logπζ22logδn)\displaystyle K(n)\left(\frac{1}{2}\log\frac{\pi\zeta^{2}}{2}-\log\delta_{n}\right) =K(n)(12logπ2+12δlogn+12log4+12logmn12logκlogσ0)\displaystyle=K(n)\left(\frac{1}{2}\log\frac{\pi}{2}+\frac{1}{2}\delta\log n+\frac{1}{2}\log 4+\frac{1}{2}\log m_{n}-\frac{1}{2}\log\kappa-\log\sigma_{0}\right)
κ~n1δ\displaystyle\leq\tilde{\kappa}n^{1-\delta} (58)

where the last inequality is a consequence of (C1) and the fact that logmn=O(logn)\log m_{n}=O(\log n) as shown next.

logmnlog(8K(n)2+8(p+1)2K(n)j=1K(n)θi0n2)log(V1n2a+V2nan1δ)V3logn.\log m_{n}\leq\log(8K(n)^{2}+8(p+1)^{2}K(n)\sum_{j=1}^{K(n)}\theta_{i0n}^{2})\leq\log(V_{1}n^{2a}+V_{2}n^{a}n^{1-\delta})\leq V_{3}\log n.

where the first inequality is a consequence of Cauchy Schwartz and the second inequality is a consequence condition (C1) and assumption (A2).

Therefore, replacing (7.2) in (7.2), we get

𝝎nMκp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} exp(κ~n1δ)\displaystyle\geq\exp(-\tilde{\kappa}n^{1-\delta})

We next prove statement 2. of Proposition 7.18.

𝝎nMκp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} =i=1K(n)θi0nδnθi0n+δn12πζ2nueθin22ζ2nu𝑑θin\displaystyle=\prod_{i=1}^{K(n)}\int_{\theta_{i0n}-\delta_{n}}^{\theta_{i0n}+\delta_{n}}\frac{1}{\sqrt{2\pi\zeta^{2}n^{u}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}n^{u}}}d\theta_{in}
=(2δn2πζ2nu)K(n)ei=1K(n)ti22ζ2nu,ti[θi0nδn,θi0n+δn],by mean value theorem\displaystyle=\left(\frac{2\delta_{n}}{\sqrt{2\pi\zeta^{2}n^{u}}}\right)^{K(n)}e^{-\sum_{i=1}^{K(n)}\frac{t_{i}^{2}}{2\zeta^{2}n^{u}}},t_{i}\in[\theta_{i0n}-\delta_{n},\theta_{i0n}+\delta_{n}],\hskip 14.22636pt\text{by mean value theorem}
exp(K(n)(12logπζ22+u2lognlogδn)i=1K(n)max((θi0nϵ)2,(θi0n+ϵ)2)2ζ2nu)\displaystyle\geq\exp\left(-K(n)\left(\frac{1}{2}\log\frac{\pi\zeta^{2}}{2}+\frac{u}{2}\log n-\log\delta_{n}\right)-\sum_{i=1}^{K(n)}\frac{\max((\theta_{i0n}-\epsilon)^{2},(\theta_{i0n}+\epsilon)^{2})}{2\zeta^{2}n^{u}}\right) (59)

since for any ϵ>0\epsilon>0 since ti[θi0nϵ,θi0n+ϵ]t_{i}\in[\theta_{i0n}-\epsilon,\theta_{i0n}+\epsilon] for any ϵ>0\epsilon>0 when δn0\delta_{n}\to 0.

Under assumption (A3) and condition (C1) together with (36), we have

1nui=1K(n)max((θi0nϵ)2,(θi0n+ϵ)2)2nu(i=1K(n)θi0n2+ϵK(n))\displaystyle\frac{1}{n^{u}}\sum_{i=1}^{K(n)}\max((\theta_{i0n}-\epsilon)^{2},(\theta_{i0n}+\epsilon)^{2})\leq\frac{2}{n^{u}}\left(\sum_{i=1}^{K(n)}{\theta}^{2}_{i0n}+\epsilon K(n)\right) κ~n1δ\displaystyle\leq\tilde{\kappa}n^{1-\delta}
K(n)(12logπ2+u2lognlogδn)\displaystyle K(n)\left(\frac{1}{2}\log\frac{\pi}{2}+\frac{u}{2}\log n-\log\delta_{n}\right) κ~n1δ\displaystyle\leq\tilde{\kappa}n^{1-\delta} (60)

where the last inequality holds by mimicking the argument in for the proof of part 1.

Therefore, replacing (7.2) in (7.2), we get

𝝎nMκp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} exp(κ~n1δ)\displaystyle\geq\exp(-\tilde{\kappa}n^{1-\delta})

which completes the proof. ∎

Proposition 7.19.

Suppose condition (C1) and assumption (A1) hold for some 0<a<10<a<1 and 0δ<1a0\leq\delta<1-a.

  1. 1.

    Suppose (A2) holds with same δ\delta as (A1) and p(𝝎n)p(\boldsymbol{\omega}_{n}) satisfies (17).

  2. 2.

    Suppose (A3) holds for some v>1v>1 and p(𝝎n)p(\boldsymbol{\omega}_{n}) satisfies (18).

Then, there exists a q𝒬nq\in\mathcal{Q}_{n} with 𝒬n\mathcal{Q}_{n} as in (13) such that

dKL(q(.),π(.|𝒚n,𝑿n))=oP0n(n1δ)d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta}) (61)
Proof.
dKL(q(.),π(.|𝒚n,𝑿n))\displaystyle d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) =q(𝝎n)logq(𝝎n)𝑑𝝎nq(𝝎n)logπ(𝝎n|𝒚n,𝑿n)𝑑𝝎n\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})d\boldsymbol{\omega}_{n}
=q(𝝎n)logq(𝝎n)𝑑𝝎nq(𝝎n)logL(𝝎n)p(𝝎n)L(𝝎n)p(𝝎n)𝑑𝝎nd𝝎n\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}d\boldsymbol{\omega}_{n}
=dKL(q(.),p(.))q(𝝎n)logL(𝝎n)L0d𝝎n+logp(𝝎n)L(𝝎n)L0𝑑𝝎n\displaystyle=\underbrace{d_{KL}(q(.),p(.))}_{①}\underbrace{-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{②}+\underbrace{\log\int p(\boldsymbol{\omega}_{n})\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{③}

We first prove statement 1. of the Lemma.

Here, we have

p(𝝎n)=i=1K(n)12πζ2eθin22ζ2q(𝝎n)=i=1K(n)n2πτ2en2τ2(θinθ0in)2p(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}\hskip 14.22636ptq(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{n}{2\tau^{2}}(\theta_{in}-\theta_{0in})^{2}} (62)
dKL(q(.),p(.))\displaystyle d_{KL}(q(.),p(.)) =q(𝝎n)logq(𝝎n)𝑑𝝎nq(𝝎n)logp(𝝎n)𝑑𝝎n\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}
=i=1K(n)(12logn12log2πlogτn(θinθi0n)22τ2)n2πτ2en(θinθi0n)22τ2𝑑θin\displaystyle=\sum_{i=1}^{K(n)}\int\left(\frac{1}{2}\log n-\frac{1}{2}\log 2\pi-\log\tau-\frac{n(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}}\right)\frac{n}{\sqrt{2\pi\tau^{2}}}e^{-\frac{n(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}}}d\theta_{in}
i=1K(n)(12log2πlogζθin22ζ2)n2πτ2en(θinθi0n)22τ2𝑑θin\displaystyle-\sum_{i=1}^{K(n)}\int\left(-\frac{1}{2}\log 2\pi-\log\zeta-\frac{\theta_{in}^{2}}{2\zeta^{2}}\right)\frac{n}{\sqrt{2\pi\tau^{2}}}e^{-\frac{n(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}}}d\theta_{in}
=K(n)2(lognlog2π2logτ1)+K(n)2(log2π2logζ)+i=1K(n)θi0n2+τ2/n2ζ2\displaystyle=\frac{K(n)}{2}(\log n-\log 2\pi-2\log\tau-1)+\frac{K(n)}{2}(-\log 2\pi-2\log\zeta)+\sum_{i=1}^{K(n)}\frac{\theta_{i0n}^{2}+\tau^{2}/n}{2\zeta^{2}} (63)

Thus,

=K(n)2logn+K(n)logζτe+12ζ2i=1K(n)θi0n2+τ22ζ2n=o(n1δ)①=\frac{K(n)}{2}\log n+K(n)\log\frac{\zeta}{\tau\sqrt{e}}+\frac{1}{2\zeta^{2}}\sum_{i=1}^{K(n)}\theta_{i0n}^{2}+\frac{\tau^{2}}{2\zeta^{2}n}=o(n^{1-\delta})

where the last equality is a consequence of condition (C1) and assumption (A2).

For, note that

dKL(l0,l𝝎n)\displaystyle d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}}) =(12logσ02σ0212σ02(yf0(𝒙))2+12σ02(yf𝜽n(𝒙))2)12πσ02e(yf0(𝒙))22σ02𝑑y𝑑𝒙\displaystyle=\int\int\left(\frac{1}{2}\log\frac{\sigma_{0}^{2}}{\sigma_{0}^{2}}-\frac{1}{2\sigma_{0}^{2}}(y-f_{0}(\boldsymbol{x}))^{2}+\frac{1}{2\sigma_{0}^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\right)\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}e^{-\frac{(y-f_{0}(\boldsymbol{x}))^{2}}{2\sigma_{0}^{2}}}dyd\boldsymbol{x}
=12σ02(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙\displaystyle=\frac{1}{2\sigma_{0}^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x} (64)

By Lemma 7.9 part 1., dKL(l0,l𝝎n)=o(nδ)d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})=o(n^{-\delta}). Therefore, by Lemma 7.11, =oP0n(n1δ)②=o_{P_{0}^{n}}(n^{1-\delta}).

Using part 1. of Proposition 7.18 in Lemma 7.10, we get =oP0n(n1δ)③=o_{P_{0}^{n}}(n^{1-\delta}).

Next we prove statement 2. of the Lemma.

Here, we have

p(𝝎n)i=1K(n)12πζ2nueθin22ζ2nuq(𝜽n)=i=1K(n)nv+12πτ2env+12τ2(θinθ0in)2p(\boldsymbol{\omega}_{n})\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}n^{u}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}n^{u}}}\hskip 14.22636ptq(\boldsymbol{\theta}_{n})=\prod_{i=1}^{K(n)}\sqrt{\frac{n^{v+1}}{2\pi\tau^{2}}}e^{-\frac{n^{v+1}}{2\tau^{2}}(\theta_{in}-\theta_{0in})^{2}} (65)
dKL(q(.),p(.))\displaystyle d_{KL}(q(.),p(.)) =q(𝝎n)logq(𝝎n)𝑑𝝎nq(𝝎n)logp(𝝎n)𝑑𝝎n\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}
=12i=1K(n)(lognv+1log2π2logτ(θinθi0n)2τ2/nv+1)nv+12πτ2e(θinθi0n)22τ2/nv+1𝑑θin\displaystyle=\frac{1}{2}\sum_{i=1}^{K(n)}\int\left(\log n^{v+1}-\log 2\pi-2\log\tau-\frac{(\theta_{in}-\theta_{i0n})^{2}}{\tau^{2}/n^{v+1}}\right)\frac{n^{v+1}}{\sqrt{2\pi\tau^{2}}}e^{-\frac{(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}/n^{v+1}}}d\theta_{in}
12i=1K(n)(log2π2logζlognuθin2ζ2nu)nv+12πτ2en(θinθi0n)22τ2/nv+1𝑑θin\displaystyle-\frac{1}{2}\sum_{i=1}^{K(n)}\int\left(-\log 2\pi-2\log\zeta-\log n^{u}-\frac{\theta_{in}^{2}}{\zeta^{2}n^{u}}\right)\frac{n^{v+1}}{\sqrt{2\pi\tau^{2}}}e^{-\frac{n(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}/n^{v+1}}}d\theta_{in}
=K(n)2((v+1)lognlog2π2logτ1)+(K(n)2(log2π2logζulogn)\displaystyle=\frac{K(n)}{2}((v+1)\log n-\log 2\pi-2\log\tau-1)+\frac{(K(n)}{2}(-\log 2\pi-2\log\zeta-u\log n)
+i=1K(n)θi0n2+τ2nv+12ζ2nu\displaystyle+\sum_{i=1}^{K(n)}\frac{\theta_{i0n}^{2}+\frac{\tau^{2}}{n^{v+1}}}{2\zeta^{2}n^{u}} (66)

Thus,

=(v+1+u)K(n)2logn+K(n)logζτe+12ζ2nui=1K(n)θi0n2+τ22ζ2nu+v+1=o(n1δ)①=(v+1+u)\frac{K(n)}{2}\log n+K(n)\log\frac{\zeta}{\tau\sqrt{e}}+\frac{1}{2\zeta^{2}n^{u}}\sum_{i=1}^{K(n)}\theta_{i0n}^{2}+\frac{\tau^{2}}{2\zeta^{2}n^{u+v+1}}=o(n^{1-\delta})

where the last equality is a consequence of condition (C1) and assumption (A3).

By Lemma 7.9 part 2., dKL(l0,l𝝎n)=o(nδ)d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})=o(n^{-\delta}). Therefore, by Lemma 7.11, =oP0n(nδ)②=o_{P_{0}^{n}}(n^{-\delta}).

Using part 2. of Proposition 7.18 in Lemma 7.10, we get =oP0n(n1δ)③=o_{P_{0}^{n}}(n^{1-\delta}).

7.3 Lemmas and Propositions for Theorem 4.1

Lemma 7.20.

Let, 𝒢~n={g:g𝒢n}\widetilde{\mathcal{G}}_{n}=\{\sqrt{g}:g\in\mathcal{G}_{n}\} where 𝒢n\mathcal{G}_{n} is given by (27) with K(n)naK(n)\sim n^{a}, Cn=enbaC_{n}=e^{n^{b-a}}, Dn=enbD_{n}=e^{n^{b}}. Then,

1n0εH[](u,𝒢~n,||.||2)𝑑uε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}
Proof.

This proof uses some ideas from the proof of Lemma 2 in Lee [2000]. First, note that by Lemma 4.1 in Pollard [1990], we have

N(ε,n,||.||)(3Cnε)K(n)(3Dnε)N(\varepsilon,\mathcal{F}_{n},||.||_{\infty})\leq\Big{(}\frac{3C_{n}}{\varepsilon}\Big{)}^{K(n)}\Big{(}\frac{3D_{n}}{\varepsilon}\Big{)}

For 𝝎1,𝝎2n\boldsymbol{\omega}_{1},\boldsymbol{\omega}_{2}\in\mathcal{F}_{n}, let L~(u)=Lu𝝎1+(1u)𝝎2(𝒙,y)\widetilde{L}(u)=\sqrt{L_{u\boldsymbol{\omega}_{1}+(1-u)\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}.

Using (7.2), we get

L𝝎1(𝒙,y)L𝝎2(𝒙,y)\displaystyle\sqrt{L_{\boldsymbol{\omega}_{1}}(\boldsymbol{x},y)}-\sqrt{L_{\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)} (K(n)+1)supi|L~ωiF(𝒙,y)|||ω1ω2||F(𝒙,y)||ω1ω2||\displaystyle\leq\underbrace{(K(n)+1)\sup_{i}\Big{|}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}}_{F(\boldsymbol{x},y)}\Big{|}||\omega_{1}-\omega_{2}||_{\infty}\leq F(\boldsymbol{x},y)||\omega_{1}-\omega_{2}||_{\infty} (67)

where the upper bound on F(𝒙,y)F(\boldsymbol{x},y) is calculated as:

|L~βj|\displaystyle|\frac{\partial{\widetilde{L}}}{\partial{\beta_{j}}}| (8πe2)1/4Cn3/2,j=0,,kn\displaystyle\leq(8\pi e^{2})^{-1/4}C_{n}^{3/2},j=0,\cdots,k_{n}
|L~γjh|\displaystyle|\frac{\partial{\widetilde{L}}}{\partial{\gamma_{jh}}}| (8πe2)1/4Cn5/2,j=0,,kn,h=0,,p\displaystyle\leq(8\pi e^{2})^{-1/4}C_{n}^{5/2},j=0,\cdots,k_{n},h=0,\cdots,p
|L~ρ|\displaystyle|\frac{\partial{\widetilde{L}}}{\partial{\rho}}| ((16π)1/4+(πe2/8)1/4)Cn5/2\displaystyle\leq((16\pi)^{-1/4}+(\pi e^{2}/8)^{-1/4})C_{n}^{5/2}

In view of (7.2) and Theorem 2.7.11 in van der Vaart et al. [1996], we have

N[](ε,𝒢~n,||.||2)(MK(n)Cn7/2ε)K(n)(MDnK(n)Cn5/2ε)N_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq\Big{(}\frac{MK(n)C_{n}^{7/2}}{\varepsilon}\Big{)}^{K(n)}\Big{(}\frac{MD_{n}K(n)C_{n}^{5/2}}{\varepsilon}\Big{)}

for some constant M>0M>0. Therefore,

H[](ε,𝒢~n,||.||2)K(n)logK(n)Cn7/2(DnK(n)Cn5/2)1/K(n)εH_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\lesssim K(n)\log\frac{K(n)C_{n}^{7/2}(D_{n}K(n)C_{n}^{5/2})^{1/K(n)}}{\varepsilon}

Using, Lemma 7.12 with Mn=K(n)Cn7/2(DnK(n)Cn5/2)1/K(n)M_{n}=K(n)C_{n}^{7/2}(D_{n}K(n)C_{n}^{5/2})^{1/K(n)}, we get

0εH[](u,𝒢~n,||.||2)𝑑uεO(K(n)log(K(n)Cn7/2(DnK(n)Cn5/2)1/K(n))=εO(nb)\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\lesssim\varepsilon O\left(\sqrt{K(n)\log(K(n)C_{n}^{7/2}(D_{n}K(n)C_{n}^{5/2})^{1/K(n)}}\right)=\varepsilon O(\sqrt{n^{b}})

where the last equality holds since K(n)naK(n)\sim n^{a}, Cn=enbaC_{n}=e^{n^{b-a}}, Dn=enbD_{n}=e^{n^{b}}.

Therefore,

1n0εH[](u,𝒢~n,||.||2)duε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Lemma 7.21.

Let

n={(𝜽n,σ):|θin|Cn,i=1,,K(n),1/CnσDn,}\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n},\sigma):|\theta_{in}|\leq C_{n},i=1,\cdots,K(n),1/C_{n}\leq\sigma\leq D_{n},\Big{\}}

where DnnaD_{n}\sim n^{a}, Cn=enbaC_{n}=e^{n^{b-a}}, Dn=enbD_{n}=e^{n^{b}}, 0<a<b<10<a<b<1. Suppose p(𝛚n)p(\boldsymbol{\omega}_{n}) satisfies (28), then for any κ>0\kappa>0 and 0<r<b0<r<b,

𝝎nncp(𝝎n)𝑑𝝎neκnr,n\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-\kappa n^{r}},n\to\infty
Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000].

Let in={θin:|θin|Cn}\mathcal{F}_{in}=\{\theta_{in}:|\theta_{in}|\leq C_{n}\} and 0n={σ:1/CnσDn}\mathcal{F}_{0n}=\{\sigma:1/C_{n}\leq\sigma\leq D_{n}\}.

n=0ni=1K(n)innc=0nci=1K(n)inc\mathcal{F}_{n}=\mathcal{F}_{0n}\cap_{i=1}^{K(n)}\mathcal{F}_{in}\implies\mathcal{F}_{n}^{c}=\mathcal{F}_{0n}^{c}\cup\cup_{i=1}^{K(n)}\mathcal{F}_{in}^{c}
𝝎nncp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} 0ncλαΓ(α)(1σ2)α+1eλσ2𝑑σ2+i=1K(n)inc12πζ2eθin22ζ2𝑑θin\displaystyle\leq\int_{\mathcal{F}_{0n}^{c}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}d\sigma^{2}+\sum_{i=1}^{K(n)}\int_{\mathcal{F}_{in}^{c}}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}
=01/Cn2λαΓ(α)(1σ2)α+1eλσ2𝑑σ2+Dn2λαΓ(α)(1σ2)α+1eλσ2𝑑σ2+enκ\displaystyle=\int_{0}^{1/C_{n}^{2}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}d\sigma^{2}+\int_{D_{n}^{2}}^{\infty}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}d\sigma^{2}+e^{-n\kappa}\leq

where the last equality is a consequence of Lemma 7.16.

=01/CnλαΓ(α)(1σ)α+1eλσ𝑑σ+DnλαΓ(α)(1σ)α+1eλσ𝑑σ+enκ\displaystyle=\int_{0}^{1/C_{n}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma}}d\sigma+\int_{D_{n}}^{\infty}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma}}d\sigma+e^{-n\kappa}
=CnλαΓ(α)uα1eu𝑑u+01/DnλαΓ(α)uα1eλu𝑑u+enκ\displaystyle=\int_{C_{n}}^{\infty}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}u^{\alpha-1}e^{-u}du+\int_{0}^{1/D_{n}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}u^{\alpha-1}e^{-\lambda u}du+e^{-n\kappa}
CnλαΓ(α)eu/2𝑑u+01/DnλαΓ(α)uα1𝑑u+enκxαexex/2,x\displaystyle\lesssim\int_{C_{n}}^{\infty}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}e^{-u/2}du+\int_{0}^{1/D_{n}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}u^{\alpha-1}du+e^{-n\kappa}\hskip 14.22636ptx^{\alpha}e^{-x}\leq e^{-x/2},x\to\infty
eenba/2+eαnb+enκeκnr\displaystyle\sim e^{-e^{n^{b-a}}/2}+e^{-\alpha n^{b}}+e^{-n\kappa}\leq e^{-\kappa n^{r}}

for any κ>0\kappa>0 and b<r<1b<r<1. ∎

Proposition 7.22.

Suppose condition (C1) holds with 0<a<10<a<1 and p(𝛚n)p(\boldsymbol{\omega}_{n}) satisfies (28). Then,

log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎nlog2nrε2+oP0n(1)\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq\log 2-n^{r}\varepsilon^{2}+o_{P_{0}^{n}}(1)

for every 0<r<10<r<1.

Proof.

This proof uses some ideas from the proof of Lemma 3 in Lee [2000]. We shall first show

P0n(log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎nlog2nrε2)0,nP_{0}^{n}\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\log 2-n^{r}\varepsilon^{2}\right)\to 0,\>\>n\to\infty
P0n(log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎nlog2nrε2)=P0n(𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎n2enrε2)\displaystyle P_{0}^{n}\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\log 2-n^{r}\varepsilon^{2}\right)=P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq 2e^{-n^{r}\varepsilon^{2}}\right)
=P0n(𝒱εcnL(𝝎n)L0p(𝝎n)𝑑𝝎nenrε2)+P0n(𝒱εcncL(𝝎n)L0p(𝝎n)𝑑𝝎nenrε2)\displaystyle=P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n^{r}\varepsilon^{2}}\right)+P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n^{r}\varepsilon^{2}}\right)
P0n(𝒱εcnL(𝝎n)L0p(𝝎n)𝑑𝝎nenε2)+P0n(ncL(𝝎n)L0p(𝝎n)𝑑𝝎nenrε2)since enrε2enε2\displaystyle\leq P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)+P_{0}^{n}\left(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n^{r}\varepsilon^{2}}\right)\hskip 8.53581pt\text{since }e^{-n^{r}\varepsilon^{2}}\geq e^{-n\varepsilon^{2}}

With n\mathcal{F}_{n} as in (27) with knnak_{n}\sim n^{a}, Cn=enbaC_{n}=e^{n^{b-a}} and Dn=enbD_{n}=e^{n^{b}} where 0<a<b<10<a<b<1

By Lemma 7.20,

1n0εH[](u,𝒢~n,||.||2)duε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Therefore, by Lemma 7.21, we have

P0n(𝒱εcnL(𝝎n)L0p(𝝎n)𝑑𝝎nenε2)0P_{0}^{n}(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}})\to 0

In view of Lemma 7.16, for p(𝝎n)p(\boldsymbol{\omega}_{n}) as in (28), for any 0<r<b0<r<b,

𝝎nncp(𝝎n)𝑑𝝎ne2nrε2,n\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-2n^{r}\varepsilon^{2}},n\to\infty

Therefore, by Lemma 7.14 with r=rr=r, κ=2ε2\kappa=2\varepsilon^{2} and κ~=ε2\tilde{\kappa}=\varepsilon^{2}, we have

P0n(ncL(𝝎n)L0p(𝝎n)𝑑𝝎nenrε2)0P_{0}^{n}(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n^{r}\varepsilon^{2}})\to 0

Since bb can be arbitrarily close to 1, the remaining part of the proof follows on lines of Proposition 7.17

Proposition 7.23.

Suppose condition (C1) holds with some 0<a<10<a<1. Let f𝛉nf_{\boldsymbol{\theta}_{n}} be a neural network satisfying assumption (A1) and (A2) for some 0δ<1a0\leq\delta<1-a. With 𝛚n=(𝛉n,σ2)\boldsymbol{\omega}_{n}=(\boldsymbol{\theta}_{n},\sigma^{2}), define,

Nκ/nδ={𝝎n:dKL(l0,l(𝝎n))=12logσ2σ0212(1σ02σ2)+12σ2(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙<ϵ}N_{\kappa/n^{\delta}}=\left\{\boldsymbol{\omega}_{n}:d_{KL}(l_{0},l(\boldsymbol{\omega}_{n}))=\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\Big{(}1-\frac{\sigma_{0}^{2}}{\sigma^{2}}\Big{)}+\frac{1}{2\sigma^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}<\epsilon\right\} (68)

For every κ~>0\tilde{\kappa}>0, with p(𝛚n)p(\boldsymbol{\omega}_{n}) as in (28), we have

𝝎nNκ/nδp(𝝎n)𝑑𝝎neκ~n1δ,n.\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},\>\>n\to\infty.
Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000].

By assumption (A1), let f𝜽0n(𝒙)=β00+j=1knβj0ψ(γj0𝒙)f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x}) be a neural network such that

f𝜽0nf02κ8nδ||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa}{8n^{\delta}} (69)

Define neighborhood MκM_{\kappa} as follows

Mκ={𝝎n:|σσ0|<κ/2nδσ0,|θinθi0n|<κ/(8nδmn)σ0,i=1,,K(n)}\displaystyle M_{\kappa}=\{\boldsymbol{\omega}_{n}:|\sigma-\sigma_{0}|<\sqrt{\kappa/2n^{\delta}}\sigma_{0},|{\theta}_{in}-{\theta}_{i0n}|<\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0},i=1,\cdots,K(n)\}

where mn=8K(n)2+8(p+1)2(j=1K(n)|θi0n|)2m_{n}=8K(n)^{2}+8(p+1)^{2}(\sum_{j=1}^{K(n)}|\theta_{i0n}|)^{2}.

Note that mn8kn+8(p+1)2(j=1kn|βj0|)2m_{n}\geq 8k_{n}+8(p+1)^{2}(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}, thereby using Lemma 7.2 with ϵ=κ/(8nδmn)σ0\epsilon=\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0}, we get

(f𝜽n(𝒙)f𝜽0n(𝒙))2𝑑𝒙κ8nδσ02\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq\frac{\kappa}{8n^{\delta}}\sigma_{0}^{2} (70)

for any 𝝎nMk\boldsymbol{\omega}_{n}\in M_{k},

In view of (69) and (70) together with (7.1), we have

(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙2f𝜽nf𝜽0n2+2f𝜽0nf02κσ022nδ\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq 2||f_{\boldsymbol{\theta}_{n}}-f_{\boldsymbol{\theta}_{0n}}||_{2}+2||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa\sigma_{0}^{2}}{2n^{\delta}} (71)

By Lemma 7.3,

12logσ2σ0212(1σ02σ2)\displaystyle\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\Big{(}1-\frac{\sigma_{0}^{2}}{\sigma^{2}}\Big{)} κ2nδ\displaystyle\leq\frac{\kappa}{2n^{\delta}}
12σ212σ02(1κ/2nδ)2\displaystyle\frac{1}{2\sigma^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\sqrt{\kappa/2n^{\delta}})^{2}} 1σ02\displaystyle\leq\frac{1}{\sigma_{0}^{2}} (72)

Using (71) and (7.3) in (68) we get 𝝎nNκ/nδ\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}} for every 𝝎nMκ\boldsymbol{\omega}_{n}\in M_{\kappa}. Therefore,

𝝎nNκ/nδp(𝝎n)𝝎nMκp(𝝎n)\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})\geq\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})

We next show that,

𝝎nMκp(𝝎n)𝑑𝝎n>eκ~n1δ\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}>e^{-\tilde{\kappa}n^{1-\delta}}

For notation simplicity, let δ1n=κ/2nδσ0\delta_{1n}=\sqrt{\kappa/2n^{\delta}}\sigma_{0} and δ2n=κ/(8nδmn)σ0\delta_{2n}=\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0}

𝝎nMκp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} =(σ0δ1n)2(σ0+δ1n)2p(σ2)𝑑σ2i=1K(n)θi0nδ2nθi0n+δ2np(θin)𝑑θin\displaystyle=\int_{(\sigma_{0}-\delta_{1n})^{2}}^{(\sigma_{0}+\delta_{1n})^{2}}p(\sigma^{2})d\sigma^{2}\prod_{i=1}^{K(n)}\int_{\theta_{i0n}-\delta_{2n}}^{\theta_{i0n}+\delta_{2n}}p(\theta_{in})d\theta_{in}
(σ0δ1n)2(σ0+δ1n)2p(σ2)𝑑σ2e(κ~/2)n1δ\displaystyle\geq\int_{(\sigma_{0}-\delta_{1n})^{2}}^{(\sigma_{0}+\delta_{1n})^{2}}p(\sigma^{2})d\sigma^{2}e^{-(\tilde{\kappa}/2)n^{1-\delta}}

where first to second step follows from part 1. of Lemma 7.18 since p(𝜽n)p(\boldsymbol{\theta}_{n}) satisfies (17). Next,

(σ0δ1n)2(σ0+δ1n)2p(σ2)𝑑σ2\displaystyle\int_{(\sigma_{0}-\delta_{1n})^{2}}^{(\sigma_{0}+\delta_{1n})^{2}}p(\sigma^{2})d\sigma^{2} =(σ0δ1n)2(σ0+δ1n)2βαΓ(α)(1σ2)α+1eβσ2𝑑σ2=σ0δ1nσ0+δ1nβαΓ(α)(1σ)α+1eβσ𝑑σ\displaystyle=\int_{(\sigma_{0}-\delta_{1n})^{2}}^{(\sigma_{0}+\delta_{1n})^{2}}\frac{\beta^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\beta}{\sigma^{2}}}d\sigma^{2}=\int_{\sigma_{0}-\delta_{1n}}^{\sigma_{0}+\delta_{1n}}\frac{\beta^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma}\Big{)}^{\alpha+1}e^{-\frac{\beta}{\sigma}}d\sigma
=2δ1nβαΓ(α)(1t)α+1eβtf(t),t[σ0δ1n,σ0+δ1n] by mean value theorem\displaystyle=2\delta_{1n}\underbrace{\frac{\beta^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{t}\Big{)}^{\alpha+1}e^{-\frac{\beta}{t}}}_{f(t)},\>\>t\in[\sigma_{0}-\delta_{1n},\sigma_{0}+\delta_{1n}]\hskip 14.22636pt\text{ by mean value theorem}
δ1nβαΓ(α)(1σ0+ϵ)α+1eβσ0ϵ\displaystyle\geq\frac{\delta_{1n}\beta^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma_{0}+\epsilon}\Big{)}^{\alpha+1}e^{-\frac{\beta}{\sigma_{0}-\epsilon}}
=exp((logδ1nαlogβ+logΓ(α)+(α+1)log(σ0+ϵ)+βσ0ϵ))\displaystyle=\exp\left(-\left(-\log\delta_{1n}-\alpha\log\beta+\log\Gamma(\alpha)+(\alpha+1)\log(\sigma_{0}+\epsilon)+\frac{\beta}{\sigma_{0}-\epsilon}\right)\right) (73)

where the third inequality holds since for any ϵ>0\epsilon>0, t[σ0ϵ,σ0+ϵ]t\in[\sigma_{0}-\epsilon,\sigma_{0}+\epsilon] when δn0\delta_{n}\to 0. Now,

logδ1nαlogλ+logΓ(α)+(α+1)log(σ0+ϵ)+λσ0ϵ\displaystyle-\log\delta_{1n}-\alpha\log\lambda+\log\Gamma(\alpha)+(\alpha+1)\log(\sigma_{0}+\epsilon)+\frac{\lambda}{\sigma_{0}-\epsilon}
=12δlogn+12log212logκlogσ0αlogλ+logΓ(α)+(α+1)log(σ0+ϵ)+λσ0ϵ(κ~/2)n1δ\displaystyle=\frac{1}{2}\delta\log n+\frac{1}{2}\log 2-\frac{1}{2}\log\kappa-\log\sigma_{0}-\alpha\log\lambda+\log\Gamma(\alpha)+(\alpha+1)\log(\sigma_{0}+\epsilon)+\frac{\lambda}{\sigma_{0}-\epsilon}\leq(\tilde{\kappa}/2)n^{1-\delta} (74)

Using (7.3) in (7.3), we get

𝝎nMκp(𝝎n)𝑑𝝎neκ~n1δ\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}}

which completes the proof. ∎

Proposition 7.24.

Suppose condition (C1) and assumptions (A1) and (A2) hold for some 0<a<10<a<1 and 0δ<1a0\leq\delta<1-a. Suppose the prior p(𝛚n)p(\boldsymbol{\omega}_{n}) satisfies (28).

Then, there exists a q𝒬nq\in\mathcal{Q}_{n} with 𝒬n\mathcal{Q}_{n} as in (29) such that

dKL(q(.),π(.|𝒚n,𝑿n))=oP0n(n1δ)d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta}) (75)
Proof.
dKL(q(.),π(.|𝒚n,𝑿n))\displaystyle d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) =q(𝝎n)logq(𝝎n)𝑑𝝎nq(𝝎n)logπ(𝝎n|𝒚n,𝑿n)𝑑𝝎n\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})d\boldsymbol{\omega}_{n}
=q(𝝎n)logq(𝝎n)𝑑𝝎nq(𝝎n)logL(𝝎n)p(𝝎n)L(𝝎n)p(𝝎n)𝑑𝝎nd𝝎n\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}d\boldsymbol{\omega}_{n}
=dKL(q(.),p(.))q(𝝎n)logL(𝝎n)L0d𝝎n+logp(𝝎n)L(𝝎n)L0𝑑𝝎n\displaystyle=\underbrace{d_{KL}(q(.),p(.))}_{①}\underbrace{-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{②}+\underbrace{\log\int p(\boldsymbol{\omega}_{n})\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{③}

We first deal with as follows

p(𝝎n)=λαΓ(α)(1σ2)α+1eλσ2p(σ2)i=1K(n)12πζ2eθin22ζ2p(𝜽n)q(𝝎n)=(nσ02)nΓ(n)(1σ2)n+1enσ02σ2q(σ2)i=1K(n)n2πτ2e(θinθi0n)2τ2q(𝜽n)p(\boldsymbol{\omega}_{n})=\underbrace{\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}}_{p(\sigma^{2})}\underbrace{\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}}_{p(\boldsymbol{\theta}_{n})}\>\>\>\>q(\boldsymbol{\omega}_{n})=\underbrace{\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}}_{q(\sigma^{2})}\underbrace{\prod_{i=1}^{K(n)}\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{(\theta_{in}-\theta_{i0n})^{2}}{\tau^{2}}}}_{q(\boldsymbol{\theta}_{n})} (76)
dKL(q(.),p(.))=q(𝝎n)logq(𝝎n)d𝝎nq(𝝎n)logp(𝝎n)d𝝎n\displaystyle d_{KL}(q(.),p(.))=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}
=q(σ2)logq(σ2)𝑑σ2q(σ2)logp(σ2)𝑑σ2+q(𝜽n)logq(𝜽n)𝑑𝜽nq(𝜽n)logp(𝜽n)𝑑𝜽n\displaystyle=\int q(\sigma^{2})\log q(\sigma^{2})d\sigma^{2}-\int q(\sigma^{2})\log p(\sigma^{2})d\sigma^{2}+\int q(\boldsymbol{\theta}_{n})\log q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}-\int q(\boldsymbol{\theta}_{n})\log p(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}
=q(σ2)logq(σ2)𝑑σ2q(σ2)logp(σ2)𝑑σ2+o(n1δ)\displaystyle=\int q(\sigma^{2})\log q(\sigma^{2})d\sigma^{2}-\int q(\sigma^{2})\log p(\sigma^{2})d\sigma^{2}+o(n^{1-\delta}) (77)

where the last inequality is a consequence of Proposition 7.19. Simplifying further, we get

q(σ2)logq(σ2)𝑑σ2\displaystyle\int q(\sigma^{2})\log q(\sigma^{2})d\sigma^{2} =(nlognσ02logΓ(n)(n+1)logσ2nσ02σ2)(nσ02)nΓ(n)(1σ2)n+1enσ02σ2𝑑σ2\displaystyle=\int\left(n\log n\sigma_{0}^{2}-\log\Gamma(n)-(n+1)\log\sigma^{2}-\frac{n\sigma_{0}^{2}}{\sigma^{2}}\right)\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}d\sigma^{2}
=nlognσ02logΓ(n)(n+1)(lognσ02ψ(n))n\displaystyle=n\log n\sigma_{0}^{2}-\log\Gamma(n)-(n+1)(\log n\sigma_{0}^{2}-\psi(n))-n
=logσ02(n+1)ψ(n)log(n1)!n\displaystyle=-\log\sigma_{0}^{2}-(n+1)\psi(n)-\log(n-1)!-n
=logσ02(n+1)logn(n1)log(n1)+(n1)n+O(logn)\displaystyle=-\log\sigma_{0}^{2}-(n+1)\log n-(n-1)\log(n-1)+(n-1)-n+O(\log n)
=logσ02+O(logn)=o(n1δ)\displaystyle=-\log\sigma_{0}^{2}+O(\log n)=o(n^{1-\delta})

where the equality in step 4 follows by approximating ψ(n)\psi(n) using Lemma 4 in Elezovic and Giordano [2000] and approximating (n1)!(n-1)! by Stirling’s formula.

q(σ2)logp(σ2)𝑑σ2\displaystyle\int q(\sigma^{2})\log p(\sigma^{2})d\sigma^{2} =(αlogλlogΓ(α)(α+1)logσ2λσ2)(nσ02)nΓ(n)(1σ2)n+1enσ02σ2𝑑σ2\displaystyle=\int\left(\alpha\log\lambda-\log\Gamma(\alpha)-(\alpha+1)\log\sigma^{2}-\frac{\lambda}{\sigma^{2}}\right)\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}d\sigma^{2}
=αlogλlogΓ(α)(α+1)(lognσ02ψ(n)))λσ02\displaystyle=\alpha\log\lambda-\log\Gamma(\alpha)-(\alpha+1)(\log n\sigma_{0}^{2}-\psi(n)))-\frac{\lambda}{\sigma_{0}^{2}}
=αlogλlogΓ(α)(α+1)(lognσ02logn)λσ02+O(logn)=o(n1δ)\displaystyle=\alpha\log\lambda-\log\Gamma(\alpha)-(\alpha+1)(\log n\sigma_{0}^{2}-\log n)-\frac{\lambda}{\sigma_{0}^{2}}+O(\log n)=o(n^{1-\delta})

where the last equality follows by approximating ψ(n)\psi(n) using Lemma 4 in Elezovic and Giordano [2000].

For, note that

dKL(l0,l𝝎n)\displaystyle d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}}) =(12logσ2σ0212σ02(yf0(𝒙))2+12σ2(yf𝜽n(𝒙))2)12πσ02e(yf0(𝒙))22σ02𝑑y𝑑𝒙\displaystyle=\int\int\Big{(}\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2\sigma_{0}^{2}}(y-f_{0}(\boldsymbol{x}))^{2}+\frac{1}{2\sigma^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)}\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}e^{-\frac{(y-f_{0}(\boldsymbol{x}))^{2}}{2\sigma_{0}^{2}}}dyd\boldsymbol{x}
=12logσ2σ0212+σ022σ2+12σ2(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙\displaystyle=\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2}+\frac{\sigma_{0}^{2}}{2\sigma^{2}}+\frac{1}{2\sigma^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x} (78)

By Lemmas 7.5, 7.6 and Lemma 7.9 part 1, we have

dKL(l0,l𝝎n)q(𝝎n)𝑑𝝎n=o(nδ)\int d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}=o(n^{-\delta})

Therefore, by Lemma 7.11, =oP0n(nδ)②=o_{P_{0}^{n}}(n^{-\delta}).

Using Proposition 7.23 in Lemma 7.10, we get =oP0n(n1δ)③=o_{P_{0}^{n}}(n^{1-\delta}).

7.4 Lemmas and Propositions for Theorem 4.4

Lemma 7.25.

For 𝒢n\mathcal{G}_{n} as in (31), let 𝒢~n={g:g𝒢n}\widetilde{\mathcal{G}}_{n}=\{\sqrt{g}:g\in\mathcal{G}_{n}\}. If K(n)naK(n)\sim n^{a}, Cn=enbaC_{n}=e^{n^{b-a}}, 0<a<b<10<a<b<1, then

1n0εH[](u,𝒢~n,||.||2)𝑑uε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}
Proof.

First, by Lemma 4.1 in Pollard [1990],

N(ε,n,||.||)(3Cnε)K(n)(3logCnε)N(\varepsilon,\mathcal{F}_{n},||.||_{\infty})\leq\Big{(}\frac{3C_{n}}{\varepsilon}\Big{)}^{K(n)}\Big{(}\frac{3\log C_{n}}{\varepsilon}\Big{)}

For 𝝎1,𝝎2n\boldsymbol{\omega}_{1},\boldsymbol{\omega}_{2}\in\mathcal{F}_{n}, let L~(u)=Lu𝝎1+(1u)𝝎2(𝒙,y)\widetilde{L}(u)=\sqrt{L_{u\boldsymbol{\omega}_{1}+(1-u)\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}.

Using (7.2), we get

L𝝎1(𝒙,y)L𝝎2(𝒙,y)\displaystyle\sqrt{L_{\boldsymbol{\omega}_{1}}(\boldsymbol{x},y)}-\sqrt{L_{\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)} (K(n)+1)supi|L~ωi|F(𝒙,y)ω1ω2F(𝒙,y)ω1ω2\displaystyle\leq\underbrace{(K(n)+1)\sup_{i}\Big{|}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}\Big{|}}_{F(\boldsymbol{x},y)}||\omega_{1}-\omega_{2}||_{\infty}\leq F(\boldsymbol{x},y)||\omega_{1}-\omega_{2}||_{\infty} (79)

where the upper bound on F(𝒙,y)F(\boldsymbol{x},y) is calculated as:

|L~βj|\displaystyle|\frac{\partial{\widetilde{L}}}{\partial{\beta_{j}}}| 23/2(8πe2)1/4Cn3/2,j=0,,kn\displaystyle\leq 2^{3/2}(8\pi e^{2})^{-1/4}C_{n}^{3/2},j=0,\cdots,k_{n}
|L~γjh|\displaystyle|\frac{\partial{\widetilde{L}}}{\partial{\gamma_{jh}}}| 23/2(8πe2)1/4Cn5/2,j=0,,kn,h=0,,p\displaystyle\leq 2^{3/2}(8\pi e^{2})^{-1/4}C_{n}^{5/2},j=0,\cdots,k_{n},h=0,\cdots,p
|L~ρ|\displaystyle|\frac{\partial{\widetilde{L}}}{\partial{\rho}}| 23/2((16π)1/4+(πe2/8)1/4)Cn5/2\displaystyle\leq 2^{3/2}((16\pi)^{-1/4}+(\pi e^{2}/8)^{-1/4})C_{n}^{5/2}

since log(1+eρ)log(1+elogCn)1/Cn1/(2Cn)\log(1+e^{\rho})\geq\log(1+e^{-\log C_{n}})\sim 1/C_{n}\geq 1/(2C_{n}) and |log(1+eρ)/ρ|1|\partial{\log(1+e^{\rho})}/\partial{\rho}|\leq 1.

In view of (79) and Theorem 2.7.11 in van der Vaart et al. [1996], we have

N[](ε,𝒢~n,||.||2)(MK(n)Cn7/2ε)K(n)(MK(n)Cn5/2logCnε)N_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq\Big{(}\frac{MK(n)C_{n}^{7/2}}{\varepsilon}\Big{)}^{K(n)}\Big{(}\frac{MK(n)C_{n}^{5/2}\log C_{n}}{\varepsilon}\Big{)}

for some M>0M>0. Therefore,

H[](ε,𝒢~n,||.||2)K(n)logK(n)Cn7/2(K(n)Cn5/2logCn)1/K(n)εH_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\lesssim K(n)\log\frac{K(n)C_{n}^{7/2}(K(n)C_{n}^{5/2}\log C_{n})^{1/K(n)}}{\varepsilon}

Using, Lemma 7.12 with Mn=K(n)Cn7/2(K(n)Cn5/2logCn)1/K(n)M_{n}=K(n)C_{n}^{7/2}(K(n)C_{n}^{5/2}\log C_{n})^{1/K(n)}, we get

0εH[](u,𝒢~n,||.||2)𝑑uεO(K(n)log(K(n)Cn7/2(K(n)Cn5/2logCn)1/K(n))=εO(nb)\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon O(\sqrt{K(n)\log(K(n)C_{n}^{7/2}(K(n)C_{n}^{5/2}\log C_{n})^{1/K(n)}})=\varepsilon O(\sqrt{n^{b}})

where the last equality holds since K(n)naK(n)\sim n^{a}, Cn=enbaC_{n}=e^{n^{b-a}}, 0<a<b<10<a<b<1.

Therefore,

1n0εH[](u,𝒢~n,||.||2)duε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Lemma 7.26.

Let

n={(𝜽n,ρ):|θin|Cn,i=1,,K(n),|ρ|logCn}\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n},\rho):|\theta_{in}|\leq C_{n},i=1,\cdots,K(n),|\rho|\leq\log C_{n}\Big{\}}

where K(n)naK(n)\sim n^{a}, Cn=enbaC_{n}=e^{n^{b-a}}, 0<a<1/20<a<1/2, a+1/2<b<1a+1/2<b<1. Then with

p(𝝎n)=12πη2eρ22η2i=1K(n)12πζ2eθin22ζ2p(\boldsymbol{\omega}_{n})=\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}

we have for every κ>0\kappa>0

𝝎nncp(𝝎n)𝑑𝝎nenκ,n\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-n\kappa},\>\>n\to\infty
Proof.

Let in={θin:|θin|Cn}\mathcal{F}_{in}=\{\theta_{in}:|\theta_{in}|\leq C_{n}\} and 0n={ρ:|ρ|<logCn}\mathcal{F}_{0n}=\{\rho:|\rho|<\log C_{n}\}.

n=0ni=1K(n)innc=0nci=1K(n)inc\mathcal{F}_{n}=\mathcal{F}_{0n}\cap_{i=1}^{K(n)}\mathcal{F}_{in}\implies\mathcal{F}_{n}^{c}=\mathcal{F}_{0n}^{c}\cup\cup_{i=1}^{K(n)}\mathcal{F}_{in}^{c}
𝝎nncp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} 0nc12πη2eρ22η2𝑑ρ+i=1K(n)inc12πζ2eθin22ζ2𝑑θin2Countable sub-additivity.\displaystyle\leq\int_{\mathcal{F}_{0n}^{c}}\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}d\rho+\sum_{i=1}^{K(n)}\int_{\mathcal{F}_{in}^{c}}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}^{2}\hskip 28.45274pt\text{Countable sub-additivity.}
=2logCn12πη2eρ22η2𝑑ρ+2i=1K(n)Cn12πζ2eθin22ζ2𝑑θin2\displaystyle=2\int_{\log C_{n}}^{\infty}\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}d\rho+2\sum_{i=1}^{K(n)}\int_{C_{n}}^{\infty}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}^{2}
=2(1Φ(logCnη))+2K(n)(1Φ(Cnζ))\displaystyle=2\left(1-\Phi\left(\frac{\log C_{n}}{\eta}\right)\right)+2K(n)\left(1-\Phi\left(\frac{C_{n}}{\zeta}\right)\right)
1logCne(logCn)22η2+K(n)CneCn22ζ2enκ By Mill’s Ratio\displaystyle\sim\frac{1}{\log C_{n}}e^{-\frac{(\log C_{n})^{2}}{2\eta^{2}}}+\frac{K(n)}{C_{n}}e^{-\frac{C_{n}^{2}}{2\zeta^{2}}}\leq e^{-n\kappa}\hskip 28.45274pt\text{ By Mill's Ratio}

since (logCn)2=n2(ba)>n(\log C_{n})^{2}=n^{2(b-a)}>n for a+1/2<b<1a+1/2<b<1 and Cn2=e2nba>nC_{n}^{2}=e^{2n^{b-a}}>n for 0<a<b<10<a<b<1. ∎

Proposition 7.27.

Suppose condition (C1) holds with 0<a<1/20<a<1/2 and p(𝛚n)p(\boldsymbol{\omega}_{n}) satisfies (32). Then,

log𝒱εcL(𝝎n)L0p(𝝎n)𝑑𝝎nlog2nε2+oP0n(1)\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq\log 2-n\varepsilon^{2}+o_{P_{0}^{n}}(1)
Proof.

Let n={𝝎n:|θin|Cn,|ρ|<logCn}\mathcal{F}_{n}=\{\boldsymbol{\omega}_{n}:|\theta_{in}|\leq C_{n},|\rho|<\log C_{n}\}. Let Cn=enbaC_{n}=e^{n^{b-a}} and K(n)naK(n)\sim n^{a} for 0<a<1/20<a<1/2.

By Lemma 7.25, we have

1n0εH[](u,𝒢~n,||.||2)duε2\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Therefore, by Lemma 7.13, we have

P0n(𝒱εcnL(𝝎n)L0p(𝝎n)𝑑𝝎nenε2)0P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)\to 0

In view of Lemma 7.26, for p(𝝎n)p(\boldsymbol{\omega}_{n}) as in (32),

𝝎nncp(𝝎n)𝑑𝝎ne2nε2\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-2n\varepsilon^{2}}

Therefore, by Lemma 7.14 with r=1r=1, κ=2ε2\kappa=2\varepsilon^{2} and κ~=ε2\tilde{\kappa}=\varepsilon^{2}, we have

P0n(ncL(𝝎n)L0p(𝝎n)𝑑𝝎nenε2)0P_{0}^{n}\left(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)\to 0

The remaining part of the proof follows on the same lines as Proposition 7.17

Proposition 7.28.

Suppose condition (C1) holds with some 0<a<10<a<1. Let f𝛉nf_{\boldsymbol{\theta}_{n}} be a neural network satisfying assumption (A1) and (A2) for some 0δ<1a0\leq\delta<1-a. With 𝛚n=(𝛉n,ρ)\boldsymbol{\omega}_{n}=(\boldsymbol{\theta}_{n},\rho), define,

Nκ/nδ={𝝎n:dKL(l0,l(𝝎n))=12logσρ2σ0212(1σ02σρ2)+12σρ2(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙<ϵ}N_{\kappa/n^{\delta}}=\{\boldsymbol{\omega}_{n}:d_{KL}(l_{0},l(\boldsymbol{\omega}_{n}))=\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\Big{(}1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\Big{)}+\frac{1}{2\sigma_{\rho}^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}<\epsilon\} (80)

For every κ~>0\tilde{\kappa}>0, with p(𝛚n)p(\boldsymbol{\omega}_{n}) as in (32), we have

𝝎nNκ/nδp(𝝎n)𝑑𝝎neκ~n1δ,n.\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},\>\>n\to\infty.
Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000].

By assumption (A1), let f𝜽0n(𝒙)=β00+j=1k(n)βj0ψ(γj0𝒙)f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k(n)}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x}) satisfy

f𝜽0nf02κ8nδ||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa}{8n^{\delta}} (81)

With σ0=log(1+eρ0)\sigma_{0}=\log(1+e^{\rho_{0}}), define neighborhood MκM_{\kappa} as follows

Mκ={\displaystyle M_{\kappa}=\{ 𝝎n:|ρρ0|<κ/2nδσ0,|θinθi0n|<κ/(8nδmn)σ0,i=1,,K(n)}\displaystyle\boldsymbol{\omega}_{n}:|\rho-\rho_{0}|<\sqrt{\kappa/2n^{\delta}}\sigma_{0},|{\theta}_{in}-{\theta}_{i0n}|<\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0},i=1,\cdots,K(n)\}

where mn=8K(n)2+8(p+1)2(j=1K(n)|θi0n|)2m_{n}=8K(n)^{2}+8(p+1)^{2}(\sum_{j=1}^{K(n)}|\theta_{i0n}|)^{2}. Note that mn8kn+8(p+1)2(j=1kn|βj0|)2m_{n}\geq 8k_{n}+8(p+1)^{2}(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}.

Thereby, using Lemma 7.2 with ϵ=κ/(8nδmn)σ0\epsilon=\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0} and (36), we get

(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙2f𝜽nf𝜽0n2+2f𝜽0nf02κσ022nδ\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq 2||f_{\boldsymbol{\theta}_{n}}-f_{\boldsymbol{\theta}_{0n}}||_{2}+2||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa\sigma_{0}^{2}}{2n^{\delta}} (82)

By Lemma 7.4,

12logσρ2σ0212(1σ02σρ2)\displaystyle\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\Big{(}1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\Big{)} κ2nδ\displaystyle\leq\frac{\kappa}{2n^{\delta}}
12σρ212σ02(1κ/2nδ)2\displaystyle\frac{1}{2\sigma_{\rho}^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\sqrt{\kappa/2n^{\delta}})^{2}} 1σ02\displaystyle\leq\frac{1}{\sigma_{0}^{2}} (83)

Using (82) and (7.4) in (80), we get 𝝎nNκ/nδ\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}, for every 𝝎nMκ\boldsymbol{\omega}_{n}\in M_{\kappa}. Therefore,

𝝎nNκ/nδp(𝝎n)𝑑𝝎n𝝎nMκp(𝝎n)𝑑𝝎n\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

We next show that,

𝝎nMκp(𝝎n)𝑑𝝎n>eκ~n1δ\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}>e^{-\tilde{\kappa}n^{1-\delta}}

For notation simplicity, let δ1n=κ/2nδσ0\delta_{1n}=\sqrt{\kappa/2n^{\delta}}\sigma_{0} and δ2n=κ/(8nδmn)σ0\delta_{2n}=\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0}

𝝎nMκp(𝝎n)𝑑𝝎n\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n} =ρ0δ1nρ0+δ1np(ρ)𝑑ρi=1K(n)θi0nδ2nθi0n+δ2np(θin)𝑑θin\displaystyle=\int_{\rho_{0}-\delta_{1n}}^{\rho_{0}+\delta_{1n}}p(\rho)d\rho\prod_{i=1}^{K(n)}\int_{\theta_{i0n}-\delta_{2n}}^{\theta_{i0n}+\delta_{2n}}p(\theta_{in})d\theta_{in}
ρ0δ1nρ0+δ1np(ρ)𝑑ρe(κ~/2)n1δ\displaystyle\geq\int_{\rho_{0}-\delta_{1n}}^{\rho_{0}+\delta_{1n}}p(\rho)d\rho e^{-(\tilde{\kappa}/2)n^{1-\delta}}

where first to second step follows from part 1. of Lemma 7.18 since p(𝜽n)p(\boldsymbol{\theta}_{n}) satisfies (17). Next,

ρ0δ1nρ0+δ1np(ρ)𝑑ρ\displaystyle\int_{\rho_{0}-\delta_{1n}}^{\rho_{0}+\delta_{1n}}p(\rho)d\rho =ρ0δ1nρ0+δ1n12πη2eρ22η2\displaystyle=\int_{\rho_{0}-\delta_{1n}}^{\rho_{0}+\delta_{1n}}\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}
=2δ1n12πη2et22η2,t[ρ0δ1n,ρ0+δ1n]by mean value theorem\displaystyle=2\delta_{1n}\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{t^{2}}{2\eta^{2}}},t\in[\rho_{0}-\delta_{1n},\rho_{0}+\delta_{1n}]\hskip 28.45274pt\text{by mean value theorem}
2δ1n2πη2emax((ρ0ϵ)2,(ρ0+ϵ)2)2η2\displaystyle\geq\frac{2\delta_{1n}}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\max((\rho_{0}-\epsilon)^{2},(\rho_{0}+\epsilon)^{2})}{2\eta^{2}}}
=exp((logδ1n+12logπ2+logη+max((ρ0ϵ)2,(ρ0+ϵ)2)2η2))\displaystyle=\exp\left(-\left(-\log\delta_{1n}+\frac{1}{2}\log\frac{\pi}{2}+\log\eta+\frac{\max((\rho_{0}-\epsilon)^{2},(\rho_{0}+\epsilon)^{2})}{2\eta^{2}}\right)\right) (84)

where the third inequality holds since for any ϵ>0\epsilon>0, t[ρ0ϵ,ρ0+ϵ]t\in[\rho_{0}-\epsilon,\rho_{0}+\epsilon] when δn0\delta_{n}\to 0. Now,

logδ1n+12logπ2+logη+max(ρ0ϵ,ρ0+ϵ)2η2\displaystyle-\log\delta_{1n}+\frac{1}{2}\log\frac{\pi}{2}+\log\eta+\frac{\max(\rho_{0}-\epsilon,\rho_{0}+\epsilon)}{2\eta^{2}}
=12δlogn+12log212logκlogσ0+logη+max(ρ0ϵ,ρ0+ϵ)2η2(κ~/2)n1δ\displaystyle=\frac{1}{2}\delta\log n+\frac{1}{2}\log 2-\frac{1}{2}\log\kappa-\log\sigma_{0}+\log\eta+\frac{\max(\rho_{0}-\epsilon,\rho_{0}+\epsilon)}{2\eta^{2}}\leq(\tilde{\kappa}/2)n^{1-\delta} (85)

Using (7.4) in (7.4), we get

𝝎nMκp(𝝎n)𝑑𝝎neκ~n1δ\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}}

which completes the proof. ∎

Proposition 7.29.

Suppose condition (C1) and assumption (A1) hold for some 0<a<1/20<a<1/2 and 0δ<1a0\leq\delta<1-a. Suppose the prior p(𝛚n)p(\boldsymbol{\omega}_{n}) satisfies as (32).

Then, there exists a q𝒬nq\in\mathcal{Q}_{n} with 𝒬n\mathcal{Q}_{n} as in (33), such that

dKL(q(.),π(.|𝒚n,𝑿n))=oP0n(n1δ)d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta}) (86)
Proof.
dKL(q(.),π(.|𝒚n,𝑿n))\displaystyle d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})) =q(𝝎n)logq(𝝎n)𝑑𝝎nq(𝝎n)logπ(𝝎n|𝒚n,𝑿n)𝑑𝝎n\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})d\boldsymbol{\omega}_{n}
=q(𝝎n)logq(𝝎n)𝑑𝝎nq(𝝎n)logL(𝝎n)p(𝝎n)L(𝝎n)p(𝝎n)𝑑𝝎nd𝝎n\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}d\boldsymbol{\omega}_{n}
=dKL(q(.),p(.))q(𝝎n)logL(𝝎n)L0d𝝎n+logp(𝝎n)L(𝝎n)L0𝑑𝝎n\displaystyle=\underbrace{d_{KL}(q(.),p(.))}_{①}\underbrace{-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{②}+\underbrace{\log\int p(\boldsymbol{\omega}_{n})\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{③}

We first deal with as follows

p(𝝎n)=12πη2eρ22η2p(ρ)i=1K(n)12πζ2eθin22ζ2p(𝜽n)q(𝝎n)=n2πν2en(ρρ0)2ν2q(ρ)i=1K(n)n2πτ2e(θinθi0n)2τ2q(𝜽n)p(\boldsymbol{\omega}_{n})=\underbrace{\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}}_{p(\rho)}\underbrace{\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}}_{p(\boldsymbol{\theta}_{n})}\>\>\>\>q(\boldsymbol{\omega}_{n})=\underbrace{\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n(\rho-\rho_{0})^{2}}{\nu^{2}}}}_{q(\rho)}\underbrace{\prod_{i=1}^{K(n)}\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{(\theta_{in}-\theta_{i0n})^{2}}{\tau^{2}}}}_{q(\boldsymbol{\theta}_{n})} (87)
dKL(q(.),p(.))\displaystyle d_{KL}(q(.),p(.)) =q(ρ)logq(ρ)𝑑ρq(ρ)logp(ρ)𝑑ρ+q(𝜽n)logq(𝜽n)𝑑𝜽nq(𝜽n)logp(𝜽n)𝑑𝜽n\displaystyle=\int q(\rho)\log q(\rho)d\rho-\int q(\rho)\log p(\rho)d\rho+\int q(\boldsymbol{\theta}_{n})\log q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}-\int q(\boldsymbol{\theta}_{n})\log p(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}
=q(ρ)logq(ρ)𝑑ρq(ρ)logp(ρ)𝑑ρ+o(n1δ)\displaystyle=\int q(\rho)\log q(\rho)d\rho-\int q(\rho)\log p(\rho)d\rho+o(n^{1-\delta}) (88)

where the last equality is a consequence of Proposition 7.19. Simplifying further, we get

q(ρ)logq(ρ)𝑑ρq(ρ)logq(ρ)𝑑ρ\displaystyle\int q(\rho)\log q(\rho)d\rho-\int q(\rho)\log q(\rho)d\rho =(12logn12log2πlogνn(ρρ0)22ν2)n2πν2en(ρρ0)22ν2𝑑ρ\displaystyle=\int\Big{(}\frac{1}{2}\log n-\frac{1}{2}\log 2\pi-\log\nu-\frac{n(\rho-\rho_{0})^{2}}{2\nu^{2}}\Big{)}\frac{n}{\sqrt{2\pi\nu^{2}}}e^{-\frac{n(\rho-\rho_{0})^{2}}{2\nu^{2}}}d\rho
(12log2πlogηρ22η2)n2πν2en(ρρ0)22ν2𝑑ρ\displaystyle-\int\Big{(}-\frac{1}{2}\log 2\pi-\log\eta-\frac{\rho^{2}}{2\eta^{2}}\Big{)}\frac{n}{\sqrt{2\pi\nu^{2}}}e^{-\frac{n(\rho-\rho_{0})^{2}}{2\nu^{2}}}d\rho
=12(lognlog2π2logν1)+12(log2π2logη)+ρ02+ν2/n2η2\displaystyle=\frac{1}{2}(\log n-\log 2\pi-2\log\nu-1)+\frac{1}{2}(-\log 2\pi-2\log\eta)+\frac{\rho_{0}^{2}+\nu^{2}/n}{2\eta^{2}}
=o(n1δ)\displaystyle=o(n^{1-\delta})

For, note that

dKL(l0,l𝝎n)\displaystyle d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}}) =(12logσρ2σ0212σ02(yf0(𝒙))2+12σρ2(yf𝜽n(𝒙))2)12πσ02e(yf0(𝒙))22σ02𝑑y𝑑𝒙\displaystyle=\int\int\Big{(}\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2\sigma_{0}^{2}}(y-f_{0}(\boldsymbol{x}))^{2}+\frac{1}{2\sigma_{\rho}^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)}\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}e^{-\frac{(y-f_{0}(\boldsymbol{x}))^{2}}{2\sigma_{0}^{2}}}dyd\boldsymbol{x}
=12logσρ2σ0212+σ022σρ2+12σρ2(f𝜽n(𝒙)f0(𝒙))2𝑑𝒙\displaystyle=\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}+\frac{\sigma_{0}^{2}}{2\sigma_{\rho}^{2}}+\frac{1}{2\sigma_{\rho}^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x} (89)

By Lemmas 7.7, 7.8 and Lemma 7.9 part 1, we have

dKL(l0,l𝝎n)q(𝝎n)𝑑𝝎n=oP0(nδ)\int d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}=o_{P_{0}}(n^{-\delta})

Therefore, by Lemma 7.11, =oP0(nδ)②=o_{P_{0}}(n^{-\delta}).

Using Lemma 7.28 in Lemma 7.10, we get =oP0(n1δ)③=o_{P_{0}}(n^{1-\delta}).

References

  • Bishop [1997] C. M. Bishop, Bayesian Neural Networks, Journal of the Brazilian Computer Society 4 (1997).
  • Neal [1992] R. M. Neal, Bayesian training of backpropagation networks by the hybrid monte-carlo method, 1992.
  • Lampinen and Vehtari [2001] J. Lampinen, A. Vehtari, Bayesian approach for neural networks–review and case studies, Neural networks : the official journal of the International Neural Network Society 14 3 (2001) 257–74.
  • Sun et al. [2017] S. Sun, C. Chen, L. Carin, Learning Structured Weight Uncertainty in Bayesian Neural Networks, in: A. Singh, J. Zhu (Eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, PMLR, Fort Lauderdale, FL, USA, 2017, pp. 1283–1292. URL: http://proceedings.mlr.press/v54/sun17b.html.
  • Mullachery et al. [2018] V. Mullachery, A. Khera, A. Husain, Bayesian neural networks, 2018. arXiv:1801.07710.
  • Hubin et al. [2018] A. Hubin, G. Storvik, F. Frommlet, Deep bayesian regression models, 2018. arXiv:1806.02160.
  • Liang et al. [2018] F. Liang, Q. Li, L. Zhou, Bayesian neural networks for selection of drug sensitive genes, Journal of the American Statistical Association 113 (2018) 955--972.
  • Javid et al. [2020] K. Javid, W. Handley, M. P. Hobson, A. Lasenby, Compromise-free bayesian neural networks, ArXiv abs/2004.12211 (2020).
  • Lee [2000] H. Lee, Consistency of posterior distributions for neural networks, Neural Networks 13 (2000) 629 -- 642.
  • Barron et al. [1999] A. Barron, M. J. Schervish, L. Wasserman, The consistency of posterior distributions in nonparametric problems, Ann. Statist. 27 (1999) 536--561.
  • Neal [1996] R. M. Neal, Bayesian Learning for Neural Neyworks, Springer-Verlag, Springer, New York, 1996. URL: https://books.google.com/books?id=OCenCW9qmp4C.
  • Lee [2004] H. K. H. Lee, Bayesian Nonparametrics via Neural Networks, Springer-Verlag, ASA-SIAM Series, 2004. URL: https://books.google.com/books?id=OCenCW9qmp4C.
  • Ghosh et al. [2004] M. Ghosh, T. Maiti, D. Kim, S. Chakraborty, A. Tewari, Hierarchical bayesian neural networks, Journal of the American Statistical Association 99 (2004) 601--608.
  • Blei et al. [2017] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association 112 (2017) 859–877.
  • Logsdon et al. [2009] B. A. Logsdon, G. E. Hoffman, J. G. Mezey, A variational bayes algorithm for fast and accurate multiple locus genome-wide association analysis, BMC Bioinformatics 11 (2009) 58 -- 58.
  • Graves [2011] A. Graves, Practical variational inference for neural networks, in: J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 24, Curran Associates, Inc., 2011, pp. 2348--2356. URL: http://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.
  • Carbonetto and Stephens [2012] P. Carbonetto, M. Stephens, Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies, Bayesian Anal. 7 (2012) 73--108.
  • Blundell et al. [2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural networks, 2015. arXiv:1505.05424.
  • Sun et al. [2019] S. Sun, G. Zhang, J. Shi, R. Grosse, Functional variational bayesian neural networks, 2019. arXiv:1903.05779.
  • Wang and Blei [2019] Y. Wang, D. M. Blei, Frequentist consistency of variational bayes, Journal of the American Statistical Association 114 (2019) 1147--1161.
  • Pati et al. [2017] D. Pati, A. Bhattacharya, Y. Yang, On statistical optimality of variational bayes, 2017. arXiv:1712.08983.
  • Yang et al. [2017] Y. Yang, D. Pati, A. Bhattacharya, α\alpha-variational inference with statistical guarantees, 2017. arXiv:1710.03266.
  • Zhang and Gao [2017] F. Zhang, C. Gao, Convergence rates of variational posterior distributions, 2017. arXiv:1712.02519.
  • Hornik et al. [1989] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359 -- 366.
  • Siegel and Xu [2019] J. W. Siegel, J. Xu, Approximation rates for neural networks with general activation functions, 2019. arXiv:1904.02311.
  • Shen [1997] X. Shen, On methods of sieves and penalization, Ann. Statist. 25 (1997) 2555--2591.
  • Shen et al. [2019] X. Shen, C. Jiang, L. Sakhanenko, Q. Lu, Asymptotic properties of neural network sieve estimators, 2019. arXiv:1906.00875.
  • White [1990] H. White, Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings, Neural Networks 3 (1990) 535 -- 549.
  • Scheffe [1947] H. Scheffe, A useful convergence theorem for probability distributions, Ann. Math. Statist. 18 (1947) 434--438.
  • Elezovic and Giordano [2000] N. Elezovic, C. Giordano, The best bounds in gautschi’s inequality, Mathematical Inequalities and Applications 3 (2000).
  • Wong and Shen [1995] W. H. Wong, X. Shen, Probability inequalities for likelihood ratios and convergence rates of sieve mles, Ann. Statist. 23 (1995) 339--362.
  • Pollard [1990] D. Pollard, Empirical Processes: Theory and Applications, Conference Board of the Mathematical Science: NSF-CBMS regional conference series in probability and statistics, Institute of Mathematical Statistics, 1990. URL: https://books.google.com/books?id=Prcsi29EU50C.
  • van der Vaart et al. [1996] A. van der Vaart, A. van der Vaart, A. van der Vaart, J. Wellner, Weak Convergence and Empirical Processes: With Applications to Statistics, Springer Series in Statistics, Springer, 1996. URL: https://books.google.com/books?id=OCenCW9qmp4C.