Statistical Foundation of Variational Bayes Neural Networks

Shrijita Bhattacharya bhatta61@msu.edu Tapabrata Maiti maiti@msu.edu Department of Statistics and Probability, Michigan State University

Abstract

Despite the popularism of Bayesian neural networks in recent years, its use is somewhat limited in complex and big data situations due to the computational cost associated with full posterior evaluations. Variational Bayes (VB) provides a useful alternative to circumvent the computational cost and time complexity associated with the generation of samples from the true posterior using Markov Chain Monte Carlo (MCMC) techniques. The efficacy of the VB methods is well established in machine learning literature. However, its potential broader impact is hindered due to a lack of theoretical validity from a statistical perspective. In this paper, we establish the fundamental result of posterior consistency for the mean-field variational posterior (VP) for a feed-forward artificial neural network model. The paper underlines the conditions needed to guarantee that the VP concentrates around Hellinger neighborhoods of the true density function. Additionally, the role of the scale parameter and its influence on the convergence rates has also been discussed. The paper mainly relies on two results (1) the rate at which the true posterior grows (2) the rate at which the KL-distance between the posterior and variational posterior grows. The theory provides a guideline of building prior distributions for Bayesian NN models along with an assessment of accuracy of the corresponding VB implementation.

keywords:

Neural networks, Variational posterior, Mean-field family, Hellinger neighborhood, Kullback-Leibler divergence, Sieve theory, Prior mass, Variational Bayes.

^†^†journal: Arxiv

1 Introduction

Bayesian neural networks (BNNs) have been comprehensively studied in the works of Bishop [1997], Neal [1992], Lampinen and Vehtari [2001], etc. More recent developments which establish the efficacy of BNNs can be found in the works of Sun et al. [2017], Mullachery et al. [2018], Hubin et al. [2018], Liang et al. [2018], Javid et al. [2020] and the references therein. The theoretical foundation of BNN by Lee [2000] widens the scope to a broader community. However, with the age of big data applications, the conventional Bayesian approach is computationally inefficient. Thus the alternative computational approaches, such as variational Bayes (VB) become popular among machine learning and applied researchers. Although, there have been many works on the algorithm development for VB in recent years, the theoretical advancement on estimation accuracy is rather limited. This article provides statistical validity of neural networks models with variational inference along with some theory-driven practical guidelines for implementation.

In this article, we mainly focus on feed-forward neural networks with a single hidden layer of inputs and a logistic activation function. Let the number of inputs be denoted by $p$ and the number of hidden nodes by $k_{n}$ where the number of nodes is allowed to increase as a function of $n$ . The true regression function, $E(Y|X=\boldsymbol{x})=f_{0}(\boldsymbol{x})$ is modeled as a neural network of the form

f(\boldsymbol{x})=\beta_{0}+\sum_{j=1}^{k_{n}}\beta_{j}\psi(\gamma_{j0}+\sum_{h=1}^{p}\gamma_{jh}x_{h})

(1)

where $\psi(u)=1/(1+\exp(-u))$ is the logistic activation function. With a Gaussian-prior on each of the parameters, Lee [2000] establishes the posterior consistency of neural networks under the simple setup where the scale parameter $\sigma=V(Y|X=\boldsymbol{x})$ is fixed at 1. The results in Lee [2000] mainly exploit Barron et al. [1999], a fundamental contribution that laid down the framework for posterior consistency in non parametric regression settings. In this paper, we closely mimick the regression model of Lee [2000] by assuming $y=f_{0}(\boldsymbol{x})+\xi$ where $f_{0}(\boldsymbol{x})$ is the true regression function and $\xi$ follows $N(0,\sigma^{2})$ .

The joint posterior distribution of a neural network model is generally evaluated by popular Markov Chain Monte Carlo (MCMC) sampling techniques, like Gibbs sampling, Metropolis Hastings, etc. (see, Neal [1996], Lee [2004], and Ghosh et al. [2004] for more details). Despite the versatility and popularity of MCMC based approach, the Bayesian estimation suffers from computational costs, scalability, time constraints along with other implementation issues such as choice of proposal densities and generating sample paths. Variational Bayes emerged as an important alternative to overcome the drawbacks of the MCMC implementation (see Blei et al. [2017]). Many recent works have discussed the application of variational inference to Bayesian neural networks e.g., Logsdon et al. [2009], Graves [2011], Carbonetto and Stephens [2012], Blundell et al. [2015], Sun et al. [2019]. Although, there is a plethora of literature implementing variational inference for neural networks, the theoretical properties of the variational posterior in BNNs remain relatively unexplored and this limits the use of this powerful computational tool beyond the machine learning community.

Some of the previous works that focused on theoretical properties of variational posterior include the frequentist consistency of variational inference in parametric models in the presence of latent variables (see Wang and Blei [2019]). Optimal risk bounds for mean-field variational Bayes for Gaussian mixture (GM) and Latent Dirichlet allocation (LDA) models have been discussed in Pati et al. [2017]. The work of Yang et al. [2017] propose $\alpha$ variational inference Bayes risk for GM and LDA models. A more recent work Zhang and Gao [2017] discusses the variational posterior consistency rates in Gaussian sequence models, infinite exponential families and piece-wise constant models. In order to evaluate the validity of a posterior in non-parametric models, one must establish its consistency and rates of contraction. To the best of our knowledge, the problem of posterior consistency, has not been studied in the context of variational Bayes neural network models.

Our contribution: Our theoretical development of posterior consistency, an essential property in nonparametric Bayesian Statistics, provides confidence in using the variational Bayes neural networks model across the disciplines. Our theoretical results help to assess the estimation accuracy for a given training sample and model complexity. Specifically, we establish the conditions needed for the variational posterior consistency of the feedforward neural networks. We establish that a simple Gaussian mean-field approximation is good enough to achieve consistency for the variational posterior. In this direction, we show that $\varepsilon$ - Hellinger neighborhood of the true density function receives close to 1 probability under the variational posterior. For the true posterior density ( Lee [2000]), the posterior probability of an $\varepsilon$ - Hellinger neighborhood grows at the rate $1-e^{-\epsilon n^{\delta}}$ . In contrast, we show for the variational posterior this rate becomes $1-\epsilon/n^{\delta}$ . The reason for this difference is explained by two folds: (1) first, the KL-distance between the variational posterior and the true posterior does not grow at a rate greater than $n^{1-\delta}$ for some $0\leq\delta<1$ , (2) second, the posterior probability of $\varepsilon$ - Hellinger neighborhood grows at the rate $1-e^{-\epsilon n^{\delta}}$ , thus, the variational posterior probability must grow at the rate $1-\epsilon/n^{\delta}$ , otherwise the rate of growth of the KL-distance cannot be controlled. We also give the conditions on the approximating neural network and the rate of growth in the number of nodes needed to ensure that the variational posterior achieves consistency. As a last contribtuion, we show that the VB estimator of the regression function converges to the true regression function.

Further, our investigation shows that although the variational posterior(VP) is asymptotically consistent, posterior probability of $\varepsilon-$ Hellinger neighborhoods does not converge to 1 as fast as the true posterior. In addition, one requires that the absolute value of the parameters in the approximating neural network function grow at a controlled rate (less than $n^{1-\delta}$ for some $0\leq\delta<1$ ), a condition not needed in dealing with MCMC based implementation. When the absolute value of the parameters grow as a polynomial function of $n$ ( $O(n^{v}),v>1$ ), one can choose a flatter prior (a prior whose variance increases with $n$ ) in order to guarantee VP consistency.

VP consistency has been established irrespective of whether $\sigma$ is known or unknown and the differences in practice have been discussed. It has been shown that one must guard against using Gaussian distributions as a variational family for $\sigma$ . Since the KL-distance between variational posterior and true posterior must be controlled, one must ensure that quantities like $E(\log X)$ and $E(1/X^{2})$ are defined under the variational distribution of $\sigma$ . We thereby discuss two sets of variational family on $\sigma$ , (1) an inverse gamma-distribution, (2) a normal distribution on the log-transformed $\sigma$ . While the second approach may seem intuitively appealing if one were to use fully Gaussian variational families, it comes with a drawback. Indeed, under the reparametrized $\sigma$ , the variational posterior is consistent if the rate of growth in the number of nodes is slower than under the original parameter models. However, a smaller growth in the number of nodes makes it more and more difficult to find an approximating neural network which converges fast enough to the true function.

The outline of the paper is as follows. In Section 2, we present the notation and the terminology of consistency for variational posterior. In Section 3, we present the consistency results when the scale parameter is known. In Section 4, we present the consistency of an unknown scale parameter under two sets of variational families. In Section 5, we show that the Bayes estimates obtained from the variational posterior converge to the true regression function and scale parameter. Finally, Section 5 ends with a discussion and conclusions from our current work.

2 Model and Assumptions

Suppose the true regression model has the form:

y_{i}=f_{0}(\boldsymbol{x}_{i})+\xi_{i}

where $\xi_{1},\cdots,\xi_{n}$ are i.i.d. $N(0,\sigma_{0}^{2})$ random variables and the feature vector $\boldsymbol{x}_{1},\cdots\boldsymbol{x}_{n}$ with $\boldsymbol{x}_{i}\in\mathbb{R}^{p}$ . For the purposes of this paper, we assume that the number of covariates $p$ is fixed.

Thus, the true conditional density of $Y|X=\boldsymbol{x}$ is

l_{0}(y,\boldsymbol{x})\propto\prod_{i=1}^{n}\exp(-\frac{1}{2\sigma_{0}^{2}}(y-f_{0}(\boldsymbol{x}))^{2})

(2)

which implies the true likelihood function is

L_{0}=\prod_{i=1}^{n}l_{0}(y_{i},\boldsymbol{x}_{i})

(3)

Universal approximation: By Hornik et al. [1989], for every function $f_{0}$ such that $\int f_{0}^{2}(x)dx<\infty$ , there exists a neural network $f$ such that $||f-f_{0}||_{2}<\epsilon$ . This led to the ubiquitous use of neural networks as a modeling approximation to a wide class of regression functions.

In this paper, we assume that the true regression function $f_{0}$ can be approximated by a neural network

f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})=\beta_{0}+\sum_{j=1}^{k_{n}}\beta_{j}\psi(\gamma_{j}^{\top}\boldsymbol{x}),\>\>\boldsymbol{\theta}_{n}=(\beta_{j},\gamma_{jh})_{j\in J,h\in H},\>\>J=\{0,\cdots,k_{n}\},\>\>H=\{0,\cdots,p\}

(4)

where $k_{n}$ , the number of nodes increases as a function of $n$ , while $p$ , the number of covariates is fixed. Thus, the total number of parameters grow at the same rate as the number of nodes, i.e. $K(n)=1+k_{n}(p+1)\sim k_{n}$ .

Suppose there exists a neural network $f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x})$ such that

\displaystyle(A1)

\displaystyle\hskip 28.45274pt||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}=o(n^{-\delta})

(5)

Note that if $f_{0}$ is a neural network function itself, then (A1) holds trivially for all $0\leq\delta<1$ irrespective of the choice of $k_{n}$ . Theorem 2 of Siegel and Xu [2019] showed that with $k_{n}=n$ , $\delta$ can be chosen between $0\leq\delta<1/2$ . Mimicking the steps of Theorem 2, Siegel and Xu [2019], it can be shown that with $k_{n}=n^{a},a>1/2$ , $\delta$ can be chosen anywhere in the range $0\leq\delta<a-1/2$ . For a given choice of $k_{n}$ , whether (A1) holds or not depends on the entropy of the true function. Assumptions of similar form can also be found in Shen [1997] (see conditions C and $C^{\prime}$ ) and Shen et al. [2019] (see condition C3).

Note that the condition (A1) characterizes the rate at which a neural network function approaches to the true function. The next set of conditions characterize the rate at which the coefficients of the approximating neural network solution grow. Suppose, one of the following two conditions hold:

	$\displaystyle(A2)$	$\displaystyle\hskip 28.45274pt\sum_{i=1}^{K(n)}\theta_{i0n}^{2}=o(n^{1-\delta}),\;0\leq\delta<1$		(6)
	$\displaystyle(A3)$	$\displaystyle\hskip 28.45274pt\sum_{i=1}^{K(n)}\theta_{i0n}^{2}=O(n^{v}),\;\;v\geq 1$		(7)

Note that condition (A2) ensures that sum of squares of the coefficients grow at a rate slower than $n$ . White [1990] proved consistency properties of feed forward neural networks with $\sum_{i=1}^{K(n)}|\theta_{i0n}|=o(n^{1/4})$ which implies $\sum_{i=1}^{K(n)}|\theta_{i0n}|^{2}\leq(\sum_{i=1}^{K(n)}|\theta_{i0n}|)^{2}=o(n^{1/2})$ , i.e. $0\leq\delta<1/2$ . Blei et al. [2017] studied the consistency properties for parametric models wherein one requires the assumption $-\log p(\theta_{0})$ be bounded (see Relations (44) and (53) in Blei et al. [2017]). With a normal prior of the form $p(\boldsymbol{\theta}_{n})\propto\exp(-\sum_{i=1}^{K(n)}\theta_{in}^{2})$ , the same condition reduces to $\sum_{i=1}^{K(n)}\theta_{i0n}^{2}$ bounded at a suitable rate. Indeed, condition (A2) guarantees that the rate of growth KL-distance between the true and the variational posterior is well controlled.

Condition (A3) is a relaxed version of (A2), where the sum of squares of the coefficients is allowed to grow at a rate in polynomial in $n$ . A standard prior independent of $n$ might fail to guarantee convergence. We thereby assume a flatter prior whose variance increases with $n$ in order to allow for consistency through variational bayes. Note that if $f_{0}$ is a neural network function itself, conditions (A2) and (A3) hold trivially.

Kullback-Leibler divergence: Let $P$ and $Q$ be two probability distributions, with density $p$ and $q$ respectively, then

d_{KL}(q,p)=\int_{\mathcal{X}}\log\frac{p(x)}{q(x)}q(x)dx

Hellinger distance: Let $P$ and $Q$ be two probability distributions with density $p$ and $q$ respectively, then

d_{H}(q,p)=\int_{\mathcal{X}}(\sqrt{q(x)}-\sqrt{p(x)})^{2}dx

Distribution of the feature vector: In order to establish posterior consistency, we assume that the feature vector $\boldsymbol{x}\sim U(0,1)^{p}$ . Although, this is not a requirement for the model, it simplifies steps of the proof since the joint density function of (Y,X) simplifies as

g_{Y,X}(y,\boldsymbol{x})=g_{Y|X}(y|\boldsymbol{x})g_{X}(\boldsymbol{x})=g_{Y|X}(y|\boldsymbol{x})

(8)

Thus, it suffices to deal with the conditional density of $Y|X=\boldsymbol{x}$ .

3 Consistency of variational posterior with $\sigma$ known

In this section, we begin with the simple model where the scale parameter $\sigma_{0}$ is known. For a simple Gaussian mean field family as in (13), we establish that variational posterior is consistent as long as assumptions (A1), (A2) and (A3) hold. We also discuss, how the rates contrast with those in Lee [2000] which established the posterior consistency of the true posterior.

Sieve Theory: Let $\boldsymbol{\omega}_{n}=\boldsymbol{\theta}_{n}$ , then

l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x})=\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}\exp\Big{(}-\frac{1}{2\sigma_{0}^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)}

(9)

where $\boldsymbol{\theta}_{n}$ and $f_{\boldsymbol{\theta}_{n}}$ are defined in (4). The sieve is then defined as:

\displaystyle\mathcal{G}_{n}=\Big{\{}l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x}),\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}\Big{\}}\hskip 28.45274pt\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n}):|\theta_{in}|\leq C_{n}\Big{\}}

(10)

Likelihood:

L(\boldsymbol{\omega}_{n})=\prod_{i=1}^{n}l_{\boldsymbol{\omega}_{n}}(y_{i},\boldsymbol{x}_{i})

(11)

Posterior: Let $p(\boldsymbol{\omega}_{n})$ denote the prior on $\boldsymbol{\omega}_{n}$ . Then, the posterior is given by

\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}

(12)

Variational Family: Variational family for $\boldsymbol{\omega}_{n}$ is given by

\mathcal{Q}_{n}=\left\{q:q(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\tilde{s}^{2}_{in}}}e^{-\frac{(\theta_{in}-\tilde{m}_{in})^{2}}{2\tilde{s}_{in}^{2}}}\right\}

(13)

Let the variational posterior be denoted by

\pi^{*}(\boldsymbol{\omega}_{n})=\underset{{q\in\mathcal{Q}_{n}}}{\text{argmin}}d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))

(14)

Hellinger neighborhood: Define the neighborhood of the true density $l_{0}$ as

\mathcal{V}_{\varepsilon}=\{\boldsymbol{\omega}_{n}:d_{H}(l_{0},l_{\boldsymbol{\omega}_{n}})<\varepsilon\}

(15)

where the Hellinger distance $d_{H}(l_{0},l_{\boldsymbol{\omega}_{n}})$ given by

d_{H}(l_{0},l_{\boldsymbol{\omega}_{n}})=\int\int\left(\sqrt{l_{\boldsymbol{\omega}_{n}}(\boldsymbol{x},y)}-\sqrt{l_{0}(\boldsymbol{x},y)}\right)^{2}d\boldsymbol{x}dy

Note that the above simplified of the Hellinger distance is due to (8).

In the following two theorems for two class of priors, we establish the posterior consistency of $\pi^{*}$ , i.e. the variational posterior concentrates in $\varepsilon-$ small Hellinger neighborhoods of the true density $l_{0}$ . Note that, assumptions (A2) and (A3) impose a restriction on the rate of growth of the sum of squares of the coefficients of the approximating neural network solution. With (A2), we show that a standard normal prior on all the parameters works. However, under the more weaker assumption (A3), a normal prior whose variance increases with $n$ is needed. Additionally, we show that for the variational posterior to achieve consistency, the number of parameters or equivalenty the number of nodes $k_{n}$ need to grow in a controlled fashion.

Theorem 3.1.

Suppose the number of nodes $k_{n}$ satisfy

\displaystyle(C1)\hskip 28.45274ptk_{n}

\displaystyle\sim n^{a}

(16)

In addition, suppose assumptions (A1) and (A2) hold for some $0\leq\delta<1-a,\;$ .

Then, with normal prior for each entry in $\boldsymbol{\omega}_{n}$ as follows

p(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}

(17)

we have

\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=o_{P_{0}^{n}}(n^{-\delta})

Note that conditions (16) and (17) agree with those assumed in Theorem 1 of Lee [2000]. Since $\pi^{*}(\mathcal{V}^{c}_{\varepsilon})=o_{P_{0}}(n^{-\delta})$ , the variational posterior is consistent with $\delta$ as small as 0. Indeed $\delta=0$ imposes the least restriction on the convergence rate and coefficient growth rate of the true function (see assumptions (A1) and (A2)). As $\delta$ grows, restrictions on the approximating neural function increase but that guarantees faster convergence of the variational posterior. Expanding upon the Bayesian posterior consistency established in Lee [2000], one can show that $\pi(\mathcal{V}_{\varepsilon}^{c}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})\leq o_{P_{0}^{n}}(e^{-n^{\delta}})$ for any $0\leq\delta<1$ (see Relation (88) in Lee [2000]). Thus, probability of $\varepsilon-$ Hellinger neighborhood grows at the rate $1-\epsilon(1/n)^{\delta}$ for variational posterior in contrast to that of $1-\epsilon(e^{-n})^{\delta}$ for true posterior. For parametric models, the rate of growth of the variational posterior was found to be $1-\epsilon(1/n)$ (see second equation 2 on page 38 of Blei et al. [2017]). Note that the consistency of true posterior requires no assumptions on the approximating neural network function whereas for the variational posterior, both assumptions (A1) and (A2) must be satisfied to guarantee convergence.

Theorem 3.2.

Suppose the number of nodes $k_{n}$ satisfy condition (C1). In addition, suppose assumptions (A1) and (A3) hold for some $0\leq\delta<1-a$ and $v>1$ .

Then, with normal prior for each entry in $\boldsymbol{\omega}_{n}$ as follows

p(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}n^{u}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}n^{u}}},u>v

(18)

we have

\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=o_{P_{0}^{n}}(n^{-\delta})

Observe that the consistency rate in Theorem 3.2 agrees to the one in Theorem 3.1. In order to prove both theorems 3.1 and 3.2, a crucial step is to show that $d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})$ . In order to show this, we show that $d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}}(n^{1-\delta})$ for some $q\in\mathcal{Q}_{n}$ . Indeed this choice of $q$ varies in order to adjust for changing nature of the prior from (17) to (18) (see statements (1) and (2) in Lemma 7.9).

We next present the proof of Theorems 3.1 and 3.2. The first crucial step of the proof is to establish that the $d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$ is bounded below by a quantity which is determined by the rate of consistency of the true posterior (see the quantities $A_{n}$ and $B_{n}$ in the proof below). The second crucial step towards the proof is to show $d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$ is bounded above at a rate which can be greater than the rate of its lower bound iff the variation posterior is consistent.

Proof of Theorems 3.1 and 3.2.

With $\mathcal{V}_{\varepsilon}$ as in (15), we have

\displaystyle d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=\underbrace{\int_{\mathcal{V}_{\varepsilon}}\pi^{*}(\boldsymbol{\omega}_{n})\log\frac{\pi^{*}(\boldsymbol{\omega}_{n})}{\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}d\boldsymbol{\omega}_{n}}_{③}+\underbrace{\int_{\mathcal{V}_{\varepsilon}^{c}}\pi^{*}(\boldsymbol{\omega}_{n})\log\frac{\pi^{*}(\boldsymbol{\omega}_{n})}{\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}d\boldsymbol{\omega}_{n}}_{④}

(19)

Without loss of generality, $\pi^{*}(\mathcal{V}_{\varepsilon})>0$ , $\pi^{*}(\mathcal{V}_{\varepsilon}^{c})>0$ .

	$\displaystyle③$	$\displaystyle=-\pi^{}(\mathcal{V}_{\varepsilon})\int_{\mathcal{V}_{\varepsilon}}\frac{\pi^{}(\boldsymbol{\omega}_{n})}{\pi^{}(\mathcal{V}_{\varepsilon})}\log\frac{\pi(\boldsymbol{\omega}_{n}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}{\pi^{}(\boldsymbol{\omega}_{n})}d\boldsymbol{\omega}_{n}$
		$\displaystyle\geq-\pi^{}(\mathcal{V}_{\varepsilon})\log\int_{\mathcal{V}_{\varepsilon}}\frac{\pi^{}(\boldsymbol{\omega}_{n})}{\pi^{}(\mathcal{V}_{\varepsilon})}\frac{\pi(\boldsymbol{\omega}_{n}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}{\pi^{}(\boldsymbol{\omega}_{n})}d\boldsymbol{\omega}_{n}\hskip 14.22636pt\text{ Jensen's inequality}$
		$\displaystyle\geq\pi^{}(\mathcal{V}_{\varepsilon})\log\frac{\pi^{}(\mathcal{V}_{\varepsilon})}{\pi(\mathcal{V}_{\varepsilon}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}\geq\pi^{}(\mathcal{V}_{\varepsilon})\log\pi^{}(\mathcal{V}_{\varepsilon})\hskip 28.45274pt\text{ since }\log\pi(\mathcal{V}_{\varepsilon}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})\leq 0$

Similarly,

	$\displaystyle④$	$\displaystyle\geq\pi^{}(\mathcal{V}_{\varepsilon}^{c})\log\frac{\pi^{}(\mathcal{V}_{\varepsilon}^{c})}{\pi(\mathcal{V}_{\varepsilon}^{c}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}$
		$\displaystyle\geq\pi^{}(\mathcal{V}_{\varepsilon}^{c})\log\pi^{}(\mathcal{V}_{\varepsilon}^{c})-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\log\pi(\mathcal{V}_{\varepsilon}^{c}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})$		(20)

Now let us consider

	$\displaystyle\log\pi(\mathcal{V}_{\varepsilon}^{c}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})$	$\displaystyle=\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}$
		$\displaystyle=\underbrace{\log\int_{\mathcal{V}_{\varepsilon}^{c}}(L(\boldsymbol{\omega}_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}_{A_{n}}\underbrace{-\log\int(L(\boldsymbol{\omega}_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}_{B_{n}}$		(21)

Using (3) in (3), we get

\displaystyle④\geq\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\log\pi^{*}(\mathcal{V}_{\varepsilon}^{c})-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})A_{n}-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}

(22)

Combining (19) and (22), we get

	$\displaystyle d_{KL}(\pi^{*}(.),\pi(.\|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$	$\displaystyle\geq\pi^{}(\mathcal{V}_{\varepsilon})\log\pi^{}(\mathcal{V}_{\varepsilon})+\pi^{}(\mathcal{V}_{\varepsilon}^{c})\log\pi^{}(\mathcal{V}_{\varepsilon}^{c})-\pi^{}(\mathcal{V}_{\varepsilon}^{c})A_{n}-\pi^{}(\mathcal{V}_{\varepsilon}^{c})B_{n}$		(23)
		$\displaystyle\geq-\log 2-\pi^{}(\mathcal{V}_{\varepsilon}^{c})A_{n}-\pi^{}(\mathcal{V}_{\varepsilon}^{c})B_{n}$		(24)

where the last inequality follows since $x\log x+(1-x)\log(1-x)\geq-\log 2$ for $0<x<1$ .

Therefore,

\boxed{d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))+\log 2+\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}\geq-\pi^{*}(\mathcal{V}_{\varepsilon}^{c})A_{n}}

(25)

By Proposition 7.17,

	$\displaystyle-A_{n}\geq-\log 2+n\varepsilon^{2}+o_{P_{0}^{n}}(1)$
	$\displaystyle\implies-A_{n}\pi^{}(\mathcal{V}_{\varepsilon})\geq-\log 2+n\varepsilon^{2}\pi^{}(\mathcal{V}_{\varepsilon})+o_{P_{0}^{n}}(1)$
	$\displaystyle\implies\pi^{}(\mathcal{V}_{\varepsilon}^{c})n\varepsilon^{2}\leq d_{KL}(\pi^{}(.),\pi(.\|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))+2\log 2+\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}+o_{P_{0}^{n}}(1)$

By Proposition 7.18,

\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}=o_{P_{0}^{n}}(n^{1-\delta})

By Proposition 7.19,

\displaystyle d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})

Therefore,

\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\leq o_{P_{0}^{n}}(n^{-\delta})+o_{P_{0}^{n}}(n^{-1})=o_{P_{0}^{n}}(n^{-\delta})

∎

In the above proof we have assumed $\pi^{*}(\mathcal{V}_{\varepsilon})>0$ , $\pi^{*}(\mathcal{V}_{\varepsilon}^{c})>0$ . If $\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=0$ , there is nothing to prove. If $\pi^{*}(\mathcal{V}_{\varepsilon})=0$ , then following the steps of the proof, we will get $\varepsilon^{2}=o_{P_{0}}(n^{-\delta})$ which is a contradiction.

The main step in the above proof is (25) which we discuss next. The quantity $e^{A_{n}}$ is indeed decomposed into two parts

e^{A_{n}}=\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}(L(\boldsymbol{\omega})_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}+\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}^{c}}(L(\boldsymbol{\omega})_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

Whereas the first term is controlled using the Hellinger bracketing entropy of $\mathcal{F}_{n}$ , the second term is controlled by the fact that the prior gives negligible probability outside $\mathcal{F}_{n}$ . Thus, the main factor influencing $e^{A_{n}}$ is a suitable choice of the sequence of spaces $\mathcal{F}_{n}$ . Indeed our choice of $\mathcal{F}_{n}$ is same as that in Lee [2000] with $k_{n}\sim n^{a}$ and $C_{n}=e^{n^{b-a}}$ . Such a choice allows one to control the Hellinger bracketing entropy of $\mathcal{F}_{n}$ while controlling the prior mass for $\mathcal{F}_{n}^{c}$ also at the same time.

The second quantity $B_{n}$ is controlled by the rate at which the prior gives mass to shrinking KL neighborhoods of the true density $l_{0}$ . Indeed, the quantity $B_{n}$ appears again when computing bounds on $d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n})$ for some $q\in\mathcal{Q}_{n}$ (see $③$ in Proposition 7.19). If $\delta=0$ , $B_{n}$ can be controlled even without assumptions (A1) and (A2). However, if $\delta>0$ , assumptions (A1) and (A2) are needed in order to guarantee that the $B_{n}$ grows at a rate less than $n^{1-\delta}$ .

The last quantity, $d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$ is controlled at a rate less than $n^{1-\delta}$ by showing that there exists a $q\in\mathcal{Q}_{n}$ (see (62) and (65)) such that $d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})$ . Both assumptions (A1) and (A2) play an important role in guaranteeing that such a $q$ does exist.

4 Consistency of variational posterior with $\sigma$ unknown

In this section, we now assume that the scale parameter $\sigma$ is unknown. In this case, our approximating variational family is slightly different from (14). Whereas, we still assume a mean field Gaussian family on $\boldsymbol{\theta}_{n}$ , our approximating family for $\sigma$ cannot be Gaussian. An important criterion to guarantee the consistency of variational posterior is to ensure $\int d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$ is well bounded (see Lemma 7.11). When $\sigma$ is unknown, $d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})$ involves terms like $\log\sigma$ and $1/\sigma^{2}$ both of whose integrals are undefined under a normally distributed $q$ . We thereby adopt two versions of $q$ for $\sigma$ , firstly an inverse gamma distribution of $\sigma$ and secondly a normal distribution on the log transformed $\sigma$ (see Sections 4.1 and 4.2 respectively). Both the transforms have their respective advantage in terms of determining the rate of consistency of the variational posterior. In this section, we work only with assumption (A2). We can handle (A3) in a way exactly similar to Section 3.

4.1 Inverse-gamma prior on $\sigma$

Sieve Theory: Let $\boldsymbol{\omega}_{n}=(\boldsymbol{\theta}_{n},\sigma^{2})$ where $\boldsymbol{\theta}_{n}$ and $f_{\boldsymbol{\theta}_{n}}$ are defined in (4), then

l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\Big{(}-\frac{1}{2\sigma^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)}

(26)

The sieve is defined as follows.

\displaystyle\mathcal{G}_{n}=\Big{\{}l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x}),\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}\Big{\}}\hskip 14.22636pt\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n},\sigma^{2}):|\theta_{in}|\leq C_{n},1/C_{n}^{2}\leq\sigma^{2}\leq D_{n}\Big{\}}

(27)

The definitions for likelihood, posterior and Hellinger neighborhood agree with those given in (11), (12) and (15) as in Section 3.

Prior distribution: We propose a normal prior on each $\theta_{in}$ and an inverse gamma prior of $\sigma^{2}$ .

p(\boldsymbol{\omega}_{n})=\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}

(28)

Variational Family: Variational family for $\boldsymbol{\omega}_{n}$ is given by

\mathcal{Q}_{n}=\left\{q:q(\boldsymbol{\omega}_{n})=\frac{\tilde{b}_{n}^{\tilde{a}_{n}}}{\Gamma(\tilde{a}_{n})}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\tilde{a}_{n}+1}e^{-\frac{\tilde{b}_{n}}{\sigma^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\tilde{s}^{2}_{in}}}e^{-\frac{(\theta_{in}-\tilde{m}_{in})^{2}}{2\tilde{s}_{in}^{2}}}\right\}

(29)

The variational posterior has the same definition as in (14).

The following theorem shows that when the $\sigma$ parameter is unknown, the variational posterior is still consistent, however the rate decreases by an amount of $n^{\epsilon}$ .

Theorem 4.1.

Suppose the number of nodes satisfy condition (C1). In addition, suppose assumptions (A1) and (A2) hold for some $0<\delta<1-a$ . Then for any $\epsilon>0$ .

\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=o_{P_{0}^{n}}(n^{\epsilon-\delta})

Note that by Theorem 3.1, the posterior is consistent iff $\epsilon-\delta<0$ which is indeed the case as long as $\delta>0$ . Whether such a $\delta$ exists or not depends on the entropy of the function $f_{0}$ (see the discussion section in Shen et al. [2019]). Mimicking the steps of Theorem 2, Siegel and Xu [2019] it can be shown that with $k_{n}=n^{a}$ , $a>1/2$ , $\delta$ can be chosen anywhere in the range $0\leq\delta<1/2$ .

Proof.

The proof mimics the steps in the proof of Theorems 3.1 and 3.2 till equation (25).

By Proposition 7.22 for any $0<r<1$ ,

	$\displaystyle-A_{n}\geq-\log 2+n^{r}\varepsilon^{2}+o_{P_{0}^{n}}(1)$
	$\displaystyle-A_{n}\pi^{}(\mathcal{V}_{\varepsilon})\geq-\log 2+n^{r}\varepsilon^{2}\pi^{}(\mathcal{V}_{\varepsilon})+o_{P_{0}^{n}}(1)$
	$\displaystyle\implies\pi^{}(\mathcal{V}_{\varepsilon}^{c})n^{r}\varepsilon^{2}\leq d_{KL}(\pi^{}(\boldsymbol{\omega}_{n}),\pi(\boldsymbol{\omega}_{n}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))+2\log 2+\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}+o_{P_{0}^{n}}(1)$

By Proposition 7.23,

\pi^{*}(\mathcal{V}_{\varepsilon}^{c})B_{n}=o_{P_{0}^{n}}(n^{1-\delta})

By Proposition 7.24,

\displaystyle d_{KL}(\pi^{*}(\boldsymbol{\omega}_{n}),\pi(\boldsymbol{\omega}_{n}|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})

Therefore, with $r=1-\epsilon$ , we have

\pi^{*}(\mathcal{V}_{\varepsilon}^{c})\leq o_{P_{0}^{n}}(n^{1-\delta-r})+o_{P_{0}^{n}}(n^{-r})=o_{P_{0}^{n}}(n^{\epsilon-\delta})+o_{P_{0}^{n}}(n^{\epsilon-1})=o_{P_{0}^{n}}(n^{\epsilon-\delta})

∎

Similar to the proof of Theorem 3.1, the quantity $e^{A_{n}}$ is indeed decomposed into two parts

e^{A_{n}}=\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}(L(\boldsymbol{\omega})_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}+\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}^{c}}(L(\boldsymbol{\omega})_{n})/L_{0})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

Whereas the first term is controlled using the Hellinger bracketing entropy of $\mathcal{F}_{n}$ at the rate $e^{-n\varepsilon^{2}}$ , the second term is controlled by the prior probability of $\mathcal{F}_{n}^{c}$ at $e^{-n^{r}}$ , $0<r<1$ . Since the prior probability of $\mathcal{F}_{n}^{c}$ is now controlled at a comparatively slightly smaller rate than that of Theorem 3.1, hence the additional $\epsilon$ term in the overall consistency rate of variational posterior.

Remark 4.2.

With $k_{n}\sim n^{a}$ and $\mathcal{F}_{n}$ as in (27), we choose $C_{n}=e^{n^{b-a}}$ and $D_{n}=e^{n^{b}}$ , $0<a<b<1$ to prove the posterior consistency statement of Theorem 4.1. Suitably choosing $\mathcal{F}_{n}$ as a function of $\varepsilon$ one may be able to refine the proof to obtain a rate of $o_{P_{0}^{n}}(n^{-\delta})$ instead of $o_{P_{0}^{n}}(n^{\epsilon-\delta})$ . However the proof becomes more involved and such a $\varepsilon-$ dependent choice of $\mathcal{F}_{n}$ has been avoided for the purposes of this paper.

Remark 4.3.

When $\sigma$ is unknown, in order to control $d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$ at a rate less than $n^{1-\delta}$ , $q(\boldsymbol{\theta}_{n})$ has the same form as in the proof of Theorem 3.1. However, we cannot choose a normally distributed $q$ for $\sigma^{2}$ . The convergence of $d_{KL}(\pi^{*}(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$ is determined by the term $\int d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$ which involves terms like $\frac{1}{2\sigma^{2}}$ and $\log\sigma^{2}$ (see (7.3)). The expectation of these terms is not defined under a normal $q$ but well defined under an inverse gamma distribution, hence an inverse-gamma variational family of $q(\sigma^{2})$ .

4.2 Normal prior on log transformed $\sigma$

Given, the wide popularity of Gaussian mean field approximation, we next use a normal variational distribution on the log-transformed $\sigma$ and compare and contrast it to the case where an inverse-gamma variational distribution on the scale parameter. In Section 3.3 of Blei et al. [2017], it has been posited that a Gaussian VB posterior can be used to approximate a wide class of posteriors. However, as mentioned in Section 4.1, a normal $q$ would cause $E_{Q}d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})$ to be undefined. One way out of this impasse reparametrizing $\sigma$ as $\sigma_{\rho}=\log(1+\exp(\rho))$ with a normal prior is used for $\rho$ . In the following section, we show that this approach may work but comes with the disadvantage where the number of nodes, $k_{n}$ needs to grow at a rate smaller than $n^{1/2}$ . The main disadvantage with this approach is if the number of nodes do not grow sufficiently, it may be difficult to find a neural network which well approximates the true function.

Sieve Theory: Let $\boldsymbol{\omega}_{n}=(\boldsymbol{\theta}_{n},\rho)$ where $\boldsymbol{\theta}_{n}$ and $f_{\boldsymbol{\theta}_{n}}$ are same as defined in (4). With $\sigma_{\rho}=\log(1+e^{\rho})$ , we have

l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x})=\frac{1}{\sqrt{2\pi\sigma_{\rho}^{2}}}\exp\Big{(}-\frac{1}{2\sigma_{\rho}^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)}

(30)

The sieve is defined as follows.

\displaystyle\mathcal{G}_{n}

\displaystyle=\Big{\{}l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x}),\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}\Big{\}}\hskip 28.45274pt\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n},\sigma^{2}):|\theta_{in}|\leq C_{n},|\rho|<\log C_{n}\Big{\}}

(31)

The definitions for likelihood, posterior and Hellinger neighborhood agree with those given in (11), (12) and (15) as in Section 3.

Prior distribution: We propose a normal prior on each $\theta_{in}$ and $\rho$ as follows

p(\boldsymbol{\omega}_{n})=\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}

(32)

Variational Family: Variational family for $\boldsymbol{\omega}_{n}$ is given by

\mathcal{Q}_{n}=\left\{q:q(\boldsymbol{\omega}_{n})=\frac{1}{\sqrt{2\pi\tilde{s}^{2}_{0n}}}e^{-\frac{(\rho-\tilde{m}_{0n})^{2}}{2\tilde{s}_{0n}^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\tilde{s}^{2}_{in}}}e^{-\frac{(\theta_{in}-\tilde{m}_{in})^{2}}{2\tilde{s}_{in}^{2}}}\right\}

(33)

The variational posterior has the same definition as in (14).

In the following theorem we show that even with $\sigma$ reparametrized as $\log(1+e^{\rho})$ the variational posterior is consistent.

Theorem 4.4.

Suppose the number of nodes satisfy condition (C1) with $a<1/2$ . In addition, suppose assumptions (A1) and (A2) hold for $0\leq\delta<1-a$ . Then,

\pi^{*}(\mathcal{V}_{\varepsilon}^{c})=o_{P_{0}^{n}}(n^{-\delta})

Proof.

The proof mimics the steps in the proof of 3.1 and 3.2 with Propositions 7.17, 7.18 and 7.19 replaced by 7.27, 7.28 and 7.29 respectively. ∎

Remark 4.5.

With $k_{n}\sim n^{a}$ and $\mathcal{F}_{n}$ as in (31), we choose $C_{n}=e^{n^{b-a}}$ where $0<a<b<1$ . In order to ensure that prior gives smaller mass outside $\mathcal{F}_{n}$ , one requires $\pi_{n}(\mathcal{F}_{n}^{c})<e^{-ns}$ for some $s>0$ . With a normal prior of $\rho$ and $P(|\rho|>\log C_{n})\sim\frac{1}{\log C_{n}}e^{-(\log C_{n})^{2}}$ which is less than $e^{-n}$ provided $2(b-a)>1$ or $a<1/2$ . Hence, the requirement of a slow growth in the number of nodes.

5 Consistency of variational bayes

In this section, we show that if the variational posterior is consistent, the variational Bayes estimator of $\sigma$ and $f_{\boldsymbol{\theta}_{n}}$ converges to the true $\sigma_{0}$ and $f_{0}$ . The proof uses ideas from Barron et al. [1999] and Corollary 1 in Lee [2000]. Let

	$\displaystyle\hat{f}_{n}(\boldsymbol{x})$	$\displaystyle=\int f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})\pi^{*}(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}$
	$\displaystyle\hat{\sigma}^{2}_{n}$	$\displaystyle=\int\sigma^{2}\pi^{*}(\sigma^{2})d\sigma^{2}$		(34)

Corollary 5.1 (Variational bayes consistency.).

Suppose $\hat{f}_{n}$ and $\hat{\sigma}_{n}^{2}$ are defined as in (5), then

	$\displaystyle\int(\hat{f}_{n}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}$	$\displaystyle=o_{P_{0}^{n}}(1)$
	$\displaystyle\frac{\hat{\sigma}_{n}}{\sigma_{0}}$	$\displaystyle=1+o_{P_{0}^{n}}(1)$		(35)

Proof.

Let

\hat{l}_{n}(y,\boldsymbol{x})=\int l_{\boldsymbol{\omega}_{n}}(y,\boldsymbol{x})\pi^{*}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

	$\displaystyle d_{H}(\hat{l}_{n}(y,\boldsymbol{x})),l_{0}(y,\boldsymbol{x}))$	$\displaystyle=d_{H}\left(\int l({\boldsymbol{\omega}_{n}})\pi^{*}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n},l_{0}\right)$
		$\displaystyle\leq\int d_{H}(l(\boldsymbol{\omega}_{n}),l_{0})\pi^{*}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\hskip 14.22636pt\text{Jensen's inequality}$
		$\displaystyle=\int_{\mathcal{V}_{\varepsilon}}d_{H}(l(\boldsymbol{\omega}_{n}),l_{0})\pi^{}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}+\int_{\mathcal{V}_{\varepsilon}^{c}}d_{H}(l(\boldsymbol{\omega}_{n}),l_{0})\pi^{}(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$
		$\displaystyle\leq\varepsilon+o_{P_{0}^{n}}(1)$

Taking $\varepsilon\to 0$ , we get $d_{H}(\hat{l}_{n}(y,\boldsymbol{x})),l_{0}(y,\boldsymbol{x}))=o_{P_{0}^{n}}(1)$ . Now,

\hat{l}_{n}(y,\boldsymbol{x})=\frac{1}{\sqrt{2\pi\hat{\sigma}^{2}_{n}}}e^{-\frac{1}{2\hat{\sigma}_{n}^{2}}(y-\hat{f}_{n}(\boldsymbol{x}))^{2}}

Now, let us consider the form of

	$\displaystyle d_{H}(\hat{l}_{n},l_{0})$	$\displaystyle=\int\int\left(\sqrt{\hat{l}_{n}(y,\boldsymbol{x})}-\sqrt{l_{0}(y,\boldsymbol{x})}\right)^{2}dyd\boldsymbol{x}$
		$\displaystyle=2-2\int\int\sqrt{\hat{l}_{n}(y,\boldsymbol{x})l_{0}(y,\boldsymbol{x})}dyd\boldsymbol{x}$
		$\displaystyle=2-2\int\int\frac{1}{\sqrt{2\pi\hat{\sigma}_{n}\sigma_{0}}}\exp\left\{-\frac{1}{4}\left(\frac{(y-\hat{f}_{n}(\boldsymbol{x}))^{2}}{\hat{\sigma}_{n}^{2}}+\frac{(y-f_{0}(\boldsymbol{x}))^{2}}{\sigma_{0}^{2}}\right)\right\}dyd\boldsymbol{x}$
		$\displaystyle=2-2\underbrace{\sqrt{\frac{2}{\hat{\sigma}_{n}/\sigma_{0}+\sigma_{0}/\hat{\sigma}_{n}}}}_{①}\underbrace{\int e^{\left\{-\frac{1}{4(\hat{\sigma}^{2}_{n}+\sigma_{0}^{2})}(\hat{f}_{n}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}\right\}}d\boldsymbol{x}}_{②}$

Since $d_{H}(\hat{l}_{n},l_{0})=o_{P_{0}^{n}}(1)$ , $①\times②\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}1$ .

Note that $①\leq 1$ and $②\leq 1$ , thus $①,②\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}1$ .

Since $x+1/x\geq 2$ , thus

①\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}1\implies\hat{\sigma}_{n}\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}\sigma_{0}

We shall next show

②\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}1\implies\int(\hat{f}_{n}(x)-f_{0}(x))^{2}dx\stackrel{{\scriptstyle P_{0}^{n}}}{{\longrightarrow}}0

We shall instead show that for any sequence $\{n\}$ , there exists a further subsequence $\{n_{k}\}$ such that $\int(\hat{f}_{n_{k}}-f_{0}(x))^{2}d\boldsymbol{x}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0$

Since $②\stackrel{{\scriptstyle P_{0}^{n}}}{{\to}}1$ , there exists a sub-sequence $\{n_{k}\}$ s.t.

\int e^{\left\{-\frac{1}{4(\hat{\sigma}^{2}_{n_{k}}+\sigma_{0}^{2})}(\hat{f}_{n_{k}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}\right\}}d\boldsymbol{x}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}1

This implies

\frac{1}{4(\hat{\sigma}^{2}_{n_{k}}+\sigma_{0}^{2})}(\hat{f}_{n_{k}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}\stackrel{{\scriptstyle a.s.}}{{\to}}0\>\>a.e.\>\>\boldsymbol{x}

(for details see proof of Corollary 1 in Lee [2000]).

Thus, using Scheffe’s theorem in Scheffe [1947], we have

\int\frac{1}{4(\hat{\sigma}^{2}_{n_{k}}+\sigma_{0}^{2})}(\hat{f}_{n_{k}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}\stackrel{{\scriptstyle a.s.}}{{\to}}0

which implies

\int\frac{1}{4(\hat{\sigma}^{2}_{n}+\sigma_{0}^{2})}(\hat{f}_{n}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}=o_{P_{0}^{n}}(1)

Since $\hat{\sigma}_{n}\stackrel{{\scriptstyle o_{P_{0}^{n}}}}{{\to}}\sigma_{0}^{2}$ , applying Slutsky, we get

\int(\hat{f}_{n}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}=o_{P_{0}^{n}}(1)

∎

6 Discussion

In this paper, we have highlighted the conditions which guarantee that the variational posterior of feed-forward neural networks is consistent. A variational family, as simple as a Gaussian mean-field, is good enough to ensure that the variational posterior is consistent provided the entropy of the true function $f_{0}$ is well behaved. In other words, $f_{0}$ has an approximating neural network solution which approximates $f_{0}$ at a fast enough rate while ensuring that the number of nodes and the $L_{2}$ norm of the NN parameters grow in a controlled manner. Conditions of this form are often needed when one tries to establish the consistency of neural networks in a frequentist set up (see condition C3 in Shen et al. [2019]). Whereas variational posterior presents a scalable alternative to MCMC, unlike MCMC its consistency cannot be guaranteed without certain restriction on the entropy of the true function. Two other main contributions of the paper include that (1) Gaussian family may not always work as the best choice for a variational family (see Section 4) and (2) One may need a prior with variance growing in $n$ when the rate of growth in the $L_{2}$ norm of the approximating NN is high (see Theorem 3.1).

Although, we have quantified consistency of the variational posterior, the rate of contraction of the variational posterior still needs to be explored. We suspect that this rate would be closely related to the rate of contraction of the true posterior with mild assumptions on the entropy of the function $f_{0}$ . By following ideas of the proofs in this paper, one may be able to quantify conditions on the entropy of $f_{0}$ when one uses a deep neural network instead of one layer neural network in order to guarantee the consistency of variational posterior. Similarly, the effect of hierarchical priors and hyperparameters on the rate of convergence of the variational posterior need to be explored.

7 Appendix

7.1 General Lemmas

Lemma 7.1.

Let $p$ and $q$ be any two density functions. Then

E_{p}\left(\left|\log\frac{p}{q}\right|\right)\leq d_{KL}(p,q)+\frac{2}{e}

Proof.

Proof is same as proof of Lemma 4 in Lee [2000]. ∎

Lemma 7.2.

Let $f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x})$ be a fixed neural network satisfying

|\boldsymbol{\theta}_{in}-\boldsymbol{\theta}_{i0n}|\leq\epsilon,\>\>i=1,\cdots,K(n).

Then,

\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x}))^{2}dx\leq 8\left(k_{n}^{2}+(p+1)^{2}(\sum_{j=1}^{k_{n}}|\theta_{i0n}|)^{2}\right)\epsilon^{2}

Proof.

This proof uses some ideas from Lemma 6 in Lee [2000]. Note that

f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})=\beta_{0}+\sum_{j=1}^{k_{n}}\beta_{j}\psi(\gamma_{j}^{\top}x)\hskip 14.22636ptf_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}x)

Therefore,

\displaystyle|f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})|

\displaystyle\leq|\beta_{0}-\beta_{00}|+\sum_{j=1}^{k_{n}}|\beta_{j}\psi(\gamma_{j}^{\top}\boldsymbol{x})-\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x})|

Let $u_{j}=-\gamma_{j0}^{\top}\boldsymbol{x}$ , $r_{j}=(\gamma_{j0}-\gamma_{j})^{\top}\boldsymbol{x}$ , then

	$\displaystyle=\|\beta_{0}-\beta_{00}\|+\sum_{j=1}^{k_{n}}\Big{\|}\frac{\beta_{j}}{1+e^{u_{j}+r_{j}}}-\frac{\beta_{j0}}{1+e^{u_{j}}}\Big{\|}$
	$\displaystyle=\|\beta_{0}-\beta_{00}\|+\sum_{j=1}^{k_{n}}\Big{\|}\frac{\beta_{j}(1+e^{u_{j}})-\beta_{j0}(1+e^{u_{j}+r_{j}})}{(1+e^{u_{j}+r_{j}})(1+e^{u_{j}})}\Big{\|}$
	$\displaystyle=\|\beta_{0}-\beta_{00}\|+\sum_{j=1}^{k_{n}}\frac{\|\beta_{j}-\beta_{j0}\|+\|\beta_{j}e^{u_{j}}-\beta_{j0}e^{u_{j}+r_{j}}\|}{(1+e^{u_{j}+r_{j}})(1+e^{u_{j}})}$
	$\displaystyle=\|\beta_{0}-\beta_{00}\|+2\sum_{j=1}^{k_{n}}\|\beta_{j}-\beta_{j0}\|+\sum_{j=1}^{k_{n}}\|\beta_{j0}\|\|1-e^{r_{j}}\|$

Since, for $\delta$ small $|r_{j}|<(p+1)\delta<1$ , thus $|1-e^{r_{j}}|<2|r_{j}|$ .

\displaystyle|f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})|

\displaystyle\leq 2k_{n}\epsilon+2\epsilon(p+1)\sum_{j=1}^{k_{n}}|\beta_{j0}|\leq 2k_{n}\epsilon+2\epsilon(p+1)\sum_{j=1}^{k_{n}}|\theta_{i0n}|

Using

(a+b)^{2}\leq 2(a^{2}+b^{2})

(36)

the proof follows. ∎

Lemma 7.3.

With $|\sigma/\sigma_{0}-1|<\delta$

1.

$h_{1}(\sigma)=\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma^{2}}\right)\leq\delta^{2}$
2.

$h_{2}(\sigma)=\frac{1}{2\sigma^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\delta)^{2}}$

Proof.

Let $x=\sigma/\sigma_{0}$ , then

h_{1}(x)=\frac{1}{2}\log x^{2}-\frac{1}{2}\left(1-\frac{1}{x^{2}}\right)

where $|x-1|<\delta$ . The function $h_{1}(x)$ satisfies

h_{1}(x)\leq(x-1)h_{1}^{\prime}(1)+\frac{(x-1)^{2}}{2}h_{1}^{\prime\prime}(1)\leq\delta h_{1}^{\prime}(1)+\frac{\delta^{2}}{2}h_{1}^{\prime\prime}(1)=\delta^{2}

since $h_{1}^{\prime\prime\prime}(y)\leq 0$ for every $y\in(1-\delta,1+\delta)$ .

2.

$h_{2}(x)=\frac{1}{2\sigma_{0}^{2}x^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\delta)^{2}}$

∎

Lemma 7.4.

With $\sigma_{\rho}=\log(1+e^{\rho})$ and $|\rho-\rho_{0}|<\delta\sigma_{0}$ , $\sigma_{0}=\log(1+e^{\rho_{0}})$ .

h_{1}(\rho)=\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\right)\leq\delta^{2}

2.

$h_{2}(\rho)=\frac{1}{2\sigma_{\rho}^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\delta)^{2}}$

Proof.

$|\rho-\rho_{0}|<\delta\log(1+e^{\rho_{0}})$ implies

\log(1+e^{\rho})-\log(1+e^{\rho_{0}})\leq\delta\log(1+e^{\rho_{0}})

Similarly,

\log(1+e^{\rho})-\log(1+e^{\rho_{0}})\geq-\delta\log(1+e^{\rho_{0}})

Thus, $|\sigma_{\rho}/\sigma_{0}-1|<\delta$ . The remaining part of the proof follows on the same lines as Lemma 7.3. ∎

Lemma 7.5.

With $q(\sigma^{2})=((n\sigma_{0}^{2})^{n}/\Gamma(n))(1/\sigma^{2})^{(n+1)}e^{-n\sigma_{0}^{2}/\sigma^{2}}$ and $h(\sigma^{2})=(1/2)(\log(\sigma^{2}/\sigma_{0}^{2})-(1-\sigma_{0}^{2}/\sigma^{2}))$ , for every $0\leq\delta<1$ , we have

\displaystyle\int h(\sigma^{2})q(\sigma^{2})d\sigma^{2}=o(n^{-\delta})

Proof.

	$\displaystyle\int h(\sigma^{2})q(\sigma^{2})d\sigma^{2}$	$\displaystyle=\int\frac{1}{2}\left(\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\left(1-\frac{\sigma_{0}^{2}}{\sigma^{2}}\right)\right)\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\left(\frac{1}{\sigma^{2}}\right)^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}d\sigma^{2}$
		$\displaystyle=\int\frac{1}{2}\left(\log\frac{\sigma}{\sigma_{0}^{2}}-\left(1-\frac{\sigma_{0}^{2}}{\sigma}\right)\right)\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\left(\frac{1}{\sigma}\right)^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma}}d\sigma$
		$\displaystyle=\frac{1}{2}\left(\log n\sigma_{0}^{2}-\log\psi(n)-\log\sigma_{0}^{2}\right)-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma_{0}^{2}}\right)$
		$\displaystyle=\frac{1}{2}\left(\log n-\log n+O(n^{-1})\right)=o(n^{-\delta})$

where the last step holds because $\psi(n)=\log n+O(n^{-1})$ (see Lemma 4 in Elezovic and Giordano [2000]). ∎

Lemma 7.6.

With $q(\sigma^{2})=((n\sigma_{0}^{2})^{n}/\Gamma(n))(1/\sigma^{2})^{(n+1)}e^{-n\sigma_{0}^{2}/\sigma^{2}}$ and $h(\sigma^{2})=1/(2\sigma^{2})$ , for every $0\leq\delta<1$ ,

\displaystyle\int h(\sigma^{2})q(\sigma^{2})d\sigma^{2}=\frac{1}{2\sigma_{0}^{2}}

Proof.

	$\displaystyle\int h(\sigma^{2})q(\sigma^{2})d\sigma^{2}$	$\displaystyle=\int\frac{1}{2\sigma^{2}}\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\left(\frac{1}{\sigma^{2}}\right)^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}d\sigma^{2}$
		$\displaystyle=\int\frac{1}{2\sigma}\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\left(\frac{1}{\sigma}\right)^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma}}d\sigma$
		$\displaystyle=\frac{n}{2n\sigma_{0}^{2}}=\frac{1}{\sigma_{0}^{2}}$

∎

Lemma 7.7.

With $\sigma_{\rho}=\log(1+e^{\rho})$ and $\sigma_{0}=\log(1+e^{\rho_{0}})$ , let $h(\rho)=(1/2)\log(\sigma_{\rho}^{2}/\sigma_{0}^{2})-(1/2)(1-\sigma_{0}^{2}/\sigma_{\rho}^{2})$ and $q(\rho)=\sqrt{n/(2\pi\nu^{2})}e^{-n(\rho-\rho_{0})^{2}/2\nu^{2}}$ . Then, for every $0\leq\delta<1$ , we have

\displaystyle\int h(\rho)q(\rho)d\rho=o(n^{-\delta})

Proof.

First note that $h(\rho)\geq 0$ , thus it suffices to show $\int h(\rho)q(\rho)d\rho\leq o(n^{-\delta})$ . In this direction,

\int h(\rho)q(\rho)d\rho=\underbrace{\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}h(\rho)q(\rho)d\rho}_{①}+\underbrace{\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}h(\rho)q(\rho)d\rho}_{②}

We can apply Taylor expansion to $①$ as

	$\displaystyle①=\int_{\|\rho-\rho_{0}\|<1/n^{\delta/2}}$	$\displaystyle\left(h(\rho_{0})+(\rho-\rho_{0})h^{\prime}(\rho_{0})+\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})+o((\rho-\rho_{0})^{2})\right)q(\rho)d\rho$
	$\displaystyle=\int_{\|\rho-\rho_{0}\|<1/n^{\delta/2}}$	$\displaystyle\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho+o(n^{-\delta})$

where the equality follows since $h(\rho_{0})=0$ and $q(\rho)$ is symmetric around $\rho=\rho_{0}$ .

It is easy to check $h^{\prime\prime}(\rho_{0})>0$ , which implies

\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho\leq\int\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho=\frac{h^{\prime\prime}(\rho_{0})\nu^{2}}{2n}=O(n^{-1})

Thus, for every $0\leq\delta<1$ , $①\leq O(n^{-1})+o(n^{-\delta})=o(n^{-\delta})$ .

For the remaining part of the proof, we shall make use of the Mill’s ratio approximation as follows.

1-\Phi(a_{n})\sim\frac{\phi(a_{n})}{a_{n}}

(37)

where $\Phi$ and $\phi$ are the cdf and pdf of standard normal distribution respectively.

For $②$ ,

	$\displaystyle②$	$\displaystyle=\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\left(\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\right)\right)\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq-\frac{1}{2}\log\sigma_{0}^{2}\underbrace{\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{③}+\frac{1}{2}\underbrace{\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{④}$
		$\displaystyle+\sigma_{0}^{2}\underbrace{\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{⑤}$

Let $c=\log(e-1)$ , then $c>0$ .

If $\rho_{0}\geq c$ , then $-\log\sigma_{0}^{2}\leq 0$ and $③$ can be dropped. If $\rho_{0}<c\implies-\log\sigma_{0}^{2}>0$ , then

③=2\left(1-\Phi\left(\frac{\sqrt{n}}{\nu n^{\delta/2}}\right)\right)\sim O\left(\frac{1}{n^{\frac{1}{2}-\frac{\delta}{2}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})

(38)

For ④, we make use of the following result

\rho<c

\log\sigma_{\rho}<0

. For

\rho>c

\log\sigma_{\rho}\leq\log(2e^{\rho})

(39)

If $\rho_{0}<c$ , then $\rho_{0}-1/n^{\delta/2},\rho_{0}+1/n^{\delta/2}<c$ for $n$ sufficiently large.

Using (39) and getting rid of negative terms, we get

	$\displaystyle④$	$\displaystyle\leq\int_{c}^{\infty}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho\leq\int_{c}^{\infty}2(\log 2+\rho)\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle=2\log 2\int_{c}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+2\int_{c}^{\infty}\rho\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle=2\log 2\int_{\sqrt{n}(c-\rho_{0})/\nu}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}d\rho+2\int_{\sqrt{n}(c-\rho_{0})/\nu}^{\infty}\left(\frac{u\nu}{\sqrt{n}}+\rho_{0}\right)\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}d\rho$
		$\displaystyle=(2\log 2+2\rho_{0})\left(1-\Phi\left(\frac{\sqrt{n}(c-\rho_{0})}{\nu}\right)\right)+\frac{2\nu}{\sqrt{n}}\int_{\sqrt{n}(c-\rho_{0})/\nu}^{\infty}\frac{u}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}d\rho$
		$\displaystyle=(2\log 2+4\rho_{0})\Phi\left(-\frac{\sqrt{n}(c-\rho_{0})}{\nu}\right)+\frac{4\nu}{\sqrt{2\pi n}}e^{-\frac{n(c-\rho_{0})^{2}}{2\nu^{2}}}$
		$\displaystyle=O\left(\frac{1}{\sqrt{n}}e^{-n}\right)+O\left(\frac{1}{\sqrt{n}}e^{-n}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}$

If $\rho_{0}>c$ , then $\rho_{0}-1/n^{\delta/2},\rho_{0}+1/n^{\delta/2}>c$ for $n$ sufficiently large.

Using (39) and getting rid of negative terms, we get

	$\displaystyle④$	$\displaystyle\leq\int_{c}^{\rho_{0}-1/n^{\delta/2}}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle=2(\log 2+\rho)\left(\int_{c}^{\rho_{0}-1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho\right)$
		$\displaystyle=(2\log 2+2\rho_{0})\left\{\Phi\left(\frac{-\sqrt{n}}{n^{\delta/2}\nu}\right)-\Phi\left(\frac{\sqrt{n}(c-\rho_{0})}{\nu}\right)+1-\Phi\left(\frac{\sqrt{n}}{n^{\delta/2}\nu}\right)\right\}$
		$\displaystyle+\frac{2\nu}{\sqrt{2\pi n}}\left(e^{-\frac{n(c-\rho_{0})^{2}}{2\nu^{2}}}-e^{-\frac{n^{1-\delta}}{2\nu^{2}}}\right)+\frac{2\nu}{\sqrt{2\pi n}}\left(e^{-\frac{n^{1-\delta}}{2\nu^{2}}}\right)$

	$\displaystyle④$	$\displaystyle\leq(2\log 2+2\rho_{0})\Phi\left(-\frac{\sqrt{n}}{n^{\delta/2}\nu}\right)+\frac{2\nu}{\sqrt{2\pi n}}\left(e^{-\frac{n(c-\rho_{0})}{2\nu^{2}}}\right)$
		$\displaystyle=O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)+O\left(\frac{1}{\sqrt{n}}e^{-n}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}$

If $\rho_{0}=c$ , then $\rho_{0}-1/n^{\delta/2}<c$ and $\rho_{0}+1/n^{\delta/2}>c$ for $n$ sufficiently large, thus

	$\displaystyle④$	$\displaystyle\leq\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho=(2\log 2+2\rho_{0})\left\{1-\Phi\left(\frac{\sqrt{n}}{n^{\delta/2}\nu}\right)\right\}+\frac{2\nu}{\sqrt{2\pi n}}\left(e^{-\frac{n^{1-\delta}}{2\nu^{2}}}\right)$
		$\displaystyle=O\left(\frac{1}{\sqrt{n}}e^{-n^{1-\delta}}\right)+O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}$

For $⑤$ , we shall make use of the following result:

	$\displaystyle e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}$	$\displaystyle=e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}$
	$\displaystyle\frac{1}{\sigma_{\rho}^{2}}\leq 3e^{-2\rho},\rho<0$	$\displaystyle\hskip 28.45274pt\frac{1}{\sigma_{\rho}^{2}}\leq\frac{1}{(\log 2)^{2}},\rho>0$		(40)

If $\rho<0$ , then $\rho_{0}-1/n^{\delta/2}$ , $\rho_{0}+1/n^{\delta/2}$ $<0$ for $n$ sufficiently large. Thus, using (7.1), we get

	$\displaystyle⑤$	$\displaystyle=\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{\rho_{0}+1/n^{\delta/2}}^{0}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle+\int_{0}^{\infty}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq 3\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+3\int_{\rho_{0}+1/n^{\delta/2}}^{0}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle+\frac{1}{(\log 2)^{2}}\int_{0}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq 3\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\frac{1}{(\log 2)^{2}}\int_{0}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq 3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}+\frac{1}{(\log 2)^{2}}\int_{0}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle=6e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\Phi\left(-\frac{\sqrt{n}}{\nu}\left(\frac{1}{n^{\delta/2}}-\frac{\nu^{2}}{n}\right)\right)+\frac{1}{(\log 2)^{2}}\Phi(-\sqrt{n}(-\rho_{0}))$
		$\displaystyle=O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)+O\left(\frac{1}{\sqrt{n}}e^{-n}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}$

If $\rho>0$ , then $\rho_{0}-1/n^{\delta/2}$ , $\rho_{0}+1/n^{\delta/2}$ $>0$ for $n$ sufficiently large. Thus, using (7.1), we get

	$\displaystyle⑤$	$\displaystyle=\int_{-\infty}^{0}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{0}^{\rho_{0}-1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle+\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq\int_{-\infty}^{0}3e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\frac{1}{(\log 2)^{2}}\int_{0}^{\rho_{0}-1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle+\frac{1}{(\log 2)^{2}}\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq 3\int_{-\infty}^{0}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\frac{1}{(\log 2)^{2}}\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$

	$\displaystyle⑤$	$\displaystyle\leq 3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\int_{-\infty}^{0}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}+\frac{1}{(\log 2)^{2}}\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq 3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\int_{-\infty}^{0}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}+\frac{1}{(\log 2)^{2}}\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle=3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\Phi\left(\frac{-\sqrt{n}}{\nu}\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)+\frac{2}{(\log 2)^{2}}\Phi\left(-\frac{\sqrt{n}\rho_{0}}{\nu n^{\delta/2}}\right)$
		$\displaystyle=O\left(\frac{1}{\sqrt{n}}e^{-n}\right)+O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}$

If $\rho_{0}=0$ , then $\rho_{0}-1/n^{\delta/2}<0$ , $\rho_{0}+1/n^{\delta/2}>0$ for $n$ sufficiently large. Thus, using (7.1), we get

	$\displaystyle⑤$	$\displaystyle=\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq 3\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}e^{-2\rho}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho+\frac{1}{(\log 2)^{2}}\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq 3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\int_{-\infty}^{\rho_{0}-1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}\left(\rho-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)\right)^{2}}+\frac{1}{(\log 2)^{2}}\int_{\rho_{0}+1/n^{\delta/2}}^{\infty}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle=3e^{-\left(\rho_{0}-\frac{\nu^{2}}{n}\right)}\Phi\left(\frac{-\sqrt{n}}{\nu}\left(\frac{1}{n^{\delta/2}}-\frac{\nu^{2}}{n}\right)\right)+\frac{1}{(\log 2)^{2}}\Phi\left(-\frac{\sqrt{n}\rho_{0}}{\nu n^{\delta/2}}\right)$
		$\displaystyle=O\left(\frac{1}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)+O\left(\frac{1}{\sqrt{n}}e^{-n}\right)=o(n^{-\delta})\hskip 28.45274pt\text{ follows from \eqref{e:mills}}$

∎

Lemma 7.8.

With $\sigma_{\rho}=\log(1+e^{\rho})$ and $\sigma_{0}=\log(1+e^{\rho_{0}})$ , let $h(\rho)=1/(2\sigma_{\rho}^{2})$ and $q(\rho)=\sqrt{n/(2\pi\nu^{2})}e^{-n(\rho-\rho_{0})^{2}/2\nu^{2}}$ . Then, for every $0\leq\delta<1$ , we have

\displaystyle\int h(\rho)q(\rho)d\rho=\frac{1}{2\sigma_{0}^{2}}+o(n^{-\delta})

Proof.

\int h(\rho)q(\rho)d\rho=\underbrace{\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}h(\rho)q(\rho)d\rho}_{①}+\underbrace{\int_{|\rho-\rho_{0}|>1/n^{\delta/2}}h(\rho)q(\rho)d\rho}_{②}

We can apply Taylor expansion to $①$ ,

	$\displaystyle①$	$\displaystyle=\int_{\|\rho-\rho_{0}\|<1/n^{\delta/2}}\left(h(\rho_{0})+(\rho-\rho_{0})h^{\prime}(\rho_{0})+\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})+o((\rho-\rho_{0})^{2})\right)q(\rho)d\rho$
		$\displaystyle=\frac{1}{2\sigma_{0}^{2}}+\int_{\|\rho-\rho_{0}\|<1/n^{\delta/2}}\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho+o(n^{-\delta})$

where the equality follows since $h(\rho_{0})=1/(2\sigma_{0}^{2})$ and $q(\rho)$ is symmetric around $\rho_{0}$ .

Since $(\rho-\rho_{0})^{2}$ and $h^{\prime\prime}(\rho_{0})>0$ , it suffices to show $\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho\leq o(n^{-\delta})$ .

In this direction,

\int_{|\rho-\rho_{0}|<1/n^{\delta/2}}\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho\leq\int\frac{(\rho-\rho_{0})^{2}}{2}h^{\prime\prime}(\rho_{0})q(\rho)d\rho=\frac{h^{\prime\prime}(\rho_{0})\nu^{2}}{2n}=O(n^{-1})=o(n^{-\delta})

Since $h(\rho)>0$ , to prove $②=o(n^{-\delta})$ it suffices to show $②\leq o(n^{-\delta})$ .

Note, $②$ is same as $⑤$ of Lemma 7.7, except for a constant. Thus, $②\leq o(n^{-\delta})$ which completes the proof.

∎

Lemma 7.9.

Suppose condition (C1) and assumption (A1) holds for some $0<a<1$ and $0\leq\delta<1-a$ . Let

h(\boldsymbol{\theta}_{n})=\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}

we have

\int h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}=o(n^{-\delta})

(41)

provided

Assumption (A2) holds with same $\delta$ as (A1) and

q(\boldsymbol{\theta}_{n})=\prod_{i=1}^{K(n)}\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{n}{2\tau^{2}}(\theta_{in}-\theta_{0in})^{2}}

Assumption (A3) holds and

q(\boldsymbol{\theta}_{n})=\prod_{i=1}^{K(n)}\sqrt{\frac{n^{v+1}}{2\pi\tau^{2}}}e^{-\frac{n^{v+1}}{2\tau^{2}}(\theta_{in}-\theta_{0in})^{2}}

Proof.

Note that since $h(\boldsymbol{\theta}_{n})>0$ , to prove (41), it suffices to show $\int h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}=o(n^{-\delta})$ .

We begin by proving statement 1. of the lemma. Let $A=\{\boldsymbol{\theta}_{n}:\cap_{i=1}^{K(n)}|\theta_{in}-\theta_{0in}|\leq 1/n^{\delta/2}\}$ , then

\int h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}=\underbrace{\int_{A}h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{①}+\underbrace{\int_{A^{c}}h(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{②}

For $①$ , we do a Taylor expansion of $h(\boldsymbol{\theta}_{n})$ around $\boldsymbol{\theta}_{0n}$ as

	$\displaystyle①$	$\displaystyle=\int_{A}\left(h(\boldsymbol{\theta}_{0n})+(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})^{\top}\nabla h(\boldsymbol{\theta}_{0n})+\frac{1}{2}(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})^{\top}\nabla^{2}h(\boldsymbol{\theta}_{0n})(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})\right)q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}$
		$\displaystyle+\int_{A}o(\|\|\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})\|\|^{2})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}$
		$\displaystyle=\underbrace{\int_{A}(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})^{\top}\nabla h(\boldsymbol{\theta}_{0n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{③}+\frac{1}{2}\underbrace{\int_{A}(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})^{\top}\nabla^{2}h(\boldsymbol{\theta}_{0n})(\boldsymbol{\theta}_{n}-\boldsymbol{\theta}_{0n})q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{④}+o(n^{-\delta})$

where the last equality follows since $h(\boldsymbol{\theta}_{0n})=o(n^{-\delta})$ by assumption (A1).

With $I=\{1,\cdots,K(n)\}$ , let $\nabla h(\theta_{0n})=(a_{i})_{i\in I}$ and $\nabla^{2}h(\theta_{0n})=((b_{ij}))_{i\in I,j\in I}$

	$\displaystyle③$	$\displaystyle=\sum_{i=1}^{K(n)}a_{i}\int_{\|\theta_{in}-\theta_{i0n}\|<1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})q(\theta_{in})d\theta_{in}$
		$\displaystyle=\sum_{i=1}^{K(n)}a_{i}\int_{\theta_{i0n}-1/n^{\delta/2}}^{\theta_{i0n}+1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{n}{2\tau^{2}}(\theta_{in}-\theta_{i0n})^{2}}=\sum_{i=1}^{K(n)}a_{i}\int_{-\sqrt{n^{1-\delta}}/\tau}^{\sqrt{n^{1-\delta}}/\tau}\frac{u}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}du=0\vspace{-5mm}$		(42)

since $ue^{-1/2u^{2}}$ is an odd function. Also,

	$\displaystyle④$	$\displaystyle=\sum_{i=1}^{K(n)}b_{ii}\int_{\|{\theta}_{in}-{\theta}_{i0n}\|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})^{2}q(\theta_{in})d\theta_{in}$
		$\displaystyle+\sum_{i=1}^{K(n)}\sum_{j=1,i\neq j}^{K(n)}\int_{\|{\theta}_{in}-{\theta}_{i0n}\|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})q(\theta_{in})d\theta_{in}\int_{\|{\theta}_{jn}-{\theta}_{j0n}\|\leq 1/n^{\delta/2}}(\theta_{jn}-\theta_{j0n})q(\theta_{jn})d\theta_{jn}$
		$\displaystyle=\sum_{i=1}^{K(n)}b_{ii}\int_{\|{\theta}_{in}-{\theta}_{i0n}\|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})^{2}q(\theta_{in})d\theta_{in}$

where second equality to third equality is a consequence of (7.1). Thus,

\displaystyle④

\displaystyle\leq\sum_{i=1}^{K(n)}|b_{ii}|\int(\theta_{in}-\theta_{i0n})^{2}q(\theta_{in})d\theta_{in}=\frac{\tau^{2}}{n}\sum_{i=1}^{K(n)}|b_{ii}|

We next try to bound the quantities $|b_{ii}|$ . First note that

\nabla^{2}h(\boldsymbol{\theta}_{n})=2\int\nabla f_{\boldsymbol{\theta_{0n}}}(\boldsymbol{x})\nabla{f_{\boldsymbol{\theta_{0n}}}(\boldsymbol{x})}^{\top}d\boldsymbol{x}+2\int(f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))\nabla^{2}{f_{\boldsymbol{\theta_{0n}}}(\boldsymbol{x})}d\boldsymbol{x}

Let $\theta_{0n}=[\beta_{0},\beta_{1},\cdots,\beta_{k_{n}},\gamma_{11},\cdots,\gamma_{1p},\gamma_{21},\cdots,\gamma_{2p},\cdots,\gamma_{K(n)1},\cdots,\gamma_{K(n)p}]^{\top}$ . Then,

\boldsymbol{b}=[2,\boldsymbol{c}_{0},\boldsymbol{c}_{1},\cdots,\boldsymbol{c}_{K(n)}]^{\top}

where for $i=1,\cdots,k_{n}$ , $j=1,\cdots,p$ , we have

	$\displaystyle\boldsymbol{c}_{0i}$	$\displaystyle=2\int(\psi(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}d\boldsymbol{x}$
	$\displaystyle\boldsymbol{c}_{ij}$	$\displaystyle=2\beta_{i0}^{2}\int(\psi^{\prime}(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}d\boldsymbol{x}+2\beta_{i0}^{2}\int(f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))(\psi^{\prime\prime}(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}d\boldsymbol{x},\>\>j=0$
		$\displaystyle=2\beta_{i0}^{2}\int(\psi^{\prime}(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}x_{ij}^{2}d\boldsymbol{x}+2\beta_{i0}^{2}\int(f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))(\psi^{\prime\prime}(\boldsymbol{\gamma}_{i0}^{\top}\boldsymbol{x}))^{2}x_{ij}^{2}d\boldsymbol{x},j>0$

Using the fact that $|\psi(u)|,|\psi^{\prime}(u)|,|\psi^{\prime\prime}(u)|\leq 1$ and $|x_{ij}|\leq 1$ we get

	$\displaystyle④$	$\displaystyle\leq\frac{\tau^{2}}{n}\left(2(k_{n}+1)+2(p+1)\sum_{i=1}^{k_{n}}\beta^{2}_{j0}+(p+1)\int\|f_{\theta_{0n}}-f_{0}(\boldsymbol{x})\|d\boldsymbol{x}\sum_{i=1}^{k_{n}}\beta^{2}_{j0}\right)$
		$\displaystyle\leq\frac{\tau^{2}}{n}(2(K(n)+1)+2(p+1)\sum_{i=1}^{K(n)}\theta_{i0n}^{2}+(p+1)\sum_{i=1}^{K(n)}\theta_{i0n}^{2}\|\|f_{\boldsymbol{\theta}_{0n}}-f_{0}\|\|^{2}_{2}=o(n^{-\delta})$

where the last equality is a consequence of assumptions (A1), (A2) and condition (C1).

For $②$ , note that

\displaystyle\int_{A^{c}}h(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}

\displaystyle=2\int_{A^{c}}\int f^{2}_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})d\boldsymbol{x}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}+2\int_{A^{c}}\int f_{0}^{2}(\boldsymbol{x})d\boldsymbol{x}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}

First, note that $|f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})|\leq\sum_{j=0}^{k_{n}}|\beta_{j}|\leq\sum_{j=0}^{k_{n}}|\beta_{j0}|+\sum_{j=0}^{k_{n}}|\beta_{j}-\beta_{j0}|^{2}$ since $|\psi(u)|\leq 1$ . Thus,

\displaystyle\int_{A^{c}}h(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}

\displaystyle\leq 4\underbrace{\int_{A^{c}}(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{⑤}+4\underbrace{\int_{A^{c}}(\sum_{j=1}^{k_{n}}|\beta_{j}-\beta_{j0}|)^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{⑥}+2\underbrace{\int f_{0}^{2}(\boldsymbol{x})d\boldsymbol{x}\int_{A^{c}}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}}_{⑦}

where the last equality follows since using $(a+b)^{2}\leq 2(a^{2}+b^{2})$ .

First note that $A^{c}=\cup_{i=1}^{K(n)}A_{i}^{c}$ where $A_{i}=\{|\theta_{in}-\theta_{i0n}|\leq 1/n^{\delta/2}\}$ . Therefore,

	$\displaystyle Q(A^{c})$	$\displaystyle=Q(\cup_{i=1}^{K(n)}A_{i}^{c})\leq\sum_{i=1}^{K(n)}Q(A_{i}^{c})$
		$\displaystyle=\sum_{i=1}^{K(n)}\int_{\|\boldsymbol{\theta}_{in}-\boldsymbol{\theta}_{i0n}\|>1/n^{\delta/2}}q(\theta_{in})d\theta_{in}=2K(n)\left(1-\Phi\left(\frac{\sqrt{n}}{\tau n^{\delta/2}}\right)\right)=O\left(\frac{n^{a}e^{-n^{1-\delta}}}{\sqrt{n^{1-\delta}}}\right)$		(43)

where the last asymptotic equality is a consequence of (37) and condition (C1).

For $⑦$ , note that $\int f_{0}^{2}(\boldsymbol{x})d\boldsymbol{x}\leq M$ for some $M>0$ . Therefore,

⑦=O\left(\frac{n^{a}}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})

for any $0\leq\delta<1$ .

For $⑤$ , note that $\sum_{j=1}^{k_{n}}\theta_{i0n}^{2}=o(n^{1-\delta})$ by assumption (A2). Using this together with (7.1), we get

⑤=(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}Q(A^{c})\leq k_{n}\sum_{i=1}^{k_{n}}\beta_{j0}^{2}Q(A^{c})\leq K(n)\sum_{j=1}^{K(n)}\theta_{i0n}^{2}Q(A^{c})\leq o(n^{1-\delta})O\left(\frac{n^{2a}}{\sqrt{n^{1-\delta}}}e^{-n^{1-\delta}}\right)=o(n^{-\delta})

For $⑥$ , using Cauchy Schwartz, we get

\displaystyle\int_{A^{c}}(\sum_{j=1}^{k_{n}}|\beta_{j}-\beta_{j0}|)^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}

\displaystyle\leq k_{n}\sum_{j=1}^{k_{n}}\int_{A^{c}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}=O(k_{n}^{2}e^{-n^{1-\delta}})=O(n^{2a}e^{-n^{1-\delta}}=o(n^{-\delta})

where the fact $\int_{A^{c}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}\sim e^{-n^{1-\delta}}$ is shown below. Now, let $A_{\beta_{j}}=\{|\beta_{j}-\beta_{j0}|>1/n^{\delta/2}\}$

	$\displaystyle\int_{A^{c}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}$	$\displaystyle=\int_{A^{c}\cap A_{\beta_{j}}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}+\int_{A^{c}\cap A_{\beta_{j}}^{c}}(\beta_{j}-\beta_{j0})^{2}q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}$
		$\displaystyle\leq\int_{A_{\beta_{j}}}(\beta_{j}-\beta_{j0})^{2}q(\beta_{j})d\beta_{j}+\frac{\tau^{2}}{n}\int_{\tilde{A}^{c}}q(\tilde{\boldsymbol{\theta}}_{n})d\tilde{\boldsymbol{\theta}}_{n}$		(44)

where $\tilde{\boldsymbol{\theta}}_{n}$ includes all coordinates of $\boldsymbol{\theta}_{n}$ except $\beta_{j}$ and $\tilde{A}^{c}$ is the union of all $A_{i}^{c}$ except $A_{\beta_{j}}^{c}$ .

$\displaystyle\int_{A_{\beta_{j}}}(\beta_{j}-\beta_{j0})^{2}q(\beta_{j})d\beta_{j}$	$\displaystyle=\int_{\|\beta_{j}-\beta_{j0}\|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\tau^{2}}}(\beta_{j}-\beta_{j0})^{2}e^{-\frac{n}{2\tau^{2}}(\beta_{j}-\beta_{j0}^{2})}$
	$\displaystyle=2\int_{\sqrt{n^{1-\delta}}\tau}^{\infty}\frac{u^{2}}{\sqrt{2\pi}}e^{-\frac{1}{2}u^{2}}\lesssim\sqrt{\frac{2}{\pi}}\int_{\sqrt{n^{1-\delta}}\tau}^{\infty}e^{-u}du\hskip 14.22636ptx^{2}e^{-x^{2}/2}\leq e^{-x},x\to\infty$
	$\displaystyle=O(e^{-n^{1-\delta}})$	(45)

Using (7.1), we get

\int_{\tilde{A}^{c}}q(\tilde{\boldsymbol{\theta}}_{n})d\tilde{\boldsymbol{\theta}}_{n}=O\left(\frac{e^{-n^{1-\delta}}}{\sqrt{n^{1-\delta}}}\right)

Using (7.1) and (7.1) in (7.1), we get

\displaystyle\int_{A^{c}}(\beta_{j}-\beta_{j0})^{2}q(\beta_{j})d\beta_{j}=O(e^{-n^{1-\delta}})+O\left(\frac{n^{a}}{n}\frac{e^{-n^{1-\delta}}}{\sqrt{n^{1-\delta}}}\right)=O(e^{-n^{1-\delta}})

(46)

The only difference with statement 2. is that $\sum_{i=1}^{K(n)}\theta_{i0n}^{2}=O(n^{v})$ and $\tau^{2}=\tau^{2}/n^{v+1}$ .

The proof is similar and details have been omitted.

∎

Lemma 7.10.

Suppose $N_{\varepsilon}=\{\boldsymbol{\omega}_{n}:d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})<\varepsilon\}$ and $p(\boldsymbol{\omega}_{n})$ satisfies

\int_{N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},n\to\infty

(47)

for every $\kappa$ and $\tilde{\kappa}$ for some $0\leq\delta<1$ . Then,

\log\int\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}=o_{P_{0}^{n}}(n^{1-\delta})

(48)

provided $0\leq\delta<1$ .

Proof.

This proof uses ideas from the proof of Lemma 5 in Lee [2000]. By Markov’s inequality,

$\displaystyle P_{0}^{n}\left(\left\|\int\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right\|\geq\epsilon n^{1-\delta}\right)$	$\displaystyle\leq\frac{1}{\epsilon n^{1-\delta}}E_{0}^{n}\left(\left\|\log\int\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right\|\right)$
	$\displaystyle=\frac{1}{\epsilon n^{1-\delta}}\int\left\|\log\int\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right\|L_{0}d\mu$
	$\displaystyle\leq\frac{1}{\epsilon n^{1-\delta}}\left(d_{KL}(L_{0},L^{*})+\frac{2}{e}\right)$	(49)

where $L^{*}=\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$ and the last equality follows from Lemma 7.1. Further,

$\displaystyle d_{KL}(L_{0},L^{*})$	$\displaystyle=E_{0}^{n}\left(\log\frac{L_{0}}{L^{*}}\right)=E_{0}^{n}\left(\log\frac{L_{0}}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}\right)$
	$\displaystyle\leq E_{0}^{n}\left(\log\frac{L_{0}}{\int_{N_{\kappa/n^{\delta}}}L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}\right)$
	$\displaystyle\leq\int_{N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}+\int_{N_{\kappa/n^{\delta}}}d_{KL}(L_{0},L(\boldsymbol{\omega}_{n}))p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\hskip 14.22636pt\text{Jensen's inequality}$
	$\displaystyle\leq-\log e^{-\tilde{\kappa}n^{1-\delta}}+\kappa n^{1-\delta}=n^{1-\delta}(\kappa+\tilde{\kappa})$	(50)

where the last equality follow from (47).

Using (7.1) in (7.1), the result follows and taking $\tilde{\kappa}\to 0$ and $\kappa\to 0$ . ∎

Lemma 7.11.

Suppose $q$ satisfies

\int d_{KL}(l_{0},l(\boldsymbol{\omega}_{n}))q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}=o(n^{-\delta}),

then

\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}=o_{P_{0}^{n}}(n^{1-\delta})

In this direction, note that

	$\displaystyle P_{0}^{n}\left(\left\|\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}\right\|>n^{1-\delta}\epsilon\right)$	$\displaystyle\leq P_{0}^{n}\left(\left\|\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}\right\|\geq n^{1-\delta}\epsilon\right)$
		$\displaystyle\leq\frac{1}{n^{1-\delta}\epsilon}E_{0}^{n}\left(\left\|\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}\right\|\right)$

where the last result follows from Markov’s Inequality

	$\displaystyle\leq\frac{1}{n^{1-\delta}\epsilon}E_{0}^{n}\left(\int q(\boldsymbol{\omega}_{n})\left\|\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}\right\|d\boldsymbol{\omega}_{n}\right)$
	$\displaystyle=\frac{1}{n^{1-\delta}\epsilon}\int q(\boldsymbol{\omega}_{n})\int\left\|\log\frac{L_{0}}{L(\boldsymbol{\omega}_{n})}\right\|L_{0}d\mu d\boldsymbol{\omega}_{n}$

Using Lemma 7.1, we get

\displaystyle\leq\frac{1}{n^{1-\delta}\epsilon}\int q(\boldsymbol{\omega}_{n})\left(d_{KL}(L_{0},L(\boldsymbol{\omega}_{n}))+\frac{2}{e}\right)d\boldsymbol{\omega}_{n}\to 0

since $\int q(\boldsymbol{\omega}_{n})d_{KL}(L_{0},L(\boldsymbol{\omega}_{n}))d\boldsymbol{\omega}_{n}=n\int q(\boldsymbol{\omega}_{n})d_{KL}(l_{0},l(\boldsymbol{\omega}_{n}))d\boldsymbol{\omega}_{n}=o(n^{1-\delta})$ .

Lemma 7.12.

Let $H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq K(n)\log\left(\frac{M_{n}}{u}\right)$ then

\int_{0}^{\varepsilon}H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})du\leq\varepsilon O(\sqrt{K(n)\log M_{n}})

Proof.

This proof uses some ideas from the proof of Lemma 1 in Lee [2000]

	$\displaystyle\int_{0}^{\varepsilon}\sqrt{H(u,\widetilde{\mathcal{G}}_{n},\|\|.\|\|_{2})}$	$\displaystyle\leq\sqrt{K(n)}\int_{0}^{\varepsilon}\sqrt{\log\left(\frac{M_{n}}{u}\right)}du=\frac{K(n)^{1/2}M_{n}}{2}\int_{\sqrt{\log\frac{M_{n}}{\varepsilon}}}^{\infty}\nu^{2}e^{-\nu^{2}/2}d\nu$
		$\displaystyle=\frac{K(n)^{1/2}M_{n}}{2}\left(\frac{\varepsilon}{M_{n}}\sqrt{\log\frac{M_{n}}{\varepsilon}}+\sqrt{2\pi}\int_{\sqrt{\log\frac{M_{n}}{\varepsilon}}}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\nu^{2}/2}d\nu\right)$
		$\displaystyle\sim\frac{K(n)^{1/2}M_{n}}{2}\left(\frac{\varepsilon}{M_{n}}\sqrt{\log\frac{M_{n}}{\varepsilon}}+\sqrt{2\pi}\frac{\phi\left(\sqrt{\log\frac{M_{n}}{\varepsilon}}\right)}{\sqrt{\log\frac{M_{n}}{\varepsilon}}}\right)\>\>\text{ by }\eqref{e:mills}$
		$\displaystyle\leq\frac{\varepsilon}{2}\sqrt{K(n)}\sqrt{\log M_{n}-\log\varepsilon}\left(1+\frac{1}{M_{n}\frac{\log M_{n}}{\varepsilon}}\right)=\varepsilon O(\sqrt{K(n)\log M_{n}})$

∎

Lemma 7.13.

For any $\varepsilon>0$ , suppose

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}H(u,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq\varepsilon^{2}

Then,

P_{0}^{n}\left(\sup_{\boldsymbol{\omega}_{n}\in\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}\geq e^{-n\varepsilon^{2}}\right)\to 0,\>\>n\to\infty

(51)

Proof.

Note that,

\displaystyle\int_{\varepsilon^{2}/8}^{\sqrt{2}\varepsilon}H(u,\widetilde{\mathcal{G}}_{n},||.||_{2})du

\displaystyle\leq\int_{0}^{\sqrt{2}\varepsilon}H(u,\widetilde{\mathcal{G}}_{n},||.||_{2})du\leq 2\varepsilon^{2}\sqrt{n}

Therefore by Theorem 1 in Wong and Shen [1995], for some constant $C>0$ , we have

P_{0}^{n}\left(\sup_{\boldsymbol{\omega}_{n}\in\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}\geq e^{-n\varepsilon^{2}}\right)\leq 4\exp(-nC\varepsilon^{2})

∎

Lemma 7.14.

Suppose, for some $r>0$ , $p(\boldsymbol{\omega}_{n})$ satisfies

\int_{\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-\kappa n^{r}},n\to\infty

for any $\kappa>0$ . Then, for every $\tilde{\kappa}<\kappa$ .

P_{0}^{n}\left(\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{r}}\right)\to 0

Proof.

This proof uses ideas from proof of Lemma 3 in Lee [2000].

	$\displaystyle P_{0}^{n}\left(\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}>e^{-\tilde{\kappa}n^{r}}\right)$	$\displaystyle=e^{\tilde{\kappa}n^{r}}E_{0}^{n}\left(\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right)$
		$\displaystyle=e^{\tilde{\kappa}n^{r}}\int\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}L_{0}d\mu$
		$\displaystyle=e^{\tilde{\kappa}n^{r}}\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$
		$\displaystyle\leq e^{\tilde{\kappa}n^{r}}e^{-\kappa n^{r}}=e^{-(\kappa-\tilde{\kappa})n^{r}}\to 0,\>\>n\to\infty$

∎

7.2 Lemmas and Propositions for Theorem 3.1 and 3.2

Lemma 7.15.

Let, $\widetilde{\mathcal{G}}_{n}=\{\sqrt{g}:g\in\mathcal{G}_{n}\}$ where $\mathcal{G}_{n}$ is given by (10) with $K(n)\sim n^{a}$ and $C_{n}=e^{n^{b-a}}$ . Then,

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Proof.

This proof uses some ideas from the proof of Lemma 2 in Lee [2000].

First, note that, by Lemma 4.1 in Pollard [1990],

N(\varepsilon,\mathcal{F}_{n},||.||_{\infty})\leq\left(\frac{3C_{n}}{\varepsilon}\right)^{K(n)}.

For $\boldsymbol{\omega}_{1},\boldsymbol{\omega}_{2}\in\mathcal{F}_{n}$ , let $\widetilde{L}(u)=\sqrt{L_{u\boldsymbol{\omega}_{1}+(1-u)\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}$ . Then,

$\displaystyle\sqrt{L_{\boldsymbol{\omega}_{1}}(\boldsymbol{x},y)}-\sqrt{L_{\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}$	$\displaystyle=\int_{0}^{1}\frac{d\widetilde{L}}{du}du=\int_{0}^{1}\sum_{i=1}^{K(n)}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}\frac{\partial{\omega_{i}}}{\partial{u}}du=\sum_{i=1}^{K(n)}(\omega_{1i}-\omega_{2i})\int_{0}^{1}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}du$
	$\displaystyle\leq\sup_{i}\|\omega_{1i}-\omega_{2i}\|\int_{0}^{1}\sum_{i=1}^{K(n)}\sup_{i}\Big{\|}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}\Big{\|}du=K(n)\sup_{i}\Big{\|}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}\Big{\|}\|\|\omega_{1}-\omega_{2}\|\|_{\infty}$
	$\displaystyle\leq F(\boldsymbol{x},y)\|\|\omega_{1}-\omega_{2}\|\|_{\infty}$	(52)

where the upper bound $F(\boldsymbol{x},y)=MK(n)C_{n}\sigma_{0}^{3/2}$ for a constant $M$ . This is because

	$\displaystyle\|\frac{\partial{\widetilde{L}}}{\partial{\beta_{j}}}\|$	$\displaystyle\leq(8\pi e^{2})^{-1/4}\sigma_{0}^{3/2},j=0,\cdots,k_{n}$
	$\displaystyle\|\frac{\partial{\widetilde{L}}}{\partial{\gamma_{jh}}}\|$	$\displaystyle\leq(8\pi e^{2})^{-1/4}C_{n}\sigma_{0}^{3/2},j=0,\cdots,k_{n},h=0,\cdots,p$

In view of (7.2) and Theorem 2.7.11 in van der Vaart et al. [1996], we have

N_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq\left(\frac{MK(n)C_{n}^{2}}{\varepsilon}\right)^{K(n)}

for some constant $M>0$ . Therefore,

H_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\lesssim K(n)\log\frac{K(n)C_{n}^{2}}{u}

Using, Lemma 7.12 with $M_{n}=K(n)C_{n}^{2}$ , we get

\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon O(\sqrt{K(n)\log K(n)C_{n}^{2}})=\varepsilon O(\sqrt{n^{b}})

where the last equality holds since $K(n)\sim n^{a}$ and $C_{n}=e^{n^{b-a}}$ . Therefore,

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

∎

Lemma 7.16.

Let

\mathcal{F}_{n}=\Big{\{}\boldsymbol{\theta}_{n}:|\theta_{in}|\leq C_{n},i=1,\cdots,K(n)\Big{\}}\>\>\>K(n)\sim n^{a},C_{n}=e^{n^{b-a}}

1.

Suppose $p(\boldsymbol{\omega}_{n})$ satisfies (17).
2.

Suppose $p(\boldsymbol{\omega}_{n})$ satisfies (18).

Then for every $\kappa>0$ ,

\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-n\kappa},\>\>n\to\infty.

Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000]. Let $\mathcal{F}_{in}=\{\theta_{in}:|\theta_{in}|\leq C_{n}\}$ ,

\mathcal{F}_{n}=\cap_{i=1}^{K(n)}\mathcal{F}_{in}\implies\mathcal{F}_{n}^{c}=\cap_{i=1}^{K(n)}\mathcal{F}_{in}^{c}

We first prove the Lemma for prior in 1.

	$\displaystyle\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle\leq\sum_{i=1}^{K(n)}\int_{\mathcal{F}_{in}^{c}}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}=2\sum_{i=1}^{K(n)}\int_{C_{n}}^{\infty}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}$
		$\displaystyle=2K(n)\left(1-\Phi\left(\frac{C_{n}}{\zeta}\right)\right)\sim\frac{K(n)}{C_{n}\zeta}e^{-\frac{C_{n}^{2}}{2\zeta^{2}}}\hskip 28.45274pt\text{ by }\eqref{e:mills}$
		$\displaystyle\sim n^{a}\zeta^{-1}e^{n^{b-a}}e^{-(e^{2{n^{b-a}}})/\zeta^{2}}\leq e^{-n\kappa},n\to\infty$

We next prove the Lemma for prior in 2. Analogous to the proof for prior in 1. we get,

	$\displaystyle\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle\leq 2K(n)\left(1-\Phi\left(\frac{C_{n}}{\zeta n^{u/2}}\right)\right)\sim\frac{K(n)}{C_{n}\zeta n^{u/2}}e^{-\frac{C_{n}^{2}}{2\zeta^{2}n^{u}}}\hskip 28.45274pt\text{ by }\eqref{e:mills}$
		$\displaystyle\sim n^{a}\zeta^{-1}n^{-u/2}e^{n^{b-a}}e^{-(e^{2{n^{b-a}}}/\zeta^{2}n^{u}})\leq e^{-n\kappa},n\to\infty$

∎

Proposition 7.17.

Suppose condition (C1) holds for some $0<a<1$ and one of the following two hold.

1.

Suppose $p(\boldsymbol{\omega}_{n})$ satisfies (17).
2.

Suppose $p(\boldsymbol{\omega}_{n})$ satisfies (18).

Then,

\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq\log 2-n\varepsilon^{2}+o_{P_{0}^{n}}(1)

Proof.

This proof uses some ideas from the proof of Lemma 3 in Lee [2000]. We shall first show

P_{0}^{n}\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\log 2-n\varepsilon^{2}\right)\to 0,\>\>n\to\infty

	$\displaystyle P_{0}^{n}\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\log 2-n\varepsilon^{2}\right)=P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq 2e^{-n\varepsilon^{2}}\right)$
	$\displaystyle\leq P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)+P_{0}^{n}\left(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)$

Let $\mathcal{F}_{n}=\{\boldsymbol{\theta}_{n}:|\theta_{in}|\leq C_{n}=e^{n^{b-a}},0<a<b<1\}$ .

By Lemma 7.15,

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Therefore, by Lemma 7.13, we have

P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)\to 0

In view of Lemma 7.16, for $p(\boldsymbol{\omega}_{n})$ as in (17) and (18),

\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-2n\varepsilon^{2}}

Therefore, using Lemma 7.14 with $r=1$ , $\kappa=2\varepsilon^{2}$ and $\tilde{\kappa}=\varepsilon^{2}$ , we have

P_{0}^{n}\left(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)\to 0

Finally to complete the proof, let

A_{n}=\left\{\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq\log 2-n\varepsilon^{2}\right\}

then,

	$\displaystyle\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle=\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right)1_{A_{n}}+\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right)1_{A_{n}^{c}}$
		$\displaystyle\leq(\log 2-n\varepsilon^{2})+\underbrace{\left(n\varepsilon^{2}-\log 2+\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\right)1_{A_{n}^{c}}}_{\tilde{A}_{n}}$

P_{0}^{n}(|\tilde{A}_{n}|>\epsilon)\leq P_{0}^{n}(1_{A_{n}^{c}}=1)\to 0

as shown before. Thus, $\tilde{A}_{n}=o_{P_{0}^{n}}(1)$ .

∎

Proposition 7.18.

Suppose condition (C1) holds with some $0<a<1$ . Let $f_{\boldsymbol{\theta}_{n}}$ be a neural network satisfying assumption (A1) for some $0\leq\delta<1-a$ . With $\boldsymbol{\omega}_{n}=\boldsymbol{\theta}_{n}$ , define,

N_{\kappa/n^{\delta}}=\{\boldsymbol{\omega}_{n}:(1/\sigma_{0}^{2})\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}<\kappa/n^{\delta}\}

(53)

For every $\tilde{\kappa}>0$ ,

Suppose (A2) holds with same $\delta$ as (A1). With $p(\boldsymbol{\omega}_{n})$ as in (17)

\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},\>\>n\to\infty.

Suppose (A3) holds with some $v>1$ . With $p(\boldsymbol{\omega}_{n})$ as in (18)

\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},\>\>n\to\infty

Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000].

By assumption (A1), let $f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x})$ be a neural network such that

||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa}{4n^{\delta}}

(54)

Define neighborhood $M_{\kappa}$ as follows

\displaystyle M_{\kappa}=\{

\displaystyle\boldsymbol{\omega}_{n}:|{\theta}_{in}-{\theta}_{i0n}|<\sqrt{\kappa/(4n^{\delta}m_{n})}\sigma_{0},i=1,\cdots,K(n)\}

where $m_{n}=8K(n)^{2}+8(p+1)^{2}(\sum_{j=1}^{K(n)}|\theta_{i0n}|)^{2}$ .

Note that $m_{n}\geq 8k_{n}+8(p+1)^{2}(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}$ , thereby using Lemma 7.2 with $\epsilon=\sqrt{\kappa/(4n^{\delta}m_{n})}\sigma_{0}$ , we get,

\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq\frac{\kappa}{4n^{\delta}}\sigma_{0}^{2}

(55)

for every $\boldsymbol{\omega}_{n}\in M_{k}$ . In view of (54) and (55), we have

\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq 2||f_{\boldsymbol{\theta}_{n}}-f_{\boldsymbol{\theta}_{0n}}||_{2}+2||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa\sigma_{0}^{2}}{n^{\delta}}\hskip 28.45274pt\text{ by \eqref{e:aplusb}}

(56)

Using (56) in (53) we get $\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}$ for every $\boldsymbol{\omega}_{n}\in M_{\kappa}$ . Therefore,

\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

We next show that,

\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}>e^{-\tilde{\kappa}n^{1-\delta}}

For notation simplicity, let $\delta_{n}=\sqrt{\kappa/(4n^{\delta}m_{n})}\sigma_{0}$

We first prove statement 1. of Proposition 7.18.

$\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle=\prod_{i=1}^{K(n)}\int_{\theta_{i0n}-\delta_{n}}^{\theta_{i0n}+\delta_{n}}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}$
	$\displaystyle=\prod_{i=1}^{K(n)}\frac{2\delta_{n}}{\zeta\sqrt{2\pi}}e^{-\frac{t_{i}^{2}}{2\zeta^{2}}},\>\>t_{i}\in[\theta_{i0n}-\delta_{n},\theta_{i0n}+\delta_{n}]\hskip 14.22636pt\text{by mean value theorem}$
	$\displaystyle=\exp\left(-K(n)\left(\frac{1}{2}\log\frac{\pi\zeta^{2}}{2}-\log\delta_{n}\right)-\sum_{i=1}^{K(n)}\frac{t_{i}^{2}}{2\zeta^{2}}\right)$
	$\displaystyle\geq\exp\left(-K(n)\left(\frac{1}{2}\log\frac{\pi\zeta^{2}}{2}-\log\delta_{n}\right)-\sum_{i=1}^{K(n)}\frac{\max((\theta_{i0n}-\epsilon)^{2},(\theta_{i0n}+\epsilon)^{2})}{2\zeta^{2}}\right)$	(57)

for any $\epsilon>0$ since $t_{i}\in[\theta_{i0n}-\epsilon,\theta_{i0n}+\epsilon]$ when $\delta_{n}\to 0$ .

Using assumption (A2) and condition (C1) together with (36), we get

$\displaystyle\sum_{i=1}^{K(n)}\max((\theta_{i0n}-\epsilon)^{2},(\theta_{i0n}+\epsilon)^{2})$	$\displaystyle\leq 2\sum_{i=1}^{K(n)}{\theta}^{2}_{i0n}+2\epsilon K(n)\leq\tilde{\kappa}n^{1-\delta}\hskip 14.22636pt$
$\displaystyle K(n)\left(\frac{1}{2}\log\frac{\pi\zeta^{2}}{2}-\log\delta_{n}\right)$	$\displaystyle=K(n)\left(\frac{1}{2}\log\frac{\pi}{2}+\frac{1}{2}\delta\log n+\frac{1}{2}\log 4+\frac{1}{2}\log m_{n}-\frac{1}{2}\log\kappa-\log\sigma_{0}\right)$
	$\displaystyle\leq\tilde{\kappa}n^{1-\delta}$	(58)

where the last inequality is a consequence of (C1) and the fact that $\log m_{n}=O(\log n)$ as shown next.

\log m_{n}\leq\log(8K(n)^{2}+8(p+1)^{2}K(n)\sum_{j=1}^{K(n)}\theta_{i0n}^{2})\leq\log(V_{1}n^{2a}+V_{2}n^{a}n^{1-\delta})\leq V_{3}\log n.

where the first inequality is a consequence of Cauchy Schwartz and the second inequality is a consequence condition (C1) and assumption (A2).

Therefore, replacing (7.2) in (7.2), we get

\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

\displaystyle\geq\exp(-\tilde{\kappa}n^{1-\delta})

We next prove statement 2. of Proposition 7.18.

$\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle=\prod_{i=1}^{K(n)}\int_{\theta_{i0n}-\delta_{n}}^{\theta_{i0n}+\delta_{n}}\frac{1}{\sqrt{2\pi\zeta^{2}n^{u}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}n^{u}}}d\theta_{in}$
	$\displaystyle=\left(\frac{2\delta_{n}}{\sqrt{2\pi\zeta^{2}n^{u}}}\right)^{K(n)}e^{-\sum_{i=1}^{K(n)}\frac{t_{i}^{2}}{2\zeta^{2}n^{u}}},t_{i}\in[\theta_{i0n}-\delta_{n},\theta_{i0n}+\delta_{n}],\hskip 14.22636pt\text{by mean value theorem}$
	$\displaystyle\geq\exp\left(-K(n)\left(\frac{1}{2}\log\frac{\pi\zeta^{2}}{2}+\frac{u}{2}\log n-\log\delta_{n}\right)-\sum_{i=1}^{K(n)}\frac{\max((\theta_{i0n}-\epsilon)^{2},(\theta_{i0n}+\epsilon)^{2})}{2\zeta^{2}n^{u}}\right)$	(59)

since for any $\epsilon>0$ since $t_{i}\in[\theta_{i0n}-\epsilon,\theta_{i0n}+\epsilon]$ for any $\epsilon>0$ when $\delta_{n}\to 0$ .

Under assumption (A3) and condition (C1) together with (36), we have

	$\displaystyle\frac{1}{n^{u}}\sum_{i=1}^{K(n)}\max((\theta_{i0n}-\epsilon)^{2},(\theta_{i0n}+\epsilon)^{2})\leq\frac{2}{n^{u}}\left(\sum_{i=1}^{K(n)}{\theta}^{2}_{i0n}+\epsilon K(n)\right)$	$\displaystyle\leq\tilde{\kappa}n^{1-\delta}$
	$\displaystyle K(n)\left(\frac{1}{2}\log\frac{\pi}{2}+\frac{u}{2}\log n-\log\delta_{n}\right)$	$\displaystyle\leq\tilde{\kappa}n^{1-\delta}$		(60)

where the last inequality holds by mimicking the argument in for the proof of part 1.

Therefore, replacing (7.2) in (7.2), we get

\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

\displaystyle\geq\exp(-\tilde{\kappa}n^{1-\delta})

which completes the proof. ∎

Proposition 7.19.

Suppose condition (C1) and assumption (A1) hold for some $0<a<1$ and $0\leq\delta<1-a$ .

1.

Suppose (A2) holds with same $\delta$ as (A1) and $p(\boldsymbol{\omega}_{n})$ satisfies (17).
2.

Suppose (A3) holds for some $v>1$ and $p(\boldsymbol{\omega}_{n})$ satisfies (18).

Then, there exists a $q\in\mathcal{Q}_{n}$ with $\mathcal{Q}_{n}$ as in (13) such that

d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})

(61)

Proof.

	$\displaystyle d_{KL}(q(.),\pi(.\|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$	$\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\pi(\boldsymbol{\omega}_{n}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})d\boldsymbol{\omega}_{n}$
		$\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}d\boldsymbol{\omega}_{n}$
		$\displaystyle=\underbrace{d_{KL}(q(.),p(.))}_{①}\underbrace{-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{②}+\underbrace{\log\int p(\boldsymbol{\omega}_{n})\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{③}$

We first prove statement 1. of the Lemma.

Here, we have

p(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}\hskip 14.22636ptq(\boldsymbol{\omega}_{n})=\prod_{i=1}^{K(n)}\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{n}{2\tau^{2}}(\theta_{in}-\theta_{0in})^{2}}

(62)

$\displaystyle d_{KL}(q(.),p(.))$	$\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$
	$\displaystyle=\sum_{i=1}^{K(n)}\int\left(\frac{1}{2}\log n-\frac{1}{2}\log 2\pi-\log\tau-\frac{n(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}}\right)\frac{n}{\sqrt{2\pi\tau^{2}}}e^{-\frac{n(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}}}d\theta_{in}$
	$\displaystyle-\sum_{i=1}^{K(n)}\int\left(-\frac{1}{2}\log 2\pi-\log\zeta-\frac{\theta_{in}^{2}}{2\zeta^{2}}\right)\frac{n}{\sqrt{2\pi\tau^{2}}}e^{-\frac{n(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}}}d\theta_{in}$
	$\displaystyle=\frac{K(n)}{2}(\log n-\log 2\pi-2\log\tau-1)+\frac{K(n)}{2}(-\log 2\pi-2\log\zeta)+\sum_{i=1}^{K(n)}\frac{\theta_{i0n}^{2}+\tau^{2}/n}{2\zeta^{2}}$	(63)

Thus,

①=\frac{K(n)}{2}\log n+K(n)\log\frac{\zeta}{\tau\sqrt{e}}+\frac{1}{2\zeta^{2}}\sum_{i=1}^{K(n)}\theta_{i0n}^{2}+\frac{\tau^{2}}{2\zeta^{2}n}=o(n^{1-\delta})

where the last equality is a consequence of condition (C1) and assumption (A2).

For, $②$ note that

	$\displaystyle d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})$	$\displaystyle=\int\int\left(\frac{1}{2}\log\frac{\sigma_{0}^{2}}{\sigma_{0}^{2}}-\frac{1}{2\sigma_{0}^{2}}(y-f_{0}(\boldsymbol{x}))^{2}+\frac{1}{2\sigma_{0}^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\right)\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}e^{-\frac{(y-f_{0}(\boldsymbol{x}))^{2}}{2\sigma_{0}^{2}}}dyd\boldsymbol{x}$
		$\displaystyle=\frac{1}{2\sigma_{0}^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}$		(64)

By Lemma 7.9 part 1., $d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})=o(n^{-\delta})$ . Therefore, by Lemma 7.11, $②=o_{P_{0}^{n}}(n^{1-\delta})$ .

Using part 1. of Proposition 7.18 in Lemma 7.10, we get $③=o_{P_{0}^{n}}(n^{1-\delta})$ .

Next we prove statement 2. of the Lemma.

Here, we have

p(\boldsymbol{\omega}_{n})\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}n^{u}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}n^{u}}}\hskip 14.22636ptq(\boldsymbol{\theta}_{n})=\prod_{i=1}^{K(n)}\sqrt{\frac{n^{v+1}}{2\pi\tau^{2}}}e^{-\frac{n^{v+1}}{2\tau^{2}}(\theta_{in}-\theta_{0in})^{2}}

(65)

$\displaystyle d_{KL}(q(.),p(.))$	$\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$
	$\displaystyle=\frac{1}{2}\sum_{i=1}^{K(n)}\int\left(\log n^{v+1}-\log 2\pi-2\log\tau-\frac{(\theta_{in}-\theta_{i0n})^{2}}{\tau^{2}/n^{v+1}}\right)\frac{n^{v+1}}{\sqrt{2\pi\tau^{2}}}e^{-\frac{(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}/n^{v+1}}}d\theta_{in}$
	$\displaystyle-\frac{1}{2}\sum_{i=1}^{K(n)}\int\left(-\log 2\pi-2\log\zeta-\log n^{u}-\frac{\theta_{in}^{2}}{\zeta^{2}n^{u}}\right)\frac{n^{v+1}}{\sqrt{2\pi\tau^{2}}}e^{-\frac{n(\theta_{in}-\theta_{i0n})^{2}}{2\tau^{2}/n^{v+1}}}d\theta_{in}$
	$\displaystyle=\frac{K(n)}{2}((v+1)\log n-\log 2\pi-2\log\tau-1)+\frac{(K(n)}{2}(-\log 2\pi-2\log\zeta-u\log n)$
	$\displaystyle+\sum_{i=1}^{K(n)}\frac{\theta_{i0n}^{2}+\frac{\tau^{2}}{n^{v+1}}}{2\zeta^{2}n^{u}}$	(66)

Thus,

①=(v+1+u)\frac{K(n)}{2}\log n+K(n)\log\frac{\zeta}{\tau\sqrt{e}}+\frac{1}{2\zeta^{2}n^{u}}\sum_{i=1}^{K(n)}\theta_{i0n}^{2}+\frac{\tau^{2}}{2\zeta^{2}n^{u+v+1}}=o(n^{1-\delta})

where the last equality is a consequence of condition (C1) and assumption (A3).

By Lemma 7.9 part 2., $d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})=o(n^{-\delta})$ . Therefore, by Lemma 7.11, $②=o_{P_{0}^{n}}(n^{-\delta})$ .

Using part 2. of Proposition 7.18 in Lemma 7.10, we get $③=o_{P_{0}^{n}}(n^{1-\delta})$ .

∎

7.3 Lemmas and Propositions for Theorem 4.1

Lemma 7.20.

Let, $\widetilde{\mathcal{G}}_{n}=\{\sqrt{g}:g\in\mathcal{G}_{n}\}$ where $\mathcal{G}_{n}$ is given by (27) with $K(n)\sim n^{a}$ , $C_{n}=e^{n^{b-a}}$ , $D_{n}=e^{n^{b}}$ . Then,

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Proof.

This proof uses some ideas from the proof of Lemma 2 in Lee [2000]. First, note that by Lemma 4.1 in Pollard [1990], we have

N(\varepsilon,\mathcal{F}_{n},||.||_{\infty})\leq\Big{(}\frac{3C_{n}}{\varepsilon}\Big{)}^{K(n)}\Big{(}\frac{3D_{n}}{\varepsilon}\Big{)}

For $\boldsymbol{\omega}_{1},\boldsymbol{\omega}_{2}\in\mathcal{F}_{n}$ , let $\widetilde{L}(u)=\sqrt{L_{u\boldsymbol{\omega}_{1}+(1-u)\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}$ .

Using (7.2), we get

\displaystyle\sqrt{L_{\boldsymbol{\omega}_{1}}(\boldsymbol{x},y)}-\sqrt{L_{\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}

\displaystyle\leq\underbrace{(K(n)+1)\sup_{i}\Big{|}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}}_{F(\boldsymbol{x},y)}\Big{|}||\omega_{1}-\omega_{2}||_{\infty}\leq F(\boldsymbol{x},y)||\omega_{1}-\omega_{2}||_{\infty}

(67)

where the upper bound on $F(\boldsymbol{x},y)$ is calculated as:

	$\displaystyle\|\frac{\partial{\widetilde{L}}}{\partial{\beta_{j}}}\|$	$\displaystyle\leq(8\pi e^{2})^{-1/4}C_{n}^{3/2},j=0,\cdots,k_{n}$
	$\displaystyle\|\frac{\partial{\widetilde{L}}}{\partial{\gamma_{jh}}}\|$	$\displaystyle\leq(8\pi e^{2})^{-1/4}C_{n}^{5/2},j=0,\cdots,k_{n},h=0,\cdots,p$
	$\displaystyle\|\frac{\partial{\widetilde{L}}}{\partial{\rho}}\|$	$\displaystyle\leq((16\pi)^{-1/4}+(\pi e^{2}/8)^{-1/4})C_{n}^{5/2}$

In view of (7.2) and Theorem 2.7.11 in van der Vaart et al. [1996], we have

N_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq\Big{(}\frac{MK(n)C_{n}^{7/2}}{\varepsilon}\Big{)}^{K(n)}\Big{(}\frac{MD_{n}K(n)C_{n}^{5/2}}{\varepsilon}\Big{)}

for some constant $M>0$ . Therefore,

H_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\lesssim K(n)\log\frac{K(n)C_{n}^{7/2}(D_{n}K(n)C_{n}^{5/2})^{1/K(n)}}{\varepsilon}

Using, Lemma 7.12 with $M_{n}=K(n)C_{n}^{7/2}(D_{n}K(n)C_{n}^{5/2})^{1/K(n)}$ , we get

\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\lesssim\varepsilon O\left(\sqrt{K(n)\log(K(n)C_{n}^{7/2}(D_{n}K(n)C_{n}^{5/2})^{1/K(n)}}\right)=\varepsilon O(\sqrt{n^{b}})

where the last equality holds since $K(n)\sim n^{a}$ , $C_{n}=e^{n^{b-a}}$ , $D_{n}=e^{n^{b}}$ .

Therefore,

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

∎

Lemma 7.21.

Let

\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n},\sigma):|\theta_{in}|\leq C_{n},i=1,\cdots,K(n),1/C_{n}\leq\sigma\leq D_{n},\Big{\}}

where $D_{n}\sim n^{a}$ , $C_{n}=e^{n^{b-a}}$ , $D_{n}=e^{n^{b}}$ , $0<a<b<1$ . Suppose $p(\boldsymbol{\omega}_{n})$ satisfies (28), then for any $\kappa>0$ and $0<r<b$ ,

\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-\kappa n^{r}},n\to\infty

Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000].

Let $\mathcal{F}_{in}=\{\theta_{in}:|\theta_{in}|\leq C_{n}\}$ and $\mathcal{F}_{0n}=\{\sigma:1/C_{n}\leq\sigma\leq D_{n}\}$ .

\mathcal{F}_{n}=\mathcal{F}_{0n}\cap_{i=1}^{K(n)}\mathcal{F}_{in}\implies\mathcal{F}_{n}^{c}=\mathcal{F}_{0n}^{c}\cup\cup_{i=1}^{K(n)}\mathcal{F}_{in}^{c}

	$\displaystyle\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle\leq\int_{\mathcal{F}_{0n}^{c}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}d\sigma^{2}+\sum_{i=1}^{K(n)}\int_{\mathcal{F}_{in}^{c}}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}$
		$\displaystyle=\int_{0}^{1/C_{n}^{2}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}d\sigma^{2}+\int_{D_{n}^{2}}^{\infty}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}d\sigma^{2}+e^{-n\kappa}\leq$

where the last equality is a consequence of Lemma 7.16.

		$\displaystyle=\int_{0}^{1/C_{n}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma}}d\sigma+\int_{D_{n}}^{\infty}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma}}d\sigma+e^{-n\kappa}$
		$\displaystyle=\int_{C_{n}}^{\infty}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}u^{\alpha-1}e^{-u}du+\int_{0}^{1/D_{n}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}u^{\alpha-1}e^{-\lambda u}du+e^{-n\kappa}$
		$\displaystyle\lesssim\int_{C_{n}}^{\infty}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}e^{-u/2}du+\int_{0}^{1/D_{n}}\frac{\lambda^{\alpha}}{\Gamma(\alpha)}u^{\alpha-1}du+e^{-n\kappa}\hskip 14.22636ptx^{\alpha}e^{-x}\leq e^{-x/2},x\to\infty$
		$\displaystyle\sim e^{-e^{n^{b-a}}/2}+e^{-\alpha n^{b}}+e^{-n\kappa}\leq e^{-\kappa n^{r}}$

for any $\kappa>0$ and $b<r<1$ . ∎

Proposition 7.22.

Suppose condition (C1) holds with $0<a<1$ and $p(\boldsymbol{\omega}_{n})$ satisfies (28). Then,

\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq\log 2-n^{r}\varepsilon^{2}+o_{P_{0}^{n}}(1)

for every $0<r<1$ .

Proof.

This proof uses some ideas from the proof of Lemma 3 in Lee [2000]. We shall first show

P_{0}^{n}\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\log 2-n^{r}\varepsilon^{2}\right)\to 0,\>\>n\to\infty

	$\displaystyle P_{0}^{n}\left(\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\log 2-n^{r}\varepsilon^{2}\right)=P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq 2e^{-n^{r}\varepsilon^{2}}\right)$
	$\displaystyle=P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n^{r}\varepsilon^{2}}\right)+P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n^{r}\varepsilon^{2}}\right)$
	$\displaystyle\leq P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)+P_{0}^{n}\left(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n^{r}\varepsilon^{2}}\right)\hskip 8.53581pt\text{since }e^{-n^{r}\varepsilon^{2}}\geq e^{-n\varepsilon^{2}}$

With $\mathcal{F}_{n}$ as in (27) with $k_{n}\sim n^{a}$ , $C_{n}=e^{n^{b-a}}$ and $D_{n}=e^{n^{b}}$ where $0<a<b<1$

By Lemma 7.20,

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Therefore, by Lemma 7.21, we have

P_{0}^{n}(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}})\to 0

In view of Lemma 7.16, for $p(\boldsymbol{\omega}_{n})$ as in (28), for any $0<r<b$ ,

\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-2n^{r}\varepsilon^{2}},n\to\infty

Therefore, by Lemma 7.14 with $r=r$ , $\kappa=2\varepsilon^{2}$ and $\tilde{\kappa}=\varepsilon^{2}$ , we have

P_{0}^{n}(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n^{r}\varepsilon^{2}})\to 0

Since $b$ can be arbitrarily close to 1, the remaining part of the proof follows on lines of Proposition 7.17

∎

Proposition 7.23.

Suppose condition (C1) holds with some $0<a<1$ . Let $f_{\boldsymbol{\theta}_{n}}$ be a neural network satisfying assumption (A1) and (A2) for some $0\leq\delta<1-a$ . With $\boldsymbol{\omega}_{n}=(\boldsymbol{\theta}_{n},\sigma^{2})$ , define,

N_{\kappa/n^{\delta}}=\left\{\boldsymbol{\omega}_{n}:d_{KL}(l_{0},l(\boldsymbol{\omega}_{n}))=\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\Big{(}1-\frac{\sigma_{0}^{2}}{\sigma^{2}}\Big{)}+\frac{1}{2\sigma^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}<\epsilon\right\}

(68)

For every $\tilde{\kappa}>0$ , with $p(\boldsymbol{\omega}_{n})$ as in (28), we have

\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},\>\>n\to\infty.

Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000].

By assumption (A1), let $f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k_{n}}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x})$ be a neural network such that

||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa}{8n^{\delta}}

(69)

Define neighborhood $M_{\kappa}$ as follows

\displaystyle M_{\kappa}=\{\boldsymbol{\omega}_{n}:|\sigma-\sigma_{0}|<\sqrt{\kappa/2n^{\delta}}\sigma_{0},|{\theta}_{in}-{\theta}_{i0n}|<\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0},i=1,\cdots,K(n)\}

where $m_{n}=8K(n)^{2}+8(p+1)^{2}(\sum_{j=1}^{K(n)}|\theta_{i0n}|)^{2}$ .

Note that $m_{n}\geq 8k_{n}+8(p+1)^{2}(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}$ , thereby using Lemma 7.2 with $\epsilon=\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0}$ , we get

\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq\frac{\kappa}{8n^{\delta}}\sigma_{0}^{2}

(70)

for any $\boldsymbol{\omega}_{n}\in M_{k}$ ,

In view of (69) and (70) together with (7.1), we have

\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq 2||f_{\boldsymbol{\theta}_{n}}-f_{\boldsymbol{\theta}_{0n}}||_{2}+2||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa\sigma_{0}^{2}}{2n^{\delta}}

(71)

By Lemma 7.3,

	$\displaystyle\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\Big{(}1-\frac{\sigma_{0}^{2}}{\sigma^{2}}\Big{)}$	$\displaystyle\leq\frac{\kappa}{2n^{\delta}}$
	$\displaystyle\frac{1}{2\sigma^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\sqrt{\kappa/2n^{\delta}})^{2}}$	$\displaystyle\leq\frac{1}{\sigma_{0}^{2}}$		(72)

Using (71) and (7.3) in (68) we get $\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}$ for every $\boldsymbol{\omega}_{n}\in M_{\kappa}$ . Therefore,

\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})\geq\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})

We next show that,

\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}>e^{-\tilde{\kappa}n^{1-\delta}}

For notation simplicity, let $\delta_{1n}=\sqrt{\kappa/2n^{\delta}}\sigma_{0}$ and $\delta_{2n}=\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0}$

	$\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle=\int_{(\sigma_{0}-\delta_{1n})^{2}}^{(\sigma_{0}+\delta_{1n})^{2}}p(\sigma^{2})d\sigma^{2}\prod_{i=1}^{K(n)}\int_{\theta_{i0n}-\delta_{2n}}^{\theta_{i0n}+\delta_{2n}}p(\theta_{in})d\theta_{in}$
		$\displaystyle\geq\int_{(\sigma_{0}-\delta_{1n})^{2}}^{(\sigma_{0}+\delta_{1n})^{2}}p(\sigma^{2})d\sigma^{2}e^{-(\tilde{\kappa}/2)n^{1-\delta}}$

where first to second step follows from part 1. of Lemma 7.18 since $p(\boldsymbol{\theta}_{n})$ satisfies (17). Next,

$\displaystyle\int_{(\sigma_{0}-\delta_{1n})^{2}}^{(\sigma_{0}+\delta_{1n})^{2}}p(\sigma^{2})d\sigma^{2}$	$\displaystyle=\int_{(\sigma_{0}-\delta_{1n})^{2}}^{(\sigma_{0}+\delta_{1n})^{2}}\frac{\beta^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\beta}{\sigma^{2}}}d\sigma^{2}=\int_{\sigma_{0}-\delta_{1n}}^{\sigma_{0}+\delta_{1n}}\frac{\beta^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma}\Big{)}^{\alpha+1}e^{-\frac{\beta}{\sigma}}d\sigma$
	$\displaystyle=2\delta_{1n}\underbrace{\frac{\beta^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{t}\Big{)}^{\alpha+1}e^{-\frac{\beta}{t}}}_{f(t)},\>\>t\in[\sigma_{0}-\delta_{1n},\sigma_{0}+\delta_{1n}]\hskip 14.22636pt\text{ by mean value theorem}$
	$\displaystyle\geq\frac{\delta_{1n}\beta^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma_{0}+\epsilon}\Big{)}^{\alpha+1}e^{-\frac{\beta}{\sigma_{0}-\epsilon}}$
	$\displaystyle=\exp\left(-\left(-\log\delta_{1n}-\alpha\log\beta+\log\Gamma(\alpha)+(\alpha+1)\log(\sigma_{0}+\epsilon)+\frac{\beta}{\sigma_{0}-\epsilon}\right)\right)$	(73)

where the third inequality holds since for any $\epsilon>0$ , $t\in[\sigma_{0}-\epsilon,\sigma_{0}+\epsilon]$ when $\delta_{n}\to 0$ . Now,

		$\displaystyle-\log\delta_{1n}-\alpha\log\lambda+\log\Gamma(\alpha)+(\alpha+1)\log(\sigma_{0}+\epsilon)+\frac{\lambda}{\sigma_{0}-\epsilon}$
		$\displaystyle=\frac{1}{2}\delta\log n+\frac{1}{2}\log 2-\frac{1}{2}\log\kappa-\log\sigma_{0}-\alpha\log\lambda+\log\Gamma(\alpha)+(\alpha+1)\log(\sigma_{0}+\epsilon)+\frac{\lambda}{\sigma_{0}-\epsilon}\leq(\tilde{\kappa}/2)n^{1-\delta}$		(74)

Using (7.3) in (7.3), we get

\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}}

which completes the proof. ∎

Proposition 7.24.

Suppose condition (C1) and assumptions (A1) and (A2) hold for some $0<a<1$ and $0\leq\delta<1-a$ . Suppose the prior $p(\boldsymbol{\omega}_{n})$ satisfies (28).

Then, there exists a $q\in\mathcal{Q}_{n}$ with $\mathcal{Q}_{n}$ as in (29) such that

d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})

(75)

Proof.

	$\displaystyle d_{KL}(q(.),\pi(.\|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$	$\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\pi(\boldsymbol{\omega}_{n}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})d\boldsymbol{\omega}_{n}$
		$\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}d\boldsymbol{\omega}_{n}$
		$\displaystyle=\underbrace{d_{KL}(q(.),p(.))}_{①}\underbrace{-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{②}+\underbrace{\log\int p(\boldsymbol{\omega}_{n})\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{③}$

We first deal with $①$ as follows

p(\boldsymbol{\omega}_{n})=\underbrace{\frac{\lambda^{\alpha}}{\Gamma(\alpha)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{\alpha+1}e^{-\frac{\lambda}{\sigma^{2}}}}_{p(\sigma^{2})}\underbrace{\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}}_{p(\boldsymbol{\theta}_{n})}\>\>\>\>q(\boldsymbol{\omega}_{n})=\underbrace{\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}}_{q(\sigma^{2})}\underbrace{\prod_{i=1}^{K(n)}\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{(\theta_{in}-\theta_{i0n})^{2}}{\tau^{2}}}}_{q(\boldsymbol{\theta}_{n})}

(76)

		$\displaystyle d_{KL}(q(.),p(.))=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$
		$\displaystyle=\int q(\sigma^{2})\log q(\sigma^{2})d\sigma^{2}-\int q(\sigma^{2})\log p(\sigma^{2})d\sigma^{2}+\int q(\boldsymbol{\theta}_{n})\log q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}-\int q(\boldsymbol{\theta}_{n})\log p(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}$
		$\displaystyle=\int q(\sigma^{2})\log q(\sigma^{2})d\sigma^{2}-\int q(\sigma^{2})\log p(\sigma^{2})d\sigma^{2}+o(n^{1-\delta})$		(77)

where the last inequality is a consequence of Proposition 7.19. Simplifying further, we get

	$\displaystyle\int q(\sigma^{2})\log q(\sigma^{2})d\sigma^{2}$	$\displaystyle=\int\left(n\log n\sigma_{0}^{2}-\log\Gamma(n)-(n+1)\log\sigma^{2}-\frac{n\sigma_{0}^{2}}{\sigma^{2}}\right)\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}d\sigma^{2}$
		$\displaystyle=n\log n\sigma_{0}^{2}-\log\Gamma(n)-(n+1)(\log n\sigma_{0}^{2}-\psi(n))-n$
		$\displaystyle=-\log\sigma_{0}^{2}-(n+1)\psi(n)-\log(n-1)!-n$
		$\displaystyle=-\log\sigma_{0}^{2}-(n+1)\log n-(n-1)\log(n-1)+(n-1)-n+O(\log n)$
		$\displaystyle=-\log\sigma_{0}^{2}+O(\log n)=o(n^{1-\delta})$

where the equality in step 4 follows by approximating $\psi(n)$ using Lemma 4 in Elezovic and Giordano [2000] and approximating $(n-1)!$ by Stirling’s formula.

	$\displaystyle\int q(\sigma^{2})\log p(\sigma^{2})d\sigma^{2}$	$\displaystyle=\int\left(\alpha\log\lambda-\log\Gamma(\alpha)-(\alpha+1)\log\sigma^{2}-\frac{\lambda}{\sigma^{2}}\right)\frac{(n\sigma_{0}^{2})^{n}}{\Gamma(n)}\Big{(}\frac{1}{\sigma^{2}}\Big{)}^{n+1}e^{-\frac{n\sigma_{0}^{2}}{\sigma^{2}}}d\sigma^{2}$
		$\displaystyle=\alpha\log\lambda-\log\Gamma(\alpha)-(\alpha+1)(\log n\sigma_{0}^{2}-\psi(n)))-\frac{\lambda}{\sigma_{0}^{2}}$
		$\displaystyle=\alpha\log\lambda-\log\Gamma(\alpha)-(\alpha+1)(\log n\sigma_{0}^{2}-\log n)-\frac{\lambda}{\sigma_{0}^{2}}+O(\log n)=o(n^{1-\delta})$

where the last equality follows by approximating $\psi(n)$ using Lemma 4 in Elezovic and Giordano [2000].

For, $②$ note that

	$\displaystyle d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})$	$\displaystyle=\int\int\Big{(}\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2\sigma_{0}^{2}}(y-f_{0}(\boldsymbol{x}))^{2}+\frac{1}{2\sigma^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)}\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}e^{-\frac{(y-f_{0}(\boldsymbol{x}))^{2}}{2\sigma_{0}^{2}}}dyd\boldsymbol{x}$
		$\displaystyle=\frac{1}{2}\log\frac{\sigma^{2}}{\sigma_{0}^{2}}-\frac{1}{2}+\frac{\sigma_{0}^{2}}{2\sigma^{2}}+\frac{1}{2\sigma^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}$		(78)

By Lemmas 7.5, 7.6 and Lemma 7.9 part 1, we have

\int d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}=o(n^{-\delta})

Therefore, by Lemma 7.11, $②=o_{P_{0}^{n}}(n^{-\delta})$ .

Using Proposition 7.23 in Lemma 7.10, we get $③=o_{P_{0}^{n}}(n^{1-\delta})$ .

∎

7.4 Lemmas and Propositions for Theorem 4.4

Lemma 7.25.

For $\mathcal{G}_{n}$ as in (31), let $\widetilde{\mathcal{G}}_{n}=\{\sqrt{g}:g\in\mathcal{G}_{n}\}$ . If $K(n)\sim n^{a}$ , $C_{n}=e^{n^{b-a}}$ , $0<a<b<1$ , then

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Proof.

First, by Lemma 4.1 in Pollard [1990],

N(\varepsilon,\mathcal{F}_{n},||.||_{\infty})\leq\Big{(}\frac{3C_{n}}{\varepsilon}\Big{)}^{K(n)}\Big{(}\frac{3\log C_{n}}{\varepsilon}\Big{)}

For $\boldsymbol{\omega}_{1},\boldsymbol{\omega}_{2}\in\mathcal{F}_{n}$ , let $\widetilde{L}(u)=\sqrt{L_{u\boldsymbol{\omega}_{1}+(1-u)\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}$ .

Using (7.2), we get

\displaystyle\sqrt{L_{\boldsymbol{\omega}_{1}}(\boldsymbol{x},y)}-\sqrt{L_{\boldsymbol{\omega}_{2}}(\boldsymbol{x},y)}

\displaystyle\leq\underbrace{(K(n)+1)\sup_{i}\Big{|}\frac{\partial{\widetilde{L}}}{\partial{\omega_{i}}}\Big{|}}_{F(\boldsymbol{x},y)}||\omega_{1}-\omega_{2}||_{\infty}\leq F(\boldsymbol{x},y)||\omega_{1}-\omega_{2}||_{\infty}

(79)

where the upper bound on $F(\boldsymbol{x},y)$ is calculated as:

	$\displaystyle\|\frac{\partial{\widetilde{L}}}{\partial{\beta_{j}}}\|$	$\displaystyle\leq 2^{3/2}(8\pi e^{2})^{-1/4}C_{n}^{3/2},j=0,\cdots,k_{n}$
	$\displaystyle\|\frac{\partial{\widetilde{L}}}{\partial{\gamma_{jh}}}\|$	$\displaystyle\leq 2^{3/2}(8\pi e^{2})^{-1/4}C_{n}^{5/2},j=0,\cdots,k_{n},h=0,\cdots,p$
	$\displaystyle\|\frac{\partial{\widetilde{L}}}{\partial{\rho}}\|$	$\displaystyle\leq 2^{3/2}((16\pi)^{-1/4}+(\pi e^{2}/8)^{-1/4})C_{n}^{5/2}$

since $\log(1+e^{\rho})\geq\log(1+e^{-\log C_{n}})\sim 1/C_{n}\geq 1/(2C_{n})$ and $|\partial{\log(1+e^{\rho})}/\partial{\rho}|\leq 1$ .

In view of (79) and Theorem 2.7.11 in van der Vaart et al. [1996], we have

N_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\leq\Big{(}\frac{MK(n)C_{n}^{7/2}}{\varepsilon}\Big{)}^{K(n)}\Big{(}\frac{MK(n)C_{n}^{5/2}\log C_{n}}{\varepsilon}\Big{)}

for some $M>0$ . Therefore,

H_{[]}(\varepsilon,\widetilde{\mathcal{G}}_{n},||.||_{2})\lesssim K(n)\log\frac{K(n)C_{n}^{7/2}(K(n)C_{n}^{5/2}\log C_{n})^{1/K(n)}}{\varepsilon}

Using, Lemma 7.12 with $M_{n}=K(n)C_{n}^{7/2}(K(n)C_{n}^{5/2}\log C_{n})^{1/K(n)}$ , we get

\int_{0}^{\varepsilon}\sqrt{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon O(\sqrt{K(n)\log(K(n)C_{n}^{7/2}(K(n)C_{n}^{5/2}\log C_{n})^{1/K(n)}})=\varepsilon O(\sqrt{n^{b}})

where the last equality holds since $K(n)\sim n^{a}$ , $C_{n}=e^{n^{b-a}}$ , $0<a<b<1$ .

Therefore,

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

∎

Lemma 7.26.

Let

\mathcal{F}_{n}=\Big{\{}(\boldsymbol{\theta}_{n},\rho):|\theta_{in}|\leq C_{n},i=1,\cdots,K(n),|\rho|\leq\log C_{n}\Big{\}}

where $K(n)\sim n^{a}$ , $C_{n}=e^{n^{b-a}}$ , $0<a<1/2$ , $a+1/2<b<1$ . Then with

p(\boldsymbol{\omega}_{n})=\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}

we have for every $\kappa>0$

\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-n\kappa},\>\>n\to\infty

Proof.

Let $\mathcal{F}_{in}=\{\theta_{in}:|\theta_{in}|\leq C_{n}\}$ and $\mathcal{F}_{0n}=\{\rho:|\rho|<\log C_{n}\}$ .

\mathcal{F}_{n}=\mathcal{F}_{0n}\cap_{i=1}^{K(n)}\mathcal{F}_{in}\implies\mathcal{F}_{n}^{c}=\mathcal{F}_{0n}^{c}\cup\cup_{i=1}^{K(n)}\mathcal{F}_{in}^{c}

	$\displaystyle\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle\leq\int_{\mathcal{F}_{0n}^{c}}\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}d\rho+\sum_{i=1}^{K(n)}\int_{\mathcal{F}_{in}^{c}}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}^{2}\hskip 28.45274pt\text{Countable sub-additivity.}$
		$\displaystyle=2\int_{\log C_{n}}^{\infty}\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}d\rho+2\sum_{i=1}^{K(n)}\int_{C_{n}}^{\infty}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}d\theta_{in}^{2}$
		$\displaystyle=2\left(1-\Phi\left(\frac{\log C_{n}}{\eta}\right)\right)+2K(n)\left(1-\Phi\left(\frac{C_{n}}{\zeta}\right)\right)$
		$\displaystyle\sim\frac{1}{\log C_{n}}e^{-\frac{(\log C_{n})^{2}}{2\eta^{2}}}+\frac{K(n)}{C_{n}}e^{-\frac{C_{n}^{2}}{2\zeta^{2}}}\leq e^{-n\kappa}\hskip 28.45274pt\text{ By Mill's Ratio}$

since $(\log C_{n})^{2}=n^{2(b-a)}>n$ for $a+1/2<b<1$ and $C_{n}^{2}=e^{2n^{b-a}}>n$ for $0<a<b<1$ . ∎

Proposition 7.27.

Suppose condition (C1) holds with $0<a<1/2$ and $p(\boldsymbol{\omega}_{n})$ satisfies (32). Then,

\log\int_{\mathcal{V}_{\varepsilon}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq\log 2-n\varepsilon^{2}+o_{P_{0}^{n}}(1)

Proof.

Let $\mathcal{F}_{n}=\{\boldsymbol{\omega}_{n}:|\theta_{in}|\leq C_{n},|\rho|<\log C_{n}\}$ . Let $C_{n}=e^{n^{b-a}}$ and $K(n)\sim n^{a}$ for $0<a<1/2$ .

By Lemma 7.25, we have

\frac{1}{\sqrt{n}}\int_{0}^{\varepsilon}{H_{[]}(u,\widetilde{\mathcal{G}}_{n},||.||_{2})}du\leq\varepsilon^{2}

Therefore, by Lemma 7.13, we have

P_{0}^{n}\left(\int_{\mathcal{V}_{\varepsilon}^{c}\cap\mathcal{F}_{n}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)\to 0

In view of Lemma 7.26, for $p(\boldsymbol{\omega}_{n})$ as in (32),

\int_{\boldsymbol{\omega}_{n}\in\mathcal{F}_{n}^{c}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\leq e^{-2n\varepsilon^{2}}

Therefore, by Lemma 7.14 with $r=1$ , $\kappa=2\varepsilon^{2}$ and $\tilde{\kappa}=\varepsilon^{2}$ , we have

P_{0}^{n}\left(\int_{\mathcal{F}_{n}^{c}}\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-n\varepsilon^{2}}\right)\to 0

The remaining part of the proof follows on the same lines as Proposition 7.17 ∎

Proposition 7.28.

N_{\kappa/n^{\delta}}=\{\boldsymbol{\omega}_{n}:d_{KL}(l_{0},l(\boldsymbol{\omega}_{n}))=\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\Big{(}1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\Big{)}+\frac{1}{2\sigma_{\rho}^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}<\epsilon\}

(80)

For every $\tilde{\kappa}>0$ , with $p(\boldsymbol{\omega}_{n})$ as in (32), we have

\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}},\>\>n\to\infty.

Proof.

This proof uses some ideas from the proof of Theorem 1 in Lee [2000].

By assumption (A1), let $f_{\boldsymbol{\theta}_{0n}}(\boldsymbol{x})=\beta_{00}+\sum_{j=1}^{k(n)}\beta_{j0}\psi(\gamma_{j0}^{\top}\boldsymbol{x})$ satisfy

||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa}{8n^{\delta}}

(81)

With $\sigma_{0}=\log(1+e^{\rho_{0}})$ , define neighborhood $M_{\kappa}$ as follows

\displaystyle M_{\kappa}=\{

\displaystyle\boldsymbol{\omega}_{n}:|\rho-\rho_{0}|<\sqrt{\kappa/2n^{\delta}}\sigma_{0},|{\theta}_{in}-{\theta}_{i0n}|<\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0},i=1,\cdots,K(n)\}

where $m_{n}=8K(n)^{2}+8(p+1)^{2}(\sum_{j=1}^{K(n)}|\theta_{i0n}|)^{2}$ . Note that $m_{n}\geq 8k_{n}+8(p+1)^{2}(\sum_{j=1}^{k_{n}}|\beta_{j0}|)^{2}$ .

Thereby, using Lemma 7.2 with $\epsilon=\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0}$ and (36), we get

\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}\leq 2||f_{\boldsymbol{\theta}_{n}}-f_{\boldsymbol{\theta}_{0n}}||_{2}+2||f_{\boldsymbol{\theta}_{0n}}-f_{0}||_{2}\leq\frac{\kappa\sigma_{0}^{2}}{2n^{\delta}}

(82)

By Lemma 7.4,

	$\displaystyle\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\Big{(}1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\Big{)}$	$\displaystyle\leq\frac{\kappa}{2n^{\delta}}$
	$\displaystyle\frac{1}{2\sigma_{\rho}^{2}}\leq\frac{1}{2\sigma_{0}^{2}(1-\sqrt{\kappa/2n^{\delta}})^{2}}$	$\displaystyle\leq\frac{1}{\sigma_{0}^{2}}$		(83)

Using (82) and (7.4) in (80), we get $\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}$ , for every $\boldsymbol{\omega}_{n}\in M_{\kappa}$ . Therefore,

\int_{\boldsymbol{\omega}_{n}\in N_{\kappa/n^{\delta}}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}

We next show that,

\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}>e^{-\tilde{\kappa}n^{1-\delta}}

For notation simplicity, let $\delta_{1n}=\sqrt{\kappa/2n^{\delta}}\sigma_{0}$ and $\delta_{2n}=\sqrt{\kappa/(8n^{\delta}m_{n})}\sigma_{0}$

	$\displaystyle\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}$	$\displaystyle=\int_{\rho_{0}-\delta_{1n}}^{\rho_{0}+\delta_{1n}}p(\rho)d\rho\prod_{i=1}^{K(n)}\int_{\theta_{i0n}-\delta_{2n}}^{\theta_{i0n}+\delta_{2n}}p(\theta_{in})d\theta_{in}$
		$\displaystyle\geq\int_{\rho_{0}-\delta_{1n}}^{\rho_{0}+\delta_{1n}}p(\rho)d\rho e^{-(\tilde{\kappa}/2)n^{1-\delta}}$

where first to second step follows from part 1. of Lemma 7.18 since $p(\boldsymbol{\theta}_{n})$ satisfies (17). Next,

$\displaystyle\int_{\rho_{0}-\delta_{1n}}^{\rho_{0}+\delta_{1n}}p(\rho)d\rho$	$\displaystyle=\int_{\rho_{0}-\delta_{1n}}^{\rho_{0}+\delta_{1n}}\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}$
	$\displaystyle=2\delta_{1n}\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{t^{2}}{2\eta^{2}}},t\in[\rho_{0}-\delta_{1n},\rho_{0}+\delta_{1n}]\hskip 28.45274pt\text{by mean value theorem}$
	$\displaystyle\geq\frac{2\delta_{1n}}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\max((\rho_{0}-\epsilon)^{2},(\rho_{0}+\epsilon)^{2})}{2\eta^{2}}}$
	$\displaystyle=\exp\left(-\left(-\log\delta_{1n}+\frac{1}{2}\log\frac{\pi}{2}+\log\eta+\frac{\max((\rho_{0}-\epsilon)^{2},(\rho_{0}+\epsilon)^{2})}{2\eta^{2}}\right)\right)$	(84)

where the third inequality holds since for any $\epsilon>0$ , $t\in[\rho_{0}-\epsilon,\rho_{0}+\epsilon]$ when $\delta_{n}\to 0$ . Now,

		$\displaystyle-\log\delta_{1n}+\frac{1}{2}\log\frac{\pi}{2}+\log\eta+\frac{\max(\rho_{0}-\epsilon,\rho_{0}+\epsilon)}{2\eta^{2}}$
		$\displaystyle=\frac{1}{2}\delta\log n+\frac{1}{2}\log 2-\frac{1}{2}\log\kappa-\log\sigma_{0}+\log\eta+\frac{\max(\rho_{0}-\epsilon,\rho_{0}+\epsilon)}{2\eta^{2}}\leq(\tilde{\kappa}/2)n^{1-\delta}$		(85)

Using (7.4) in (7.4), we get

\int_{\boldsymbol{\omega}_{n}\in M_{\kappa}}p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}\geq e^{-\tilde{\kappa}n^{1-\delta}}

which completes the proof. ∎

Proposition 7.29.

Suppose condition (C1) and assumption (A1) hold for some $0<a<1/2$ and $0\leq\delta<1-a$ . Suppose the prior $p(\boldsymbol{\omega}_{n})$ satisfies as (32).

Then, there exists a $q\in\mathcal{Q}_{n}$ with $\mathcal{Q}_{n}$ as in (33), such that

d_{KL}(q(.),\pi(.|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))=o_{P_{0}^{n}}(n^{1-\delta})

(86)

Proof.

	$\displaystyle d_{KL}(q(.),\pi(.\|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$	$\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\pi(\boldsymbol{\omega}_{n}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})d\boldsymbol{\omega}_{n}$
		$\displaystyle=\int q(\boldsymbol{\omega}_{n})\log q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})}{\int L(\boldsymbol{\omega}_{n})p(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}}d\boldsymbol{\omega}_{n}$
		$\displaystyle=\underbrace{d_{KL}(q(.),p(.))}_{①}\underbrace{-\int q(\boldsymbol{\omega}_{n})\log\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{②}+\underbrace{\log\int p(\boldsymbol{\omega}_{n})\frac{L(\boldsymbol{\omega}_{n})}{L_{0}}d\boldsymbol{\omega}_{n}}_{③}$

We first deal with $①$ as follows

p(\boldsymbol{\omega}_{n})=\underbrace{\frac{1}{\sqrt{2\pi\eta^{2}}}e^{-\frac{\rho^{2}}{2\eta^{2}}}}_{p(\rho)}\underbrace{\prod_{i=1}^{K(n)}\frac{1}{\sqrt{2\pi\zeta^{2}}}e^{-\frac{\theta_{in}^{2}}{2\zeta^{2}}}}_{p(\boldsymbol{\theta}_{n})}\>\>\>\>q(\boldsymbol{\omega}_{n})=\underbrace{\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n(\rho-\rho_{0})^{2}}{\nu^{2}}}}_{q(\rho)}\underbrace{\prod_{i=1}^{K(n)}\sqrt{\frac{n}{2\pi\tau^{2}}}e^{-\frac{(\theta_{in}-\theta_{i0n})^{2}}{\tau^{2}}}}_{q(\boldsymbol{\theta}_{n})}

(87)

	$\displaystyle d_{KL}(q(.),p(.))$	$\displaystyle=\int q(\rho)\log q(\rho)d\rho-\int q(\rho)\log p(\rho)d\rho+\int q(\boldsymbol{\theta}_{n})\log q(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}-\int q(\boldsymbol{\theta}_{n})\log p(\boldsymbol{\theta}_{n})d\boldsymbol{\theta}_{n}$
		$\displaystyle=\int q(\rho)\log q(\rho)d\rho-\int q(\rho)\log p(\rho)d\rho+o(n^{1-\delta})$		(88)

where the last equality is a consequence of Proposition 7.19. Simplifying further, we get

	$\displaystyle\int q(\rho)\log q(\rho)d\rho-\int q(\rho)\log q(\rho)d\rho$	$\displaystyle=\int\Big{(}\frac{1}{2}\log n-\frac{1}{2}\log 2\pi-\log\nu-\frac{n(\rho-\rho_{0})^{2}}{2\nu^{2}}\Big{)}\frac{n}{\sqrt{2\pi\nu^{2}}}e^{-\frac{n(\rho-\rho_{0})^{2}}{2\nu^{2}}}d\rho$
		$\displaystyle-\int\Big{(}-\frac{1}{2}\log 2\pi-\log\eta-\frac{\rho^{2}}{2\eta^{2}}\Big{)}\frac{n}{\sqrt{2\pi\nu^{2}}}e^{-\frac{n(\rho-\rho_{0})^{2}}{2\nu^{2}}}d\rho$
		$\displaystyle=\frac{1}{2}(\log n-\log 2\pi-2\log\nu-1)+\frac{1}{2}(-\log 2\pi-2\log\eta)+\frac{\rho_{0}^{2}+\nu^{2}/n}{2\eta^{2}}$
		$\displaystyle=o(n^{1-\delta})$

For, $②$ note that

	$\displaystyle d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})$	$\displaystyle=\int\int\Big{(}\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2\sigma_{0}^{2}}(y-f_{0}(\boldsymbol{x}))^{2}+\frac{1}{2\sigma_{\rho}^{2}}(y-f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x}))^{2}\Big{)}\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}e^{-\frac{(y-f_{0}(\boldsymbol{x}))^{2}}{2\sigma_{0}^{2}}}dyd\boldsymbol{x}$
		$\displaystyle=\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}+\frac{\sigma_{0}^{2}}{2\sigma_{\rho}^{2}}+\frac{1}{2\sigma_{\rho}^{2}}\int(f_{\boldsymbol{\theta}_{n}}(\boldsymbol{x})-f_{0}(\boldsymbol{x}))^{2}d\boldsymbol{x}$		(89)

By Lemmas 7.7, 7.8 and Lemma 7.9 part 1, we have

\int d_{KL}(l_{0},l_{\boldsymbol{\omega}_{n}})q(\boldsymbol{\omega}_{n})d\boldsymbol{\omega}_{n}=o_{P_{0}}(n^{-\delta})

Therefore, by Lemma 7.11, $②=o_{P_{0}}(n^{-\delta})$ .

Using Lemma 7.28 in Lemma 7.10, we get $③=o_{P_{0}}(n^{1-\delta})$ .

∎

References

Bishop [1997] C. M. Bishop, Bayesian Neural Networks, Journal of the Brazilian Computer Society 4 (1997).
Neal [1992] R. M. Neal, Bayesian training of backpropagation networks by the hybrid monte-carlo method, 1992.
Lampinen and Vehtari [2001] J. Lampinen, A. Vehtari, Bayesian approach for neural networks–review and case studies, Neural networks : the official journal of the International Neural Network Society 14 3 (2001) 257–74.
Sun et al. [2017] S. Sun, C. Chen, L. Carin, Learning Structured Weight Uncertainty in Bayesian Neural Networks, in: A. Singh, J. Zhu (Eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, PMLR, Fort Lauderdale, FL, USA, 2017, pp. 1283–1292. URL: http://proceedings.mlr.press/v54/sun17b.html.
Mullachery et al. [2018] V. Mullachery, A. Khera, A. Husain, Bayesian neural networks, 2018. arXiv:1801.07710.
Hubin et al. [2018] A. Hubin, G. Storvik, F. Frommlet, Deep bayesian regression models, 2018. arXiv:1806.02160.
Liang et al. [2018] F. Liang, Q. Li, L. Zhou, Bayesian neural networks for selection of drug sensitive genes, Journal of the American Statistical Association 113 (2018) 955--972.
Javid et al. [2020] K. Javid, W. Handley, M. P. Hobson, A. Lasenby, Compromise-free bayesian neural networks, ArXiv abs/2004.12211 (2020).
Lee [2000] H. Lee, Consistency of posterior distributions for neural networks, Neural Networks 13 (2000) 629 -- 642.
Barron et al. [1999] A. Barron, M. J. Schervish, L. Wasserman, The consistency of posterior distributions in nonparametric problems, Ann. Statist. 27 (1999) 536--561.
Neal [1996] R. M. Neal, Bayesian Learning for Neural Neyworks, Springer-Verlag, Springer, New York, 1996. URL: https://books.google.com/books?id=OCenCW9qmp4C.
Lee [2004] H. K. H. Lee, Bayesian Nonparametrics via Neural Networks, Springer-Verlag, ASA-SIAM Series, 2004. URL: https://books.google.com/books?id=OCenCW9qmp4C.
Ghosh et al. [2004] M. Ghosh, T. Maiti, D. Kim, S. Chakraborty, A. Tewari, Hierarchical bayesian neural networks, Journal of the American Statistical Association 99 (2004) 601--608.
Blei et al. [2017] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association 112 (2017) 859–877.
Logsdon et al. [2009] B. A. Logsdon, G. E. Hoffman, J. G. Mezey, A variational bayes algorithm for fast and accurate multiple locus genome-wide association analysis, BMC Bioinformatics 11 (2009) 58 -- 58.
Graves [2011] A. Graves, Practical variational inference for neural networks, in: J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 24, Curran Associates, Inc., 2011, pp. 2348--2356. URL: http://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.
Carbonetto and Stephens [2012] P. Carbonetto, M. Stephens, Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies, Bayesian Anal. 7 (2012) 73--108.
Blundell et al. [2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural networks, 2015. arXiv:1505.05424.
Sun et al. [2019] S. Sun, G. Zhang, J. Shi, R. Grosse, Functional variational bayesian neural networks, 2019. arXiv:1903.05779.
Wang and Blei [2019] Y. Wang, D. M. Blei, Frequentist consistency of variational bayes, Journal of the American Statistical Association 114 (2019) 1147--1161.
Pati et al. [2017] D. Pati, A. Bhattacharya, Y. Yang, On statistical optimality of variational bayes, 2017. arXiv:1712.08983.
Yang et al. [2017] Y. Yang, D. Pati, A. Bhattacharya, $\alpha$ -variational inference with statistical guarantees, 2017. arXiv:1710.03266.
Zhang and Gao [2017] F. Zhang, C. Gao, Convergence rates of variational posterior distributions, 2017. arXiv:1712.02519.
Hornik et al. [1989] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359 -- 366.
Siegel and Xu [2019] J. W. Siegel, J. Xu, Approximation rates for neural networks with general activation functions, 2019. arXiv:1904.02311.
Shen [1997] X. Shen, On methods of sieves and penalization, Ann. Statist. 25 (1997) 2555--2591.
Shen et al. [2019] X. Shen, C. Jiang, L. Sakhanenko, Q. Lu, Asymptotic properties of neural network sieve estimators, 2019. arXiv:1906.00875.
White [1990] H. White, Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings, Neural Networks 3 (1990) 535 -- 549.
Scheffe [1947] H. Scheffe, A useful convergence theorem for probability distributions, Ann. Math. Statist. 18 (1947) 434--438.
Elezovic and Giordano [2000] N. Elezovic, C. Giordano, The best bounds in gautschi’s inequality, Mathematical Inequalities and Applications 3 (2000).
Wong and Shen [1995] W. H. Wong, X. Shen, Probability inequalities for likelihood ratios and convergence rates of sieve mles, Ann. Statist. 23 (1995) 339--362.
Pollard [1990] D. Pollard, Empirical Processes: Theory and Applications, Conference Board of the Mathematical Science: NSF-CBMS regional conference series in probability and statistics, Institute of Mathematical Statistics, 1990. URL: https://books.google.com/books?id=Prcsi29EU50C.
van der Vaart et al. [1996] A. van der Vaart, A. van der Vaart, A. van der Vaart, J. Wellner, Weak Convergence and Empirical Processes: With Applications to Statistics, Springer Series in Statistics, Springer, 1996. URL: https://books.google.com/books?id=OCenCW9qmp4C.

	$\displaystyle③$	$\displaystyle=-\pi^{}(\mathcal{V}_{\varepsilon})\int_{\mathcal{V}_{\varepsilon}}\frac{\pi^{}(\boldsymbol{\omega}_{n})}{\pi^{}(\mathcal{V}_{\varepsilon})}\log\frac{\pi(\boldsymbol{\omega}_{n}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}{\pi^{}(\boldsymbol{\omega}_{n})}d\boldsymbol{\omega}_{n}$
		$\displaystyle\geq-\pi^{}(\mathcal{V}_{\varepsilon})\log\int_{\mathcal{V}_{\varepsilon}}\frac{\pi^{}(\boldsymbol{\omega}_{n})}{\pi^{}(\mathcal{V}_{\varepsilon})}\frac{\pi(\boldsymbol{\omega}_{n}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}{\pi^{}(\boldsymbol{\omega}_{n})}d\boldsymbol{\omega}_{n}\hskip 14.22636pt\text{ Jensen's inequality}$
		$\displaystyle\geq\pi^{}(\mathcal{V}_{\varepsilon})\log\frac{\pi^{}(\mathcal{V}_{\varepsilon})}{\pi(\mathcal{V}_{\varepsilon}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})}\geq\pi^{}(\mathcal{V}_{\varepsilon})\log\pi^{}(\mathcal{V}_{\varepsilon})\hskip 28.45274pt\text{ since }\log\pi(\mathcal{V}_{\varepsilon}\|\boldsymbol{y}_{n},\boldsymbol{X}_{n})\leq 0$

	$\displaystyle d_{KL}(\pi^{*}(.),\pi(.\|\boldsymbol{y}_{n},\boldsymbol{X}_{n}))$	$\displaystyle\geq\pi^{}(\mathcal{V}_{\varepsilon})\log\pi^{}(\mathcal{V}_{\varepsilon})+\pi^{}(\mathcal{V}_{\varepsilon}^{c})\log\pi^{}(\mathcal{V}_{\varepsilon}^{c})-\pi^{}(\mathcal{V}_{\varepsilon}^{c})A_{n}-\pi^{}(\mathcal{V}_{\varepsilon}^{c})B_{n}$		(23)
		$\displaystyle\geq-\log 2-\pi^{}(\mathcal{V}_{\varepsilon}^{c})A_{n}-\pi^{}(\mathcal{V}_{\varepsilon}^{c})B_{n}$		(24)

	$\displaystyle=\|\beta_{0}-\beta_{00}\|+\sum_{j=1}^{k_{n}}\Big{\|}\frac{\beta_{j}}{1+e^{u_{j}+r_{j}}}-\frac{\beta_{j0}}{1+e^{u_{j}}}\Big{\|}$
	$\displaystyle=\|\beta_{0}-\beta_{00}\|+\sum_{j=1}^{k_{n}}\Big{\|}\frac{\beta_{j}(1+e^{u_{j}})-\beta_{j0}(1+e^{u_{j}+r_{j}})}{(1+e^{u_{j}+r_{j}})(1+e^{u_{j}})}\Big{\|}$
	$\displaystyle=\|\beta_{0}-\beta_{00}\|+\sum_{j=1}^{k_{n}}\frac{\|\beta_{j}-\beta_{j0}\|+\|\beta_{j}e^{u_{j}}-\beta_{j0}e^{u_{j}+r_{j}}\|}{(1+e^{u_{j}+r_{j}})(1+e^{u_{j}})}$
	$\displaystyle=\|\beta_{0}-\beta_{00}\|+2\sum_{j=1}^{k_{n}}\|\beta_{j}-\beta_{j0}\|+\sum_{j=1}^{k_{n}}\|\beta_{j0}\|\|1-e^{r_{j}}\|$

	$\displaystyle②$	$\displaystyle=\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\left(\frac{1}{2}\log\frac{\sigma_{\rho}^{2}}{\sigma_{0}^{2}}-\frac{1}{2}\left(1-\frac{\sigma_{0}^{2}}{\sigma_{\rho}^{2}}\right)\right)\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho$
		$\displaystyle\leq-\frac{1}{2}\log\sigma_{0}^{2}\underbrace{\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{③}+\frac{1}{2}\underbrace{\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\log\sigma_{\rho}^{2}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{④}$
		$\displaystyle+\sigma_{0}^{2}\underbrace{\int_{\|\rho-\rho_{0}\|>1/n^{\delta/2}}\frac{1}{\sigma_{\rho}^{2}}\sqrt{\frac{n}{2\pi\nu^{2}}}e^{-\frac{n}{2\nu^{2}}(\rho-\rho_{0})^{2}}d\rho}_{⑤}$

	$\displaystyle④$	$\displaystyle=\sum_{i=1}^{K(n)}b_{ii}\int_{\|{\theta}_{in}-{\theta}_{i0n}\|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})^{2}q(\theta_{in})d\theta_{in}$
		$\displaystyle+\sum_{i=1}^{K(n)}\sum_{j=1,i\neq j}^{K(n)}\int_{\|{\theta}_{in}-{\theta}_{i0n}\|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})q(\theta_{in})d\theta_{in}\int_{\|{\theta}_{jn}-{\theta}_{j0n}\|\leq 1/n^{\delta/2}}(\theta_{jn}-\theta_{j0n})q(\theta_{jn})d\theta_{jn}$
		$\displaystyle=\sum_{i=1}^{K(n)}b_{ii}\int_{\|{\theta}_{in}-{\theta}_{i0n}\|\leq 1/n^{\delta/2}}(\theta_{in}-\theta_{i0n})^{2}q(\theta_{in})d\theta_{in}$

Statistical Foundation of Variational Bayes Neural Networks

Abstract

keywords:

1 Introduction

2 Model and Assumptions

3 Consistency of variational posterior with σ\sigma known

Theorem 3.1.

Theorem 3.2.

Proof of Theorems 3.1 and 3.2.

4 Consistency of variational posterior with σ\sigma unknown

4.1 Inverse-gamma prior on σ\sigma

Theorem 4.1.

Proof.

Remark 4.2.

Remark 4.3.

4.2 Normal prior on log transformed σ\sigma

Theorem 4.4.

Proof.

Remark 4.5.

5 Consistency of variational bayes

Corollary 5.1 (Variational bayes consistency.).

Proof.

6 Discussion

7 Appendix

7.1 General Lemmas

Lemma 7.1.

Proof.

Lemma 7.2.

Proof.

Lemma 7.3.

Proof.

Lemma 7.4.

Proof.

Lemma 7.5.

Proof.

Lemma 7.6.

Proof.

Lemma 7.7.

Proof.

Lemma 7.8.

Proof.

Lemma 7.9.

Proof.

Lemma 7.10.

Proof.

Lemma 7.11.

Lemma 7.12.

Proof.

Lemma 7.13.

Proof.

Lemma 7.14.

Proof.

7.2 Lemmas and Propositions for Theorem 3.1 and 3.2

Lemma 7.15.

Proof.

Lemma 7.16.

Proof.

Proposition 7.17.

Proof.

Proposition 7.18.

Proof.

Proposition 7.19.

Proof.

7.3 Lemmas and Propositions for Theorem 4.1

Lemma 7.20.

Proof.

Lemma 7.21.

Proof.

Proposition 7.22.

Proof.

Proposition 7.23.

Proof.

Proposition 7.24.

Proof.

7.4 Lemmas and Propositions for Theorem 4.4

Lemma 7.25.

Proof.

Lemma 7.26.

Proof.

Proposition 7.27.

3 Consistency of variational posterior with $\sigma$ known

4 Consistency of variational posterior with $\sigma$ unknown

4.1 Inverse-gamma prior on $\sigma$

4.2 Normal prior on log transformed $\sigma$