Statistical Foundation of Variational Bayes Neural Networks
Abstract
Despite the popularism of Bayesian neural networks in recent years, its use is somewhat limited in complex and big data situations due to the computational cost associated with full posterior evaluations. Variational Bayes (VB) provides a useful alternative to circumvent the computational cost and time complexity associated with the generation of samples from the true posterior using Markov Chain Monte Carlo (MCMC) techniques. The efficacy of the VB methods is well established in machine learning literature. However, its potential broader impact is hindered due to a lack of theoretical validity from a statistical perspective. In this paper, we establish the fundamental result of posterior consistency for the mean-field variational posterior (VP) for a feed-forward artificial neural network model. The paper underlines the conditions needed to guarantee that the VP concentrates around Hellinger neighborhoods of the true density function. Additionally, the role of the scale parameter and its influence on the convergence rates has also been discussed. The paper mainly relies on two results (1) the rate at which the true posterior grows (2) the rate at which the KL-distance between the posterior and variational posterior grows. The theory provides a guideline of building prior distributions for Bayesian NN models along with an assessment of accuracy of the corresponding VB implementation.
keywords:
Neural networks, Variational posterior, Mean-field family, Hellinger neighborhood, Kullback-Leibler divergence, Sieve theory, Prior mass, Variational Bayes.1 Introduction
Bayesian neural networks (BNNs) have been comprehensively studied in the works of Bishop [1997], Neal [1992], Lampinen and Vehtari [2001], etc. More recent developments which establish the efficacy of BNNs can be found in the works of Sun et al. [2017], Mullachery et al. [2018], Hubin et al. [2018], Liang et al. [2018], Javid et al. [2020] and the references therein. The theoretical foundation of BNN by Lee [2000] widens the scope to a broader community. However, with the age of big data applications, the conventional Bayesian approach is computationally inefficient. Thus the alternative computational approaches, such as variational Bayes (VB) become popular among machine learning and applied researchers. Although, there have been many works on the algorithm development for VB in recent years, the theoretical advancement on estimation accuracy is rather limited. This article provides statistical validity of neural networks models with variational inference along with some theory-driven practical guidelines for implementation.
In this article, we mainly focus on feed-forward neural networks with a single hidden layer of inputs and a logistic activation function. Let the number of inputs be denoted by and the number of hidden nodes by where the number of nodes is allowed to increase as a function of . The true regression function, is modeled as a neural network of the form
(1) |
where is the logistic activation function. With a Gaussian-prior on each of the parameters, Lee [2000] establishes the posterior consistency of neural networks under the simple setup where the scale parameter is fixed at 1. The results in Lee [2000] mainly exploit Barron et al. [1999], a fundamental contribution that laid down the framework for posterior consistency in non parametric regression settings. In this paper, we closely mimick the regression model of Lee [2000] by assuming where is the true regression function and follows .
The joint posterior distribution of a neural network model is generally evaluated by popular Markov Chain Monte Carlo (MCMC) sampling techniques, like Gibbs sampling, Metropolis Hastings, etc. (see, Neal [1996], Lee [2004], and Ghosh et al. [2004] for more details). Despite the versatility and popularity of MCMC based approach, the Bayesian estimation suffers from computational costs, scalability, time constraints along with other implementation issues such as choice of proposal densities and generating sample paths. Variational Bayes emerged as an important alternative to overcome the drawbacks of the MCMC implementation (see Blei et al. [2017]). Many recent works have discussed the application of variational inference to Bayesian neural networks e.g., Logsdon et al. [2009], Graves [2011], Carbonetto and Stephens [2012], Blundell et al. [2015], Sun et al. [2019]. Although, there is a plethora of literature implementing variational inference for neural networks, the theoretical properties of the variational posterior in BNNs remain relatively unexplored and this limits the use of this powerful computational tool beyond the machine learning community.
Some of the previous works that focused on theoretical properties of variational posterior include the frequentist consistency of variational inference in parametric models in the presence of latent variables (see Wang and Blei [2019]). Optimal risk bounds for mean-field variational Bayes for Gaussian mixture (GM) and Latent Dirichlet allocation (LDA) models have been discussed in Pati et al. [2017]. The work of Yang et al. [2017] propose variational inference Bayes risk for GM and LDA models. A more recent work Zhang and Gao [2017] discusses the variational posterior consistency rates in Gaussian sequence models, infinite exponential families and piece-wise constant models. In order to evaluate the validity of a posterior in non-parametric models, one must establish its consistency and rates of contraction. To the best of our knowledge, the problem of posterior consistency, has not been studied in the context of variational Bayes neural network models.
Our contribution: Our theoretical development of posterior consistency, an essential property in nonparametric Bayesian Statistics, provides confidence in using the variational Bayes neural networks model across the disciplines. Our theoretical results help to assess the estimation accuracy for a given training sample and model complexity. Specifically, we establish the conditions needed for the variational posterior consistency of the feedforward neural networks. We establish that a simple Gaussian mean-field approximation is good enough to achieve consistency for the variational posterior. In this direction, we show that - Hellinger neighborhood of the true density function receives close to 1 probability under the variational posterior. For the true posterior density ( Lee [2000]), the posterior probability of an - Hellinger neighborhood grows at the rate . In contrast, we show for the variational posterior this rate becomes . The reason for this difference is explained by two folds: (1) first, the KL-distance between the variational posterior and the true posterior does not grow at a rate greater than for some , (2) second, the posterior probability of - Hellinger neighborhood grows at the rate , thus, the variational posterior probability must grow at the rate , otherwise the rate of growth of the KL-distance cannot be controlled. We also give the conditions on the approximating neural network and the rate of growth in the number of nodes needed to ensure that the variational posterior achieves consistency. As a last contribtuion, we show that the VB estimator of the regression function converges to the true regression function.
Further, our investigation shows that although the variational posterior(VP) is asymptotically consistent, posterior probability of Hellinger neighborhoods does not converge to 1 as fast as the true posterior. In addition, one requires that the absolute value of the parameters in the approximating neural network function grow at a controlled rate (less than for some ), a condition not needed in dealing with MCMC based implementation. When the absolute value of the parameters grow as a polynomial function of (), one can choose a flatter prior (a prior whose variance increases with ) in order to guarantee VP consistency.
VP consistency has been established irrespective of whether is known or unknown and the differences in practice have been discussed. It has been shown that one must guard against using Gaussian distributions as a variational family for . Since the KL-distance between variational posterior and true posterior must be controlled, one must ensure that quantities like and are defined under the variational distribution of . We thereby discuss two sets of variational family on , (1) an inverse gamma-distribution, (2) a normal distribution on the log-transformed . While the second approach may seem intuitively appealing if one were to use fully Gaussian variational families, it comes with a drawback. Indeed, under the reparametrized , the variational posterior is consistent if the rate of growth in the number of nodes is slower than under the original parameter models. However, a smaller growth in the number of nodes makes it more and more difficult to find an approximating neural network which converges fast enough to the true function.
The outline of the paper is as follows. In Section 2, we present the notation and the terminology of consistency for variational posterior. In Section 3, we present the consistency results when the scale parameter is known. In Section 4, we present the consistency of an unknown scale parameter under two sets of variational families. In Section 5, we show that the Bayes estimates obtained from the variational posterior converge to the true regression function and scale parameter. Finally, Section 5 ends with a discussion and conclusions from our current work.
2 Model and Assumptions
Suppose the true regression model has the form:
where are i.i.d. random variables and the feature vector with . For the purposes of this paper, we assume that the number of covariates is fixed.
Thus, the true conditional density of is
(2) |
which implies the true likelihood function is
(3) |
Universal approximation: By Hornik et al. [1989], for every function such that , there exists a neural network such that . This led to the ubiquitous use of neural networks as a modeling approximation to a wide class of regression functions.
In this paper, we assume that the true regression function can be approximated by a neural network
(4) |
where , the number of nodes increases as a function of , while , the number of covariates is fixed. Thus, the total number of parameters grow at the same rate as the number of nodes, i.e. .
Suppose there exists a neural network such that
(5) |
Note that if is a neural network function itself, then (A1) holds trivially for all irrespective of the choice of . Theorem 2 of Siegel and Xu [2019] showed that with , can be chosen between . Mimicking the steps of Theorem 2, Siegel and Xu [2019], it can be shown that with , can be chosen anywhere in the range . For a given choice of , whether (A1) holds or not depends on the entropy of the true function. Assumptions of similar form can also be found in Shen [1997] (see conditions C and ) and Shen et al. [2019] (see condition C3).
Note that the condition (A1) characterizes the rate at which a neural network function approaches to the true function. The next set of conditions characterize the rate at which the coefficients of the approximating neural network solution grow. Suppose, one of the following two conditions hold:
(6) | ||||
(7) |
Note that condition (A2) ensures that sum of squares of the coefficients grow at a rate slower than . White [1990] proved consistency properties of feed forward neural networks with which implies , i.e. . Blei et al. [2017] studied the consistency properties for parametric models wherein one requires the assumption be bounded (see Relations (44) and (53) in Blei et al. [2017]). With a normal prior of the form , the same condition reduces to bounded at a suitable rate. Indeed, condition (A2) guarantees that the rate of growth KL-distance between the true and the variational posterior is well controlled.
Condition (A3) is a relaxed version of (A2), where the sum of squares of the coefficients is allowed to grow at a rate in polynomial in . A standard prior independent of might fail to guarantee convergence. We thereby assume a flatter prior whose variance increases with in order to allow for consistency through variational bayes. Note that if is a neural network function itself, conditions (A2) and (A3) hold trivially.
Kullback-Leibler divergence: Let and be two probability distributions, with density and respectively, then
Hellinger distance: Let and be two probability distributions with density and respectively, then
Distribution of the feature vector: In order to establish posterior consistency, we assume that the feature vector . Although, this is not a requirement for the model, it simplifies steps of the proof since the joint density function of (Y,X) simplifies as
(8) |
Thus, it suffices to deal with the conditional density of .
3 Consistency of variational posterior with known
In this section, we begin with the simple model where the scale parameter is known. For a simple Gaussian mean field family as in (13), we establish that variational posterior is consistent as long as assumptions (A1), (A2) and (A3) hold. We also discuss, how the rates contrast with those in Lee [2000] which established the posterior consistency of the true posterior.
Likelihood:
(11) |
Posterior: Let denote the prior on . Then, the posterior is given by
(12) |
Variational Family: Variational family for is given by
(13) |
Let the variational posterior be denoted by
(14) |
Hellinger neighborhood: Define the neighborhood of the true density as
(15) |
where the Hellinger distance given by
Note that the above simplified of the Hellinger distance is due to (8).
In the following two theorems for two class of priors, we establish the posterior consistency of , i.e. the variational posterior concentrates in small Hellinger neighborhoods of the true density . Note that, assumptions (A2) and (A3) impose a restriction on the rate of growth of the sum of squares of the coefficients of the approximating neural network solution. With (A2), we show that a standard normal prior on all the parameters works. However, under the more weaker assumption (A3), a normal prior whose variance increases with is needed. Additionally, we show that for the variational posterior to achieve consistency, the number of parameters or equivalenty the number of nodes need to grow in a controlled fashion.
Theorem 3.1.
Suppose the number of nodes satisfy
(16) |
In addition, suppose assumptions (A1) and (A2) hold for some .
Then, with normal prior for each entry in as follows
(17) |
we have
Note that conditions (16) and (17) agree with those assumed in Theorem 1 of Lee [2000]. Since , the variational posterior is consistent with as small as 0. Indeed imposes the least restriction on the convergence rate and coefficient growth rate of the true function (see assumptions (A1) and (A2)). As grows, restrictions on the approximating neural function increase but that guarantees faster convergence of the variational posterior. Expanding upon the Bayesian posterior consistency established in Lee [2000], one can show that for any (see Relation (88) in Lee [2000]). Thus, probability of Hellinger neighborhood grows at the rate for variational posterior in contrast to that of for true posterior. For parametric models, the rate of growth of the variational posterior was found to be (see second equation 2 on page 38 of Blei et al. [2017]). Note that the consistency of true posterior requires no assumptions on the approximating neural network function whereas for the variational posterior, both assumptions (A1) and (A2) must be satisfied to guarantee convergence.
Theorem 3.2.
Suppose the number of nodes satisfy condition (C1). In addition, suppose assumptions (A1) and (A3) hold for some and .
Then, with normal prior for each entry in as follows
(18) |
we have
Observe that the consistency rate in Theorem 3.2 agrees to the one in Theorem 3.1. In order to prove both theorems 3.1 and 3.2, a crucial step is to show that . In order to show this, we show that for some . Indeed this choice of varies in order to adjust for changing nature of the prior from (17) to (18) (see statements (1) and (2) in Lemma 7.9).
We next present the proof of Theorems 3.1 and 3.2. The first crucial step of the proof is to establish that the is bounded below by a quantity which is determined by the rate of consistency of the true posterior (see the quantities and in the proof below). The second crucial step towards the proof is to show is bounded above at a rate which can be greater than the rate of its lower bound iff the variation posterior is consistent.
In the above proof we have assumed , . If , there is nothing to prove. If , then following the steps of the proof, we will get which is a contradiction.
The main step in the above proof is (25) which we discuss next. The quantity is indeed decomposed into two parts
Whereas the first term is controlled using the Hellinger bracketing entropy of , the second term is controlled by the fact that the prior gives negligible probability outside . Thus, the main factor influencing is a suitable choice of the sequence of spaces . Indeed our choice of is same as that in Lee [2000] with and . Such a choice allows one to control the Hellinger bracketing entropy of while controlling the prior mass for also at the same time.
The second quantity is controlled by the rate at which the prior gives mass to shrinking KL neighborhoods of the true density . Indeed, the quantity appears again when computing bounds on for some (see in Proposition 7.19). If , can be controlled even without assumptions (A1) and (A2). However, if , assumptions (A1) and (A2) are needed in order to guarantee that the grows at a rate less than .
4 Consistency of variational posterior with unknown
In this section, we now assume that the scale parameter is unknown. In this case, our approximating variational family is slightly different from (14). Whereas, we still assume a mean field Gaussian family on , our approximating family for cannot be Gaussian. An important criterion to guarantee the consistency of variational posterior is to ensure is well bounded (see Lemma 7.11). When is unknown, involves terms like and both of whose integrals are undefined under a normally distributed . We thereby adopt two versions of for , firstly an inverse gamma distribution of and secondly a normal distribution on the log transformed (see Sections 4.1 and 4.2 respectively). Both the transforms have their respective advantage in terms of determining the rate of consistency of the variational posterior. In this section, we work only with assumption (A2). We can handle (A3) in a way exactly similar to Section 3.
4.1 Inverse-gamma prior on
Sieve Theory: Let where and are defined in (4), then
(26) |
The sieve is defined as follows.
(27) |
The definitions for likelihood, posterior and Hellinger neighborhood agree with those given in (11), (12) and (15) as in Section 3.
Prior distribution: We propose a normal prior on each and an inverse gamma prior of .
(28) |
Variational Family: Variational family for is given by
(29) |
The variational posterior has the same definition as in (14).
The following theorem shows that when the parameter is unknown, the variational posterior is still consistent, however the rate decreases by an amount of .
Theorem 4.1.
Suppose the number of nodes satisfy condition (C1). In addition, suppose assumptions (A1) and (A2) hold for some . Then for any .
Note that by Theorem 3.1, the posterior is consistent iff which is indeed the case as long as . Whether such a exists or not depends on the entropy of the function (see the discussion section in Shen et al. [2019]). Mimicking the steps of Theorem 2, Siegel and Xu [2019] it can be shown that with , , can be chosen anywhere in the range .
Similar to the proof of Theorem 3.1, the quantity is indeed decomposed into two parts
Whereas the first term is controlled using the Hellinger bracketing entropy of at the rate , the second term is controlled by the prior probability of at , . Since the prior probability of is now controlled at a comparatively slightly smaller rate than that of Theorem 3.1, hence the additional term in the overall consistency rate of variational posterior.
Remark 4.2.
With and as in (27), we choose and , to prove the posterior consistency statement of Theorem 4.1. Suitably choosing as a function of one may be able to refine the proof to obtain a rate of instead of . However the proof becomes more involved and such a dependent choice of has been avoided for the purposes of this paper.
Remark 4.3.
When is unknown, in order to control at a rate less than , has the same form as in the proof of Theorem 3.1. However, we cannot choose a normally distributed for . The convergence of is determined by the term which involves terms like and (see (7.3)). The expectation of these terms is not defined under a normal but well defined under an inverse gamma distribution, hence an inverse-gamma variational family of .
4.2 Normal prior on log transformed
Given, the wide popularity of Gaussian mean field approximation, we next use a normal variational distribution on the log-transformed and compare and contrast it to the case where an inverse-gamma variational distribution on the scale parameter. In Section 3.3 of Blei et al. [2017], it has been posited that a Gaussian VB posterior can be used to approximate a wide class of posteriors. However, as mentioned in Section 4.1, a normal would cause to be undefined. One way out of this impasse reparametrizing as with a normal prior is used for . In the following section, we show that this approach may work but comes with the disadvantage where the number of nodes, needs to grow at a rate smaller than . The main disadvantage with this approach is if the number of nodes do not grow sufficiently, it may be difficult to find a neural network which well approximates the true function.
Sieve Theory: Let where and are same as defined in (4). With , we have
(30) |
The sieve is defined as follows.
(31) |
The definitions for likelihood, posterior and Hellinger neighborhood agree with those given in (11), (12) and (15) as in Section 3.
Prior distribution: We propose a normal prior on each and as follows
(32) |
Variational Family: Variational family for is given by
(33) |
The variational posterior has the same definition as in (14).
In the following theorem we show that even with reparametrized as the variational posterior is consistent.
Theorem 4.4.
Suppose the number of nodes satisfy condition (C1) with . In addition, suppose assumptions (A1) and (A2) hold for . Then,
Proof.
Remark 4.5.
With and as in (31), we choose where . In order to ensure that prior gives smaller mass outside , one requires for some . With a normal prior of and which is less than provided or . Hence, the requirement of a slow growth in the number of nodes.
5 Consistency of variational bayes
In this section, we show that if the variational posterior is consistent, the variational Bayes estimator of and converges to the true and . The proof uses ideas from Barron et al. [1999] and Corollary 1 in Lee [2000]. Let
(34) |
Corollary 5.1 (Variational bayes consistency.).
Suppose and are defined as in (5), then
(35) |
Proof.
Let
Taking , we get . Now,
Now, let us consider the form of
Since , .
Note that and , thus .
Since , thus
We shall next show
We shall instead show that for any sequence , there exists a further subsequence such that
Since , there exists a sub-sequence s.t.
This implies
(for details see proof of Corollary 1 in Lee [2000]).
Thus, using Scheffe’s theorem in Scheffe [1947], we have
which implies
Since , applying Slutsky, we get
∎
6 Discussion
In this paper, we have highlighted the conditions which guarantee that the variational posterior of feed-forward neural networks is consistent. A variational family, as simple as a Gaussian mean-field, is good enough to ensure that the variational posterior is consistent provided the entropy of the true function is well behaved. In other words, has an approximating neural network solution which approximates at a fast enough rate while ensuring that the number of nodes and the norm of the NN parameters grow in a controlled manner. Conditions of this form are often needed when one tries to establish the consistency of neural networks in a frequentist set up (see condition C3 in Shen et al. [2019]). Whereas variational posterior presents a scalable alternative to MCMC, unlike MCMC its consistency cannot be guaranteed without certain restriction on the entropy of the true function. Two other main contributions of the paper include that (1) Gaussian family may not always work as the best choice for a variational family (see Section 4) and (2) One may need a prior with variance growing in when the rate of growth in the norm of the approximating NN is high (see Theorem 3.1).
Although, we have quantified consistency of the variational posterior, the rate of contraction of the variational posterior still needs to be explored. We suspect that this rate would be closely related to the rate of contraction of the true posterior with mild assumptions on the entropy of the function . By following ideas of the proofs in this paper, one may be able to quantify conditions on the entropy of when one uses a deep neural network instead of one layer neural network in order to guarantee the consistency of variational posterior. Similarly, the effect of hierarchical priors and hyperparameters on the rate of convergence of the variational posterior need to be explored.
7 Appendix
7.1 General Lemmas
Lemma 7.1.
Let and be any two density functions. Then
Proof.
Proof is same as proof of Lemma 4 in Lee [2000]. ∎
Lemma 7.2.
Let be a fixed neural network satisfying
Then,
Proof.
This proof uses some ideas from Lemma 6 in Lee [2000]. Note that
Therefore,
Let , , then
Since, for small , thus .
Using
(36) |
the proof follows. ∎
Lemma 7.3.
With
-
1.
-
2.
Proof.
Let , then
-
1.
where . The function satisfies
since for every .
-
2.
∎
Lemma 7.4.
With and , .
-
1.
-
2.
Proof.
implies
Similarly,
Thus, . The remaining part of the proof follows on the same lines as Lemma 7.3. ∎
Lemma 7.5.
With and , for every , we have
Proof.
where the last step holds because (see Lemma 4 in Elezovic and Giordano [2000]). ∎
Lemma 7.6.
With and , for every ,
Proof.
∎
Lemma 7.7.
With and , let and . Then, for every , we have
Proof.
First note that , thus it suffices to show . In this direction,
We can apply Taylor expansion to as
where the equality follows since and is symmetric around .
It is easy to check , which implies
Thus, for every , .
For the remaining part of the proof, we shall make use of the Mill’s ratio approximation as follows.
(37) |
where and are the cdf and pdf of standard normal distribution respectively.
For ,
Let , then .
If , then and can be dropped. If , then
(38) |
For ④, we make use of the following result
If , . For , . | (39) |
If , then for sufficiently large.
Using (39) and getting rid of negative terms, we get
If , then for sufficiently large.
If , then and for sufficiently large, thus
For , we shall make use of the following result:
(40) |
If , then , for sufficiently large. Thus, using (7.1), we get
If , then , for sufficiently large. Thus, using (7.1), we get
If , then , for sufficiently large. Thus, using (7.1), we get
∎
Lemma 7.8.
With and , let and . Then, for every , we have
Proof.
We can apply Taylor expansion to ,
where the equality follows since and is symmetric around .
Since and , it suffices to show .
In this direction,
Since , to prove it suffices to show .
Note, is same as of Lemma 7.7, except for a constant. Thus, which completes the proof.
∎
Lemma 7.9.
Suppose condition (C1) and assumption (A1) holds for some and . Let
we have
(41) |
provided
-
1.
Assumption (A2) holds with same as (A1) and
-
2.
Assumption (A3) holds and
Proof.
Note that since , to prove (41), it suffices to show .
We begin by proving statement 1. of the lemma. Let , then
For , we do a Taylor expansion of around as
where the last equality follows since by assumption (A1).
With , let and
(42) |
since is an odd function. Also,
where second equality to third equality is a consequence of (7.1). Thus,
We next try to bound the quantities . First note that
Let . Then,
where for , , we have
Using the fact that and we get
where the last equality is a consequence of assumptions (A1), (A2) and condition (C1).
For , note that
First, note that since . Thus,
where the last equality follows since using .
First note that where . Therefore,
(43) |
where the last asymptotic equality is a consequence of (37) and condition (C1).
For , note that for some . Therefore,
for any .
For , note that by assumption (A2). Using this together with (7.1), we get
For , using Cauchy Schwartz, we get
where the fact is shown below. Now, let
(44) |
where includes all coordinates of except and is the union of all except .
(45) |
Using (7.1), we get
Using (7.1) and (7.1) in (7.1), we get
(46) |
The only difference with statement 2. is that and .
The proof is similar and details have been omitted.
∎
Lemma 7.10.
Suppose and satisfies
(47) |
for every and for some . Then,
(48) |
provided .
Proof.
Lemma 7.11.
Suppose satisfies
then
In this direction, note that
where the last result follows from Markov’s Inequality
Using Lemma 7.1, we get
since .
Lemma 7.12.
Let then
Proof.
Lemma 7.13.
For any , suppose
Then,
(51) |
Lemma 7.14.
Suppose, for some , satisfies
for any . Then, for every .
Proof.
7.2 Lemmas and Propositions for Theorem 3.1 and 3.2
Lemma 7.15.
Let, where is given by (10) with and . Then,
Proof.
This proof uses some ideas from the proof of Lemma 2 in Lee [2000].
First, note that, by Lemma 4.1 in Pollard [1990],
For , let . Then,
(52) |
where the upper bound for a constant . This is because
In view of (7.2) and Theorem 2.7.11 in van der Vaart et al. [1996], we have
for some constant . Therefore,
Using, Lemma 7.12 with , we get
where the last equality holds since and . Therefore,
∎
Proof.
This proof uses some ideas from the proof of Theorem 1 in Lee [2000]. Let ,
We first prove the Lemma for prior in 1.
We next prove the Lemma for prior in 2. Analogous to the proof for prior in 1. we get,
∎
Proposition 7.17.
Proof.
In view of Lemma 7.16, for as in (17) and (18),
Therefore, using Lemma 7.14 with , and , we have
Finally to complete the proof, let
then,
as shown before. Thus, .
∎
Proposition 7.18.
Proof.
This proof uses some ideas from the proof of Theorem 1 in Lee [2000].
By assumption (A1), let be a neural network such that
(54) |
Define neighborhood as follows
where .
Note that , thereby using Lemma 7.2 with , we get,
(55) |
for every . In view of (54) and (55), we have
(56) |
We next show that,
For notation simplicity, let
We first prove statement 1. of Proposition 7.18.
(57) |
for any since when .
Using assumption (A2) and condition (C1) together with (36), we get
(58) |
where the last inequality is a consequence of (C1) and the fact that as shown next.
where the first inequality is a consequence of Cauchy Schwartz and the second inequality is a consequence condition (C1) and assumption (A2).
Under assumption (A3) and condition (C1) together with (36), we have
(60) |
where the last inequality holds by mimicking the argument in for the proof of part 1.
Proposition 7.19.
Proof.
We first prove statement 1. of the Lemma.
Here, we have
(62) |
(63) |
Thus,
where the last equality is a consequence of condition (C1) and assumption (A2).
Next we prove statement 2. of the Lemma.
Here, we have
(65) |
(66) |
Thus,
where the last equality is a consequence of condition (C1) and assumption (A3).
∎
7.3 Lemmas and Propositions for Theorem 4.1
Lemma 7.20.
Let, where is given by (27) with , , . Then,
Proof.
This proof uses some ideas from the proof of Lemma 2 in Lee [2000]. First, note that by Lemma 4.1 in Pollard [1990], we have
For , let .
Using (7.2), we get
(67) |
where the upper bound on is calculated as:
In view of (7.2) and Theorem 2.7.11 in van der Vaart et al. [1996], we have
for some constant . Therefore,
Using, Lemma 7.12 with , we get
where the last equality holds since , , .
Therefore,
∎
Lemma 7.21.
Proposition 7.22.
Proof.
This proof uses some ideas from the proof of Lemma 3 in Lee [2000]. We shall first show
With as in (27) with , and where
In view of Lemma 7.16, for as in (28), for any ,
Therefore, by Lemma 7.14 with , and , we have
Since can be arbitrarily close to 1, the remaining part of the proof follows on lines of Proposition 7.17
∎
Proposition 7.23.
Suppose condition (C1) holds with some . Let be a neural network satisfying assumption (A1) and (A2) for some . With , define,
(68) |
For every , with as in (28), we have
Proof.
This proof uses some ideas from the proof of Theorem 1 in Lee [2000].
By assumption (A1), let be a neural network such that
(69) |
Define neighborhood as follows
where .
By Lemma 7.3,
(72) |
We next show that,
For notation simplicity, let and
Proposition 7.24.
Suppose condition (C1) and assumptions (A1) and (A2) hold for some and . Suppose the prior satisfies (28).
Then, there exists a with as in (29) such that
(75) |
Proof.
We first deal with as follows
(76) |
(77) |
where the last inequality is a consequence of Proposition 7.19. Simplifying further, we get
where the equality in step 4 follows by approximating using Lemma 4 in Elezovic and Giordano [2000] and approximating by Stirling’s formula.
where the last equality follows by approximating using Lemma 4 in Elezovic and Giordano [2000].
Therefore, by Lemma 7.11, .
∎
7.4 Lemmas and Propositions for Theorem 4.4
Lemma 7.25.
For as in (31), let . If , , , then
Proof.
First, by Lemma 4.1 in Pollard [1990],
For , let .
In view of (79) and Theorem 2.7.11 in van der Vaart et al. [1996], we have
for some . Therefore,
Using, Lemma 7.12 with , we get
where the last equality holds since , , .
Therefore,
∎
Lemma 7.26.
Let
where , , , . Then with
we have for every
Proof.
Let and .
since for and for . ∎
Proposition 7.27.
Suppose condition (C1) holds with and satisfies (32). Then,
Proof.
Let . Let and for .
Proposition 7.28.
Suppose condition (C1) holds with some . Let be a neural network satisfying assumption (A1) and (A2) for some . With , define,
(80) |
For every , with as in (32), we have
Proof.
This proof uses some ideas from the proof of Theorem 1 in Lee [2000].
By assumption (A1), let satisfy
(81) |
With , define neighborhood as follows
where . Note that .
We next show that,
References
- Bishop [1997] C. M. Bishop, Bayesian Neural Networks, Journal of the Brazilian Computer Society 4 (1997).
- Neal [1992] R. M. Neal, Bayesian training of backpropagation networks by the hybrid monte-carlo method, 1992.
- Lampinen and Vehtari [2001] J. Lampinen, A. Vehtari, Bayesian approach for neural networks–review and case studies, Neural networks : the official journal of the International Neural Network Society 14 3 (2001) 257–74.
- Sun et al. [2017] S. Sun, C. Chen, L. Carin, Learning Structured Weight Uncertainty in Bayesian Neural Networks, in: A. Singh, J. Zhu (Eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, PMLR, Fort Lauderdale, FL, USA, 2017, pp. 1283–1292. URL: http://proceedings.mlr.press/v54/sun17b.html.
- Mullachery et al. [2018] V. Mullachery, A. Khera, A. Husain, Bayesian neural networks, 2018. arXiv:1801.07710.
- Hubin et al. [2018] A. Hubin, G. Storvik, F. Frommlet, Deep bayesian regression models, 2018. arXiv:1806.02160.
- Liang et al. [2018] F. Liang, Q. Li, L. Zhou, Bayesian neural networks for selection of drug sensitive genes, Journal of the American Statistical Association 113 (2018) 955--972.
- Javid et al. [2020] K. Javid, W. Handley, M. P. Hobson, A. Lasenby, Compromise-free bayesian neural networks, ArXiv abs/2004.12211 (2020).
- Lee [2000] H. Lee, Consistency of posterior distributions for neural networks, Neural Networks 13 (2000) 629 -- 642.
- Barron et al. [1999] A. Barron, M. J. Schervish, L. Wasserman, The consistency of posterior distributions in nonparametric problems, Ann. Statist. 27 (1999) 536--561.
- Neal [1996] R. M. Neal, Bayesian Learning for Neural Neyworks, Springer-Verlag, Springer, New York, 1996. URL: https://books.google.com/books?id=OCenCW9qmp4C.
- Lee [2004] H. K. H. Lee, Bayesian Nonparametrics via Neural Networks, Springer-Verlag, ASA-SIAM Series, 2004. URL: https://books.google.com/books?id=OCenCW9qmp4C.
- Ghosh et al. [2004] M. Ghosh, T. Maiti, D. Kim, S. Chakraborty, A. Tewari, Hierarchical bayesian neural networks, Journal of the American Statistical Association 99 (2004) 601--608.
- Blei et al. [2017] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association 112 (2017) 859–877.
- Logsdon et al. [2009] B. A. Logsdon, G. E. Hoffman, J. G. Mezey, A variational bayes algorithm for fast and accurate multiple locus genome-wide association analysis, BMC Bioinformatics 11 (2009) 58 -- 58.
- Graves [2011] A. Graves, Practical variational inference for neural networks, in: J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 24, Curran Associates, Inc., 2011, pp. 2348--2356. URL: http://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.
- Carbonetto and Stephens [2012] P. Carbonetto, M. Stephens, Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies, Bayesian Anal. 7 (2012) 73--108.
- Blundell et al. [2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural networks, 2015. arXiv:1505.05424.
- Sun et al. [2019] S. Sun, G. Zhang, J. Shi, R. Grosse, Functional variational bayesian neural networks, 2019. arXiv:1903.05779.
- Wang and Blei [2019] Y. Wang, D. M. Blei, Frequentist consistency of variational bayes, Journal of the American Statistical Association 114 (2019) 1147--1161.
- Pati et al. [2017] D. Pati, A. Bhattacharya, Y. Yang, On statistical optimality of variational bayes, 2017. arXiv:1712.08983.
- Yang et al. [2017] Y. Yang, D. Pati, A. Bhattacharya, -variational inference with statistical guarantees, 2017. arXiv:1710.03266.
- Zhang and Gao [2017] F. Zhang, C. Gao, Convergence rates of variational posterior distributions, 2017. arXiv:1712.02519.
- Hornik et al. [1989] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359 -- 366.
- Siegel and Xu [2019] J. W. Siegel, J. Xu, Approximation rates for neural networks with general activation functions, 2019. arXiv:1904.02311.
- Shen [1997] X. Shen, On methods of sieves and penalization, Ann. Statist. 25 (1997) 2555--2591.
- Shen et al. [2019] X. Shen, C. Jiang, L. Sakhanenko, Q. Lu, Asymptotic properties of neural network sieve estimators, 2019. arXiv:1906.00875.
- White [1990] H. White, Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings, Neural Networks 3 (1990) 535 -- 549.
- Scheffe [1947] H. Scheffe, A useful convergence theorem for probability distributions, Ann. Math. Statist. 18 (1947) 434--438.
- Elezovic and Giordano [2000] N. Elezovic, C. Giordano, The best bounds in gautschi’s inequality, Mathematical Inequalities and Applications 3 (2000).
- Wong and Shen [1995] W. H. Wong, X. Shen, Probability inequalities for likelihood ratios and convergence rates of sieve mles, Ann. Statist. 23 (1995) 339--362.
- Pollard [1990] D. Pollard, Empirical Processes: Theory and Applications, Conference Board of the Mathematical Science: NSF-CBMS regional conference series in probability and statistics, Institute of Mathematical Statistics, 1990. URL: https://books.google.com/books?id=Prcsi29EU50C.
- van der Vaart et al. [1996] A. van der Vaart, A. van der Vaart, A. van der Vaart, J. Wellner, Weak Convergence and Empirical Processes: With Applications to Statistics, Springer Series in Statistics, Springer, 1996. URL: https://books.google.com/books?id=OCenCW9qmp4C.