This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic Neural Networks with Infinite Width are Deterministic

Liu Ziyin1, Hanlin Zhang2, Xiangming Meng1, Yuting Lu1, Eric Xing2, Masahito Ueda1
1The University of Tokyo
2Carnegie Mellon University
Abstract

This work theoretically studies stochastic neural networks, a main type of neural network in use. We prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Our theory justifies the common intuition that adding stochasticity to the model can help regularize the model by introducing an averaging effect. Two common examples that our theory can be relevant to are neural networks with dropout and Bayesian latent variable models in a special limit. Our result thus helps better understand how stochasticity affects the learning of neural networks and potentially design better architectures for practical problems.

1 Introduction

Applications of neural networks have achieved great success in various fields. A major extension of the standard neural networks is to make them stochastic, namely, to make the output a random function of the input. In a broad sense, stochastic neural networks include neural networks trained with dropout (Srivastava et al.,, 2014; Gal and Ghahramani,, 2016), Bayesian networks (Mackay,, 1992), variational autoencoders (VAE) (Kingma and Welling,, 2013), and generative adversarial networks (Goodfellow et al.,, 2014). In this work, we formulate a rather broad definition of a stochastic neural network in Section 3. There are many reasons why one wants to make a neural network stochastic. Two main reasons are (1) regularization and (2) distribution modeling. Since neural networks with stochastic latent layers are more difficult to train, stochasticity is believed to help regularize the model and prevent memorization of samples (Srivastava et al.,, 2014). The second reason is easier to understand from the perspective of latent variable models. By making the network stochastic, one implicitly assumes that there exist latent random variables that generate the data through some unknown function. Therefore, by sampling these latent variables, we are performing a Monte Carlo sampling from the underlying data distribution, which allows us to model the underlying data distribution by a neural network. This type of logic is often invoked to motivate the VAE and GAN. Therefore, stochastic networks are of both practical and theoretical importance to study. However, most existing works on stochastic nets are empirical in nature, and almost no theory exists to explain how stochastic nets work from a unified perspective.

In this work, we theoretically study the stochastic neural networks. We prove that as the width of an optimized stochastic net increases to infinity, its predictive variance decreases to zero on the training set, a result we believe to have both important practical and theoretical implications. Two common examples that our theory can apply to are dropout and VAEs. See Figure 1 for an illustration of this effect. Along with this proof, we propose a novel theoretical framework and lay out a potential path for future theoretical research on stochastic nets. This framework allows us to abstract away the specific definitions of contemporary architectures of neural networks and makes our result applicable to a family of functions that includes many common neural networks as a strict subset.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Distribution of the prediction of a trained neural network with dropout. We see that as the hidden width dd increases, the spread of the prediction decreases. Left: d=10.d=10. Mid: d=50d=50, Right: d=500d=500. See Section 4.1 for a detailed description of the experimental setup.

2 Related Works

For most tasks, the quantity we would like to model is the conditional probability of the target yy for a given input xx, pw(y|x)p_{w}(y|x). This implies that good Bayesian inference can only be made if the statistics of yy are correctly modeled. For example, stochastic neural networks are expected to have well-calibrated uncertainty estimates, a trait that is highly desirable for practical safe and reliable applications (Wilson and Izmailov,, 2020; Gawlikowski et al.,, 2021; Izmailov et al.,, 2021). This expectation means that a well-trained stochastic network should have a predictive variance that matches the actual level of randomness in the labeling. However, despite the extensive exploration of empirical techniques, almost no work theoretically studies the capability of a neural network to model the statistics of data distribution correctly, and our work fills in an important gap in this field of research. Two applications we consider in this work are dropout (Srivastava et al.,, 2014), which can be interpreted a stochastic technique for approximate Bayesian inference, and VAE (Kingma and Welling,, 2013), which is among the main Bayesian deep learning methods in use.

Theoretically, while a unified approach is lacking, some previous works exist to separately study different stochastic techniques in deep learning. A series of recent works approaches the VAE loss theoretically (Dai and Wipf,, 2019). Another line of recent works analyzes linear models trained with VAE to study the commonly encountered mode collapse problem of VAE (Lucas et al.,, 2019; Koehler et al.,, 2021). In the case of dropout, Gal and Ghahramani, (2016) establishes the connection between the dropout technique and Bayesian learning. Another series of work extensively studied the dropout technique with a linear network (Cavazza et al.,, 2018; Mianjy and Arora,, 2019; Arora et al.,, 2020) and showed that dropout effectively controls the rank of the learned solution and approximates a data-dependent L2L_{2} regularization.

The investigation of neural network behaviour in the extreme width limit is also relevant to our work (Neal,, 1996; Lee et al.,, 2018; Jacot et al.,, 2018; Matthews et al.,, 2018; Lee et al.,, 2019; Matthews et al.,, 2018; Novak et al.,, 2018; Garriga-Alonso et al.,, 2018; Allen-Zhu et al.,, 2018; Khan et al.,, 2019; Agrawal et al.,, 2020), which establishes some equivalence between neural networks and Gaussian processes (GPs) (Rasmussen,, 2003). Meanwhile, several theoretical studies (Jacot et al.,, 2018; Lee et al.,, 2018; Chizat et al.,, 2018) demonstrated that for infinite-width networks, the distribution of functions induced by gradient descent during training could also be described as a GP with the neural tangent kernel (NTK) kernel (Jacot et al.,, 2018), which facilitates the study of deep neural networks using theoretical tools from kernel methods. Our work studies the property of the global minimum of an infinite-width network and does not rely on the NNGP or NTK formalism. Also, both lines of research are not directly relevant for studying trained stochastic networks at the global minimum.

3 Main Result

In this section, we present and discuss our main result. Notation-wise, let WW denote a matrix, Wj:W_{j:} denote its jj-th row viewed as a vector. Let vv be any vector, and vjv_{j} denote its jj-th element; however, when vv involves a complicated expression, we denote its jj-th element as [v]j[v]_{j} for clarity.

3.1 Problem Setting

We first introduce two basic assumptions of the network structure.

Assumption 1.

(Neural networks can be decomposed into Lipshitz-continuous blocks.) Let ff be a neural network. We assume that there exist functions g1g^{1} and g2g^{2} such that f=g2g1f=g^{2}\circ g^{1}, where \circ denotes functional composition. Additionally, both g1g^{1} and g2g^{2} are Lipshitz-continuous.

Throughout this work, the component functions g1,g2g^{1},\ g^{2} of a network ff are called a block, which can be seen as a generalization of a layer. It is appropriate to call g1g^{1} the input block and g2g^{2} the output block. Because the Lipshitz constant of a neural network can be upper bounded by the product of the largest eigenvalue of each weight matrix times the Lipshitz constant of the non-linearity, the assumption that every block is Lipshitz-continuous applies to all existing networks with fixed weights and with Lipshitz-continuous activation functions (such as ReLU, tanh, Swish (Ramachandran et al.,, 2017), Snake (Ziyin et al.,, 2020) etc.).

If we restrict ourselves to feedforward architectures, we can discuss the meaning of an ”increasing width” without much ambiguity. However, in our work, since the definition of blocks (layers) is abstract, it is not immediately clear what it means to ”increase the width of a block.” The following definition makes it clear that one needs to specify a sequence of blocks to define an increasing width.111Also, note that this definition of ”width” makes it possible to define different ways of ”increasing” the width and is thus more general than the standard procedure of simply increasing the output dimension of the corresponding linear transformation.

Definition 1.

(Models with an increasing width.) Each block of a neural network ff is labeled with two indices d1,d2+d_{1},d_{2}\in\mathbb{Z}^{+}. Let f=g2g1f=g^{2}\circ g^{1}; we write gi=gid2,d1g^{i}={}^{d_{2},d_{1}}g^{i} if for all xx, gid2,d1(x)d2{}^{d_{2},d_{1}}g^{i}(x)\in\mathbb{R}^{d_{2}} and xd1x\in\mathbb{R}^{d_{1}}. Moreover, to every block gg, there corresponds a countable set of blocks {i,jg}i,j+\{^{i,j}g\}_{i,j\in\mathbb{Z}^{+}}. For a block gg, its corresponding block set is denoted as 𝒮(g)={i,jg}i,j+\mathcal{S}(g)=\{^{i,j}g\}_{i,j\in\mathbb{Z}^{+}}. Also, to every sequence of blocks g1,g2,g^{1},g^{2},..., there also corresponds a sequence of parameter sets w1,w2,w^{1},w^{2},... such that gi=gwiig^{i}=g^{i}_{w^{i}} is parametrized by wiw^{i}. The corresponding parameter set of block gg is denoted as w(g)w(g).

Note that if f=g2g1f=g^{2}\circ g^{1}, g1=g1d2,d1g^{1}={}^{d_{2},d_{1}}g^{1} and g2=g2d4,d3g^{2}={}^{d_{4},d_{3}}g^{2}, d2d_{2} must be equal to d3d_{3}; namely, specifying g1g^{1} constrains the input dimension of the next block. It is appropriate to call d2d_{2} the width of the block gd2,d1{}^{d_{2},d_{1}}g. Since each block is parametrized by its own parameter set, the union of all parameter sets is the parameter set of the neural network ff: f=fwf=f_{w}. Since every block comes with the respective indices and equipped with its own parameter set, we omit specifying the indices and the parameter set when unnecessary. The next assumption specifies what it means to have a larger width.

Assumption 2.

(A model with larger width can express a model with smaller width.) Let gg be a block and 𝒮(g)\mathcal{S}(g) its block set. Each block g=gwg=g_{w} in 𝒮(g)\mathcal{S}(g) is associated with a set of parameters ww such that for any pair of functions gd1,d2,gd1,d2𝒮(g){}^{d_{1},d_{2}}g,{}^{d_{1}^{\prime},d_{2}}g^{\prime}\in\mathcal{S}(g), any fixed ww, and any mappings mm from {1,,d1}{1,,d1}\{1,...,d_{1}^{\prime}\}\to\{1,...,d_{1}\}, there exists parameters ww^{\prime} such that [gwd1,d2(x)]m(l)=[gwd1,d2(x)]l[{}^{d_{1},d_{2}}g_{w}(x)]_{m(l)}=[{}^{d_{1}^{\prime},d_{2}}g_{w^{\prime}}(x)]_{l} for all xx and ll.

This assumption can be seen as a constraint on the types of block sets 𝒮(w)\mathcal{S}(w) we can choose. As a concrete example, the following proposition shows that the block set induced by a linear layer with arbitrary input and output dimensions followed by an element-wise non-linearity satisfies our assumption.

Proposition 1.

Let gW,bd2,d1(x)=σ(Wx+b){}^{d_{2},d_{1}}g_{W,b}(x)=\sigma(Wx+b) where σ\sigma is an element-wise function, Wd1×d2W\in\mathbb{R}^{d_{1}\times d_{2}}, and bd2b\in\mathbb{R}^{d_{2}}. Then, 𝒮(g)={gW,bi,j(x)}i,j+\mathcal{S}(g)=\left\{{}^{i,j}g_{W,b}(x)\right\}_{i,j\in\mathbb{Z}^{+}} satisfies Assumption 2.

Proof. Consider two functions g{W,b}d1,d2(x){}^{d_{1},d_{2}}g_{\{W,b\}}(x) and g{W,b}d1,d2(x){}^{d_{1}^{\prime},d_{2}}g_{\{W^{\prime},b^{\prime}\}}(x) in 𝒮(g)\mathcal{S}(g). Let mm be an arbitrary mapping from {1,,d1}{1,,d1}.\{1,...,d_{1}^{\prime}\}\to\{1,...,d_{1}\}. It suffices to show that there exist WW^{\prime} and bb^{\prime} such that [d1,d2g{W,b}(x)]m(l)[d1,d2g{W,b}(x)]l=0[^{d_{1},d_{2}}g_{\{W,b\}}(x)]_{m(l)}-[^{d_{1}^{\prime},d_{2}}g_{\{W^{\prime},b^{\prime}\}}(x)]_{l}=0 for all ll. For a matrix MM, we use Mj:M_{j:} to denote the jj-th row of MM. By definition, this condition is equivalent to

σ(Wm(l):x+b)=σ(Wl:x+b)\sigma(W_{m(l):}x+b)=\sigma(W^{\prime}_{l:}x+b^{\prime}) (1)

which is achieved by setting b=bb^{\prime}=b and Wl:=Wm(l):W^{\prime}_{l:}=W_{m(l):}, where Wm(l):W_{m(l):} is the m(l)m(l)-th row of WW. \square

Now, we are ready to define a stochastic neural network.

Definition 2.

(Stochastic Neural Networks) A neural network f=g2g1f=g^{2}\circ g^{1} is said to be a stochastic neural network with stochastic block g1g^{1} if g1=g1(x,ϵ)g^{1}=g^{1}(x,\epsilon) is a function of xx and a random vector ϵ\epsilon, and the corresponding deterministic function f:=g2h1f^{\prime}:=g^{2}\circ h^{1} satisfies Assumption 2, where h1=𝔼ϵg1h^{1}=\mathbb{E}_{\epsilon}\circ g^{1}.

Namely, a stochastic network becomes a proper neural network when averaged over the noise of the stochastic block. To proceed, we make the following assumption about the randomness in the stochastic layer.

Assumption 3.

(Uncorrelated noise) For a stochastic block gg, Cov[g(x,ϵ),g(x,ϵ)]=Σ\text{Cov}[g(x,\epsilon),g(x,\epsilon)]=\Sigma, where Σ\Sigma is a diagonal matrix and Σii<\Sigma_{ii}<\infty for all ii.

This assumption applies to standard stochastic techniques in deep learning, such as dropout or the reparametrization trick used in approximate Bayesian deep learning. Lastly, we assume the following condition for the architecture.

Assumption 4.

(Stochastic block is followed by linear transformation.) Let f=g2g1f=g^{2}\circ g^{1} be the stochastic neural network under consideration, and let g1g^{1} be the stochastic layer. We assume that for all gi,j𝒮(g2){}^{i,j}g\in\mathcal{S}(g^{2}), gwi,j=gw(Wx+b){}^{i,j}g_{w}=g^{\prime}_{w^{\prime}}(Wx+b) for a fixed function g:dig^{\prime}:\mathbb{R}^{d}\to\mathbb{R}^{i} with parameter set ww^{\prime}, where Wd×jW\in\mathbb{R}^{d\times j} and bias bdb\in\mathbb{R}^{d} for a fixed integer dd. In our main result, we further assume that b=0b=0 for notational conciseness.

In other words, we assume that the second block g2g^{2} can always be decomposed as gMg^{\prime}\circ M, such that MM is an optimizable linear transformation. This is the only architectural assumption we make. In principle, this can be replaced by weaker assumptions. However, we leave this as an important future work because Assumption 4 is sufficient for the purpose of this work and is general enough for the applications we consider (such as dropout and VAE). We also stress that the condition that g2g^{2} starts with a linear transformation does not mean that the first actual layer of g2g^{2} is linear. Instead, MM can be followed by an arbitrary Lipshitz activation function as is usual in practice; in our definition, if it exists, this following activation is assumed into the definition of gg^{\prime}.

The actual rather restrictive assumption in Assumption 4 is that the function gg^{\prime} has a fixed input dimension (like a “bottleneck”). In practice, when one scales up the model, it is often the case that one wants to scale up the width of all other layers simultaneously. For the first block, this is allowed by assumption 2.222For example, if g1g^{1} is a multilayer perceptron, it is easy to check that assumption  2 is satisfied if one increases the intermediate layers of g1g^{1} simultaneously. We note that this bottleneck assumption is mainly for notational concision. In the appendix A.3, we show that one can also extend the result to the case when the input dimension (and the intermediate dimensions) of gg^{\prime} also increases as one increases the width of the stochastic layer.

Notation Summary. To summarize, we require a network ff to be decomposed into two blocks: f=g2g1f=g^{2}\circ g^{1} and g1g^{1} is a stochastic block. Each block is associated with its indices, which specify its input and output dimensions, and a parameter set that we optimize over. For example, we can write a block gg as g=gwid2,d1g={}^{d_{2},d_{1}}g^{i}_{w} to specify that gg is the ii-th block in a neural network, is a mapping from d1\mathbb{R}^{d_{1}} to d2\mathbb{R}^{d_{2}}, and that its parameters are ww. However, for notational conciseness and better readability, we omit some of the specifications when the context is clear. For the parameter ww, we abuse the notation a little. Sometimes, we view ww as a set and discuss its unions and subsets; for example, let fw=gw22gw11f_{w}=g^{2}_{w^{2}}\circ g^{1}_{w^{1}}; then, we say that the parameter set ww of ff is the union of the parameter set of g1g^{1} and g2g^{2}: w=w1w2w=w^{1}\cup w^{2}. Alternatively, we also view ww as a vector in a subset of the real space, so that we can look for the minimizer ww in such a space (in expressions such as minwL(w)\min_{w}L(w)).

3.2 Convergence without Prior

In this work, we restrict our theoretical result to the MSE loss. Consider an arbitrary training set {(xi,yi)}i=1N\{(x_{i},y_{i})\}_{i=1}^{N}, when training the deterministic network, we want to find

w=argminwi=1N[fw(xi)yi]2.w_{*}=\arg\min_{w}\sum_{i=1}^{N}[f_{w}(x_{i})-y_{i}]^{2}. (2)

It is convenient to write [fw(xi)yi]2[f_{w}(x_{i})-y_{i}]^{2} as Li(w)L_{i}(w).

An overparametrized network can be defined as a network that can achieve zero training loss on such a dataset.

Definition 3.

A neural network fwf_{w} is said to be overparametrized for a non-negative differentiable loss function iLi\sum_{i}L_{i} if there exists ww_{*} such that iLi(w)=0\sum_{i}L_{i}(w_{*})=0. For a stochastic neural network ff with stochastic block g1g^{1}, ff is said to be overparametrized if f:=g2𝔼ϵg1f^{\prime}:=g^{2}\circ\mathbb{E}_{\epsilon}\circ g^{1} is overparametrized, where 𝔼ϵ\mathbb{E}_{\epsilon} is the expectation operation that averages over ϵ\epsilon.333We note that our result can be easily generalized to some cases when zero training loss is not achievable. For example, when there is degeneracy in the data (when the same xx comes with two different labels), zero training loss is not reachable, but our result can be generalized to such a case.

Namely, for a stochastic network, we say that it is overparametrized if its deterministic part is overparametrized. When there is no degeneracy in the data (if xi=xjx_{i}=x_{j}, then yi=yjy_{i}=y_{j}), zero training loss can be achieved for a wide enough neural network, and this definition is essentially equivalent to assuming that there is no data degeneracy and is thus not a strong limitation of our result.

With a stochastic block, the training loss becomes (due to the sampling of the hidden representation)

𝔼ϵ[iNLi]=i=1N𝔼ϵ[(fw(xi,ϵ)yi)2].\mathbb{E}_{\epsilon}\left[{}\sum_{i}^{N}L_{i}\right]={}\sum_{i=1}^{N}\mathbb{E}_{\epsilon}\left[(f_{w}(x_{i},\epsilon)-y_{i})^{2}\right]. (3)

Note that this loss function can still be reduced to 0 if fw(xi)=yif_{w}(x_{i})=y_{i} for all ii with probability 11.

With these definitions at hand, we are ready to state our main result.

Theorem 1.

Let the neural network under consideration satisfy Assumptions 1 ,2, 4 and 3, and assume that the loss function is given by Eq. (3). Let {fd1}d1+\{{}^{d_{1}}f\}_{d_{1}\in\mathbb{Z}^{+}} be a sequence of stochastic networks such that, for fixed integers d2,d0d_{2},d_{0}, fd1=g2d2,d1g1d1,d0{}^{d_{1}}f={}^{d_{2},d_{1}}g^{2}\circ{}^{d_{1},d_{0}}g^{1} with stochastic block gw(g1)1d1,d0𝒮(g1){}^{d_{1},d_{0}}g^{1}_{w(g_{1})}\in\mathcal{S}(g^{1}). Let fd1{}^{d_{1}}f be overparameterized for all d1dd_{1}\geq d^{*} for some d>0d^{*}>0. Let w=argminwiN𝔼[L(fwdi(x,ϵ),yi)]w_{*}=\arg\min_{w}\sum_{i}^{N}\mathbb{E}[L({}^{d_{i}}f_{w}(x,\epsilon),y_{i})] be a global minimum of the loss function. Then, for all xx in the training set,

limkVarϵ[fwkd(x,ϵ)]=0.\lim_{k\to\infty}\text{Var}_{\epsilon}\left[{}^{kd^{*}}f_{w_{*}}(x,\epsilon)\right]=0. (4)

Proof Sketch. The full proof is given in Appendix Section A.1. In the proof, we denote the term L(d1fw(x,ϵ),yi)L(^{d_{1}}f_{w}(x,\epsilon),y_{i}) as Lid1(w)L_{i}^{d_{1}}(w). Let ww_{*} be the global minimizer of jN𝔼ϵ[Ljd1(w)]{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]. Then, for any ww, by definition of the global minimum,

0jN𝔼ϵ[Ljd1(w)]jN𝔼ϵ[Ljd1(w)].0\leq{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]\leq{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]. (5)

If limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]=0, we have limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]=0, which implies that limd1𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]=0 for all jj. By bias-variance decomposition of the MSE, this, in turn, implies that Var[d1fw(xj)]=0\text{Var}[^{d_{1}}f_{w_{*}}(x_{j})]=0 for all jj. Therefore, it is sufficient to construct a sequence of ww such that limd1𝔼ϵjN[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\mathbb{E}_{\epsilon}\sum_{j}^{N}\left[L_{j}^{d_{1}}(w)\right]=0. The rest of the proof shows that, with the architecture assumptions we made, such a network can indeed be constructed. In particular, the architectural assumptions allow us to make independent copies of the output of the stochastic block and the linear transformation after it allows us to average over such independent copies to recover the mean with a vanishing variance, which can then be shown to be able to achieve zero loss. \square

Remark.

Note that the parameter set ww that we optimize is different for different fd1{}^{d_{1}}f; we do not attach an index d1d_{1} to ww in the proof for notational conciseness. In plain words, our main result states that for all xx in the training set, an overparametrized stochastic neural network at its global minimum has zero predictive variance in the limit d1d_{1}\to\infty. The condition that the width is a multiple of dd^{*} is not essential and is only used for making the proof concise. One can prove the same result without requiring d1=kdd_{1}=kd^{*}. Also, one might find the restriction to the MSE loss restrictive. However, it is worth noting that our result is quite strong because it applies to an arbitrary latent noise ϵ\epsilon distribution whose second moment exists. It is possible to prove a similar result for a broader class of loss functions if we place more restrictions on the distribution of the latent noise (such as the existence of higher moments).

At a high level, one might wonder why the optimized model achieves zero variance. Our results suggest that the form of the loss function may be crucial. The MSE loss can be decomposed into a bias term and a variance term:

iNLi=bias+variance.\sum_{i}^{N}L_{i}=bias+variance. (6)

Minimizing the MSE loss involves both minimizing the bias term and the variance term, and the key step of our proof involves showing that a neural network with a sufficient width can reduce the variance to zero. We thus conjecture that convergence to a zero-variance model can be generally true for a broad class of loss functions. For example, one possible candidate for this function class is the set of convex loss functions, which favor a mean solution more than a solution with variance (by Jensen’s inequality), and a neural network is then encouraged to converge to such solutions so as to minimize the variance. However, identifying this class of loss functions is beyond the scope of the present work, and we leave it as an important future step. Lastly, we also stress that the main results are not a trivial consequence of the MSE being convex. When the model is linear, it is rather straightforward to show that the variance reduces to zero in the large-width limit because taking the expectation of the model output is equivalent to taking the expectation of the latent noise. However, this is not trivial to prove for a genuine neural network because the net ff is, in general, a nonlinear and nonconvex function of the latent noise ϵ\epsilon.

3.3 Application to Dropout

Definition 4.

A stochastic block g(x)g(x) is said to be a pp-dropout layer if [g(x)]j=ϵj[h(x)]j[g(x)]_{j}=\epsilon_{j}[h(x)]_{j}, where h(x)h(x) is a deterministic block, and ϵj\epsilon_{j} are independent random variables such that ϵj=1/p\epsilon_{j}=1/p with probability pp and ϵj=0\epsilon_{j}=0 with probability 1p1-p.

Since the noise of dropout is independent, one can immediately apply the main theorem and obtain the following corollary.

Corollary 1.

For any 0<p<10<p<1, an optimized stochastic network with an infinite width pp-dropout layer has zero variance on the training set.

Our result thus formally proves the intuitive hypothesis in the original dropout paper (Srivastava et al.,, 2014) that applying dropout to training has the effect of encouraging an averaging effect in the latent space.

3.4 Convergence with a Prior

We now extend our result to the case where a soft prior constraint exists in the loss function. The model setting of this section is equivalent to a special type of Bayesian latent variable model (e.g., a VAE) whose prior strength decreases towards zero as the width of the model is increased.444In the context of VAE, many previous works have noticed that the VAE has problems with its prediction variance (Dai and Wipf,, 2019; Wang and Ziyin,, 2022). However, none of the previous works discusses the behavior of a VAE when its width tends to infinity. The result in this section is more general and involved than Theorem 1 because the soft prior constraint matches the latent variable to the prior distribution in addition to the MSE loss. The existence of the prior term regularizes the model and prevents a perfect fitting of the original MSE loss, and the main message of this section is to show that even in the presence of such a (weak) regularization term, a vanishing variance can still emerge, but relatively slowly.

While it is natural that a latent variable model f=g2g1f=g^{2}\circ g^{1} can be decomposed into two blocks, where g2g^{2} is the decoder, and g1g^{1} is an encoder, we require one additional condition. Namely, the stochastic block ends with a linear transformation layer (which is satisfied by a standard VAE encoder (Kingma and Welling,, 2013), for example).

Definition 5.

A stochastic block g(x)g(x) is said to be a encoder block if g(x)=W1h(x)+ϵj(W2h(x)+b)g(x)=W_{1}h(x)+\epsilon_{j}\odot(W_{2}h(x)+b), where \odot is the Hadamard product, W1W_{1}, W2W_{2} are linear transformations and bb is the bias of W2W_{2}, h(x)h(x) is a deterministic block, and ϵj\epsilon_{j} are uncorrelated random variables with zero mean and unit variance.

Note that we explicitly require the weight matrix W2W_{2} to come with a bias. For the other linear transformations, we have omitted the bias term for notational simplicity. One can check that this definition is consistent with Definition 2. Namely, a net with an encoder block is indeed a type of stochastic net. When the training loss only consists of the MSE, it should follow as an immediate corollary of Theorem 1 that the variance converges to zero. However, the prior term complicates the problem. The following definition specifies the type of the prior loss under our consideration.

Definition 6.

(Prior-Regularized Loss.) Let f=g2g1f=g^{2}\circ g^{1}, such that [g1(x)]j=W1h(x)+ϵjW2h(x)[g^{1}(x)]_{j}=W_{1}h(x)+\epsilon_{j}\odot W_{2}h(x) is a encoder block. A loss function \ell is said to be a prior-regularized loss function if =iLi+prior\ell=\sum_{i}L_{i}+\ell_{\rm prior}, where iLi\sum_{i}L_{i} is given by Eq. (3) and prior=1d1αjmean([W1h(xi)]j)+var([W2h(x)+b]j)\ell_{\rm prior}=\frac{1}{d_{1}^{\alpha}}\sum_{j}\ell_{\rm mean}([W_{1}h(x_{i})]_{j})+\ell_{\rm var}([W_{2}h(x)+b]_{j}), where α>0\alpha>0, mean0\ell_{\rm mean}\geq 0 and var0\ell_{\rm var}\geq 0 are differentiable functions that are equal to zero if for all xix_{i}, [W1h(xi)]j=0[W_{1}h(x_{i})]_{j}=0 and [W2h(x)+b]j=1[W_{2}h(x)+b]_{j}=1.

We have abstracted away the actual details of the definitions of the prior loss. For our purpose, it is sufficient to say that the equation [W1h(xi)]j=0[W_{1}h(x_{i})]_{j}=0 means that the loss function encourages the posterior to have a zero mean and [W2h(x)+b]j=1[W_{2}h(x)+b]_{j}=1 encourages a unit variance. As an example, one can check that the standard ELBO loss for VAE satisfies this definition. With this architecture, we prove a similar result. The proof is given in Section A.2.

Theorem 2.

Assuming that the neural networks under consideration satisfy Assumptions 1 ,2, 4, and the stochastic block is a encoder block and satisfies Assumption 3, and that the loss function is a prior-regularized loss with parameter α>0\alpha>0, let d2,d0d_{2},d_{0} be fixed integers, fd1=g2d2,d1g1d1,d0{}^{d_{1}}f={}^{d_{2},d_{1}}g^{2}\circ{}^{d_{1},d_{0}}g^{1} and {d1f}d1+\{^{d_{1}}f\}_{d_{1}\in\mathbb{Z}^{+}} be a sequence of stochastic networks with stochastic block gw(g1)1d1,d0𝒮(g1){}^{d_{1},d_{0}}g^{1}_{w(g_{1})}\in\mathcal{S}(g^{1}). Let fd1{}^{d_{1}}f be overparameterized for all d1dd_{1}\geq d^{*} for some d>0d^{*}>0. Let w=argminwiN𝔼[L(fwdi(x,ϵ),yi)]w_{*}=\arg\min_{w}\sum_{i}^{N}\mathbb{E}[L(f^{d_{i}}_{w}(x,\epsilon),y_{i})] be the global minimum of the loss function. Then, for all xx in the training set,

limkVarϵ[fwkd(x,ϵ)]=0.\lim_{k\to\infty}\text{Var}_{\epsilon}\left[{}^{kd^{*}}f_{w_{*}}(x,\epsilon)\right]=0. (7)
Remark.

One should also be careful when interpreting this result. Strictly speaking, this result shows that, conditioning on a fixed input to the encoder, an optimized latent variable model (such as a VAE) outputs a deterministic prediction, independent of the sampling of the hidden representation. Strictly speaking, this does not mean that the generation will be deterministic because it is also often assumed that the outputs of the decoder obey another independent, say, Gaussian distribution with a predetermined variance. However, it is hardly the case for practitioners to separately inject a Gaussian noise to the generated data (Doersch,, 2021), and so our result is relevant for practical situations. Also, the condition α>0\alpha>0 implies that prior\ell_{\rm prior} cannot increase too fast with the hidden width.

One might wonder whether a vanishing prior strength makes this setting trivial. The proof shows that it is far from trivial and that even a prior with vanishing strength can have a very strong influence on how fast the variance decays towards zero. The proof suggests that if the prior strength decays as 1/d1α1/d_{1}^{\alpha}, the variance should decay roughly as d1min(1,α)2d_{1}^{-\frac{\min(1,\alpha)}{2}}. Namely, the smaller the α\alpha, the slower the rate of convergence towards 0. Additionally, the fastest exponent at which the variance can decay is 0.5-0.5, significantly smaller than the case for Theorem 1, where the proof suggests an exponent of 1-1. This means that even a vanishing regularization strength can have a strong impact on the prediction variance of the model. Quantitatively, as α\alpha approaches zero, the variance can decay arbitrarily slowly. Qualitatively, our result implies that having a regularization term or not qualitatively changes the nature of the global minima of the stochastic model. Also, we note that our problem setting is formally equivalent to a generalized555It is generalized because the MSE is not necessarily a reconstruction loss. β\beta-VAE (Higgins et al.,, 2016) with β1/widthα\beta\sim 1/width^{\alpha}. While it is rarely the case for practitioners to scale β\beta towards zero, we stress that the main value of Theorem 2 is the theoretical insight it offers on the qualitative distinction between having a prior term and having not.

Refer to caption
(a) Feedforward Network with Dropout (Adam).
Refer to caption
(b) VAE (Adam).
Figure 2: Scaling of the prediction variance of different models as the width of the stochastic layer extends to infinity. For both dropout network and VAE, we see that the prediction variance decreases towards 0 as the width increases. For completeness, the prediction variance over an independently sampled test set is also shown.

4 Numerical Simulations

We perform experiments with nonlinear neural networks to demonstrate the studied effect of vanishing variance. The first part describes the illustrative experiment presented at the beginning of the paper. The second part experimentally demonstrates that a dropout network and VAE have a vanishing prediction variance on the training set as the width tends to infinity. Additional experiments performed with weight decay and SGD are presented in the appendix.

4.1 Illustration

In this experiment, we let the target function be y=sin(x)+ηy=\sin(x)+\eta for xx uniformly sampled from the domain [3,3][-3,3]. The target is corrupted by a weak noise η𝒩(0,0.1)\eta\sim\mathcal{N}(0,0.1). The model is a feedforward network with tanh activation with the number of neurons 1d11\to d\to 1 where d{10,50,500}d\in\{10,50,500\}, and a dropout with dropout probability p=0.1p=0.1 is applied in the hidden layer. See Figure 1.

4.2 Dropout

We now systematically explore the prediction variance of a model trained with dropout. As an extension of the previous experiment, we let both the input and the target be vectors such that x,ydx,\ y\in\mathbb{R}^{d}. The target function is yi=xi3+ηiy_{i}=x_{i}^{3}+\eta_{i} for i{1,,d}i\in\{1,...,d\}, where xi,ηi𝒩(𝟎,1)x_{i},\eta_{i}\sim\mathcal{N}(\mathbf{0},1) and the noise ηd\eta\in\mathbb{R}^{d} is also a vector. We let d=20d=20 and sample 10001000 data point pairs for training. The model is a three-layer MLP with ReLU activation functions, with the number of neurons 20dh2020\to d_{h}\to 20, where dhd_{h} is the width of the hidden layer. Dropout is applied to the post-activation values of the hidden layer. In the experiments, we set the dropout probability pp to be 0.1, and we independently sample outputs 30003000 times to estimate the prediction variance. Training proceeds with Adam for 45004500 steps, with an initial learning rate of 0.010.01, and the learning rate is decreased by a factor of 10 every 15001500 step. See Figure 2(a) for a log-log plot of width vs. prediction variance. We see that the prediction variance of the model on the training set decreases towards zero as the width increases, as our theory predicts. For completeness, we also plot the prediction variance for an independently sampled test set for completeness. We see that, for this task, the prediction variance of the test points agree well with that of the training set. A linear regression on the slope of the tail of the width-variance curve shows that the variances decrease roughly as d0.7d^{-0.7}, close to what our proof suggests (d1d^{-1}); we hypothesize that the exponent is slightly smaller than 11 because the training is stopped at a finite time and the model has not fully reached the global minimum.666Moreover, with a finite-learning rate, SGD is a biased estimator of a minimum (Ziyin et al., 2021a, )..

4.3 Variational Autoencoder

In this section, we conduct experiment on a β\beta-VAE with latent Gaussian noise. The input data xdx\in\mathbb{R}^{d} is sampled from a standard Gaussian distribution with d=20d=20. We generate 100100 data points for training. The VAE employs a standard encoder-decoder architecture with ReLU nonlinearity. The encoder is a two-layer feedforward network with neurons 20322×dh20\to 32\to 2\times d_{h}. The decoder is also a two-layer feedforward network with architecture dh3220d_{h}\to 32\to 20. Note that our theory requires the prior term vae\ell_{\rm vae} not to increase with the width; we, therefore, choose β=0.1/dh\beta=0.1/d_{h}. The training objective is the minus standard Evidence Lower Bound (ELBO) composed of reconstruction error and the KL divergence between the parameterized variable and standard Gaussian. We independently sample outputs 100100 times to estimate the prediction variance for estimating the variance. The results in Fig. 2(b) show that the variances of both training and test set decrease as the width increases and follow the same pattern.

4.4 Experiments with Weight Decay

Weight decay often has the Bayesian interpretation of imposing a normal distribution prior over the variables, and sometimes it is believed to prevent the model from making a deterministic prediction (Gal and Ghahramani,, 2016). We, therefore, also perform the same experiments as in the previous two subsections, with a weight decay strength of λ=5e4\lambda=5e-4. See Appendix Sec. B.1 for the results. We notice that the experimental result is similar and that applying weight decay is not sufficient to prevent the model from reaching a vanishing variance. Conventionally, we expect the prediction of a dropout net to be of order λ\lambda and not directly dependent on the width (Gal and Ghahramani,, 2016); however, this experiment suggests that width may wield a stronger influence on the prediction variance than the weight decay strength. Since a model cannot reach the actual global minimum when weight decay is applied, this experiment is beyond the applicability of our theory; therefore, this result suggests a conjecture that even in a local minimum, it can still be highly likely for stochastic models to reach a solution with vanishing variance, and proving or disproving this conjecture can be an important future task.

5 Discussion

In this work, we theoretically studied the prediction variance of stochastic neural networks. We showed that when the loss function satisfies certain conditions and under mild architectural conditions, the prediction variance of a stochastic net on the training set tends to zero as the width of the stochastic layer tends to infinity. Our theory offers a precise mathematical explanation to a frequently quoted anecdotal problem of a stochastic network, that the neural networks can be too expressive such that adding noise to the latent layers is not sufficient to make a network capable of modeling a distribution well (Higgins et al.,, 2016; Burgess et al.,, 2018; Dai and Wipf,, 2019). Our result also offers a formal justification of the original intuition behind the dropout technique. From a practical point of view, our result suggests that it is generally nontrivial to train a model whose prediction variance matches the true variance of the data. One potential fix is to design a loss function that encourages a nonvanishing prediction variance, which is an interesting future problem. There are two major limitations of the present approach. First of all, our result can only be applied to data points in the training set, and it is unclear what we can say about the test set. Even though experiments on non-linear neural networks suggest that the test variance may also decrease towards zero, it is unclear under what condition this is the case, and it is an important topic to investigate in the future. The second limitation is that we have only studied the global minimum, and it is important to study whether or under what condition the variance also vanishes for a local minimum.

References

  • Agrawal et al., (2020) Agrawal, D., Papamarkou, T., and Hinkle, J. (2020). Wide neural networks with bottlenecks are deep gaussian processes. Journal of Machine Learning Research, 21(175).
  • Allen-Zhu et al., (2018) Allen-Zhu, Z., Li, Y., and Liang, Y. (2018). Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918.
  • Arora et al., (2020) Arora, R., Bartlett, P., Mianjy, P., and Srebro, N. (2020). Dropout: Explicit forms and capacity control.
  • Burgess et al., (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. (2018). Understanding disentangling in β\beta-vae. arXiv preprint arXiv:1804.03599.
  • Cavazza et al., (2018) Cavazza, J., Morerio, P., Haeffele, B., Lane, C., Murino, V., and Vidal, R. (2018). Dropout as a low-rank regularizer for matrix factorization. In International Conference on Artificial Intelligence and Statistics, pages 435–444. PMLR.
  • Chizat et al., (2018) Chizat, L., Oyallon, E., and Bach, F. (2018). On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956.
  • Dai and Wipf, (2019) Dai, B. and Wipf, D. (2019). Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789.
  • Doersch, (2021) Doersch, C. (2021). Tutorial on variational autoencoders.
  • Gal and Ghahramani, (2016) Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR.
  • Garriga-Alonso et al., (2018) Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L. (2018). Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587.
  • Gawlikowski et al., (2021) Gawlikowski, J., Tassi, C. R. N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R., et al. (2021). A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342.
  • Goodfellow et al., (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
  • Higgins et al., (2016) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2016). beta-vae: Learning basic visual concepts with a constrained variational framework.
  • Izmailov et al., (2021) Izmailov, P., Vikram, S., Hoffman, M. D., and Wilson, A. G. (2021). What are bayesian neural network posteriors really like? arXiv preprint arXiv:2104.14421.
  • Jacot et al., (2018) Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572.
  • Khan et al., (2019) Khan, M. E. E., Immer, A., Abedi, E., and Korzepa, M. (2019). Approximate inference turns deep networks into gaussian processes. Advances in Neural Information Processing Systems, 32.
  • Kingma and Welling, (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Koehler et al., (2021) Koehler, F., Mehta, V., Risteski, A., and Zhou, C. (2021). Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias. arXiv preprint arXiv:2112.06868.
  • Lee et al., (2018) Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. (2018). Deep neural networks as gaussian processes. In International Conference on Learning Representations.
  • Lee et al., (2019) Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32:8572–8583.
  • Liu et al., (2021) Liu, K., Ziyin, L., and Ueda, M. (2021). Noise and fluctuation of finite learning rate stochastic gradient descent.
  • Lucas et al., (2019) Lucas, J., Tucker, G., Grosse, R., and Norouzi, M. (2019). Don’t blame the elbo! a linear vae perspective on posterior collapse.
  • Mackay, (1992) Mackay, D. J. C. (1992). Bayesian methods for adaptive models. PhD thesis, California Institute of Technology.
  • Matthews et al., (2018) Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271.
  • Mianjy and Arora, (2019) Mianjy, P. and Arora, R. (2019). On dropout and nuclear norm regularization. In International Conference on Machine Learning, pages 4575–4584. PMLR.
  • Neal, (1996) Neal, R. (1996). Bayesian learning for neural networks. Lecture Notes in Statistics.
  • Novak et al., (2018) Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. (2018). Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148.
  • Ramachandran et al., (2017) Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions.
  • Rasmussen, (2003) Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer school on machine learning, pages 63–71. Springer.
  • Srivastava et al., (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
  • Wang and Ziyin, (2022) Wang, Z. and Ziyin, L. (2022). Posterior collapse of a linear latent variable model. arXiv preprint arXiv:2205.04009.
  • Wilson and Izmailov, (2020) Wilson, A. G. and Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791.
  • Ziyin et al., (2020) Ziyin, L., Hartwig, T., and Ueda, M. (2020). Neural networks fail to learn periodic functions and how to fix it. arXiv preprint arXiv:2006.08195.
  • (34) Ziyin, L., Li, B., Simon, J. B., and Ueda, M. (2021a). Sgd may never escape saddle points.
  • (35) Ziyin, L., Liu, K., Mori, T., and Ueda, M. (2021b). Strength of minibatch noise in sgd. arXiv preprint arXiv:2102.05375.

Appendix A Proofs and Theoretical Concerns

A.1 Proof of Theorem 1

Proof. In the proof, we denote the term L(d1fw(x,ϵ),yi)L(^{d_{1}}f_{w}(x,\epsilon),y_{i}) as Lid1(w)L_{i}^{d_{1}}(w). Let ww_{*} be the global minimizer of jN𝔼ϵ[Ljd1(w)]{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]. Then, for any ww, we have by definition

0jN𝔼ϵ[Ljd1(w)]jN𝔼ϵ[Ljd1(w)].0\leq{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]\leq{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]. (8)

If limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]=0, we have limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]=0, which implies that limd𝔼ϵ[Ljd1(w)]=0\lim_{d\to\infty}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]=0 for all jj. By the bias-variance decomposition of the MSE, this, in turn, implies that Var[d1fw(xj)]=0\text{Var}[^{d_{1}}f_{w_{*}}(x_{j})]=0 for all jj. Therefore, it is sufficient to construct a sequence of ww such that limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]=0.

Now, we construct such a ww. Let fd1{}^{d_{1}}f^{\prime} denote the deterministic counterpart of fd1{}^{d_{1}}f. By definition and by Assumption 1, we have

fd1(x){}^{d_{1}}f(x) =gw22d2,d1gw11d1,d0(x);\displaystyle={}^{d_{2},d_{1}}g^{2}_{w^{2}}\circ{}^{d_{1},d_{0}}g^{1}_{w^{1}}(x); (9)
fd1(x){}^{d_{1}}f^{\prime}(x) =gw22d2,d1𝔼ϵgw11d1,d0(x).\displaystyle={}^{d_{2},d_{1}}g^{2}_{w^{2}}\circ\mathbb{E}_{\epsilon}\circ{}^{d_{1},d_{0}}g^{1}_{w^{1}}(x). (10)

By the architecture assumption (assumption 4), there exists a function hvh_{v} parametered by a parameter set vv and a linear transformation MM such that we can further decompose the two neural networks as

f(x)\displaystyle f(x) =hvMgw11(x);\displaystyle=h_{v}\circ M\circ g^{1}_{w_{1}}(x); (11)
f(x)\displaystyle f^{\prime}(x) =hvM𝔼gw11(x),\displaystyle=h_{v}\circ M\circ\mathbb{E}\circ g^{1}_{w_{1}}(x), (12)

where Md×d1M\in\mathbb{R}^{d\times d_{1}} for a fixed integer dd and is a linear transformation belonging to the parameter set of g2g^{2}. Note that, by definition, the parameter set of the g2g^{2} block is w2=vMw_{2}=v\cup M.

Let uu_{*} be a global minimum of fd1{}^{d_{1}}f^{\prime}:

u:=(v,M,w1)=argminv,M,w1jNL(d1fw(x),yi),u_{*}:=(v_{*},{M_{*}}^{\prime},w^{1}_{*})=\arg\min_{v,M^{\prime},w_{1}}\sum_{j}^{N}L(^{d_{1}}f_{w}^{\prime}(x),y_{i}), (13)

and, by the assumption of overparametrization, we also have fud1(xj)=yj{}^{d_{1}}f^{\prime}_{u_{*}}(x_{j})=y_{j} for all jj.

We now specify the parameters for fwf_{w} for k>1k>1. By assumption 2, we can find parameters w1w_{1}^{\prime} such that 𝔼[kd,d0gw11(x)j]=𝔼[gw11d,d0(x)]jmodd\mathbb{E}[^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)_{j}]=\mathbb{E}[{}^{d^{*},d_{0}}g^{1}_{w_{*}^{1}}(x)]_{j\mod d^{*}} for j=1,,kdj=1,...,kd^{*}. Namely, we choose the parameters such that the expected output of the stochastic block are kk identical copies of the output of the overparametrized deterministic model with width dd^{*}.

Since MM is a linear transformation, one can factorize it as a product of two matrices such that M=AGM=AG, where Ad×dA\in\mathbb{R}^{d\times d^{*}} and Gd×d1G\in\mathbb{R}^{d^{*}\times d_{1}}:

f(x)\displaystyle f(x) =hvAGg1(x)w1;\displaystyle=h_{v}\circ A\circ G\circ g^{1}(x)_{w_{1}}; (14)
f(x)\displaystyle f^{\prime}(x) =hvAG𝔼gw11(x).\displaystyle=h_{v}\circ A\circ G\circ\mathbb{E}\circ g^{1}_{w_{1}}(x). (15)

Now, note that by definition, the function hvAh_{v}\circ A for any kdf{}^{kd^{*}f} coincides with the g2g_{2} block of fd{}^{d*}f^{\prime}. Namely, hvA=gw22d2,dh_{v}\circ A={}^{d_{2},d^{*}}g^{2}_{w^{2}} such that w2=vAw^{2}=v\cup A, and we let v=vv=v^{*} and A=MA=M^{*}.

Now, the last step is to specify GG. We let Gij=1kδi,jmoddG_{ij}^{*}=\frac{1}{k}\delta_{i,j\mod d}, where δi,j=1\delta_{i,j}=1 if i=ji=j and 0 otherwise. Namely, GG^{*} is nothing but an averaging matrix that sums and rescales the deterministic layer by a factor of 1/k1/k.

To summarize, our specification defines the following stochastic neural network:

fkd(x)=hvMGgw11(x).{}^{kd*}f(x)=h_{v^{*}}\circ M^{*}\circ G^{*}\circ g^{1}_{w_{1}^{\prime}}(x). (16)

By definition of GG^{*} and w1w_{1}^{\prime}, 𝔼ϵ[Ggw11kd,d0(x)]=𝔼ϵ[gw11d,d0(x)]\mathbb{E}_{\epsilon}[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]=\mathbb{E}_{\epsilon}[{}^{d^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)], and [Ggw11kd,d0(x)]j[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} are independent for different jj. Moreover, because [gw11d,d0(x)]j[{}^{d^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} has variance Σii\Sigma_{ii} by assumption, [Ggw11kd,d0(x)]j[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} has variance Σii/k\Sigma_{ii}/k.

Now, as kk\to\infty,

Ggw11kd,d0(x)L2𝔼ϵ[Ggw11kd,d0(x)],G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)\to_{L^{2}}\mathbb{E}_{\epsilon}[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)], (17)

where L2\to_{L^{2}} denotes convergence in mean square. Because hvAh_{v}\circ A is a Lipshitz continuous function by Assumption 1,

limkhvMGgw11kd,d0(x)L2hvM𝔼ϵgw11d,d0(x).\lim_{k\to\infty}h_{v^{*}}\circ M^{*}\circ G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)\to_{L^{2}}h_{v^{*}}\circ M^{*}\circ\mathbb{E}_{\epsilon}\circ{}^{d^{*},d_{0}}g^{1}_{w_{1}^{*}}(x). (18)

This implies that the expectation of the constructed model with increasing width converge to that of the overparametrized determinstic model (convergence in mean square implies convergence in mean). Therefore, defining our model as fd1=hvMGgw11kd,d0{}^{d_{1}}f^{*}=h_{v^{*}}\circ M^{*}\circ G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}, we obtain

𝔼ϵ[fd1(xi)]fd(xi).\mathbb{E}_{\epsilon}[{}^{d_{1}}f(x_{i})]\to{}^{d^{*}}f^{\prime}(x_{i}). (19)

Therefore, we have, by the bias-variance decomposition for MSE:

𝔼ϵ[fd1(xi,ϵ)yi]2=[𝔼ϵ[fd1(xi,ϵ)]yi]2+Var[fd1(xi,ϵ)]\displaystyle\mathbb{E}_{\epsilon}[{}^{d_{1}}f^{*}(x_{i},\epsilon)-y_{i}]^{2}=\left[\mathbb{E}_{\epsilon}[{}^{d_{1}}f^{*}(x_{i},\epsilon)]-y_{i}\right]^{2}+\text{Var}[{}^{d_{1}}f^{*}(x_{i},\epsilon)] (20)

Both terms converges to 0 for all ii, and so the sum of two terms converge to 0. This finishes the proof. \square

A.2 Proof of Theorem 2

Before the proof, we first comment that the proof is quite similar to the previous case. The difference lies in how we construct the model so as to reduce the training loss to zero as

Proof. In the proof, we denote the term L(d1fw(x,ϵ),yi)L(^{d_{1}}f_{w}(x,\epsilon),y_{i}) as Lid1(w)L_{i}^{d_{1}}(w). Let ww_{*} be the global minimizer of jN𝔼ϵ[Ljd1(w)]{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]. Then, for any ww, we have by definition

0jN𝔼ϵ[Ljd1(w)]jN𝔼ϵ[Ljd1(w)].0\leq{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]\leq{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]. (21)

If limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]=0, we have limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]=0, which implies that limd𝔼ϵ[Ljd1(w)]=0\lim_{d\to\infty}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]=0 for all jj. This, in turn, implies that Var[d1fw(xj)]=0\text{Var}[^{d_{1}}f_{w_{*}}(x_{j})]=0 for all jj because both the reconstruction loss and the prior are non-negative. Therefore, it is sufficient to construct a sequence of ww such that limd1𝔼ϵjN[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\mathbb{E}_{\epsilon}\sum_{j}^{N}\left[L_{j}^{d_{1}}(w)\right]=0.

Let fd1{}^{d_{1}}f^{\prime} denote the deterministic counterpart of fdi{}^{d_{i}}f, and let uu_{*} be a global minimum of fd1{}^{d_{1}}f^{\prime}. By the definition of a neural network, we can write

fd1(x){}^{d_{1}}f(x) =gw22d2,d1gw11d1,d0(x);\displaystyle={}^{d_{2},d_{1}}g^{2}_{w^{2}}\circ{}^{d_{1},d_{0}}g^{1}_{w^{1}}(x); (22)
fd1(x){}^{d_{1}}f^{\prime}(x) =gw22d2,d1𝔼ϵgw11d1,d0(x).\displaystyle={}^{d_{2},d_{1}}g^{2}_{w^{2}}\circ\mathbb{E}_{\epsilon}\circ{}^{d_{1},d_{0}}g^{1}_{w^{1}}(x). (23)

By the assumption of overparametrization, fud1(xj)=yj{}^{d_{1}}f^{\prime}_{u_{*}}(x_{j})=y_{j} for all jj.

By the architecture assumption, there exists a function hvh_{v} parametrized by a parameter set vv such that we can further decompose the two neural networks as

f(x)\displaystyle f(x) =hvMZgw1(x);\displaystyle=h_{v}\circ M\circ Z\circ g_{w_{1}}(x); (24)
f(x)\displaystyle f^{\prime}(x) =hvM𝔼Zgw1(x),\displaystyle=h_{v}\circ M\circ\mathbb{E}\circ Z\circ g_{w_{1}}(x), (25)

where Md1×dM\in\mathbb{R}^{d_{1}\times d} for a fixed integer dd and is a linear transformation belonging to the parameter set of g2g^{2}, and Z(x)=T1x+ϵ(T2x+b)Z(x)=T_{1}x+\epsilon\odot(T_{2}x+b) is the linear stochastic layer. Now, the parameter set of the network ff is w=vMT1T2bw1w=v\cup M\cup T_{1}\cup T_{2}\cup b\cup w_{1}.

By definition, the loss takes the form

iNLi+jd1mean(j)+jd1var(j)\sum_{i}^{N}L_{i}+\sum_{j}^{d_{1}}\ell_{\rm mean}^{(j)}+\sum_{j}^{d_{1}}\ell_{\rm var}^{(j)} (26)

We first let T2=0T_{2}=0 and b=1b=1, which immediately minimizes the variance part of of the loss: var(j)=0\ell_{\rm var}^{(j)}=0 for all jj. By assumption, for d1dd_{1}\geq d^{*}, f(x)f^{\prime}(x) is overparametrized, and one can find (v,M,w1,T1)=argminv,M,w1jNLjd(w)(v_{*},{M_{*}}^{\prime},w^{1}_{*},T_{1}^{*})=\arg\min_{v,M^{\prime},w_{1}}\sum_{j}^{N}L^{d^{*}}_{j}(w) such that iNLi=0\sum_{i}^{N}L_{i}=0.

For k>1k>1, we let [T1]j:=a[T1]jmodd[T_{1}^{\prime}]_{j:}=a[T_{1}^{*}]_{j\mod d} for a positive scalar aa. Namely, we copy the rows of the matrix W1W_{1}^{*} so that the expected output of the stochastic block are kk identical copies of the output of the overparametrized deterministic model with width dd^{*}, rescaled by a factor aa.777Note that this factor of aa is one crucial difference from the previous proof. This factor of aa will be crucial for reducing mean\ell_{\rm mean} to 0.

Since MM is a linear transformation, one can factorize it as a product of two matrices such that M=AGM=AG where Ad×dA\in\mathbb{R}^{d\times d^{*}} and Gd×d1G\in\mathbb{R}^{d^{*}\times d_{1}}:

f(x)\displaystyle f(x) =hvAGg1(x)w1;\displaystyle=h_{v}\circ A\circ G\circ g^{1}(x)_{w_{1}}; (27)
f(x)\displaystyle f^{\prime}(x) =hvAG𝔼gw11(x).\displaystyle=h_{v}\circ A\circ G\circ\mathbb{E}\circ g^{1}_{w_{1}}(x). (28)

Now, by definition, the function hvA=gw22d2,dh_{v}\circ A={}^{d_{2},d^{*}}g^{2}_{w^{2}} such that w2=vAw^{2}=v\cup A, and we let v=vv=v^{*} and A=M/aA=M^{*}/a. Again, notice that we have rescaled the matrix by a factor of 1/a1/a.

Now, the last step is specify GG. We let Gij=1kδi,jmoddG_{ij}^{*}=\frac{1}{k}\delta_{i,j\mod d} where δi,j=1\delta_{i,j}=1 if i=ji=j and 0 otherwise. Namely, GG^{*} sums and rescales the deterministic layer by a factor of 1/k1/k. This transformation has an averaging effect.

To summarize, our specification defines the following stochastic neural network:

f(x)=hvMGgw11(x).f(x)=h_{v^{*}}\circ M^{*}\circ G^{*}\circ g^{1}_{w_{1}^{\prime}}(x). (29)

By definition of GG^{*} and w1w_{1}^{\prime}, 𝔼ϵ[Ggw11kd,d0(x)]=𝔼ϵ[gw11d,d0(x)]\mathbb{E}_{\epsilon}[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]=\mathbb{E}_{\epsilon}[{}^{d^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)], and [Ggw11kd,d0(x)]j[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} are independent for different jj. Moreover, because [gw11d,d0(x)]j[{}^{d^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} has variance Σii\Sigma_{ii} by assumption, [Ggw11kd,d0(x)]j[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} has variance Σii/(ak)\Sigma_{ii}/(ak). We let a=kγa=k^{-\gamma}, where 1>γ>1α1>\gamma>1-\alpha, and, therefore, [Ggw11kd,d0(x)]j[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} has variance Σii/(k1γ)\Sigma_{ii}/(k^{1-\gamma}) which vanishes to 0 as kk increases.

At the same time, because \ell is a differentiable function of T1xT_{1}x,

1(kd)αjkdmean(j)\displaystyle\frac{1}{(kd^{*})^{\alpha}}\sum_{j}^{kd^{*}}\ell_{\rm mean}^{(j)} 1(kd)αjkdkγ+O(k2γ)\displaystyle\sim\frac{1}{(kd^{*})^{\alpha}}\sum_{j}^{kd^{*}}k^{-\gamma}+O(k^{-2\gamma}) (30)
=k1γα(d)1α\displaystyle=k^{1-\gamma-\alpha}(d^{*})^{1-\alpha} (31)
0,\displaystyle\to 0, (32)

where the last line follows from the condition γ>1α\gamma>1-\alpha, which holds by assumption.888Namely, with this construction, the variance part of the loss scales as k(1γ)k^{-(1-\gamma)} and the prior part of the loss scales as k(γ+alpha1)k^{-(\gamma+alpha-1)}. The sum of the two terms are minimized if 1γ=γ+alpha11-\gamma=\gamma+alpha-1, or, γ=1α/2\gamma=1-\alpha/2.

Now, as kk\to\infty,

Ggw11kd,d0(x)L2𝔼ϵ[Ggw11kd,d0(x)],G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)\to_{L^{2}}\mathbb{E}_{\epsilon}[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)], (33)

where L2\to_{L^{2}} denotes convergence in mean square. Because hvAh_{v}\circ A is a Lipshitz continuous function,

limkhvMGgw11kd,d0(x)L2hvM𝔼ϵgw11d,d0(x).\lim_{k\to\infty}h_{v^{*}}\circ M^{*}\circ G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)\to_{L^{2}}h_{v^{*}}\circ M^{*}\circ\mathbb{E}_{\epsilon}\circ{}^{d^{*},d_{0}}g^{1}_{w_{1}^{*}}(x). (34)

This implies that the expectation of the model with increasing width converge to that of the overparametrized deterministic model (convergence in mean square implies convergence in mean. Therefore, defining our model as fd1=hvMGgw11kd,d0{}^{d_{1}}f^{*}=h_{v^{*}}\circ M^{*}\circ G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}

𝔼ϵ[fd1(xi)]fd(xi).\mathbb{E}_{\epsilon}[{}^{d_{1}}f(x_{i})]\to{}^{d^{*}}f^{\prime}(x_{i}). (35)

Therefore, defining our model as fd1=hvMGgw11kd,d0{}^{d_{1}}f^{*}=h_{v^{*}}\circ M^{*}\circ G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}, we have, by the bias-variance decomposition for MSE:

𝔼ϵ[fd1(xi,ϵ)yi]2=[𝔼ϵ[fd1(xi,ϵ)]yi]2+Var[fd1(xi,ϵ)],\displaystyle\mathbb{E}_{\epsilon}[{}^{d_{1}}f^{*}(x_{i},\epsilon)-y_{i}]^{2}=\left[\mathbb{E}_{\epsilon}[{}^{d_{1}}f^{*}(x_{i},\epsilon)]-y_{i}\right]^{2}+\text{Var}[{}^{d_{1}}f^{*}(x_{i},\epsilon)], (36)

which converges to 0. This finishes the proof. \square

A.3 Removing the Bottleneck Constraint for g2g^{2}

In this section, we prove a version of Theorem 1 to demonstrate how one can remove the Bottleneck constraint of g2g^{2}. A similarly generalized version of Theorem 2 can also be proved using following the same steps, and so we leave that as an exercise to the readers. To begin, we first need an extended version of Assumption 5.

Assumption 5.

(A model with larger width can express a model with smaller width II.) Additionally, let xx^{\prime} denote a subset of xx (namely, dim(x)dim(x){\rm dim}(x^{\prime})\leq{\rm dim}(x) and for all i[1,dim(x)]i\in[1,{\rm dim}(x^{\prime})], there exists jj such that xj=xix_{j}=x_{i}^{\prime}). Let gg be a block and 𝒮(g)\mathcal{S}(g) its block set. Each block g=gwg=g_{w} in 𝒮(g)\mathcal{S}(g) is associated with a set of parameters ww such that for any pair of functions gd1,dim(x),gd1,dim(x)𝒮(g){}^{d_{1},{\rm dim}(x)}g,{}^{d_{1},{\rm dim}(x^{\prime})}g^{\prime}\in\mathcal{S}(g), any fixed ww^{\prime}, and any mappings mm from {1,,d1}{1,,d1}\{1,...,d_{1}^{\prime}\}\to\{1,...,d_{1}\}, there exists parameters ww such that gwd1,dim(x)(x)=gwd1,dim(x)(x){}^{d_{1},{\rm dim}(x)}g_{w}(x)={}^{d_{1},{\rm dim}(x^{\prime})}g_{w^{\prime}}(x^{\prime}) for all xx^{\prime}.

The original Assumption 2 only specifies what it means to have a larger output dimension. This extended version, in addition, says what it means to have a larger input dimension for a block. We note that this additional condition is quite general and is satisfied by the usual structures, such as a fully connected layer. As the original Assumption 2, this assumption also agrees with the standard intuitive understanding of what it means to have a larger width.

With this additional assumption, we can remove the bottleneck requirement in the original Assumption 4. Formally, we now require the following weak condition for the architecture.

Assumption 6.

(g2g^{2} can be further decomposed into two blocks) Let f=g2g1f=g^{2}\circ g^{1} be the stochastic neural network under consideration, and let g1g^{1} be the stochastic layer. We assume that for all gi,j𝒮(g2){}^{i,j}g\in\mathcal{S}(g^{2}), gwi,j=gw(Wx){}^{i,j}g_{w}=g^{\prime}_{w^{\prime}}(Wx) for a block gg^{\prime} with its block set 𝒮(g)\mathcal{S}(g^{\prime}). WW is a linear transformation with the standard block set (see Proposition 1).

This generalized assumption effectively means that the model can be decomposed into three blocks:

f(x)=gd2,DWD,d1g1d1,d0(x).f(x)={}^{d_{2},D}g^{\prime}\circ{}^{D,d_{1}}W\circ{}^{d_{1},d_{0}}g^{1}(x). (37)

With these extended assumptions, we can prove a more general version of Theorem 1. In comparison to the original Theorem 1, this theorem effectively allows one to increase the width of all the layers that g1g^{1} and g2g^{2} implicitly contain simultaneously.

Theorem 3.

Let the neural network under consideration satisfy Assumptions 1 ,2, 5, 6 and 3, and assume that the loss function is given by Eq. (3). Let {fd1}d1+\{{}^{d_{1}}f\}_{d_{1}\in\mathbb{Z}^{+}} be a sequence of stochastic networks, for fixed integers d2,d0d_{2},d_{0}, fd1=g2d2,d1g1d1,d0{}^{d_{1}}f={}^{d_{2},d_{1}}g^{2}\circ{}^{d_{1},d_{0}}g^{1} with stochastic block gw(g1)1d1,d0𝒮(g1){}^{d_{1},d_{0}}g^{1}_{w(g_{1})}\in\mathcal{S}(g^{1}). Additionally, let D=D(d1)D=D(d_{1}) be an monotonically increasing function of d1d_{1} such that for fd1=g2d2,d1g1d1,d0{}^{d_{1}}f={}^{d_{2},d_{1}}g^{2}\circ{}^{d_{1},d_{0}}g^{1},

g2d2,d1=gd2,D(d1)WD(d1),d1.{}^{d_{2},d_{1}}g^{2}={}^{d_{2},D(d_{1})}g^{\prime}\circ{}^{D(d_{1}),d_{1}}W. (38)

Let fd1{}^{d_{1}}f be overparameterized for all d1dd_{1}\geq d^{*} for some d>0d^{*}>0. Let w=argminwiN𝔼[L(fwdi(x,ϵ),yi)]w_{*}=\arg\min_{w}\sum_{i}^{N}\mathbb{E}[L({}^{d_{i}}f_{w}(x,\epsilon),y_{i})] be a global minimum of the loss function. Then, for all xx in the training set,

limkVarϵ[fwkd(x,ϵ)]=0.\lim_{k\to\infty}\text{Var}_{\epsilon}\left[{}^{kd^{*}}f_{w_{*}}(x,\epsilon)\right]=0. (39)

Proof. In the proof, we denote the term L(d1fw(x,ϵ),yi)L(^{d_{1}}f_{w}(x,\epsilon),y_{i}) as Lid1(w)L_{i}^{d_{1}}(w). Let ww_{*} be the global minimizer of jN𝔼ϵ[Ljd1(w)]{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]. Then, for any ww, we have by definition

0jN𝔼ϵ[Ljd1(w)]jN𝔼ϵ[Ljd1(w)].0\leq{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]\leq{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]. (40)

If limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]=0, we have limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]=0, which implies that limd𝔼ϵ[Ljd1(w)]=0\lim_{d\to\infty}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w_{*})\right]=0 for all jj. By the bias-variance decomposition of the MSE, this, in turn, implies that Var[d1fw(xj)]=0\text{Var}[^{d_{1}}f_{w_{*}}(x_{j})]=0 for all jj. Therefore, it is sufficient to construct a sequence of ww such that limd1jN𝔼ϵ[Ljd1(w)]=0\lim_{d_{1}\to\infty}{}\sum_{j}^{N}\mathbb{E}_{\epsilon}\left[L_{j}^{d_{1}}(w)\right]=0.

Now, we construct such a ww. Let fd1{}^{d_{1}}f^{\prime} denote the deterministic counterpart of fd1{}^{d_{1}}f. By definition and by Assumption 1, we have

fd1(x){}^{d_{1}}f(x) =gw22d2,d1gw11d1,d0(x);\displaystyle={}^{d_{2},d_{1}}g^{2}_{w^{2}}\circ{}^{d_{1},d_{0}}g^{1}_{w^{1}}(x); (41)
fd1(x){}^{d_{1}}f^{\prime}(x) =gw22d2,d1𝔼ϵgw11d1,d0(x).\displaystyle={}^{d_{2},d_{1}}g^{2}_{w^{2}}\circ\mathbb{E}_{\epsilon}\circ{}^{d_{1},d_{0}}g^{1}_{w^{1}}(x). (42)

By the architecture assumption (assumption 6), there exists a function hvh_{v} parametered by a parameter set vv and a linear transformation MM such that we can further decompose the two neural networks as

f(x)\displaystyle f(x) =hvMgw11(x);\displaystyle=h_{v}\circ M\circ g^{1}_{w_{1}}(x); (43)
f(x)\displaystyle f^{\prime}(x) =hvM𝔼gw11(x),\displaystyle=h_{v}\circ M\circ\mathbb{E}\circ g^{1}_{w_{1}}(x), (44)

where Md×d1M\in\mathbb{R}^{d\times d_{1}} for a fixed integer dd and is a linear transformation belonging to the parameter set of g2g^{2}. Note that, by definition, the parameter set of the g2g^{2} block is w2=vMw_{2}=v\cup M.

Let uu_{*} be a global minimum of fd1{}^{d_{1}}f^{\prime}:

u:=(v,M,w1)=argminv,M,w1jNL(d1fw(x),yi),u_{*}:=(v_{*},{M_{*}}^{\prime},w^{1}_{*})=\arg\min_{v,M^{\prime},w_{1}}\sum_{j}^{N}L(^{d_{1}}f_{w}^{\prime}(x),y_{i}), (45)

and, by the assumption of overparametrization, we also have fud1(xj)=yj{}^{d_{1}}f^{\prime}_{u_{*}}(x_{j})=y_{j} for all jj.

We now specify the parameters for fwf_{w} for k>1k>1. By assumption 2, we can find parameters w1w_{1}^{\prime} such that 𝔼[kd,d0gw11(x)j]=𝔼[gw11d,d0(x)]jmodd\mathbb{E}[^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)_{j}]=\mathbb{E}[{}^{d^{*},d_{0}}g^{1}_{w_{*}^{1}}(x)]_{j\mod d^{*}} for j=1,,kdj=1,...,kd^{*}. Namely, we choose the parameters such that the expected output of the stochastic block are kk identical copies of the output of the overparametrized deterministic model with width dd^{*}.

Since MM is a linear transformation, we factorize it as a product of two matrices such that M=AGM=AG, where AD(d1)×dA\in\mathbb{R}^{D(d_{1})\times d^{*}} and Gd×d1G\in\mathbb{R}^{d^{*}\times d_{1}}:

f(x)\displaystyle f(x) =hvAGg1(x)w1;\displaystyle=h_{v}\circ A\circ G\circ g^{1}(x)_{w_{1}}; (46)
f(x)\displaystyle f^{\prime}(x) =hvAG𝔼gw11(x).\displaystyle=h_{v}\circ A\circ G\circ\mathbb{E}\circ g^{1}_{w_{1}}(x). (47)

Now, note that by assumption 5, for any subset of any xx, there exists vv^{\prime} such that hv(x)=hv(x)h_{v^{\prime}}(x)=h^{\prime}_{v^{*}}(x^{\prime}), where hh^{\prime} is the corresponding block of fd{}^{d^{*}}f^{\prime}. For AA, we let the beginning columns of AA the same as MM^{*}, and the remaining columns 0:

A=(M0).A=(M^{*}\quad 0). (48)

With this choice, it follows that for any xdx\in\mathbb{R}^{d^{*}}, hvA(x)=g2(x)h_{v^{\prime}}\circ A(x)=g^{2}(x) for the g2g^{2} block of fd{}^{d^{*}}f^{\prime}.

Now, the last step is to specify GG. We let Gij=1kδi,jmoddG_{ij}^{*}=\frac{1}{k}\delta_{i,j\mod d}, where δi,j=1\delta_{i,j}=1 if i=ji=j and 0 otherwise. Namely, GG^{*} is nothing but an averaging matrix that sums and rescales the deterministic layer by a factor of 1/k1/k.

To summarize, our specification defines the following stochastic neural network:

fkd(x)=hvMGgw11(x).{}^{kd*}f(x)=h_{v^{*}}\circ M^{*}\circ G^{*}\circ g^{1}_{w_{1}^{\prime}}(x). (49)

By definition of GG^{*} and w1w_{1}^{\prime}, 𝔼ϵ[Ggw11kd,d0(x)]=𝔼ϵ[gw11d,d0(x)]\mathbb{E}_{\epsilon}[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]=\mathbb{E}_{\epsilon}[{}^{d^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)], and [Ggw11kd,d0(x)]j[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} are independent for different jj. Moreover, because [gw11d,d0(x)]j[{}^{d^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} has variance Σii\Sigma_{ii} by assumption, [Ggw11kd,d0(x)]j[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)]_{j} has variance Σii/k\Sigma_{ii}/k.

Now, as kk\to\infty,

Ggw11kd,d0(x)L2𝔼ϵ[Ggw11kd,d0(x)],G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)\to_{L^{2}}\mathbb{E}_{\epsilon}[G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)], (50)

where L2\to_{L^{2}} denotes convergence in mean square. Because hvAh_{v}\circ A is a Lipshitz continuous function by Assumption 1,

limkhvMGgw11kd,d0(x)L2hvM𝔼ϵgw11d,d0(x).\lim_{k\to\infty}h_{v^{*}}\circ M^{*}\circ G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}(x)\to_{L^{2}}h_{v^{*}}\circ M^{*}\circ\mathbb{E}_{\epsilon}\circ{}^{d^{*},d_{0}}g^{1}_{w_{1}^{*}}(x). (51)

This implies that the expectation of the constructed model with increasing width converge to that of the overparametrized determinstic model (convergence in mean square implies convergence in mean). Therefore, defining our model as fd1=hvMGgw11kd,d0{}^{d_{1}}f^{*}=h_{v^{*}}\circ M^{*}\circ G^{*}\circ{}^{kd^{*},d_{0}}g^{1}_{w_{1}^{\prime}}, we obtain

𝔼ϵ[fd1(xi)]fd(xi).\mathbb{E}_{\epsilon}[{}^{d_{1}}f(x_{i})]\to{}^{d^{*}}f^{\prime}(x_{i}). (52)

Therefore, we have, by the bias-variance decomposition for MSE:

𝔼ϵ[fd1(xi,ϵ)yi]2=[𝔼ϵ[fd1(xi,ϵ)]yi]2+Var[fd1(xi,ϵ)]\displaystyle\mathbb{E}_{\epsilon}[{}^{d_{1}}f^{*}(x_{i},\epsilon)-y_{i}]^{2}=\left[\mathbb{E}_{\epsilon}[{}^{d_{1}}f^{*}(x_{i},\epsilon)]-y_{i}\right]^{2}+\text{Var}[{}^{d_{1}}f^{*}(x_{i},\epsilon)] (53)

Both terms converges to 0 for all ii, and so the sum of two terms converge to 0. This finishes the proof. \square

Appendix B Additional Experiments

B.1 Weight decay

This part of experiment has been described in the main text. See Figure 3. We see that even with weight decay, the prediction variance drops towards zero unhindered.

Refer to caption
(a) Dropout
Refer to caption
(b) VAE
Figure 3: Variance vs. the width of NNs using weight decay.

B.2 Training with SGD

Since our result only depends on the global minimum of the loss function, one expects to also find that the prediction variance to decrease with a different optimization procedure. In this section, we perform the same experiment with SGD. See Figure 4. We see that, for dropout, the result is similar to the case with Adam. For VAE, the result is a little more subtle in the tail, where the decrease in variance slows down. We hypothesize that it is because the SGD algorithm increase in fluctuation and reduce in stability as the width of the hidden layer increases (Liu et al.,, 2021; Ziyin et al., 2021b, ), which causes the prediction variance to increase and partially offsets the effect due to averaging of the parameters.

Refer to caption
(a) Feedforward Network with Dropout (SGD).
Refer to caption
(b) VAE (SGD).
Figure 4: Empirical scaling of the prediction variance of different models as the width of the stochastic layer extends to infinity. For both dropout network and VAE, we see that the prediction variance decreases towards 0 as the width increases.