This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalization and Memorization: The Bias Potential Model

Hongkang Yang hongkang@princeton.edu Program in Applied and Computational Mathematics, Princeton University Weinan E weinan@math.princeton.edu Program in Applied and Computational Mathematics, Princeton University Department of Mathematics, Princeton University
Abstract

Models for learning probability distributions such as generative models and density estimators behave quite differently from models for learning functions. One example is found in the memorization phenomenon, namely the ultimate convergence to the empirical distribution, that occurs in generative adversarial networks (GANs). For this reason, the issue of generalization is more subtle than that for supervised learning. For the bias potential model, we show that dimension-independent generalization accuracy is achievable if early stopping is adopted, despite that in the long term, the model either memorizes the samples or diverges.

Keywords: probability distribution, machine learning, generalization error, curse of dimensionality, early stopping.

1 Introduction

Distribution learning models such as GANs have achieved immense popularity from their empirical success in learning complex high-dimensional probability distributions, and they have found diverse applications such as generating images [7] and paintings [21], writing articles [8], composing music [31], editing photos [5], designing new drugs [33] and new materials [29], generating spin configurations [52] and modeling quantum gases [11], to name a few.

As a mathematical problem, distribution learning is much less understood. Arguably, the most fundamental question is the generalization ability of these models. One puzzling issue is the following.

  1. 0.

    Generalization vs. memorization:

    Let QQ_{*} be the target distribution, and Q(n)Q_{*}^{(n)} be the empirical distribution associated with nn sample points. Let Q(f)Q(f) be the probability distribution generated by some machine learning model parametrized by the function ff in some hypothesis space \mathcal{F}. It has been argued, for example in the case of GAN, that as training proceeds, one has [23]

    limtQ(f(t))=Q(n)\lim_{t\rightarrow\infty}Q(f(t))=Q_{*}^{(n)} (1)

    where f(t)f(t) is the parameter we obtain at training step tt. We refer to (1) as the “memorization phenomenon”. When it happens, the model learned does not give us anything other than the samples we already have.

    This is in sharp contrast to supervised learning, where models are typically trained till interpolation and can generalize well to unseen data both in practice [51] and in theory [20].

Despite this, these distribution-learning models perform surprisingly well in practice, being able to come close to the unseen target QQ_{*} and allowing us to generate new samples. This counterintuitive result calls for a closer examination of their training dynamics, beyond the statement (1).

There are many other mysteries for distribution learning, and we list a few below.

  1. 1.

    Curse of dimensionality:

    The superb performance of these models (e.g. on generating high-resolution, lifelike and diverse images [7, 14, 45]) indicates that they can approximate the target QQ_{*} with satisfactorily small error. Yet, in theory, this should not be possible, because to estimate a general distribution in d\mathbb{R}^{d} with error ϵ\leq\epsilon, we need n=ϵΩ(d)n=\epsilon^{-\Omega(d)} amount of samples (discussed below), which becomes astronomical for real-world tasks. For instance, the BigGAN [7] was trained on the ILSVRC dataset [39] with 107\leq 10^{7} images of resolution 512×512512\times 512, but the theoretical sample size should be like 10512×512\gg 10^{512\times 512}.

    Of course, for restricted distribution families like the Gaussians, the sample complexity is only n=poly(d)n=\textsf{poly}(d). Yet, one is really interested in complex distributions such as the distribution of facial images that a priori do not belong to any known family, so these tasks require the models to possess not only a dimension-independent sample complexity but also the universal approximation property.

  2. 2.

    The fragility of the training process:

    It is well-known that distribution-learning models like GANs and VAE (variational autoencoder) are difficult to train. They are especially vulnerable to issues like mode collapse [12, 40], instability and oscillation [34], and vanishing gradient [1]. The current treatment is to find by trial-and-error a delicate combination of the right architectures and hyper-parameters [34]. The need to understand these issues calls for a mathematical treatment.

This paper offers a partial answer to these questions. We focus on the bias potential model, an expressive distribution-learning model that is relatively transparent, and uncover the mechanisms for its generalization ability.

Specifically, we establish a dimension-independent a priori generalization error estimate with early-stopping. With appropriate function spaces ff\in\mathcal{F}, the training process consists of two regimes:

  • First, by implicit regularization, the training trajectory Q(f(t))Q(f(t)) comes very close to the unseen target QQ_{*}, and this is when early-stopping should be performed.

  • Afterwards, Q(f(t))Q(f(t)) either converges to the sample distribution Q(n)Q_{*}^{(n)} or it diverges.

This paper is structured as follows. In Section 2, we introduce the bias potential model and pose it as a continuous calculus of variations problem. Section 3 analyzes the training behavior of this model and presents this paper’s main results on generalization error and memorization. Section 4 presents some numerical examples. Section 5 contains all the proofs. Section 6 concludes this paper with remarks on future directions.

Notation: denote vectors by bold letters 𝐱\mathbf{x}. Let C(K)C(K) be the space of continuous functions over some subset KdK\subseteq\mathbb{R}^{d} equipped with supremum norm. Let 𝒫(K),𝒫ac(K),𝒫2(K)\mathcal{P}(K),\mathcal{P}_{ac}(K),\mathcal{P}_{2}(K) be the space of probability measures over KK, the subset of absolutely continuous measures, and the subset of measures with finite second moments. Denote the support of a distribution Q𝒫(K)Q\in\mathcal{P}(K) by sprtQ\text{sprt}Q. Let W2W_{2} be the Wasserstein metric over 𝒫2(K)\mathcal{P}_{2}(K).

1.1 Related works

  • Generalization ability: Among distribution-learning models, GANs have attracted the most attention and their generalization ability has been discussed in [2, 53, 3, 24] from the perspective of the neural network-based distances. For trained models, dimension-independent generalization error estimates have been obtained only for certain restricted models, such as GANs whose generators are linear maps or one-layer networks [49, 27, 22].

  • Curse of dimensionality (CoD): If the sampling error is measured by the Wasserstein metric W2W_{2}, then for any absolutely continuous QQ_{*} and any δ>0\delta>0, it always holds that [48]

    W2(Q(n),Q)n1dδW_{2}(Q_{*}^{(n)},Q_{*})\gtrsim n^{-\frac{1}{d-\delta}}

    To achieve an error of ϵ\epsilon, the required sample size is n=ϵΩ(d)n=\epsilon^{-\Omega(d)}.

    If sampling error is measured by KL divergence, then KL(QQ(n))=KL(Q_{*}\|Q^{(n)}_{*})=\infty since Q(n)Q_{*}^{(n)} is singular. If kernel smoothing is applied to Q(n)Q^{(n)}_{*}, it is known that the error scales like O(n4d+4)O(n^{-\frac{4}{d+4}}) [47] (technically the norm used in [47] is the L2L^{2} difference between densities, but one should expect that CoD would likewise be present for KL divergence.)

  • Continuous perspective: [19, 16] provide a framework to study supervised learning as continuous calculus of variations problems, with emphasis on the role of the function representation, e.g. continuous neural networks [18]. In particular, the function representation largely determines the trainability [13, 38] and generalization ability [17, 15, 20] of a supervised-learning model. This framework can be applied to studying distribution learning in general, and we use it to analyze the bias potential model.

  • Exponential family:

    The density function of the bias-potential model is an instance of the exponential families. These distributions have long been applied to density estimation [4, 10] with theoretical guarantees [50, 43]. Yet, existing theories focus only on black-box estimators, instead of the training process. It has also been popular to adopt a mixture of exponential distributions [26, 25, 36], but it will not be covered in this paper.

2 Bias Potential Model

This section introduces the bias potential model, a simple distribution-learning model proposed by [46, 6] and also known as “variationally enhanced sampling”.

To pose a supervised learning model as a calculus of variations problem, one needs to consider four factors: function representation, training objective, training rule, and discretization [19]. For distribution learning, there is the additional factor of distribution representation, namely how probability distributions are represented through functions. These are general issues for any distribution learning model. For future reference, we go through these components in some detail.

2.1 Distribution representation

The bias potential model adopts the following representation:

Q=1ZeVP,Z=𝔼P[eV]Q=\frac{1}{Z}e^{-V}P,\quad Z=\mathbb{E}_{P}[e^{-V}] (2)

where VV is some potential function and PP is some base distribution. This representation commonly appears in statistical mechanics as the Boltzmann distribution. It is suitable for density estimation, and can also be applied to generative modeling via sampling techniques like MCMC, Langevin diffusion [37], hit-and-run [28], etc.

Typically the partition function ZZ can be ignored, since it is not involved in the training objectives or most of the sampling algorithms.

2.2 Training objective

Since the representation (2) is defined by a density function, it is natural to define a density-based training objective. Given a target distribution QQ_{*}, one convenient choice is the backward KL divergence

KL(QQ)\displaystyle KL(Q_{*}\|Q) =𝔼Q[logQlogP]+𝔼Q[V]+log𝔼P[eV]\displaystyle=\mathbb{E}_{Q_{*}}[\log Q_{*}-\log P]+\mathbb{E}_{Q_{*}}[V]+\log\mathbb{E}_{P}[e^{-V}]

An alternative way introduced in [46] is to define the “biased distribution”

P=eVQ𝔼Q[eV]P_{*}=\frac{e^{V}Q_{*}}{\mathbb{E}_{Q_{*}}[e^{V}]}

so that Q=QQ=Q_{*} iff P=PP=P_{*}. Then, we can define an objective by the forward KL

KL(PP)\displaystyle KL(P\|P_{*}) =𝔼P[logPlogQ]𝔼P[V]+log𝔼Q[eV]\displaystyle=\mathbb{E}_{P}[\log P-\log Q_{*}]-\mathbb{E}_{P}[V]+\log\mathbb{E}_{Q_{*}}[e^{V}]

Removing constant terms, we obtain the following objectives

L(V):=𝔼Q[V]+log𝔼P[eV]L+(V):=𝔼P[V]+log𝔼Q[eV]\displaystyle\begin{split}L^{-}(V)&:=\mathbb{E}_{Q_{*}}[V]+\log\mathbb{E}_{P}[e^{-V}]\\ L^{+}(V)&:=-\mathbb{E}_{P}[V]+\log\mathbb{E}_{Q_{*}}[e^{V}]\end{split} (3)

Both objectives are convex in VV (Lemma 5.2). Suppose QQ_{*} can be written as (2) with potential VV_{*}, then Q=QQ=Q_{*} iff V=V+cV=V_{*}+c for some constant cc iff L+(V)L^{+}(V) or L(V)=0L^{-}(V)=0, so we have a unique global minimizer up to constants. Otherwise, if QQ_{*} does not have the form (2), then the minimizer does not exist.

In practice, when QQ_{*} is available only through its sample Q(n)Q_{*}^{(n)}, we simply substitute all the expectation terms 𝔼Q\mathbb{E}_{Q_{*}} in (3) by 𝔼Q(n)\mathbb{E}_{Q_{*}^{(n)}}.

2.3 Function representation

A good function representation (or function space \mathcal{F}) should have two conflicting properties:

  1. 1.

    \mathcal{F} is expressive so that distributions generated by ff\in\mathcal{F} satisfy universal approximation property.

  2. 2.

    \mathcal{F} has small complexity so that the generalization gap is small.

One approach is to adopt an integral transform-based representation [19],

V(𝐱)=𝔼ρ(θ)[ϕ(𝐱,θ)]V(\mathbf{x})=\mathbb{E}_{\rho(\theta)}\big{[}\phi(\mathbf{x},\theta)\big{]}

for some feature function ϕ(;θ)\phi(\cdot;\theta) and parameter distribution ρ\rho. Then, VV can be approximated with Monte-Carlo rate by

Vm(𝐱)=1mj=1mϕ(𝐱;θi)V_{m}(\mathbf{x})=\frac{1}{m}\sum_{j=1}^{m}\phi(\mathbf{x};\theta_{i}) (4)

where {θj}\{\theta_{j}\} are i.i.d. samples from ρ(θ)\rho(\theta).

Let us consider function representations built from neural networks:

  • 2-layer neural networks: Define the continuous 2-layer network by

    V(𝐱)=𝔼ρ(a,𝐰,b)[aσ(𝐰𝐱+b)]V(\mathbf{x})=\mathbb{E}_{\rho(a,\mathbf{w},b)}\big{[}a~{}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\big{]} (5)

    with an activation function σ:\sigma:\mathbb{R}\to\mathbb{R} and weights 𝐰d\mathbf{w}\in\mathbb{R}^{d} and a,ba,b\in\mathbb{R}. The natural functional norm is the Barron norm [18]:

    V:=infρρP,ρP2:=𝔼ρ(a,𝐰,b)[a2(𝐰2+b2)]\displaystyle\begin{split}\|V\|_{\mathcal{B}}&:=\inf_{\rho}\|\rho\|_{P},\quad\|\rho\|_{P}^{2}:=\mathbb{E}_{\rho(a,\mathbf{w},b)}\big{[}a^{2}(\|\mathbf{w}\|^{2}+b^{2})\big{]}\end{split} (6)

    where ρ\rho ranges over all parameter distributions that satisfy (5) and ρP\|\rho\|_{P} is known as the path norm.

  • Random feature model: Rewrite (5) as

    V(𝐱)=𝔼ρ0(𝐰,b)[a(𝐰,b)σ(𝐰𝐱+b)]V(\mathbf{x})=\mathbb{E}_{\rho_{0}(\mathbf{w},b)}\big{[}a(\mathbf{w},b)~{}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\big{]} (7)

    with fixed parameter distribution ρ0(𝐰,b)\rho_{0}(\mathbf{w},b) and

    a(𝐰,b):=da𝑑ρ(a,𝐰,b)dρ(a,𝐰,b)𝑑aa(\mathbf{w},b):=\frac{d\int a~{}d\rho(a,\mathbf{w},b)}{d\int\rho(a,\mathbf{w},b)da}

    The natural functional norm is the RHKS (reproducing kernel Hilbert space) norm [18, 35]:

    V2:=𝔼ρ0[a(𝐰,b)2]=aL2(ρ0)2\|V\|_{\mathcal{H}}^{2}:=\mathbb{E}_{\rho_{0}}\big{[}a(\mathbf{w},b)^{2}\big{]}=\|a\|^{2}_{L^{2}(\rho_{0})} (8)

    It corresponds to the RKHS \mathcal{H} induced by the kernel

    k(𝐱,𝐱)=𝔼ρ0(𝐰,b)[σ(𝐰𝐱+b)σ(𝐰𝐱+b)]k(\mathbf{x},\mathbf{x}^{\prime})=\mathbb{E}_{\rho_{0}(\mathbf{w},b)}[\sigma(\mathbf{w}\cdot\mathbf{x}+b)\sigma(\mathbf{w}\cdot\mathbf{x}^{\prime}+b)]

It is straightforward to establish the universal approximation theorem for these two representations and we provide such results below: Denote by 𝒫ac(K)C(K)\mathcal{P}_{ac}(K)\cap C(K) the distributions over KK with continuous density functions, and by TV\|\cdot\|_{TV} the total variation distance, which is equivalent to the L1L_{1} norm when restricted to 𝒫ac(d)\mathcal{P}_{ac}(\mathbb{R}^{d}).

Proposition 2.1 (Universal approximation).

Let KdK\subseteq\mathbb{R}^{d} be any compact set with positive Lebesgue measure, let PP be the uniform distribution over KK, and let 𝒱\mathcal{V} be any class of functions that is dense in C(K)C(K). Then, the class of probability distributions (2) generated by V𝒱V\in\mathcal{V} and PP are

  • dense in 𝒫(K)\mathcal{P}(K) under the Wasserstein metric WpW_{p} (1p<1\leq p<\infty),

  • dense in 𝒫ac(K)\mathcal{P}_{ac}(K) under the total variation norm TV\|\cdot\|_{TV},

  • dense in 𝒫ac(K)C(K)\mathcal{P}_{ac}(K)\cap C(K) under KL divergence.

Given assumption 5.1, this result applies if 𝒱\mathcal{V} is the Barron space {V<}\{\|V\|_{\mathcal{B}}<\infty\} or RKHS space {V<}\{\|V\|_{\mathcal{H}}<\infty\}.

The Monte-Carlo approximation (4) suggests that these continuous models can be approximated efficiently by finite neural networks. Specifically, we can establish the following a priori error estimates:

Proposition 2.2 (Efficient approximation).

Suppose that the base distribution PP is compactly-supported in a ball BR(0)B_{R}(0), and the activation function σ\sigma is Lipschitz with σ(0)=0\sigma(0)=0. Given V<\|V\|_{\mathcal{B}}<\infty, for every mm\in\mathbb{N}, there exists a finite 2-layer network VmV_{m} with mm neurons that satisfies:

KL(QQm)\displaystyle KL(Q\|Q_{m}) Vm23σLipR2+1\displaystyle\leq\frac{\|V\|_{\mathcal{B}}}{\sqrt{m}}\cdot 2\sqrt{3}\|\sigma\|_{Lip}\sqrt{R^{2}+1}
Vm\displaystyle\|V_{m}\|_{\mathcal{B}} 2V\displaystyle\leq\sqrt{2}\|V\|_{\mathcal{B}}

where QmQ_{m} is the distribution generated by VmV_{m}. Similarly, assume that the fixed parameter distribution ρ0\rho_{0} in (7) is compactly-supported in a ball Br(0)B_{r}(0), then given V<\|V\|_{\mathcal{H}}<\infty, for every mm, there exists VmV_{m} such that

KL(QQm)\displaystyle KL(Q\|Q_{m}) Vm23σLipR2+1r\displaystyle\leq\frac{\|V\|_{\mathcal{H}}}{\sqrt{m}}\cdot 2\sqrt{3}\|\sigma\|_{Lip}\sqrt{R^{2}+1}~{}r
Vm\displaystyle\|V_{m}\|_{\mathcal{H}} 2V\displaystyle\leq\sqrt{2}\|V\|_{\mathcal{H}}

2.4 Training rule

We consider the simplest training rule, the gradient flow.

For continuous function representations, there are generally two kinds of flows:

  • Non-conservative gradient flow

    For the random feature model (7), we can train the function a(𝐰,b)a(\mathbf{w},b) using its variational gradient

    ta(𝐰,b)=δLδa(𝐰,b)\partial_{t}a(\mathbf{w},b)=-\frac{\delta L}{\delta a}(\mathbf{w},b)

    Specifically, for the training objectives L±(V)L^{\pm}(V) in (3), the corresponding flows are defined by

    ddta\displaystyle\frac{d}{dt}a =δL+δa=𝔼PP[σ(𝐰𝐱+b)]\displaystyle=-\frac{\delta L^{+}}{\delta a}=-\mathbb{E}_{P_{*}-P}[\sigma(\mathbf{w}\cdot\mathbf{x}+b)]
    ddta\displaystyle\frac{d}{dt}a =δLδa=𝔼QQ[σ(𝐰𝐱+b)]\displaystyle=-\frac{\delta L^{-}}{\delta a}=-\mathbb{E}_{Q_{*}-Q}[\sigma(\mathbf{w}\cdot\mathbf{x}+b)]
  • Conservative gradient flow:

    For the 2-layer neural network (5), we train the parameter distribution ρ(a,𝐰,b)\rho(a,\mathbf{w},b). Its gradient flow is constrained by the conservation of local mass and obeys the continuity equation (in the weak sense):

    tρ(ρδLδρ)=0\partial_{t}\rho-\nabla\cdot\big{(}\rho\nabla\frac{\delta L}{\delta\rho}\big{)}=0 (9)

    With the objectives (3), the gradient fields are given by

    δL+δρ=(a,𝐰,b)𝔼PP[aσ(𝐰𝐱+b)]=𝔼PP[σ(𝐰𝐱+b)aσ(𝐰𝐱+b)𝐱aσ(𝐰𝐱+b)]δLδρ=(a,𝐰,b)𝔼QQ[aσ(𝐰𝐱+b)]=𝔼QQ[σ(𝐰𝐱+b)aσ(𝐰𝐱+b)𝐱aσ(𝐰𝐱+b)]\displaystyle\begin{split}\nabla\frac{\delta L^{+}}{\delta\rho}&=\nabla_{(a,\mathbf{w},b)}\mathbb{E}_{P_{*}-P}\big{[}a~{}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\big{]}=\mathbb{E}_{P_{*}-P}\begin{bmatrix}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\\ a~{}\sigma^{\prime}(\mathbf{w}\cdot\mathbf{x}+b)~{}\mathbf{x}\\ a~{}\sigma^{\prime}(\mathbf{w}\cdot\mathbf{x}+b)\end{bmatrix}\\ \nabla\frac{\delta L^{-}}{\delta\rho}&=\nabla_{(a,\mathbf{w},b)}\mathbb{E}_{Q_{*}-Q}\big{[}a~{}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\big{]}=\mathbb{E}_{Q_{*}-Q}\begin{bmatrix}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\\ a~{}\sigma^{\prime}(\mathbf{w}\cdot\mathbf{x}+b)~{}\mathbf{x}\\ a~{}\sigma^{\prime}(\mathbf{w}\cdot\mathbf{x}+b)\end{bmatrix}\end{split} (10)

2.5 Discretization

So far we have only discussed the continuous formulation of distribution learning models. In practice, we implement these continuous models using discretized versions, with the hope that the discretized models inherit these properties up to a controllable discretization error.

Let us focus on the discretization in the parameter space, and in particular, the most popular “particle discretization”, since this is the analog of Monte-Carlo for dynamic problems. Consider the parameter distribution ρ(a,𝐰,b)\rho(a,\mathbf{w},b) of the 2-layer net (5) and its approximation by the empirical distribution

ρ(m)=1mj=1mδ(aj,𝐰j,bj)\rho^{(m)}=\frac{1}{m}\sum_{j=1}^{m}\delta_{(a_{j},\mathbf{w}_{j},b_{j})}

where the particles {(aj,𝐰j,bj)}\{(a_{j},\mathbf{w}_{j},b_{j})\} are i.i.d. samples of ρ\rho. The potential function represented by this empirical distribution is given by:

Vm(𝐱)=𝔼ρ(m)[aσ(𝐰𝐱+b)]=1mjajσ(𝐰j𝐱+bj)V_{m}(\mathbf{x})=\mathbb{E}_{\rho^{(m)}}\big{[}a~{}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\big{]}=\frac{1}{m}\sum_{j}a_{j}~{}\sigma(\mathbf{w}_{j}\cdot\mathbf{x}+b_{j})

Suppose we train ρ(m)\rho^{(m)} by conservative gradient flow (9,10) with the objective LL^{-}. The continuity equation (9) implies that, for any smooth test function f(a,𝐰,b)f(a,\mathbf{w},b), we have

ddtf𝑑ρ(m)=fδLδρdρ(m)=1mj=1mf(aj,𝐰j,bj)TδLδρ(aj,𝐰j,bj)\displaystyle\frac{d}{dt}\int f~{}d\rho^{(m)}=-\int\nabla f\cdot\nabla\frac{\delta L}{\delta\rho}d\rho^{(m)}=-\frac{1}{m}\sum_{j=1}^{m}\nabla f(a_{j},\mathbf{w}_{j},b_{j})^{T}\cdot\nabla\frac{\delta L}{\delta\rho}(a_{j},\mathbf{w}_{j},b_{j})

Meanwhile, we also have

ddtf𝑑ρ(m)=1mj=1mf(aj,𝐰j,bj)Tddt[aj𝐰jbj]\displaystyle\frac{d}{dt}\int f~{}d\rho^{(m)}=\frac{1}{m}\sum_{j=1}^{m}\nabla f(a_{j},\mathbf{w}_{j},b_{j})^{T}\cdot\frac{d}{dt}\begin{bmatrix}a_{j}\\ \mathbf{w}_{j}\\ b_{j}\end{bmatrix}

Thus we have recovered the gradient flow for finite scaled 2-layer networks:

ddt[𝐚j𝐰jbj]\displaystyle\frac{d}{dt}\begin{bmatrix}\mathbf{a}_{j}\\ \mathbf{w}_{j}\\ b_{j}\end{bmatrix} =δL(Vm)δρ(aj,𝐰j,bj)=mL(Vm)(aj,𝐰j,bj)=𝔼QQ[σ(𝐰j𝐱+bj)ajσ(𝐰j𝐱+bj)𝐱ajσ(𝐰j𝐱+bj)]\displaystyle=-\nabla\frac{\delta L^{-}(V_{m})}{\delta\rho}(a_{j},\mathbf{w}_{j},b_{j})=-m\cdot\frac{\partial L^{-}(V_{m})}{\partial(a_{j},\mathbf{w}_{j},b_{j})}=-\mathbb{E}_{Q_{*}-Q}\begin{bmatrix}\sigma(\mathbf{w}_{j}\cdot\mathbf{x}+b_{j})\\ a_{j}~{}\sigma^{\prime}(\mathbf{w}_{j}\cdot\mathbf{x}+b_{j})~{}\mathbf{x}\\ a_{j}~{}\sigma^{\prime}(\mathbf{w}_{j}\cdot\mathbf{x}+b_{j})\end{bmatrix}

This example shows that the particle discretization of continuous 2-layer networks (5) leads to the same result as the mean-field modeling of 2-layer nets [30, 38].

3 Training Dynamics

This section studies the training behavior of the bias potential model and presents the main result of this paper, on the relation between generalization and memorization: When trained on a finite sample set,

  • With early stopping, the model reaches dimension-independent generalization error rate.

  • As tt\to\infty, the model necessarily memorizes the samples unless it diverges.

3.1 Trainability

We begin with the training dynamics on the population loss. First, we consider the random feature model (7) and establish global convergence:

Proposition 3.1 (Trainability).

Suppose that the target distribution QQ_{*} is generated by a potential VV_{*} (V<\|V_{*}\|_{\mathcal{H}}<\infty). Suppose that our distribution QtQ_{t} is generated by potential VtV_{t} with parameter function ata_{t} trained by gradient flow on either of the objectives (3). Then,

L±(Vt)L±(V)VV022tL^{\pm}(V_{t})-L^{\pm}(V_{*})\leq\frac{||V_{*}-V_{0}||^{2}_{\mathcal{H}}}{2t}

Next, for 2-layer neural networks, we show that whenever the conservative gradient flow converges, it must converge to the global minimizer. In particular, it will not be trapped at bad local minima and thus avoids mode collapse. This result is analogous to the global optimality guarantees for supervised learning and regression problems [13, 38].

Proposition 3.2.

Assume that the distribution QtQ_{t} is generated by potential VtV_{t}, a 2-layer network with parameter distribution ρt\rho_{t} trained by gradient flow on either of the objectives (3). Assume that the assumption 5.2 holds. If the flow ρt\rho_{t} converges in W1W_{1} metric (or any WpW_{p}, 1p1\leq p\leq\infty) to some ρ\rho_{\infty} as tt\to\infty, then ρ\rho_{\infty} is a global minimizer of L±L^{\pm}: Let VV_{\infty} be the corresponding 2-layer network, then

Q=Q=eVP𝔼P[eV]Q_{*}=Q_{\infty}=\frac{e^{-V_{\infty}}P}{\mathbb{E}_{P}[e^{-V_{\infty}}]}

3.2 Generalization ability

Now we consider the most important issue for the model, the generalization error, and prove that a dimension-independent a priori error rate is achievable within a convenient early-stopping time interval.

We study the training dynamics on the empirical loss. For convenience, we make the following assumptions:

  • Let the base distribution PP in (2) be supported on [1,1]d[-1,1]^{d} (the ll^{\infty} ball). Without loss of generality, we use the ll^{\infty} norm on [1,1]d[-1,1]^{d}.

  • Let the objective LL be L(V)L^{-}(V) from (3) (The analysis of L+L^{+} would be more involved). Recall that if the target QQ_{*} is generated by a potential VV_{*}, then

    L(V)L(V)=KL(QQ)L(V)-L(V_{*})=KL(Q_{*}\|Q)

    Denote by L(n)L^{(n)} the empirical loss that corresponds to Q(n)Q_{*}^{(n)}:

    L(n)(V)=𝔼Q(n)[V]+log𝔼P[eV]=L(V)+𝔼Q(n)Q[V]L^{(n)}(V)=\mathbb{E}_{Q^{(n)}_{*}}[V]+\log\mathbb{E}_{P}[e^{-V}]=L(V)+\mathbb{E}_{Q^{(n)}_{*}-Q_{*}}[V]
  • Model VV by the random feature model (7) with RKHS norm V=aL2(ρ0)\|V\|_{\mathcal{H}}=\|a\|_{L^{2}(\rho_{0})} from (8). Assume that the activation function σ\sigma is ReLU, and that the fixed parameter distribution ρ0\rho_{0} is supported inside the l1l^{1} ball, that is, 𝐰1+|b|1\|\mathbf{w}\|_{1}+|b|\leq 1 for ρ0\rho_{0} almost all (𝐰,b)(\mathbf{w},b). Denote (𝐱,1)(\mathbf{x},1) by 𝐱~\tilde{\mathbf{x}} and (𝐰,b)(\mathbf{w},b) by 𝐰\mathbf{w}, so the activation can be written as σ(𝐰𝐱~)\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}}).

    Remark 1 (Universal approximation).

    If we further assume that ρ0\rho_{0} covers all directions (e.g. ρ0\rho_{0} is uniform over the l1l^{1} sphere {𝐰1+|b|=1}\{\|\mathbf{w}\|_{1}+|b|=1\}) and PP is uniform over some K[1,1]dK\subseteq[-1,1]^{d}, then Proposition 2.1 implies that this model enjoys universal approximation over distributions on KK.

  • Training rule: We train aa by gradient flow (Section 2.4). Let at,Vt,Qta_{t},V_{t},Q_{t} and at(n),Vt(n),Qt(n)a_{t}^{(n)},V_{t}^{(n)},Q_{t}^{(n)} be the training trajectories under LL and L(n)L^{(n)}. Assume the same initialization a0=a0(n)a_{0}=a_{0}^{(n)}.

Theorem 3.3 (Generalization ability).

Suppose QQ_{*} is generated by a potential function VV_{*} (V<\|V_{*}\|_{\mathcal{H}}<\infty). For any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta over the sampling of Q(n)Q_{*}^{(n)}, the testing error of Qt(n)Q_{t}^{(n)} is bounded by

KL(QQt(n))VV022t+2(42log2dn+2log(2/δ)n)tKL\big{(}Q_{*}\|Q_{t}^{(n)}\big{)}\leq\frac{\|V_{*}-V_{0}\|^{2}_{\mathcal{H}}}{2t}+2\Big{(}4\frac{\sqrt{2\log 2d}}{\sqrt{n}}+\frac{\sqrt{2\log(2/\delta)}}{\sqrt{n}}\Big{)}t
Corollary 3.4.

Given the condition of Theorem 3.3, if we choose an early-stopping time TT such that

T=Θ(VV0(nlogd)1/4)T=\Theta\Big{(}\|V_{*}-V_{0}\|_{\mathcal{H}}\big{(}\frac{n}{\log d}\big{)}^{1/4}\Big{)}

then the testing error obeys

KL(QQT(n))VV0(logdn)1/4KL\big{(}Q_{*}\|Q_{T}^{(n)}\big{)}\lesssim\|V_{*}-V_{0}\|_{\mathcal{H}}\Big{(}\frac{\log d}{n}\Big{)}^{1/4}

This rate is significant in that it is dimension-independent up to a negligible (logd)1/4(\log d)^{1/4} term. Although the upper bound n1/4n^{-1/4} is slower than the desirable Monte-Carlo rate of n1/2n^{-1/2}, it is much better than the rate n1/dn^{-1/d} and we believe there is room for improvement. In addition, the early-stopping time interval is reachable within a time that is dimension-independent and the width of this interval is at least on the order of n1/4n^{1/4}.

This result is enabled by the function representation of the model, specifically:

  1. 1.

    Learnability: If the target QQ_{*} lives in the right space for our function representation, then the optimization rate (for the population loss L(Vt)L(V)L(V_{t})-L(V_{*})) is fast and dimension-independent. In this case, the right space consists of distributions generated by random feature models, and the O(1/t)O(1/t) rate is provided by Proposition 3.1.

  2. 2.

    Insensitivity to high dimensional structures: The function representations have small Rademacher complexity, so they are insensitive to the empirical error QQ(n)Q_{*}-Q_{*}^{(n)} and the resulting deviation of the training trajectory Qt(n)QtQ^{(n)}_{t}-Q_{t} scales as O(n1/2)O(n^{-1/2}) instead of O(n1/d)O(n^{-1/d}). This result is provided by Lemmas 3.5 and 3.6 below.

Lemma 3.5.

For any distribution QQ_{*} supported on [1,1]d[-1,1]^{d} and any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta over the i.i.d. sampling of the empirical distribution Q(n)Q_{*}^{(n)}, we have

sup𝐰11𝔼QQ(n)[σ(𝐰𝐱~)]42log2dn+2log(2/δ)n\displaystyle\sup_{\|\mathbf{w}\|_{1}\leq 1}\mathbb{E}_{Q_{*}-Q_{*}^{(n)}}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]\leq 4\sqrt{\frac{2\log 2d}{n}}+\sqrt{\frac{2\log(2/\delta)}{n}}
Lemma 3.6.

Let LL be a convex Fréchet-differentiable function over a Hilbert space HH with Lipschitz constant ll. Let hh be a Fréchet-differentiable function with Lipschitz constant ϵ\epsilon. Define two gradient flow trajectories xt,ytx_{t},y_{t}:

x0=y0,ddtxt=L(xt),ddtyt=L~(yt)\displaystyle x_{0}=y_{0},~{}\frac{d}{dt}x_{t}=-\nabla L(x_{t}),~{}\frac{d}{dt}y_{t}=-\nabla\widetilde{L}(y_{t})

where L~=L+h\widetilde{L}=L+h represents a perturbed function. Then,

L(yt)L(xt)lϵtL(y_{t})-L(x_{t})\leq l\epsilon t

for all time t0t\geq 0.

Numerical examples for the training process and generalization error are provided in Section 4.

3.3 Memorization

Despite that the model enjoys good generalization accuracy with early stopping, we show that in the long term the solution Qt(n)Q_{t}^{(n)} necessarily deteriorates.

Proposition 3.7 (Memorization).

Under the condition of Theorem 3.3 and Remark 1,

  1. 1.

    If the trajectory Qt(n)Q_{t}^{(n)} has only one weak limit, then Qt(n)Q_{t}^{(n)} converges weakly to the empirical distribution Q(n)Q_{*}^{(n)}.

  2. 2.

    The true target distribution QQ_{*} can never be a limit point of Qt(n)Q_{t}^{(n)}. The generalization error and the potential function’s norm both diverge

    limtKL(QQt(n))=limtVt(n)=\lim_{t\to\infty}KL(Q_{*}\|Q_{t}^{(n)})=\lim_{t\to\infty}\|V_{t}^{(n)}\|_{\mathcal{H}}=\infty

Hence, the model either memorizes the samples or diverges (coming to more than one limit, which are all degenerate), even though it may not manifest within realistic training time.

The proof is based on the following observation.

Lemma 3.8.

Let KdK\subseteq\mathbb{R}^{d} be a compact set with positive Lebesgue measure, let the base distribution PP be uniform over KK, and let kk be a continuous and integrally strictly positive definite kernel on KK. Given any target distribution Q𝒫(K)Q^{\prime}\in\mathcal{P}(K) and any initialization V0C(K)V_{0}\in C(K), train the potential VtV_{t} by

ddtVt(𝐱)=𝔼(QtQ)(𝐱)[k(𝐱,𝐱)]\frac{d}{dt}V_{t}(\mathbf{x})=\mathbb{E}_{(Q_{t}-Q^{\prime})(\mathbf{x}^{\prime})}[k(\mathbf{x},\mathbf{x}^{\prime})]

If QtQ_{t} has only one weak limit, then QtQ_{t} converges weakly to QQ^{\prime}. Else, none of the limit points cover the support of QQ^{\prime}.

A numerical demonstration of memorization is provided in Section 4.

3.4 Regularization

Instead of early stopping, one can also consider explicit regularization: With the empirical loss L(n)L^{(n)}, define the problem

minVRL(n)(V)\min_{\|V\|\leq R}L^{(n)}(V)

for some appropriate functional norm \|\cdot\| and adjustable bound RR. For the special case of random feature models (8), this problem becomes

minaL2(ρ0)RL(n)(a)\min_{\|a\|_{L^{2}(\rho_{0})}\leq R}L^{(n)}(a) (11)

where L(n)(a)L^{(n)}(a) denotes L(n)(V)L^{(n)}(V) with potential VV generated by aa.

By convexity, L(n)(at(n))L^{(n)}(a^{(n)}_{t}) can always converge to the minimum value as tt\to\infty if at(n)a^{(n)}_{t} is trained by gradient flow constrained to the ball {aL2(ρ0)R}\{\|a\|_{L^{2}(\rho_{0})}\leq R\}. Denote the minimizer of (11) by aR(n)a_{R}^{(n)} (which exists by Lemma 5.7) and denote the corresponding distribution by QR(n)Q^{(n)}_{R}.

Proposition 3.9.

Given the condition of Theorem 3.3, choose any RVR\geq\|V_{*}\|_{\mathcal{H}}. With probability 1δ1-\delta over the sampling of Q(n)Q_{*}^{(n)}, the minimizer aR(n)a^{(n)}_{R} satisfies

KL(QQR(n))logd+log1/δnRKL(Q_{*}\|Q^{(n)}_{R})\lesssim\frac{\sqrt{\log d}+\sqrt{\log 1/\delta}}{\sqrt{n}}R

This result can be straightforwardly extended to the case when V,VV,V_{*} are implemented as 2-layer networks or deep residual networks, equipped with the norms defined in [18]. The proof only involves the Rademacher complexity, and it is known that these functions’ complexity scales as O(Rn)O(\frac{R}{\sqrt{n}}) [18].

4 Numerical Experiments

Corollary 3.4 and Proposition 3.7 tell us that the training process roughly consists of two phases: the first phase in which a dimension-independent generalization error rate is reached, and a second phase in which the model deteriorates into memorization or divergence. We now examine how these happen in practice.

4.1 Dimension-independent error rate

The key aspect of the generalization estimate of Corollary 3.4 is that its sample complexity O(nα)O(n^{-\alpha}) (α1/4\alpha\geq 1/4) is dimension-independent.

To verify dimension-independence, we estimate the exponent α\alpha for varying dimension dd. We adopt the set-up of Theorem 3.3 and train our model Qt(n)Q^{(n)}_{t} by SGD on a finite sample set Q(n)Q_{*}^{(n)}. Specifically, PP is uniform over [1,1]d[-1,1]^{d}, the target and trained distributions Q,Qt(n)Q_{*},Q_{t}^{(n)} are generated by the potentials V,Vt(n)V_{*},V_{t}^{(n)}, these potentials are random feature functions (7) with ρ0\rho_{0} being uniform over the l1l^{1} sphere {𝐰1+|b|=1}\{\|\mathbf{w}\|_{1}+|b|=1\}, with parameter functions a,at(n)a_{*},a^{(n)}_{t} and ReLU activation. The samples Q(n)Q_{*}^{(n)} are obtained by Projected Langevin Monte Carlo [9]. We approximate ρ0\rho_{0} using m=500m=500 samples (particle discretization) and set a50a_{*}\equiv 50. We initialize training with a0(n)0a^{(n)}_{0}\equiv 0 and train at(n)a^{(n)}_{t} by scaled gradient descent with learning rate 0.5m0.5m.

The generalization error is measured by KL(QQt(n))KL(Q_{*}\|Q_{t}^{(n)}). Denote the optimal stopping time by

To=argmint>0KL(QQt(n))T_{o}=\text{argmin}_{t>0}KL(Q_{*}\|Q_{t}^{(n)})

and the corresponding optimal error by LoL_{o}. The most difficult part of this experiment turned out to be the computation of the KL divergence: Monte-Carlo approximation has led to excessive variance. Therefore we computed by numerical integration on a uniform grid on [1,1]d[-1,1]^{d}. This limits the experiments to low dimensions.

For each d5d\leq 5, we estimate α\alpha by linear regression between logn\log n and logLo\log L_{o}. The sample size nn ranges in {25,50,100,200}\{25,50,100,200\}, each setting is repeated 20 times with a new sample set Q(n)Q_{*}^{(n)}. Also, we solve for the dependence of ToT_{o} on nn by linear regression between logn\log n and logLo\log L_{o}. Here are the results:

Dimension dd 1 2 3 4 5
Exponent α-\alpha of LoL_{o} 0.74-0.74 0.71-0.71 0.81-0.81 0.74-0.74 0.87-0.87
Exponent of ToT_{o} 0.30 0.29 0.27 0.26 0.31
Table 1: Upper: empirically, the exponent α\alpha of the sample complexity is dimension-independent. Lower: the optimal stopping time grows with nn.

Our experiments suggest that the generalization error of the early-stopping solution scales as n0.8n^{-0.8} and is dimension-independent, and the optimal early-stopping time is around n0.3n^{0.3}. This error is much better than the upper bound O(n1/4)O(n^{-1/4}) given by Corollary 3.4, indicating that our analysis has much room for improvement.

Shown in Figure 1 is the generalization error KL(QQt(n))KL(Q_{*}\|Q_{t}^{(n)}) during training, for dimension d=5d=5.

Refer to caption
Figure 1: Generalization error curves with log axes. The solid curves are averages over 20 trials, and the shaded regions are ±1\pm 1 standard deviations. The results for other dd are similar.

All error curves go through a rapid descent, followed by a slower but gradual ascent due to memorization. In fact, the convergence rate prior to the optimal stopping time appears to be exponential. Note that if exponential convergence indeed holds, then the generalization error estimate of Corollary 3.4 can be improved to O(n1/2logn)O(n^{-1/2}\log n).

4.2 Deterioration and memorization

Proposition 3.7 indicates that as tt\to\infty the model either memorizes the sample points or diverges. Our result shows that in practice we obtain memorization.

We adopt the same set-up as in Section 4.1. Since memorization occurs very slowly with SGD, we accelerate training using Adam. Figure 2 shows the result for d=1,n=25d=1,n=25.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 2: From top left to bottom right: Initialization, optimal stopping time at iteration 160160, long time solutions at iterations 103,104,10510^{3},10^{4},10^{5} and 10610^{6}. The orange curve is the density of the target distribution QQ_{*}, and the blue curves are Qt(n)Q_{t}^{(n)}. The red dots are the samples Q(n)Q_{*}^{(n)}.

We see that there is a time interval during which the trained model closely fits the target distribution, but it eventually concentrates around the samples, and this memorization process does not seem to halt within realistic training time.

Figure 3 suggests that memorization is correlated with the growth of the function norm of the potential.

Refer to caption
Figure 3: Left: generalization error with Adam optimizer. Right: RKHS norm Vt(n)\|V^{(n)}_{t}\|_{\mathcal{H}}.

5 Proofs

5.1 Proof of the Universal Approximation Property

Assumption 5.1.

For 2-layer neural networks (5), assume that the activation function σ:\sigma:\mathbb{R}\to\mathbb{R} is continuous and is not a polynomial.

For the random feature model (7), assume that the activation function is continuous, non-polynomial and grows at most linearly at infinity, σ(x)=O(|x|)\sigma(x)=O(|x|). In addition, we assume that the fixed parameter distribution ρ0(𝐰,b)\rho_{0}(\mathbf{w},b) has full support over d+1\mathbb{R}^{d+1}. (See Theorem 1 of [44] for more general conditions.) Alternatively, one can assume that σ\sigma is ReLU and ρ0\rho_{0} covers all directions, that is, for all (𝐰,b)𝟎(\mathbf{w},b)\neq\mathbf{0}, we have λ(𝐰,b)sprtρ0\lambda(\mathbf{w},b)\in\text{sprt}\rho_{0} for some λ>0\lambda>0.

By Theorem 3.1 of [32], Theorem 1 and Proposition 1 of [44], the Barron space \mathcal{B} and RKHS space \mathcal{H} defined by (6) and (8) are dense in the space of continuous functions over any compact subset of d\mathbb{R}^{d}.

Proof of Proposition 2.1.

Denote the set of distributions generated by 𝒱\mathcal{V} by

𝒬={Q𝒫(K)|Q is given by (2) with V𝒱}\mathcal{Q}=\{Q\in\mathcal{P}(K)~{}|~{}Q\text{ is given by (\ref{potential representation}) with }V\in\mathcal{V}\}

First, for any Q𝒫ac(K)C(K)Q_{*}\in\mathcal{P}_{ac}(K)\cap C(K), assume that its density function is strictly positive Q(𝐱)ϵ>0Q_{*}(\mathbf{x})\geq\epsilon>0 over KK. Then, logQC(K)\log Q_{*}\in C(K). Let {Vm}𝒱\{V_{m}\}\subseteq\mathcal{V} be a sequence that approximates logQ\log Q_{*} in the supremum norm, and let QmQ_{m} be the distributions (2) generated by VmV_{m}. It follows that

limmKL(QQm)limmlogQlogQmC(K)=0\lim_{m\to\infty}KL(Q_{*}\|Q_{m})\leq\lim_{m\to\infty}\|\log Q_{*}-\log Q_{m}\|_{C(K)}=0

For the general case Q𝒫ac(K)C(K)Q_{*}\in\mathcal{P}_{ac}(K)\cap C(K), define for any ϵ(0,1)\epsilon\in(0,1),

Qϵ=(1ϵ)Q+ϵPQ_{*}^{\epsilon}=(1-\epsilon)Q_{*}+\epsilon P

For each mm\in\mathbb{N}, let QmQ_{m} be a distribution generated by some Vm𝒱V_{m}\in\mathcal{V} such that logQ1/mVmC(K)<1/m\|\log Q_{*}^{1/m}-V_{m}\|_{C(K)}<1/m. Then,

limmKL(QQm)\displaystyle\lim_{m\to\infty}KL(Q_{*}\|Q_{m}) =limmKL(QQ1/m)+𝔼QlogQ1/m(𝐱)Qm(𝐱)\displaystyle=\lim_{m\to\infty}KL(Q_{*}\|Q_{*}^{1/m})+\mathbb{E}_{Q_{*}}\log\frac{Q_{*}^{1/m}(\mathbf{x})}{Q_{m}(\mathbf{x})}
limm1mKL(QP)+logQ1/mVmC(K)=0\displaystyle\leq\lim_{m\to\infty}\frac{1}{m}KL(Q_{*}\|P)+\|\log Q_{*}^{1/m}-V_{m}\|_{C(K)}=0

where the inequality follows from the convexity of KL. Hence, the set 𝒬\mathcal{Q} is dense in 𝒫ac(K)C(K)\mathcal{P}_{ac}(K)\cap C(K) under KL divergence.

Next, consider the total variation norm. Since 𝒫ac(K)C(K)\mathcal{P}_{ac}(K)\cap C(K) is dense in 𝒫ac(K)\mathcal{P}_{ac}(K) under TV\|\cdot\|_{TV}, and since Pinsker’s inequality bounds TV\|\cdot\|_{TV} from above by KL divergence, we conclude that 𝒬\mathcal{Q} is also dense in (𝒫ac(K),TV)(\mathcal{P}_{ac}(K),\|\cdot\|_{TV}).

Now consider the W1W_{1} metric. TV\|\cdot\|_{TV} can be seen as an optimal transport cost with cost function c(𝐱,𝐱)=𝟏𝐱𝐱c(\mathbf{x},\mathbf{x}^{\prime})=\mathbf{1}_{\mathbf{x}\neq\mathbf{x}^{\prime}}, so for any Q1,Q2𝒫(K)Q_{1},Q_{2}\in\mathcal{P}(K),

W1(Q1,Q2)diam(K)Q1Q2TVW_{1}(Q_{1},Q_{2})\leq\text{diam}(K)~{}\|Q_{1}-Q_{2}\|_{TV}

Since 𝒫ac(K)\mathcal{P}_{ac}(K) is dense in 𝒫(K)\mathcal{P}(K) under the W1W_{1} metric, we conclude that 𝒬\mathcal{Q} is dense in (𝒫(K),W1)(\mathcal{P}(K),W_{1}).

Finally, note that for any p[1,)p\in[1,\infty),

Wpdiam(K)11/pW11/pW_{p}\lesssim~{}\text{diam}(K)^{1-1/p}~{}W_{1}^{1/p}

So 𝒬\mathcal{Q} is dense in (𝒫(K),Wp)(\mathcal{P}(K),W_{p}). ∎

5.2 Estimating the Approximation Error

Lemma 5.1.

For any base distribution PP and any potential functions V1,V2V_{1},V_{2},

|log𝔼P[eV1]log𝔼P[eV2]|V1V2L(P)\big{|}\log\mathbb{E}_{P}[e^{-V_{1}}]-\log\mathbb{E}_{P}[e^{-V_{2}}]\big{|}\leq\|V_{1}-V_{2}\|_{L^{\infty}(P)}
Proof.

Denote Vmax,Vmin=max(V1,V2),min(V1,V2)V_{\max},V_{\min}=\max(V_{1},V_{2}),\min(V_{1},V_{2}). Then,

|log𝔼P[eV1]log𝔼P[eV2]|\displaystyle\quad\big{|}\log\mathbb{E}_{P}[e^{-V_{1}}]-\log\mathbb{E}_{P}[e^{-V_{2}}]\big{|}
log𝔼P[eVmin]log𝔼P[eVmax]\displaystyle\leq\log\mathbb{E}_{P}[e^{-V_{\min}}]-\log\mathbb{E}_{P}[e^{-V_{\max}}]
log(eVmaxL1(P)eVmaxVminL(P))logeVmaxL1(P)\displaystyle\leq\log\big{(}\|e^{-V_{\max}}\|_{L^{1}(P)}\|e^{V_{\max}-V_{\min}}\|_{L^{\infty}(P)}\big{)}-\log\|e^{-V_{\max}}\|_{L^{1}(P)}
=logeVmaxVminL(P)\displaystyle=\log\|e^{V_{\max}-V_{\min}}\|_{L^{\infty}(P)}
=V1V2L(P)\displaystyle=\|V_{1}-V_{2}\|_{L^{\infty}(P)}

Proof of Proposition 2.2.

The proof follows the standard argument of Monte-Carlo estimation (Theorem 4 of [18]). First, consider the case V<\|V\|_{\mathcal{B}}<\infty. For any ϵ(0,0.01)\epsilon\in(0,0.01), let ρ\rho be a parameter distribution of VV with path norm ρP<(1+ϵ)V\|\rho\|_{P}<(1+\epsilon)\|V\|_{\mathcal{B}}. Define the finite neural network

Vm(𝐱)=1mj=1majσ(𝐰j𝐱+bj)=:1mj=1mϕ(𝐱;θj)V_{m}(\mathbf{x})=\frac{1}{m}\sum_{j=1}^{m}a_{j}\sigma(\mathbf{w}_{j}\cdot\mathbf{x}+b_{j})=:\frac{1}{m}\sum_{j=1}^{m}\phi(\mathbf{x};\theta_{j})

where θj=(aj,𝐰j,bj)\theta_{j}=(a_{j},\mathbf{w}_{j},b_{j}) are i.i.d. samples from ρ\rho. Denote Θ=(θj)j=1m\Theta=(\theta_{j})_{j=1}^{m}.

Let QmQ_{m} be the distribution generated by VmV_{m}. The approximation error is given by

KL(QQm)\displaystyle KL(Q\|Q_{m}) =𝔼Q[VmV]+(log𝔼P[eVmlog𝔼P[eV]])\displaystyle=\mathbb{E}_{Q}\big{[}V_{m}-V\big{]}+(\log\mathbb{E}_{P}[e^{-V_{m}}-\log\mathbb{E}_{P}[e^{-V}]])

By Lemma 5.1,

KL(QQm)\displaystyle KL(Q\|Q_{m}) VmVL(Q)+VmVL(P)\displaystyle\leq\|V_{m}-V\|_{L^{\infty}(Q)}+\|V_{m}-V\|_{L^{\infty}(P)}
2VmVL(P)\displaystyle\leq 2\|V_{m}-V\|_{L^{\infty}(P)}

Given that sprtPBR(0)\text{sprt}P\subseteq B_{R}(0), we can bound

𝔼Θ[VVmL(P)2]\displaystyle\mathbb{E}_{\Theta}\big{[}\|V-V_{m}\|^{2}_{L^{\infty}(P)}\big{]} 𝔼Θ[sup𝐱R(1mj=1mϕ(𝐱;θj)𝔼θρ[ϕ(𝐱;θ)])2]\displaystyle\leq\mathbb{E}_{\Theta}\Big{[}\sup_{\|\mathbf{x}\|\leq R}\Big{(}\frac{1}{m}\sum_{j=1}^{m}\phi(\mathbf{x};\theta_{j})-\mathbb{E}_{\theta\sim\rho}[\phi(\mathbf{x};\theta)]\Big{)}^{2}\Big{]}
𝔼Θ[1m2sup𝐱Rj=1m(ϕ(𝐱;θj)𝔼θ[ϕ(𝐱;θ)])2]\displaystyle\leq\mathbb{E}_{\Theta}\Big{[}\frac{1}{m^{2}}\sup_{\|\mathbf{x}\|\leq R}\sum_{j=1}^{m}\big{(}\phi(\mathbf{x};\theta_{j})-\mathbb{E}_{\theta^{\prime}}[\phi(\mathbf{x};\theta^{\prime})]\big{)}^{2}\Big{]}
=𝔼θρ[1msup𝐱R(ϕ(𝐱;θ)𝔼θ[ϕ(𝐱;θ)])2]\displaystyle=\mathbb{E}_{\theta\sim\rho}\Big{[}\frac{1}{m}\sup_{\|\mathbf{x}\|\leq R}\big{(}\phi(\mathbf{x};\theta)-\mathbb{E}_{\theta^{\prime}}[\phi(\mathbf{x};\theta^{\prime})]\big{)}^{2}\Big{]}
𝔼θρ[1msup𝐱Rϕ(𝐱;θ)2]\displaystyle\leq\mathbb{E}_{\theta\sim\rho}\Big{[}\frac{1}{m}\sup_{\|\mathbf{x}\|\leq R}\phi(\mathbf{x};\theta)^{2}\Big{]}
𝔼θρ[1msup𝐱Ra2σLip2(𝐰2+b2)(𝐱2+1)]\displaystyle\leq\mathbb{E}_{\theta\sim\rho}\Big{[}\frac{1}{m}\sup_{\|\mathbf{x}\|\leq R}a^{2}\|\sigma\|^{2}_{Lip}(\|\mathbf{w}\|^{2}+b^{2})(\|\mathbf{x}\|^{2}+1)\Big{]}
1mρP2(R2+1)σLip2\displaystyle\leq\frac{1}{m}\|\rho\|_{P}^{2}(R^{2}+1)\|\sigma\|^{2}_{Lip}
1m(1+ϵ)2V2(R2+1)σLip2\displaystyle\leq\frac{1}{m}(1+\epsilon)^{2}\|V\|^{2}_{\mathcal{B}}(R^{2}+1)\|\sigma\|^{2}_{Lip}

Meanwhile, denote the empirical measure on Θ=(θj)\Theta=(\theta_{j}) by ρ(m)=1mj=1mδθj\rho^{(m)}=\frac{1}{m}\sum_{j=1}^{m}\delta_{\theta_{j}}. Then, its expected path norm is bounded by

𝔼Θ[ρ(m)P2]\displaystyle\mathbb{E}_{\Theta}\big{[}\|\rho^{(m)}\|^{2}_{P}\big{]} =1mj=1m𝔼θj[aj2(𝐰j2+bj2)]=ρP2(1+ϵ)2V2\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}_{\theta_{j}}\big{[}a_{j}^{2}(\|\mathbf{w}_{j}\|^{2}+b_{j}^{2})\big{]}=\|\rho\|^{2}_{P}\leq(1+\epsilon)^{2}\|V\|^{2}_{\mathcal{B}}

Define the events

E1\displaystyle E_{1} :={Θ|VVmL(P)231mV2(R2+1)σLip2}\displaystyle:=\Big{\{}\Theta~{}\big{|}~{}\|V-V_{m}\|^{2}_{L^{\infty}(P)}\leq 3\cdot\frac{1}{m}\|V\|^{2}_{\mathcal{B}}(R^{2}+1)\|\sigma\|^{2}_{Lip}\Big{\}}
E2\displaystyle E_{2} :={Θ|ρ(m)P22V2}\displaystyle:=\Big{\{}\Theta~{}\big{|}~{}\|\rho^{(m)}\|^{2}_{P}\leq 2\|V\|^{2}_{\mathcal{B}}\Big{\}}

By Markov’s inequality,

(E1)\displaystyle\mathbb{P}(E_{1}) =1(E1C)1𝔼[VVmL(P)2]3mV2(R2+1)σLip21(1+ϵ)23\displaystyle=1-\mathbb{P}(E_{1}^{C})\geq 1-\frac{\mathbb{E}\big{[}\|V-V_{m}\|^{2}_{L^{\infty}(P)}\big{]}}{\frac{3}{m}\|V\|^{2}_{\mathcal{B}}(R^{2}+1)\|\sigma\|^{2}_{Lip}}\geq 1-\frac{(1+\epsilon)^{2}}{3}
(E2)\displaystyle\mathbb{P}(E_{2}) =1(E2C)1𝔼[ρ(m)P2]2V21(1+ϵ)22\displaystyle=1-\mathbb{P}(E_{2}^{C})\geq 1-\frac{\mathbb{E}\big{[}\|\rho^{(m)}\|^{2}_{P}\big{]}}{2\|V\|^{2}_{\mathcal{B}}}\geq 1-\frac{(1+\epsilon)^{2}}{2}

Since ϵ(0,0.01)\epsilon\in(0,0.01),

(E1E2)=(E1)+(E2)1110ϵ5ϵ26>0\mathbb{P}(E_{1}\cap E_{2})=\mathbb{P}(E_{1})+\mathbb{P}(E_{2})-1\geq\frac{1-10\epsilon-5\epsilon^{2}}{6}>0

Hence, there exists Θ=(θj)j=1m\Theta=(\theta_{j})_{j=1}^{m} such that

KL(QQm)\displaystyle KL(Q\|Q_{m}) 2VmVL(P)23VmσLipR2+1\displaystyle\leq 2\|V_{m}-V\|_{L^{\infty}(P)}\leq\frac{2\sqrt{3}\|V\|_{\mathcal{B}}}{\sqrt{m}}\|\sigma\|_{Lip}\sqrt{R^{2}+1}
Vm\displaystyle\|V_{m}\|_{\mathcal{B}} ρ(m)P2V\displaystyle\leq\|\rho^{(m)}\|_{P}\leq\sqrt{2}\|V\|_{\mathcal{B}}

The argument for the case V<\|V\|_{\mathcal{H}}<\infty is the same. ∎

5.3 Proof of Trainability

Lemma 5.2.

The objectives L+,LL^{+},L^{-} from (3) are convex in VV.

Proof.

It suffices to show that log𝔼P[eV]\log\mathbb{E}_{P}[e^{V}] is convex: Given any two potential functions V1,V2V_{1},V_{2} and any t(0,1)t\in(0,1), Hölder’s inequality implies that

log𝔼P[etV1+(1t)V2]\displaystyle\log\mathbb{E}_{P}\big{[}e^{tV_{1}+(1-t)V_{2}}\big{]} =log𝔼P[(eV1)t(eV2)(1t)]\displaystyle=\log\mathbb{E}_{P}\big{[}(e^{V_{1}})^{t}(e^{V_{2}})^{(1-t)}\big{]}
log((eV1)tL1/t(P)(eV2)(1t)L1/(1t)(P))\displaystyle\leq\log\Big{(}\big{\|}(e^{V_{1}})^{t}\big{\|}_{L^{1/t}(P)}\big{\|}(e^{V_{2}})^{(1-t)}\big{\|}_{L^{1/(1-t)}(P)}\Big{)}
=log(𝔼P[eV1]t𝔼P[eV2](1t))\displaystyle=\log\big{(}\mathbb{E}_{P}[e^{V_{1}}]^{t}\mathbb{E}_{P}[e^{V_{2}}]^{(1-t)}\big{)}
=tlog𝔼P[eV1]+(1t)log𝔼P[eV2]\displaystyle=t\log\mathbb{E}_{P}[e^{V_{1}}]+(1-t)\log\mathbb{E}_{P}[e^{V_{2}}]

Proof of Proposition 3.1.

For the target potential function VV_{*}, denote its parameter function by aL2(ρ0)a_{*}\in L^{2}(\rho_{0}). Let the objective LL be either L+L^{+} or LL^{-}. The mapping

aV=𝔼ρ0[a(𝐰,b)σ(𝐰+b)]a\mapsto V=\mathbb{E}_{\rho_{0}}[a(\mathbf{w},b)\sigma(\mathbf{w}\cdot+b)]

is linear while LL is convex in VV by Lemma 5.2, so LL is convex in aL2(ρ0)a\in L^{2}(\rho_{0}) and we simply write the objective as L(a)L(a). Define the Lyapunov function

E(t)=t(L(at)L(a))+12aatL2(dρ0)2E(t)=t~{}\big{(}L(a_{t})-L(a_{*})\big{)}+\frac{1}{2}||a_{*}-a_{t}||^{2}_{L^{2}(d\rho_{0})}

Then,

ddtE(t)\displaystyle\frac{d}{dt}E(t) =(L(at)L(a))+tddtL(at)+ata,ddtatL2(ρ0)\displaystyle=\big{(}L(a_{t})-L(a_{*})\big{)}+t\cdot\frac{d}{dt}L(a_{t})+\big{\langle}a_{t}-a_{*},~{}\frac{d}{dt}a_{t}\big{\rangle}_{L^{2}(\rho_{0})}
(L(at)L(a))ata,L(at)L2(ρ0)\displaystyle\leq\big{(}L(a_{t})-L(a_{*})\big{)}-\big{\langle}a_{t}-a_{*},~{}\nabla L(a_{t})\big{\rangle}_{L^{2}(\rho_{0})}

By convexity, for any a1,a2a_{1},a_{2},

L(a1)+a2a1,L(a1)L(a2)L(a_{1})+\langle a_{2}-a_{1},~{}\nabla L(a_{1})\rangle\leq L(a_{2})

Hence, ddtE0\frac{d}{dt}E\leq 0. We conclude that E(t)E(0)E(t)\leq E(0) or equivalently

t(L(at)L(a))+12aatL2(dρ0)2\displaystyle t~{}\big{(}L(a_{t})-L(a_{*})\big{)}+\frac{1}{2}||a_{*}-a_{t}||^{2}_{L^{2}(d\rho_{0})} 12aa0L2(dρ0)2\displaystyle\leq\frac{1}{2}||a^{*}-a_{0}||^{2}_{L^{2}(d\rho_{0})}

Assumption 5.2.

We make the following assumptions on the activation function σ(𝐰𝐱+b)\sigma(\mathbf{w}\cdot\mathbf{x}+b), the initialization ρ0\rho_{0} of ρt\rho_{t}, and the base distribution PP:

  1. 1.

    The weights (𝐰,b)(\mathbf{w},b) are restricted to the sphere 𝕊dd+1\mathbb{S}^{d}\subseteq\mathbb{R}^{d+1}.

  2. 2.

    The activation is universal in the sense that for any distributions P,QP,Q,

    P=Q iff (𝐰,b)𝕊d,𝔼PQ[σ(𝐰𝐱+b)]=0P=Q\text{ iff }\forall(\mathbf{w},b)\in\mathbb{S}^{d},~{}\mathbb{E}_{P-Q}\big{[}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\big{]}=0
  3. 3.

    σ\sigma is continuously differentiable with a Lipschitz derivative σ\sigma^{\prime}. (For instance, σ\sigma might be sigmoid or mollified ReLU.)

  4. 4.

    The initialization ρ0=ρ0(a,𝐰,b)𝒫(×𝕊d)\rho_{0}=\rho_{0}(a,\mathbf{w},b)\in\mathcal{P}(\mathbb{R}\times\mathbb{S}^{d}) has full support over 𝕊d\mathbb{S}^{d}. Specifically, the support of ρ0\rho_{0} contains a submanifold that separates the two components, (,a¯)×𝕊d(\infty,-\overline{a})\times\mathbb{S}^{d} and (a¯,)×𝕊d(\overline{a},\infty)\times\mathbb{S}^{d}, for some a¯\overline{a}.

  5. 5.

    PP is compactly-supported.

Proof of Proposition 3.2.

The proof follows the arguments of [13, 38]. For convenience, denote (𝐱,1)(\mathbf{x},1) by 𝐱~\tilde{\mathbf{x}} and (𝐰,b)(\mathbf{w},b) by 𝐰\mathbf{w}, so the activation is simply σ(𝐰𝐱~)\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}}). Denote the training objective by LL (L=L+L=L^{+} or L=LL=L^{-}). From a particle perspective, the flow (10) can be written as

a˙t=𝔼Δt[σ(𝐰t𝐱~)]𝐰˙t=at𝔼Δt[σ(𝐰t𝐱~)𝐱~]\displaystyle\begin{split}\dot{a}_{t}&=-\mathbb{E}_{\Delta_{t}}\big{[}\sigma(\mathbf{w}_{t}\cdot\tilde{\mathbf{x}})\big{]}\\ \dot{\mathbf{w}}_{t}&=-a_{t}~{}\mathbb{E}_{\Delta_{t}}\big{[}\sigma^{\prime}(\mathbf{w}_{t}\cdot\tilde{\mathbf{x}})~{}\tilde{\mathbf{x}}\big{]}\end{split} (12)

where Δt=PP\Delta_{t}=P_{*}-P if L=L+L=L^{+} and Δt=QQt\Delta_{t}=Q_{*}-Q^{-}_{t} if L=LL=L^{-}.

Since the velocity field (12) is locally Lipschitz over ×𝕊d\mathbb{R}\times\mathbb{S}^{d}, the induced flow is a family of locally Lipschitz diffeomorphisms, and thus preserve the submanifold given by Assumption 5.2. Denote by ρ^t\hat{\rho}_{t} and ρ^\hat{\rho}_{\infty} the projections of ρt,ρ\rho_{t},\rho_{\infty} onto 𝕊d\mathbb{S}^{d}. It follows that ρ^t\hat{\rho}_{t} has full support over 𝕊d\mathbb{S}^{d} for all time t<t<\infty.

Since ρ\rho_{\infty} is a stationary point of LL, the velocity field (12) vanishes at ρ\rho_{\infty} almost everywhere. In particular, for all 𝐰\mathbf{w} in the support of ρ^\hat{\rho}_{\infty},

g(𝐰):=𝔼Δ[σ(𝐰𝐱~)]=0g(\mathbf{w}):=\mathbb{E}_{\Delta_{\infty}}\big{[}\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})\big{]}=0

We show that this condition holds for all 𝐰𝕊d\mathbf{w}\in\mathbb{S}^{d}. Denote S=𝕊dsprtρ^S=\mathbb{S}^{d}-\text{sprt}\hat{\rho}_{\infty}. Assume to the contrary that g(𝐰)g(\mathbf{w}) does not vanish on SS. Let 𝐰S\mathbf{w}_{*}\in S be a maximizer of |g(𝐰)||g(\mathbf{w})|. Without loss of generality, let g(𝐰)>0g(\mathbf{w}_{*})>0; the same reasoning applies to g(𝐰)<0g(\mathbf{w}_{*})<0.

Since ρtρ\rho_{t}\to\rho_{\infty} in W1W_{1}, the bias potential VtV_{t} converges to VV^{*} uniformly over the compact support of PP. Since all Δt\Delta_{t} are supported on sprtP\text{sprt}P, the velocity field (12) converges locally uniformly to

[𝔼Δ[σ(𝐰𝐱~)]a𝔼Δ[σ(𝐰𝐱~)𝐱~]]=[g(𝐰)ag(𝐰)]\begin{bmatrix}-\mathbb{E}_{\Delta_{\infty}}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]\\ -a~{}\mathbb{E}_{\Delta_{\infty}}[\sigma^{\prime}(\mathbf{w}\cdot\tilde{\mathbf{x}})~{}\tilde{\mathbf{x}}]\end{bmatrix}=\begin{bmatrix}-g(\mathbf{w})\\ -ag^{\prime}(\mathbf{w})\end{bmatrix}

For tt sufficiently large, we can study the flow with this approximate field. Let (a,𝐰)(a,\mathbf{w}) be any point with 𝐰\mathbf{w} sufficiently close to 𝐰\mathbf{w}_{*}, consider a trajectory (at,𝐰t)(a_{t},\mathbf{w}_{t}) initialized from at0=a,𝐰t0=𝐰a_{t_{0}}=a,\mathbf{w}_{t_{0}}=\mathbf{w} with a large t0t_{0}. If a<0a<0, then ata_{t} becomes increasingly negative, while 𝐰t\mathbf{w}_{t} follows a gradient ascent on gg and converges to 𝐰\mathbf{w}_{*} (or any maximizer nearby). Else, a0a\geq 0, but if 𝐰\mathbf{w} is sufficiently close to 𝐰\mathbf{w}_{*}, then 𝐰˙t=O(g(𝐰))\dot{\mathbf{w}}_{t}=O(g^{\prime}(\mathbf{w})) is very small (since g(𝐰)=0g^{\prime}(\mathbf{w}_{*})=0 and gg^{\prime} is Lipschitz in 𝐰\mathbf{w}), so 𝐰t\mathbf{w}_{t} will stay around 𝐰\mathbf{w}_{*} and g(𝐰t)g(\mathbf{w}_{t}) remains positive. Then, ata_{t} eventually becomes negative, and 𝐰t\mathbf{w}_{t} converges to 𝐰\mathbf{w}_{*}.

Since ρ^t\hat{\rho}_{t} has positive mass in any neighborhood of 𝐰\mathbf{w}_{*} at time t0t_{0}, this mass will remain in SS as tt\to\infty. This is a contradiction since SS is disjoint from sprtρ^\text{sprt}\hat{\rho}^{*}. It follows that g(𝐰)g(\mathbf{w}) vanishes on all of 𝕊d\mathbb{S}^{d}. Then for any 𝐰𝕊d\mathbf{w}\in\mathbb{S}^{d},

𝔼Δ[σ(𝐰𝐱~)]=0\mathbb{E}_{\Delta_{\infty}}\big{[}\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})\big{]}=0

By Assumption 5.2, we conclude that Δ=0\Delta_{\infty}=0, or equivalently Q=QQ_{\infty}=Q_{*} and V=VV_{\infty}=V_{*} (up to an additive constant). ∎

5.4 Proof of Generalization Ability

Proof of Lemma 3.5.

Theorem 6 of [18] implies that given any nn points with ll^{\infty} norm 1\leq 1, the Rademacher complexity of the class {σ(𝐰𝐱~),𝐰11}\{\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}}),\|\mathbf{w}\|_{1}\leq 1\} is bounded by

Radn22log2dnRad_{n}\leq 2\sqrt{\frac{2\log 2d}{n}}

Since |σ(𝐰𝐱~)|1|\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})|\leq 1 for all 𝐰11,𝐱1\|\mathbf{w}\|_{1}\leq 1,\|\mathbf{x}\|_{\infty}\leq 1, we can apply Theorem 26.5 of [41] to conclude that

𝐰1,𝔼QQ(n)[σ(𝐰𝐱~)]2Radn+2log(2/δ)n\forall\|\mathbf{w}\|\leq 1,~{}\mathbb{E}_{Q_{*}-Q_{*}^{(n)}}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]\leq 2Rad_{n}+\sqrt{\frac{2\log(2/\delta)}{n}}

with probability 1δ1-\delta over the sampling of Q(n)Q_{*}^{(n)}. ∎

Proof of Lemma 3.6.

Denote the inner product and norm of HH by x,y\langle x,y\rangle and x\|x\|. Then,

ddtytxtytxtytxt,L(yt)L(xt)+h(yt)\frac{d}{dt}\|y_{t}-x_{t}\|\leq-\big{\langle}\frac{y_{t}-x_{t}}{\|y_{t}-x_{t}\|},~{}\nabla L(y_{t})-\nabla L(x_{t})+\nabla h(y_{t})\big{\rangle}

Since LL is convex, (yx)(L(y)L(x))0(y-x)\cdot(\nabla L(y)-\nabla L(x))\geq 0 for any x,yHx,y\in H. Therefore,

ddtytxt\displaystyle\frac{d}{dt}\|y_{t}-x_{t}\| ytxtytxt,h(yt)\displaystyle\leq-\langle\frac{y_{t}-x_{t}}{\|y_{t}-x_{t}\|},\nabla h(y_{t})\rangle
h(yt)ϵ\displaystyle\leq\|\nabla h(y_{t})\|\leq\epsilon

so that ytxtϵt\|y_{t}-x_{t}\|\leq\epsilon t. By Lipschitz continuity, L(yt)L(xt)lϵtL(y_{t})-L(x_{t})\leq l\epsilon t. ∎

Proof of Theorem 3.3.

For any time TT, the testing error can be decomposed into

KL(QQT(n))\displaystyle KL\big{(}Q_{*}\|Q_{T}^{(n)}\big{)} =L(VT(n))L(V)\displaystyle=L(V_{T}^{(n)})-L(V_{*})
=(L(VT(n))L(VT))+(L(VT)L(V))\displaystyle=\big{(}L(V_{T}^{(n)})-L(V_{T})\big{)}+\big{(}L(V_{T})-L(V_{*})\big{)}

The second term is bounded by Proposition 3.1, while the first term can be bounded by Lemma 3.6. The Hilbert space HH in Lemma 3.6 corresponds to the parameter functions L2(ρ0)L^{2}(\rho_{0}) for the random feature model, the convex objective corresponds to the objective LL over aL2(ρ0)a\in L^{2}(\rho_{0}),

L(a)=𝔼Q[V]+log𝔼P[eV],V(𝐱)=𝔼ρ0(𝐰)[a(𝐰)σ(𝐰𝐱~)]L(a)=\mathbb{E}_{Q_{*}}[V]+\log\mathbb{E}_{P}[e^{-V}],\quad V(\mathbf{x})=\mathbb{E}_{\rho_{0}(\mathbf{w})}[a(\mathbf{w})\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]

and the perturbation term hh corresponds to L(n)LL^{(n)}-L,

L(n)(a)L(a)=𝔼Q(n)Q[V]L^{(n)}(a)-L(a)=\mathbb{E}_{Q^{(n)}_{*}-Q_{*}}[V]

The remaining task is to estimate the constants ll and ϵ\epsilon.

First, we have l2l\leq 2. For any aL2(ρ0)a\in L^{2}(\rho_{0}), let QQ be the modeled distribution,

L(a)L2(ρ0)\displaystyle\|\nabla L(a)\|_{L^{2}(\rho_{0})} =𝔼QQ[σ(𝐰𝐱~)]L2(dρ0(𝐰))\displaystyle=\|\mathbb{E}_{Q_{*}-Q}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]\|_{L^{2}(d\rho_{0}(\mathbf{w}))}
sup𝐰11|𝔼QQ[σ(𝐰𝐱~)]|\displaystyle\leq\sup_{\|\mathbf{w}\|_{1}\leq 1}|\mathbb{E}_{Q_{*}-Q}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]|
sup𝐰11|𝔼Q[σ(𝐰𝐱~)]|+sup𝐰11|𝔼Q[σ(𝐰𝐱~)]|\displaystyle\leq\sup_{\|\mathbf{w}\|_{1}\leq 1}|\mathbb{E}_{Q_{*}}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]|+\sup_{\|\mathbf{w}\|_{1}\leq 1}|\mathbb{E}_{Q}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]|
2\displaystyle\leq 2

where in the last step, since all distributions are supported on [1,1]d[-1,1]^{d}, σ(𝐰𝐱~)𝐰1𝐱~1\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})\leq\|\mathbf{w}\|_{1}\|\tilde{\mathbf{x}}\|_{\infty}\leq 1.

Next, the estimate of ϵ\epsilon has been provided by Lemma 3.5, because for any aL2(ρ0)a\in L^{2}(\rho_{0}),

h(a)L2(ρ0)\displaystyle\|\nabla h(a)\|_{L^{2}(\rho_{0})} =L(n)(a)L(a)L2\displaystyle=\|\nabla L^{(n)}(a)-\nabla L(a)\|_{L^{2}}
=𝔼QQ(n)[σ(𝐰𝐱~)]L2\displaystyle=\|\mathbb{E}_{Q_{*}-Q_{*}^{(n)}}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]\|_{L^{2}}
𝔼QQ(n)[σ(𝐰𝐱~)]L(ρ0)\displaystyle\leq\|\mathbb{E}_{Q_{*}-Q_{*}^{(n)}}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]\|_{L^{\infty}(\rho_{0})}
sup𝐰11|𝔼QQ(n)[σ(𝐰𝐱~)]|\displaystyle\leq\sup_{\|\mathbf{w}\|_{1}\leq 1}|\mathbb{E}_{Q_{*}-Q_{*}^{(n)}}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]|

5.5 Proof of Memorization

To prove Proposition 3.7 and Lemma 3.8, we begin with a few useful lemmas.

Let (K)\mathcal{M}(K) be the space of finite signed measures on KK. We say that a kernel kk is integrally strictly positive definite if

m(K),𝔼m(𝐱)𝔼m(𝐱)[k(𝐱,𝐱)]m=0\forall m\in\mathcal{M}(K),~{}\mathbb{E}_{m(\mathbf{x})}\mathbb{E}_{m(\mathbf{x}^{\prime})}[k(\mathbf{x},\mathbf{x}^{\prime})]\to m=0

Equip (K)\mathcal{M}(K) with the inner product

m1,m2(K),m1,m2k:=𝔼m1(𝐱)𝔼m2(𝐱)[k(𝐱,𝐱)]\forall m_{1},m_{2}\in\mathcal{M}(K),~{}\langle m_{1},m_{2}\rangle_{k}:=\mathbb{E}_{m_{1}(\mathbf{x})}\mathbb{E}_{m_{2}(\mathbf{x}^{\prime})}[k(\mathbf{x},\mathbf{x}^{\prime})]

from which we define the MMD (maximum mean discrepancy) distance k\|\cdot\|_{k}

m1m2k2=m1m2,m1m2k\|m_{1}-m_{2}\|_{k}^{2}=\langle m_{1}-m_{2},m_{1}-m_{2}\rangle_{k}

Let k\mathcal{H}_{k} be the RKHS generated by kk with inner product ,k\langle,\rangle_{\mathcal{H}_{k}}. Then the MMD inner product is the RKHS inner product on the mean embeddings fi=𝔼mi(𝐱)[k(𝐱,)]f_{i}=\mathbb{E}_{m_{i}(\mathbf{x})}[k(\mathbf{x},\cdot)],

m1,m2k\displaystyle\langle m_{1},m_{2}\rangle_{k} =f1,f2k\displaystyle=\langle f_{1},f_{2}\rangle_{\mathcal{H}_{k}}
m1m2k\displaystyle\|m_{1}-m_{2}\|_{k} =supfk1𝔼m1m2[f]\displaystyle=\sup_{\|f\|_{\mathcal{H}_{k}}\leq 1}\mathbb{E}_{m_{1}-m_{2}}[f]
Lemma 5.3.

When restricted to the subset 𝒫(K)\mathcal{P}(K), the MMD distance k\|\cdot\|_{k} induces the weak topology and thus (𝒫(K),k)(\mathcal{P}(K),\|\cdot\|_{k}) is compact.

Proof.

By Lemma 2.1 of [42], the MMD distance metrizes the weak topology of 𝒫(K)\mathcal{P}(K), which is compact by Prokhorov’s theorem. ∎

As 𝒫(K)\mathcal{P}(K) is a convex subset of (K)\mathcal{M}(K), we can define the tangent cone at each point Q𝒫(K)Q\in\mathcal{P}(K) by

TQ𝒫(K)={λΔ|λ0,Δ=Δ+Δ,Δ±𝒫(K),ΔQ}T_{Q}\mathcal{P}(K)=\big{\{}\lambda\Delta~{}\big{|}~{}\lambda\geq 0,~{}\Delta=\Delta^{+}-\Delta^{-},~{}\Delta^{\pm}\in\mathcal{P}(K),~{}\Delta^{-}\ll Q\big{\}}

and equip it with the MMD norm, Δk2=𝔼Δ2[k]\|\Delta\|_{k}^{2}=\mathbb{E}_{\Delta^{2}}[k].

Given the gradient flow VtV_{t} defined in Lemma 3.8, the distribution QtQ_{t} evolves by

ddtQt(𝐱)\displaystyle\frac{d}{dt}Q_{t}(\mathbf{x}) =(v(𝐱;Qt)𝔼Qt(𝐱)[v(𝐱;Qt)])Qt(𝐱)\displaystyle=\big{(}v(\mathbf{x};Q_{t})-\mathbb{E}_{Q_{t}(\mathbf{x}^{\prime})}[v(\mathbf{x}^{\prime};Q_{t})]\big{)}Q_{t}(\mathbf{x})
v(𝐱;Q)\displaystyle v(\mathbf{x};Q) :=𝔼(QQ)(𝐱)[k(𝐱,𝐱)]\displaystyle:=\mathbb{E}_{(Q^{\prime}-Q)(\mathbf{x}^{\prime})}[k(\mathbf{x},\mathbf{x}^{\prime})]

We can extend this flow to a dynamical system on 𝒫(K)\mathcal{P}(K) in positive time t0t\geq 0, defined by

ddtQt=v¯(Qt)Qtv¯(Q)=v(;Q)𝔼Q(𝐱)[v(𝐱;Q)]\displaystyle\begin{split}\frac{d}{dt}Q_{t}&=\overline{v}(Q_{t})Q_{t}\\ \overline{v}(Q)&=v(\cdot~{};Q)-\mathbb{E}_{Q(\mathbf{x}^{\prime})}[v(\mathbf{x}^{\prime};Q)]\end{split} (13)

Each v¯(Q)Q\overline{v}(Q)Q is a tangent vector in TQ𝒫(K)T_{Q}\mathcal{P}(K).

Note that we can rewrite vv and v¯\overline{v} in terms of the RKHS norm: Let f,ff,f^{\prime} be the mean embeddings of Q,QQ,Q^{\prime},

v(𝐱;Q)\displaystyle v(\mathbf{x};Q) =k(𝐱,),ffk\displaystyle=\big{\langle}k(\mathbf{x},\cdot),~{}f^{\prime}-f\big{\rangle}_{\mathcal{H}_{k}}
v¯(𝐱;Q)\displaystyle\overline{v}(\mathbf{x};Q) =k(𝐱,)f,ffk\displaystyle=\big{\langle}k(\mathbf{x},\cdot)-f,~{}f^{\prime}-f\big{\rangle}_{\mathcal{H}_{k}}

It follows that vv and v¯\overline{v} are uniformly continuous over the compact space K×(𝒫(K),k)K\times(\mathcal{P}(K),\|\cdot\|_{k}).

Lemma 5.4.

Given any initialization Q0𝒫(K)Q_{0}\in\mathcal{P}(K), there exists a unique solution QtQ_{t}, t0t\geq 0 to the dynamics (13).

Proof.

The integral form of (13) can be written as

t0,Qt=Q0+0tv¯(Qs)Qs𝑑s\forall t\geq 0,~{}Q_{t}=Q_{0}+\int_{0}^{t}\overline{v}(Q_{s})Q_{s}ds (14)

where we adopt the Bochner integral on ((K),,k)(\mathcal{M}(K),\langle,\rangle_{k}). In the spirit of the classical Picard-Lindelöf theorem, we consider the vector space C([0,T],(K))C([0,T],\mathcal{M}(K)) equipped with sup-norm

|ϕ|=supt[0,T]ϕ(t)k|||\phi|||=\sup_{t\in[0,T]}\|\phi(t)\|_{k}

On the complete subspace C([0,T],𝒫(K))C([0,T],\mathcal{P}(K)), define the operator ϕF(ϕ)\phi\mapsto F(\phi) by

F(ϕ)t=ϕ0+0tv¯(ϕs)ϕs𝑑sF(\phi)_{t}=\phi_{0}+\int_{0}^{t}\overline{v}(\phi_{s})\phi_{s}ds

Define the sequence ϕ0Q0\phi^{0}\equiv Q_{0} and ϕn+1=F(ϕn)\phi^{n+1}=F(\phi^{n}).

Note that the tangent field (13) is Lipschitz

Q1,Q2,v¯(Q1)Q1v¯(Q2)Q2kcQ1Q2k\forall Q_{1},Q_{2},\quad\|\overline{v}(Q_{1})Q_{1}-\overline{v}(Q_{2})Q_{2}\|_{k}\leq c\|Q_{1}-Q_{2}\|_{k}

with c4(kC(K×K)2+kC(K×K))c\leq 4(\|k\|_{C(K\times K)}^{2}+\|k\|_{C(K\times K)}). Then, with T1/2cT\leq 1/2c,

|ϕn+1ϕn|\displaystyle|||\phi^{n+1}-\phi^{n}||| supt[0,T]0tv¯(ϕsn)ϕsnv¯(ϕsn1)ϕsn1dsk\displaystyle\leq\sup_{t\in[0,T]}\big{\|}\int_{0}^{t}\overline{v}(\phi^{n}_{s})\phi^{n}_{s}-\overline{v}(\phi^{n-1}_{s})\phi^{n-1}_{s}ds\big{\|}_{k}
0Tv¯(ϕtn)ϕtnv¯(ϕtn1)ϕtn1k𝑑t\displaystyle\leq\int_{0}^{T}\|\overline{v}(\phi^{n}_{t})\phi^{n}_{t}-\overline{v}(\phi^{n-1}_{t})\phi^{n-1}_{t}\|_{k}dt
cTsupt[0,T]ϕtnϕtn1k\displaystyle\leq cT\sup_{t\in[0,T]}\|\phi^{n}_{t}-\phi^{n-1}_{t}\|_{k}
12|ϕnϕn1|\displaystyle\leq\frac{1}{2}|||\phi^{n}-\phi^{n-1}|||

By the completeness of (C([0,T],𝒫(K)),||||||)(C([0,T],\mathcal{P}(K)),|||\cdot|||) and Banach fixed point theorem, the sequence ϕn\phi^{n} converges to a unique solution ϕ\phi of (14) on [0,T][0,T]. Then, we can extend this solution iteratively to [T,2T],[2T,3T],[T,2T],[2T,3T],\dots and obtain a unique solution on [0,)[0,\infty). ∎

Denote the set of fixed points of (13) by

𝒫o={Q𝒫(K)|v¯(Q)Q=0}\mathcal{P}_{o}=\{Q\in\mathcal{P}(K)~{}|~{}\overline{v}(Q)Q=0\}

Also, define the set of distributions that have larger supports than the target distribution QQ^{\prime}

𝒫={Q𝒫(K)|sprtQsprtQ}\mathcal{P}_{*}=\{Q\in\mathcal{P}(K)~{}|~{}\text{sprt}Q^{\prime}\subseteq\text{sprt}Q\}
Lemma 5.5.

We have the following inclusion

𝒫o{Q}(𝒫(K)𝒫)\mathcal{P}_{o}\subseteq\{Q^{\prime}\}\cup(\mathcal{P}(K)-\mathcal{P}_{*})

Given any initialization Q0𝒫(K)Q_{0}\in\mathcal{P}(K), let QtQ_{t}, t0t\geq 0 be the trajectory defined by Lemma 5.4 and let 𝒬\mathcal{Q} be the set of limit points in MMD metric

𝒬=T{Qt,tT}¯k\mathcal{Q}=\bigcap_{T\to\infty}\overline{\{Q_{t},t\geq T\}}^{\|\cdot\|_{k}}

then 𝒬𝒫o\mathcal{Q}\subseteq\mathcal{P}_{o}.

Proof.

For any fixed point Q𝒫oQ\in\mathcal{P}_{o}, we have v¯(𝐱;Q)=0\overline{v}(\mathbf{x};Q)=0 for QQ-almost every 𝐱\mathbf{x}. By continuity, we have

𝐱sprtQ,v(𝐱;Q)=𝔼Q(𝐱)[v(𝐱;Q)]\forall\mathbf{x}\in\text{sprt}Q,\quad v(\mathbf{x};Q)=\mathbb{E}_{Q(\mathbf{x}^{\prime})}[v(\mathbf{x}^{\prime};Q)] (15)

If we further suppose that Q𝒫Q\in\mathcal{P}_{*}, then this equality holds for QQ^{\prime}-almost all 𝐱\mathbf{x}, so

0\displaystyle 0 =𝔼(QQ)(𝐱)[v(𝐱;Q)]\displaystyle=\mathbb{E}_{(Q-Q^{\prime})(\mathbf{x})}[v(\mathbf{x};Q)]
=𝔼(QQ)2(𝐱,𝐱)[k(𝐱,𝐱)]\displaystyle=\mathbb{E}_{(Q-Q^{\prime})^{2}(\mathbf{x},\mathbf{x}^{\prime})}[k(\mathbf{x},\mathbf{x}^{\prime})]
=QQk2\displaystyle=\|Q-Q^{\prime}\|_{k}^{2}

Since kk is integrally strictly positive definite, we have Q=QQ=Q^{\prime}. It follows that

𝒫o𝒫={Q}\mathcal{P}_{o}\cap\mathcal{P}_{*}=\{Q^{\prime}\}

or equivalently 𝒫o{Q}(𝒫(K)𝒫)\mathcal{P}_{o}\subseteq\{Q^{\prime}\}\cup(\mathcal{P}(K)-\mathcal{P}_{*}).

Meanwhile, the MMD distance QtQk2\|Q_{t}-Q^{\prime}\|^{2}_{k} is decreasing along any trajectory QtQ_{t} of (13):

ddt12QtQk2=𝔼Qt(𝐱)𝔼(QtQ)(𝐱)[k(𝐱,𝐱)v¯(𝐱;Qt)]=𝔼Qt(𝐱)[v¯(𝐱;Qt)2]0\displaystyle\begin{split}\frac{d}{dt}\frac{1}{2}\|Q_{t}-Q^{\prime}\|^{2}_{k}&=\mathbb{E}_{Q_{t}(\mathbf{x})}\mathbb{E}_{(Q_{t}-Q^{\prime})(\mathbf{x}^{\prime})}\big{[}k(\mathbf{x},\mathbf{x}^{\prime})~{}\overline{v}(\mathbf{x};Q_{t})\big{]}\\ &=-\mathbb{E}_{Q_{t}(\mathbf{x})}\big{[}\overline{v}(\mathbf{x};Q_{t})^{2}\big{]}\\ &\leq 0\end{split} (16)

Define the extended sublevel sets for every c>0c>0,

𝒫c:={Q𝒫(K)|QQkc or Q𝒫o}\mathcal{P}_{c}:=\{Q\in\mathcal{P}(K)~{}|~{}\|Q-Q^{\prime}\|_{k}\leq c\text{ or }Q\in\mathcal{P}_{o}\}

By Lemma 5.3, the space (𝒫(K),k)(\mathcal{P}(K),\|\cdot\|_{k}) is compact, so the set of limit points 𝒬\mathcal{Q} of the trajectory QtQ_{t} is nonempty. The inequality (16) is strict if Qt𝒫oQ_{t}\notin\mathcal{P}_{o}, so these limit points all belong to

c0+𝒫c=𝒫o\bigcap_{c\to 0^{+}}\mathcal{P}_{c}=\mathcal{P}_{o}

Lemma 5.6.

Given any initialization Q0𝒫Q_{0}\in\mathcal{P}_{*}, if the limit point set 𝒬\mathcal{Q} contains only one point QQ_{\infty}, then Q𝒫Q_{\infty}\in\mathcal{P}_{*} and thus Q=QQ_{\infty}=Q^{\prime}. Else, 𝒬\mathcal{Q} is contained in 𝒫(K)𝒫\mathcal{P}(K)-\mathcal{P}_{*}.

Proof.

For any open subset AA that intersects sprtQ0\text{sprt}Q_{0}, we have Q0(A)>0Q_{0}(A)>0. Also

ddtQt(A)\displaystyle\frac{d}{dt}Q_{t}(A) =𝔼Qt[𝟏A(𝐱)v¯(𝐱;Qt)]\displaystyle=\mathbb{E}_{Q_{t}}[\mathbf{1}_{A}(\mathbf{x})\overline{v}(\mathbf{x};Q_{t})]
Qt(A)v¯(Qt)L(Qt)\displaystyle\geq-Q_{t}(A)\|\overline{v}(Q_{t})\|_{L^{\infty}(Q_{t})}
4kC(K×K)Qt(A)\displaystyle\geq-4\|k\|_{C(K\times K)}Q_{t}(A)

So Qt(A)Q_{t}(A) remains positive for all finite tt. It follows that sprtQ0sprtQt\text{sprt}Q_{0}\subseteq\text{sprt}Q_{t} and Qt𝒫Q_{t}\in\mathcal{P}_{*} for all tt.

First, consider the case 𝒬={Q}\mathcal{Q}=\{Q_{\infty}\}. Assume for contradiction that Q=Q~Q_{\infty}=\tilde{Q} for some Q~𝒫o{Q}𝒫(K)𝒫\tilde{Q}\in\mathcal{P}_{o}-\{Q^{\prime}\}\subseteq\mathcal{P}(K)-\mathcal{P}_{*}. Equation (15) implies that

𝔼Q~[v¯(𝐱;Q~)]=0\mathbb{E}_{\tilde{Q}}[\overline{v}(\mathbf{x};\tilde{Q})]=0

and thus

𝔼Q[v¯(𝐱;Q~)]\displaystyle\mathbb{E}_{Q^{\prime}}[\overline{v}(\mathbf{x};\tilde{Q})] =𝔼QQ~[v¯(𝐱;Q~)]\displaystyle=\mathbb{E}_{Q^{\prime}-\tilde{Q}}[\overline{v}(\mathbf{x};\tilde{Q})]
=𝔼QQ~[v(𝐱;Q~)]\displaystyle=\mathbb{E}_{Q^{\prime}-\tilde{Q}}[v(\mathbf{x};\tilde{Q})]
=QQ~k2\displaystyle=\|Q^{\prime}-\tilde{Q}\|_{k}^{2}
>0\displaystyle>0

In particular, there exists some measureable subset SosprtQS_{o}\subseteq\text{sprt}Q^{\prime} and some δ>0\delta>0 such that

𝐱So,v¯(𝐱;Q~)>2δ\forall\mathbf{x}\in S_{o},~{}\overline{v}(\mathbf{x};\tilde{Q})>2\delta

By continuity, there exists some open subset SS (SoSS_{o}\subseteq S) such that its closure S¯\overline{S} satisfies

𝐱S¯,v¯(𝐱;Q~)δ\forall\mathbf{x}\in\overline{S},~{}\overline{v}(\mathbf{x};\tilde{Q})\geq\delta

Meanwhile, since SS intersects sprtQsprtQt\text{sprt}Q^{\prime}\subseteq\text{sprt}Q_{t}, we have Qt(S)>0Q_{t}(S)>0 for all tt. Whereas (15) implies that S¯\overline{S} is disjoint from sprtQ~\text{sprt}\tilde{Q}.

Since v¯\overline{v} is continuous over (𝐱,Q)K×(𝒫(K),k)(\mathbf{x},Q)\in K\times(\mathcal{P}(K),\|\cdot\|_{k}) and S¯\overline{S} is compact, there exists some neighborhood Br(Q~)={Q𝒫(K)|QQ~k<r}B_{r}(\tilde{Q})=\{Q\in\mathcal{P}(K)~{}|~{}\|Q-\tilde{Q}\|_{k}<r\} such that

QBr(Q~),𝐱S¯,v¯(𝐱;Q~)0\forall Q\in B_{r}(\tilde{Q}),~{}\forall\mathbf{x}\in\overline{S},~{}\overline{v}(\mathbf{x};\tilde{Q})\geq 0

Since the trajectory QtQ_{t} converges in the MMD distance k\|\cdot\|_{k} to Q~\tilde{Q}, there exists some time t0t_{0} such that for all tt0t\geq t_{0}, QtBr(Q~)Q_{t}\in B_{r}(\tilde{Q}). It follows that

ddtQt(S¯)=𝔼Qt[𝟏S¯(𝐱)v¯(𝐱;Qt)]0\frac{d}{dt}Q_{t}(\overline{S})=\mathbb{E}_{Q_{t}}[\mathbf{1}_{\overline{S}}(\mathbf{x})\overline{v}(\mathbf{x};Q_{t})]\geq 0

so that Qt(S¯)Qt0(S¯)Q_{t}(\overline{S})\geq Q_{t_{0}}(\overline{S}) for all tt0t\geq t_{0}. Yet, Lemma 5.3 implies that QtQ_{t} converges weakly to Q~\tilde{Q}, so that

0=Q~(S¯)lim suptQt(S¯)0=\tilde{Q}(\overline{S})\geq\limsup_{t\to\infty}Q_{t}(\overline{S})

A contradiction. We conclude that the limit point QQ_{\infty} does not belong to 𝒫o{Q}\mathcal{P}_{o}-\{Q^{\prime}\}. By Lemma 5.5, we must have Q=QQ_{\infty}=Q^{\prime}.

Next, consider the case when 𝒬\mathcal{Q} has more than one point. Inequality (16) implies that the MMD distance L(Q)=QQ(n)k2L(Q)=\|Q-Q_{*}^{(n)}\|^{2}_{k} is monotonously decreasing along the flow QtQ_{t}. Suppose that Q𝒬Q^{\prime}\in\mathcal{Q}, then limtL(Qt)=0\lim_{t\to\infty}L(Q_{t})=0 and thus 𝒬={Q}\mathcal{Q}=\{Q^{\prime}\}, a contradiction. Hence, 𝒬𝒫o{Q}𝒫(K)𝒫\mathcal{Q}\subseteq\mathcal{P}_{o}-\{Q^{\prime}\}\subseteq\mathcal{P}(K)-\mathcal{P}_{*}. ∎

Proof of Lemma 3.8.

Since V0C(K)V_{0}\in C(K), the initialization Q0Q_{0} has full support over KK and thus Q0𝒫Q_{0}\in\mathcal{P}_{*}. If QtQ_{t} converges weakly to some limit QQ_{\infty}, Lemma 5.3 implies that QtQ_{t} also converges in MMD metric to QQ_{\infty}. Then, Lemma 5.6 implies that the limit QQ_{\infty} must be QQ^{\prime}.

If there are more than one limit, then Lemma 5.6 implies that all limit points belong to 𝒫(K)𝒫\mathcal{P}(K)-\mathcal{P}_{*} and thus do not cover the full support of QQ^{\prime}. ∎

Proof of Proposition 3.7.

We simply set Q=Q(n)Q^{\prime}=Q_{*}^{(n)}. Note that since at(n)a_{t}^{(n)} is trained by

ddtat(n)(𝐰)=𝔼(Qt(n)Q(n))(𝐱)[σ(𝐰𝐱~)]\frac{d}{dt}a_{t}^{(n)}(\mathbf{w})=\mathbb{E}_{(Q_{t}^{(n)}-Q_{*}^{(n)})(\mathbf{x})}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]

the training dynamics for the potential Vt(n)V_{t}^{(n)} is the same as in Lemma 3.8

ddtVt(n)(𝐱)=𝔼(Qt(n)Q(n))(𝐱)[k(𝐱,𝐱)]\frac{d}{dt}V^{(n)}_{t}(\mathbf{x})=\mathbb{E}_{(Q_{t}^{(n)}-Q_{*}^{(n)})(\mathbf{x}^{\prime})}[k(\mathbf{x},\mathbf{x}^{\prime})]

with kernel kk defined by

k(𝐱,𝐱)=𝔼ρ0(𝐰)[σ(𝐰𝐱~)σ(𝐰𝐱~)]k(\mathbf{x},\mathbf{x}^{\prime})=\mathbb{E}_{\rho_{0}(\mathbf{w})}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}}^{\prime})]

It is straightforward to check that kk is integrally strictly positive definite: For any m(K)m\in\mathcal{M}(K), if

0=mk2=𝔼ρ0(𝐰)(𝔼m(𝐱)[σ(𝐰𝐱~))])20=\|m\|_{k}^{2}=\mathbb{E}_{\rho_{0}(\mathbf{w})}\big{(}\mathbb{E}_{m(\mathbf{x})}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}}))]\big{)}^{2}

then for ρ0\rho_{0}-almost all 𝐰\mathbf{w}, 𝔼m(𝐱)[σ(𝐰𝐱~)]=0\mathbb{E}_{m(\mathbf{x})}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]=0. It follows that for all random feature models ff from (7), we have 𝔼m(𝐱)[f(𝐱)]=0\mathbb{E}_{m(\mathbf{x})}[f(\mathbf{x})]=0. Assuming Remark 1, the random feature models are dense in C(K)C(K) by Proposition 2.1, so this equality holds for all fC(K)f\in C(K). Hence, m=0m=0 and kk is integrally strictly positive definite.

Hence, Lemma 3.8 implies that if Qt(n)Q_{t}^{(n)} has one limit point, then Qt(n)Q_{t}^{(n)} converges weakly to Q(n)Q_{*}^{(n)}. Else, no limit point can cover the support of Q(n)Q_{*}^{(n)} and thus do not have full support over KK. Since the true target distribution QQ_{*} is generated by a continuous potential VV_{*}, it has full support and thus does not belong to 𝒬\mathcal{Q} and KL(QQ)=KL(Q_{*}\|Q)=\infty for all Q𝒬Q\in\mathcal{Q}. Similarly, we must have

lim inftVt(n)=\liminf_{t\to\infty}\|V_{t}^{(n)}\|_{\mathcal{H}}=\infty

otherwise some subsequence of Qt(n)Q_{t}^{(n)} would converge to a limit with full support. ∎

5.6 Proof for the Regularized Model

Lemma 5.7.

For any R0R\geq 0, there exists a minimizer of (11).

Proof.

Since the closed ball BR={aL2(ρ0)R}B_{R}=\{\|a\|_{L^{2}(\rho_{0})}\leq R\} is weakly compact in L2(ρ0)L^{2}(\rho_{0}), it suffices to show that the mapping

L(n)(a)=𝔼ρ0(𝐰)[a(𝐰)𝔼Q(n)(𝐱)[σ(𝐰𝐱~)]]+log𝔼P(𝐱)[e𝔼ρ0(𝐰)[a(𝐰)σ(𝐰𝐱~)]]L^{(n)}(a)=\mathbb{E}_{\rho_{0}(\mathbf{w})}\big{[}a(\mathbf{w})\mathbb{E}_{Q_{*}^{(n)}(\mathbf{x})}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]\big{]}+\log\mathbb{E}_{P(\mathbf{x})}\big{[}e^{-\mathbb{E}_{\rho_{0}(\mathbf{w})}[a(\mathbf{w})\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]}\big{]}

is weakly continuous over BRB_{R} (e.g. show that the term 𝔼P[e𝔼ρ0(𝐰)[a(𝐰)σ(𝐰𝐱~)]]\mathbb{E}_{P}\big{[}e^{-\mathbb{E}_{\rho_{0}(\mathbf{w})}[a(\mathbf{w})\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]}\big{]} can be expressed as the uniform limit of a sequence of weakly continuous functions over BRB_{R}). Then, every minimizing sequence of L(n)L^{(n)} in BRB_{R} converges weakly to a minimizer of (11). ∎

Proof of Proposition 3.9.

For any aL2(ρ0)a\in L^{2}(\rho_{0}),

|L(a)L(n)(a)|\displaystyle|L(a)-L^{(n)}(a)| 𝔼ρ0(𝐰)[|𝔼QQ(n)[a(𝐰)σ(𝐰𝐱~)]|]\displaystyle\leq\mathbb{E}_{\rho_{0}(\mathbf{w})}\big{[}|\mathbb{E}_{Q_{*}-Q_{*}^{(n)}}[a(\mathbf{w})\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]|\big{]}
aL2(ρ0)sup𝐰11𝔼QQ(n)[σ(𝐰𝐱~)]\displaystyle\leq\|a\|_{L^{2}(\rho_{0})}\cdot\sup_{\|\mathbf{w}\|_{1}\leq 1}\mathbb{E}_{Q_{*}-Q_{*}^{(n)}}[\sigma(\mathbf{w}\cdot\tilde{\mathbf{x}})]

Thus, Lemma 3.5 implies that with probability 1δ1-\delta over the sampling of Q(n)Q_{*}^{(n)},

|L(a)L(n)(a)|aL2(ρ0)(42log2dn+2log(2/δ)n)|L(a)-L^{(n)}(a)|\leq\|a\|_{L^{2}(\rho_{0})}\cdot\Big{(}4\sqrt{\frac{2\log 2d}{n}}+\sqrt{\frac{2\log(2/\delta)}{n}}\Big{)} (17)

It follows that

L(aR(n))\displaystyle L(a^{(n)}_{R}) L(n)(aR(n))+42log2d+2log(2/δ)nR\displaystyle\leq L^{(n)}(a^{(n)}_{R})+\frac{4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}}{\sqrt{n}}R
L(n)(a)+42log2d+2log(2/δ)nR\displaystyle\leq L^{(n)}(a_{*})+\frac{4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}}{\sqrt{n}}R
L(a)+42log2d+2log(2/δ)n(R+aL2(ρ))\displaystyle\leq L(a_{*})+\frac{4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}}{\sqrt{n}}(R+\|a_{*}\|_{L^{2}(\rho_{*})})

where the first and third inequalities follow from (17) and the second inequality follows from the fact that a{aL2(ρ0)R}a_{*}\in\{\|a\|_{L^{2}(\rho_{0})}\leq R\}.

Hence,

KL(QQR(n))=L(aR(n))L(a)2R42log2d+2log(2/δ)nKL(Q_{*}\|Q^{(n)}_{R})=L(a^{(n)}_{R})-L(a_{*})\leq 2R\cdot\frac{4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}}{\sqrt{n}}

6 Discussion

Let us summarize some of the insights obtained in this paper:

  • For distribution-learning models, good generalization can be characterized by dimension-independent a priori error estimates for early-stopping solutions. As demonstrated by the proof of Theorem 3.3, such estimates are enabled by two conditions:

    1. 1.

      Fast global convergence is guaranteed for learning distributions that can be represented by the model, with an explicit and dimension-independent rate. For our example, this results from the convexity of the model.

    2. 2.

      The model is insensitive to the sampling error QQ(n)Q_{*}-Q_{*}^{(n)}, so memorization happens very slowly and early-stopping solutions generalize well. For our example, this is enabled by the small Rademacher complexity of the random feature model.

  • Memorization seems inevitable for all sufficiently expressive models (Proposition 3.7), and the generalization error L~\widetilde{L} will eventually deteriorate to either nO(1/d)n^{-O(1/d)} or \infty. Thus, instead of the long time limit tt\to\infty, one needs to consider early-stopping.

    The basic approach, as suggested by Theorem 3.3, is to choose an appropriate function representation such that, with absolute constants α1,α2>0\alpha_{1},\alpha_{2}>0, there exists an early-stopping interval [Tmin,Tmax][T_{\min},T_{\max}] with Tminnα1TmaxT_{\min}\ll n^{\alpha_{1}}\ll T_{\max} and

    supt[Tmin,Tmax]L~(Q,Qt(n))=O(nα2)\sup_{t\in[T_{\min},T_{\max}]}\widetilde{L}\big{(}Q_{*},Q_{t}^{(n)}\big{)}=O(n^{-\alpha_{2}}) (18)

    Then, with a reasonably large sample set (polynomial in precision ϵ1\epsilon^{-1}), the early-stopping interval will become sufficiently wide and hard to miss, and the corresponding generalization error will be satisfactorily small.

  • A distribution-learning model can be posed as a calculus of variations problem. Given a training objective L(Q)L(Q) and distribution representation Q(f)Q(f), this problem is entirely determined by the function representation or function space {f<}\{\|f\|<\infty\}. Given a training rule, the choice of the function representation then determines the trainability (Proposition 3.2) and generalization ability (Theorem 3.3) of the model.

Future work can be developed from the above insights:

  • Generalization error estimates for GANs

    The Rademacher complexity argument should be applicable to GANs to bound the deviation GtGt(n)L2(P)\|G_{t}-G_{t}^{(n)}\|_{L^{2}(P)}, where Gt,Gt(n)G_{t},G_{t}^{(n)} are the generators trained on QQ_{*} and Q(n)Q_{*}^{(n)} respectively. Nevertheless, the difficulty is in the convergence analysis. Unlike bias potential models, the training objective of GAN is non-convex in the generator GG, and the solutions to G#P=QG\#P=Q_{*} are in general not unique.

  • Mode collapse

    If we consider mode collapse as a form of bad local minima, then it can benefit from a study of the critical points of GAN, once we pose GAN as a calculus of variations problem. Unlike the bias potential model whose parameter function VV ranges in the Hilbert space \mathcal{H}, GANs are formulated on the Wasserstein manifold whose tangent space L2(Q;d)L^{2}(Q;\mathbb{R}^{d}) depends significantly on the current position QQ. In particular, the behavior of gradient flow differs whether QQ is absolutely continuous or not, and we expect that successful GAN models can maintain the absolutely continuity of the trajectory QtQ_{t}.

  • New designs

    The design of distribution-learning model can benefit from a mathematical understanding. For instance, consider the early-stopping interval (18), can there be better training rules than gradient flow that reduces TminT_{\min} or postpones TmaxT_{\max} so that early-stopping becomes easier to perform?

References

  • [1] Arjovsky, M., and Bottou, L. Towards principled methods for training generative adversarial networks, 2017.
  • [2] Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization and equilibrium in generative adversarial nets (GANs). arXiv preprint arXiv:1703.00573 (2017).
  • [3] Bai, Y., Ma, T., and Risteski, A. Approximability of discriminators implies diversity in GANs, 2019.
  • [4] Barron, A. R., and Sheu, C.-H. Approximation of density functions by sequences of exponential families. The Annals of Statistics (1991), 1347–1369.
  • [5] Bau, D., Strobelt, H., Peebles, W., Wulff, J., Zhou, B., Zhu, J.-Y., and Torralba, A. Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics 38, 4 (Jul 2019), 1–11.
  • [6] Bonati, L., Zhang, Y.-Y., and Parrinello, M. Neural networks-based variationally enhanced sampling. In Proceedings of the National Academy of Sciences (2019), vol. 116, pp. 17641–17647.
  • [7] Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018).
  • [8] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020.
  • [9] Bubeck, S., Eldan, R., and Lehec, J. Sampling from a log-concave distribution with projected Langevin Monte Carlo. Discrete & Computational Geometry 59, 4 (2018), 757–783.
  • [10] Canu, S., and Smola, A. Kernel methods and the exponential family. Neurocomputing 69, 7-9 (2006), 714–720.
  • [11] Casert, C., Mills, K., Vieijra, T., Ryckebusch, J., and Tamblyn, I. Optical lattice experiments at unobserved conditions and scales through generative adversarial deep learning, 2020.
  • [12] Che, T., Li, Y., Jacob, A., Bengio, Y., and Li, W. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136 (2016).
  • [13] Chizat, L., and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems (2018), pp. 3036–3046.
  • [14] Donahue, J., and Simonyan, K. Large scale adversarial representation learning. In Advances in neural information processing systems (2019), pp. 10542–10552.
  • [15] E, W., Ma, C., and Wang, Q. A priori estimates of the population risk for residual networks. arXiv preprint arXiv:1903.02154 1, 7 (2019).
  • [16] E, W., Ma, C., Wojtowytsch, S., and Wu, L. Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t, 2020.
  • [17] E, W., Ma, C., and Wu, L. A priori estimates for two-layer neural networks. arXiv preprint arXiv:1810.06397 (2018).
  • [18] E, W., Ma, C., and Wu, L. Barron spaces and the compositional function spaces for neural network models. arXiv preprint arXiv:1906.08039 (2019).
  • [19] E, W., Ma, C., and Wu, L. Machine learning from a continuous viewpoint. arXiv preprint arXiv:1912.12777 (2019).
  • [20] E, W., Ma, C., and Wu, L. On the generalization properties of minimum-norm solutions for over-parameterized neural network models. arXiv preprint arXiv:1912.06987 (2019).
  • [21] Elgammal, A., Liu, B., Elhoseiny, M., and Mazzone, M. CAN: Creative adversarial networks, generating “art” by learning about styles and deviating from style norms, 2017.
  • [22] Feizi, S., Farnia, F., Ginart, T., and Tse, D. Understanding GANs in the LQG setting: Formulation, generalization and stability. IEEE Journal on Selected Areas in Information Theory 1, 1 (2020), 304–311.
  • [23] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems (2014), pp. 2672–2680.
  • [24] Gulrajani, I., Raffel, C., and Metz, L. Towards GAN benchmarks which require generalization. arXiv preprint arXiv:2001.03653 (2020).
  • [25] Jewell, N. P. Mixtures of exponential distributions. Ann. Statist. 10, 2 (06 1982), 479–484.
  • [26] Kiefer, J., and Wolfowitz, J. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist. 27, 4 (12 1956), 887–906.
  • [27] Lei, Q., Lee, J. D., Dimakis, A. G., and Daskalakis, C. Sgd learns one-layer networks in wgans, 2020.
  • [28] Lovász, L., and Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms 30, 3 (2007), 307–358.
  • [29] Mao, Y., He, Q., and Zhao, X. Designing complex architectured materials with generative adversarial networks. Science Advances 6, 17 (2020).
  • [30] Mei, S., Montanari, A., and Nguyen, P.-M. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences 115, 33 (2018), E7665–E7671.
  • [31] Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
  • [32] Pinkus, A. Approximation theory of the mlp model in neural networks. Acta numerica 8, 1 (1999), 143–195.
  • [33] Prykhodko, O., Johansson, S. V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E. J., Engkvist, O., and Chen, H. A de novo molecular generation method using latent vector based generative adversarial network. Journal of Cheminformatics 11, 74 (Dec 2019), 1–11.
  • [34] Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).
  • [35] Rahimi, A., and Recht, B. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing (2008), IEEE, pp. 555–561.
  • [36] Redner, R. A., and Walker, H. F. Mixture densities, maximum likelihood and the em algorithm. SIAM review 26, 2 (1984), 195–239.
  • [37] Roberts, G. O., and Tweedie, R. L. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli 2, 4 (1996), 341–363.
  • [38] Rotskoff, G. M., and Vanden-Eijden, E. Trainability and accuracy of neural networks: An interacting particle system approach. stat 1050 (2019), 30.
  • [39] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge, 2015.
  • [40] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In Advances in neural information processing systems (2016), pp. 2234–2242.
  • [41] Shalev-Shwartz, S., and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • [42] Simon-Gabriel, C.-J., Barp, A., and Mackey, L. Metrizing weak convergence with maximum mean discrepancies, 2020.
  • [43] Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvärinen, A., and Kumar, R. Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research 18 (2017).
  • [44] Sun, Y., Gilbert, A., and Tewari, A. On the approximation properties of random relu features. arXiv preprint arXiv:1810.04374 (2018).
  • [45] Vahdat, A., and Kautz, J. NVAE: A deep hierarchical variational autoencoder. arXiv preprint arXiv:2007.03898 (2020).
  • [46] Valsson, O., and Parrinello, M. Variational approach to enhanced sampling and free energy calculations. Physical review letters 113 (2014).
  • [47] Wand, M. P., and Jones, M. C. Kernel smoothing. Crc Press, 1994.
  • [48] Weed, J., and Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. arxiv e-prints, art. arXiv preprint arXiv:1707.00087 (2017).
  • [49] Wu, S., Dimakis, A. G., and Sanghavi, S. Learning distributions generated by one-layer relu networks. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 8107–8117.
  • [50] Yuan, L., Kirshner, S., and Givan, R. Estimating densities with non-parametric exponential families. arXiv preprint arXiv:1206.5036 (2012).
  • [51] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016).
  • [52] Zhang, L., E, W., and Wang, L. Monge-Ampère flow for generative modeling. arXiv preprint arXiv:1809.10188 (2018).
  • [53] Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. On the discrimination-generalization tradeoff in GANs. arXiv preprint arXiv:1711.02771 (2017).