This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Manifold Implicitly via Explicit Heat-Kernel Learning

Yufan Zhou, Changyou Chen, Jinhui Xu
Department of Computer Science and Engineering
State University of New York at Buffalo
{yufanzho, changyou, jinhui}@buffalo.edu
Abstract

Manifold learning is a fundamental problem in machine learning with numerous applications. Most of the existing methods directly learn the low-dimensional embedding of the data in some high-dimensional space, and usually lack the flexibility of being directly applicable to down-stream applications. In this paper, we propose the concept of implicit manifold learning, where manifold information is implicitly obtained by learning the associated heat kernel. A heat kernel is the solution of the corresponding heat equation, which describes how “heat” transfers on the manifold, thus containing ample geometric information of the manifold. We provide both practical algorithm and theoretical analysis of our framework. The learned heat kernel can be applied to various kernel-based machine learning models, including deep generative models (DGM) for data generation and Stein Variational Gradient Descent for Bayesian inference. Extensive experiments show that our framework can achieve state-of-the-art results compared to existing methods for the two tasks.

1 Introduction

Manifold is an important concept in machine learning, where a typical assumption is that data are sampled from a low-dimensional manifold embedded in some high-dimensional space. There have been extensive research trying to utilize the hidden geometric information of data samples [1, 2, 3]. For example, Laplacian eigenmap [1], a popular dimensionality reduction algorithm, represents the low-dimensional manifold by a graph built based on the neighborhood information of the data. Each data point serves as a node in the graph, edges are determined by methods like k-nearest neighbors, and weights are computed using Gaussian kernel. With this graph, one can then compute its essential information such as graph Laplacian and eigenvalues, which can help embed the data points to a k-dimensional space (by using the kk smallest non-zero eigenvalues and eigenvectors), following the principle of preserving the proximity of data points in both the original and the embedded spaces. Such an approach ensures that as the number of data samples goes to infinity, graph Laplacian converges to the Laplacian-Beltrami operator, a key operator defining the heat equation used in our approach.

In deep learning, there are also some methods try to directly learn the Riemannian metric of manifold other than its embedding. For example, [4] and [5] approximate the Riemannian metric by using the Jacobian matrix of a function, which maps latent variables to data samples.

Different from the aforementioned existing results that try to learn the embedding or Riemannian metric directly, we propose to learn the manifold implicitly by explicitly learning its associated heat kernel. The heat kernel describes the heat diffusion on a manifold and thus encodes a great deal of geometric information of the manifold. Note that unlike Laplacian eigenmap that relies on graph construction, our proposed method targets at a lower-level problem of directly learning the geometry-encoded heat kernel, which can be subsequently used in Laplacian eigenmap or diffusion map [1, 6], where a kernel function is required. Once the heat kernel is learned, it can be directly applied to a large family of kernel-based machine learning models, thus making the geometric information of the manifold more applicable to down-stream applications. There is a recent work [7] utilizing heat kernel in variational inference, which uses a very different approach from ours. Specifically, our proposed framework approximates the unknown heat kernel by optimizing a deep neural network, based on the theory of Wasserstein Gradient Flows (WGFs) [8]. In summary, our paper has the following main contributions.

  • We introduce the concept of implicit manifold learning to learn the geometric information of the manifold through heat kernel, propose a theoretically grounded and practically simple algorithm based on the WGF framework.

  • We demonstrate how to apply our framework to different applications. Specifically, we show that DGMs like MMD-GAN [9] are special cases of our proposed framework, thus bringing new insights into Generative Adversarial Networks (GANs). We further show that Stein Variational Gradient Descent (SVGD) [10] can also be improved using our framework.

  • Experiments suggest that our proposed framework achieves the state-of-the-art results for applications including image generation and Bayesian neural network regression with SVGD.

Relation with traditional kernel-based learning

Our proposed method is also related to kernel selection and kernel learning, and thus can be used to improve many kernel based methods. Compared to pre-defined kernels, our learned kernels can seamlessly integrate the geometric information of the underlying manifold. Compared to some existing kernel-learning methods such as [11, 12], our framework is more theoretically motivated and practically superior. Furthermore, [11, 12] learn kernels by maximizing the Maximum Mean Discrepancy (MMD), which is not suitable when there is only one distribution involved, e.g., learning the parameter manifold in Bayesian Inference.

2 Preliminaries

2.1 Riemannian Manifold

We use \mathcal{M} to denote manifold, and dim()dim(\mathcal{M}) to denote the dimensionality of manifold \mathcal{M}. We will only briefly introduce the needed concepts, with formal definitions and details provided in the Appendix. A Riemannian manifold, (,g)(\mathcal{M},g), is a real smooth manifold Rd\mathcal{M}\subset R^{d} associated with an inner product, defined by a positive definite metric tensor gg, varying smoothly on the tangent space of \mathcal{M}. Given an oriented Riemannian manifold, there exists a Riemannian volume element dm\mathrm{d}m [13], which can be expressed in local coordinates as: dm=|g|dx1dxd\mathrm{d}m=\sqrt{|g|}\mathrm{d}x^{1}\land...\land\mathrm{d}x^{d}, where |g||g| is the absolute value of the determinant of the metric tensor’s matrix representation; and \land denotes the exterior product of differential forms. The Riemannian volume element allows us to integrate functions on manifolds. Let ff be a smooth, compactly supported function on manifold \mathcal{M}. The integral of ff over \mathcal{M} is defined as fdm\int_{\mathcal{M}}f\mathrm{d}m. Now we can define the probability density function (PDF) on manifold [14, 15]. Let 𝝁{\bm{\mu}} be a probability measure on Rd\mathcal{M}\subset R^{d} such that 𝝁()=1{\bm{\mu}}(\mathcal{M})=1. A PDF pp of 𝝁{\bm{\mu}} on \mathcal{M} is a real, positive and intergrable function satisfying: 𝝁(S)=𝐱Sp(𝐱)dm,S{\bm{\mu}}(S)=\int_{\operatorname{\mathbf{x}}\in S}p(\operatorname{\mathbf{x}})\mathrm{d}m,\forall S\subset\mathcal{M}.

Ricci curvature tensor plays an important role in Riemannian geometry. It describes how a Riemannian manifold differs from an Euclidean space, represented as the volume difference between a narrow conical piece of a small geodesic ball in manifold and that of a standard ball with a same radius in Euclidean space. In this paper, we will focus on Riemannian manifolds with positive Ricci curvatures.

2.2 Heat Equation and Heat Kernel

The key ingredient in implicit manifold learning is heat kernel, which encodes extensive geodesic information of the manifold. Intuitively, the heat kernel k(t,𝐱0,𝐱)k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}) describes the process of heat diffusion on the manifold \mathcal{M}, given a heat source 𝐱0\operatorname{\mathbf{x}}_{0} and time tt. Throughout the paper, when 𝐱0\operatorname{\mathbf{x}}_{0} is assumed to be fixed, we will use kt(𝐱)k_{\mathcal{M}}^{t}(\operatorname{\mathbf{x}}) to denote k(t,𝐱0,𝐱)k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}) for notation simplicity. Heat equation and heat kernel are defined as below.

Definition 1 ([16])

Let (,g)(\mathcal{M},g) be a connected Riemannian manifold, and Δ\Delta be the Laplace-Beltrami operator on \mathcal{M}. The heat kernel k(t,𝐱0,𝐱)k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}) is the minimal positive solution of the heat equation: k(t,𝐱0,𝐱)/t=Δ𝐱k(t,𝐱0,𝐱),limt0+k(t,𝐱0,𝐱)=δ𝐱0(𝐱)\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})/\partial t=\Delta_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\lim_{t\rightarrow 0^{+}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=\delta_{\operatorname{\mathbf{x}}_{0}}(\operatorname{\mathbf{x}}).

Remarkably, heat kernel encodes a massive amount of geometric information of the manifold, and is closely related to the geodesic distance on the manifold.

Lemma 1 ([16])

For an arbitrary Riemannian manifold \mathcal{M}, logk(t,𝐱0,𝐱)d2(𝐱0,𝐱)/(4t)\log k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\sim-d_{\mathcal{M}}^{2}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})/(4t) as t0t\rightarrow 0, where d(𝐱0,𝐱)d_{\mathcal{M}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}) is the geodesic distance on manifold {\mathcal{M}} and 𝐱0,𝐱\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}\in{\mathcal{M}}.

This relation indicates that learning a heat kernel also learns the corresponding manifold implicitly. For this reason, we call our method “implicit” manifold learning. It is known that heat kernel is positive definite [17], contains all the information of intrinsic geometry, fully characterizes shapes up to isometry [18]. As a result, heat kernel has been widely used in computer graphics [18, 19].

2.3 Wasserstein Gradient Flows

Let 𝒫()\mathcal{P}(\mathcal{M}) denote the space of probability measures on Rd\mathcal{M}\subset R^{d}. Assume that 𝒫()\mathcal{P}(\mathcal{M}) is endowed with a Riemannian geometry induced by 2-Wasserstein distance, i.e., the distance between two probability measures 𝝁,𝝂𝒫(){\bm{\mu}}_{\mathcal{M}},{\bm{\nu}}_{\mathcal{M}}\in\mathcal{P}(\mathcal{M}) is defined as: dW2(𝝁,𝝂)=infπ𝚪(𝝁,𝝂)×𝐱𝐲2dπd_{W}^{2}({\bm{\mu}}_{\mathcal{M}},{\bm{\nu}}_{\mathcal{M}})=\inf_{\mathbf{\pi\in\Gamma({\bm{\mu}}_{\mathcal{M}},{\bm{\nu}}_{\mathcal{M}})}}\int_{\mathcal{M}\times\mathcal{M}}\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{y}}\|^{2}\mathrm{d}\mathbf{\pi}, where Γ(𝝁,𝝂)\Gamma({\bm{\mu}}_{\mathcal{M}},{\bm{\nu}}_{\mathcal{M}}) is the set of joint distribution on ×\mathcal{M}\times\mathcal{M} satisfying the condition that its two marginals equal to 𝝁{\bm{\mu}}_{\mathcal{M}} and 𝝂{\bm{\nu}}_{\mathcal{M}}, respectively. Let F:𝒫()RF:\mathcal{P}(\mathcal{M})\rightarrow R be a functional on 𝒫()\mathcal{P}(\mathcal{M}), mapping a probability measure to a real value. Wasserstein gradient flows describe the time-evolution of a probability measure 𝝁t{\bm{\mu}}_{\mathcal{M}}^{t}, defined by a partial differential equation (PDE): 𝝁t/t=W2F(𝝁t), for t>0\partial{\bm{\mu}}_{\mathcal{M}}^{t}/\partial t=-\nabla_{W_{2}}F({\bm{\mu}}_{\mathcal{M}}^{t}),\text{ for }t>0, where W2F(𝝁)(𝝁(F/𝝁))\nabla_{W_{2}}F({\bm{\mu}}_{\mathcal{M}})\triangleq-\nabla\cdot({\bm{\mu}}_{\mathcal{M}}\nabla(\partial F/\partial{\bm{\mu}}_{\mathcal{M}})). Importantly, there is a close relation between WGF and the heat equation on manifold.

Theorem 2 ([20])

Let (,g)(\mathcal{M},g) be a connected and complete Riemannian manifold with Riemannian volume element dm\mathrm{d}m, and 𝒫2()\mathcal{P}_{2}(\mathcal{M}) be the Wasserstein space of probability measures on \mathcal{M} equipped with the 2-Wasserstein distance dW2d_{W}^{2}. Let (𝛍t)t>0({\bm{\mu}}_{\mathcal{M}}^{t})_{t>0} be a continuous curve in 𝒫2()\mathcal{P}_{2}(\mathcal{M}). Then, the followings are equivalent:

1. (𝛍t)t>0({\bm{\mu}}_{\mathcal{M}}^{t})_{t>0} is a trajectory of the gradient flow for the negative entropy H(𝛍t)=𝐱logkt(𝐱)d𝛍tF(𝛍t)H({\bm{\mu}}^{t}_{\mathcal{M}})=\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}logk_{\mathcal{M}}^{t}(\operatorname{\mathbf{x}})\mathrm{d}{\bm{\mu}}_{\mathcal{M}}^{t}\triangleq F({\bm{\mu}}^{t}_{\mathcal{M}});

2. 𝛍t{\bm{\mu}}^{t}_{\mathcal{M}} is given by 𝛍t(S)=𝐱Skt(𝐱)dm{\bm{\mu}}^{t}_{\mathcal{M}}(S)=\int_{\operatorname{\mathbf{x}}\in S}k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\mathrm{d}m for t>0t>0, where (kt)t>0(k^{t}_{\mathcal{M}})_{t>0} is a solution to the heat equation k(t,𝐱0,𝐱)/t=Δ𝐱k(t,𝐱0,𝐱),limt0+k(t,𝐱0,𝐱)=δ𝐱0(𝐱)\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})/\partial t=\Delta_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\lim_{t\rightarrow 0^{+}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=\delta_{\operatorname{\mathbf{x}}_{0}}(\operatorname{\mathbf{x}}), satisfying: H(𝛍t)<H({\bm{\mu}}_{\mathcal{M}}^{t})<\infty, and for 0<s0<s1\forall 0<s_{0}<s_{1}, s0s1𝐱kt(𝐱)2/kt(𝐱)dmdt<\int_{s_{0}}^{s_{1}}\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|\triangle k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}/k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\mathrm{d}m\mathrm{d}t<\infty.

Different from [20], We use the term negative entropy instead of relative entropy for clearness, because relative entropy is often referred to KL-divergence. From the 2nd bullet of Theorem 2, one can see that ktk^{t}_{\mathcal{M}} is the probability density function of 𝝁t{\bm{\mu}}^{t}_{\mathcal{M}} on \mathcal{M} [14, 15].

3 The Proposed Framework

Our intuition of learning the heat kernel (thus learning a manifold implicitly) is inspired by Theorem 2, which indicates that one can learn the probability density function (PDF) on the manifold from the corresponding WGF. To this end, we first define the evolving PDF induced by a WGF.

Definition 2 (Evolving PDF)

Let (,g)(\mathcal{M},g) be a connected and complete Riemannian manifold with Riemannian volume element dm\mathrm{d}m; (𝛍t)t>0({\bm{\mu}}^{t})_{t>0} be the trajectory of a WGF of negative entropy H(𝛍t)=𝐱logpt(𝐱)d𝛍tH({\bm{\mu}}^{t})=\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\log p^{t}(\operatorname{\mathbf{x}})\mathrm{d}{\bm{\mu}}^{t} with initial point 𝛍0{\bm{\mu}}^{0}. We call the evolving function ptp^{t} satisfying 𝛍t(S)=𝐱Spt(𝐱)dm(t0){\bm{\mu}}^{t}(S)=\int_{\operatorname{\mathbf{x}}\in S}p^{t}(\operatorname{\mathbf{x}})\mathrm{d}m(\forall t\geq 0) the evolving PDF of 𝛍t{\bm{\mu}}^{t} induced by the WGF.

In the following, we start with some theoretical foundation of heat-kernel learning, which shows that two evolving PDFs induced by the WGF of negative entropy on a given manifold approaches each other at an exponential convergence rate, indicating the learnability of the heat kernel. We then propose an efficient and practical algorithm, where neural network and gradient descent are applied to learn a heat kernel. Finally, we apply our algorithm for Bayesian inference and DGMs as two down-stream applications. All proofs are provided in the Appendix.

3.1 Theoretical Foundation of Heat-Kernel Learning

Our goal in this section is to illustrate the feasibility and convergence speed of heat-kernel learning. We start from the following theorem.

Theorem 3

With the same setting as in Definition 2, suppose the manifold has positive Ricci curvature, let ptp^{t}, qtq^{t} be two evolving PDFs induced by the WGF of negative entropy with the corresponding probability measures being 𝛍t{\bm{\mu}}^{t} and 𝛎t{\bm{\nu}}^{t}, respectively. If dW2(𝛍0,𝛎0)<d^{2}_{W}({\bm{\mu}}^{0},{\bm{\nu}}^{0})<\infty, then pt(𝐱)=qt(𝐱)p^{t}(\operatorname{\mathbf{x}})=q^{t}(\operatorname{\mathbf{x}}) almost everywhere as tt\rightarrow\infty. Furthermore, 𝐱pt(𝐱)qt(𝐱)2dm\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|p^{t}(\operatorname{\mathbf{x}})-q^{t}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m converges to 0 exponentially fast.

Theorem 3 is a natural extension of Proposition 4.4 in [20], which says that two trajectories of WGF for negative entropy would approach each other. We extend their results from probability measures to evolving PDFs. Thus, if one can learn the trajectory of an evolving PDF pt(𝐱)p^{t}(\operatorname{\mathbf{x}}), it can be used to approximate the true heat kernel ktk^{t}_{\mathcal{M}} (which corresponds to qtq^{t} in Theorem 3 by Theorem 2).

One potential issue is that if the heat kernel ktk^{t}_{\mathcal{M}} itself converges fast enough to 0 in the time limit of tt\rightarrow\infty, the convergence result in Theorem 3 might be uninformative, because one ends up with an almost zero-output function. Fortunately, we can prove that the convergence rate of 𝐱pt(𝐱)kt(𝐱)2dm\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|p^{t}(\operatorname{\mathbf{x}})-k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m is faster than that of 𝐱kt(𝐱)2dm\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m.

Theorem 4

Let (,g)(\mathcal{M},g) be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary \partial\mathcal{M}. Suppose the manifold has positive Ricci curvature. Then 𝐱kt(𝐱)2dm\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m converges to 0 at most polynomially as tt\rightarrow\infty.

In addition, we can also prove a lower bound of the heat kernel ktk^{t}_{\mathcal{M}} for the non-asymptotic case of t<\forall t<\infty. This plays an important role when developing our practical algorithm. We will incorporate the lower bound into optimization by Lagrangian multiplier in our algorithm.

Theorem 5

Let (,g)(\mathcal{M},g) be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary \partial\mathcal{M}. If \mathcal{M} has positive Ricci curvature bounded by KK and its dimension dim()1dim(\mathcal{M})\geq 1, we have kt(𝐱)Γ(dim()/2+1)C(ϵ)(πt)dim()/2exp(π2π2dim()(4ϵ)Kt)k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\geq\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{C(\epsilon)(\pi t)^{dim(\mathcal{M})/2}}\exp(\dfrac{\pi^{2}-\pi^{2}dim(\mathcal{M})}{(4-\epsilon)Kt}) for 𝐱0,𝐱\forall\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}\in\mathcal{M} and small ϵ>0\epsilon>0, where C(ϵ)C(\epsilon) is a constant depending on ϵ>0\epsilon>0 and dd such that C(ϵ)0C(\epsilon)\rightarrow 0 as ϵ0\epsilon\rightarrow 0, and Γ\Gamma is the gamma function.

Theorem 5 implies that for any finite time tt, there is a lower bound of the heat kernel, which depends on the time and manifold shape, and is independent of the distance between 𝐱0\operatorname{\mathbf{x}}_{0} and 𝐱\operatorname{\mathbf{x}}. In fact, there also exists an upper bound [21], which depends on the geodesic distance between 𝐱0\operatorname{\mathbf{x}}_{0} and 𝐱\operatorname{\mathbf{x}}. However, we will show later that the upper bound has little impact in our algorithm, and is thus omitted here.

3.2 A Practical Heat-Kernel Learning Algorithm

We now propose a practical framework to solve the heat-kernel learning problem. We decompose the procedure into three steps: 1) constructing a parametric function pϕtp^{t}_{\boldsymbol{\phi}} to approximate the ptp^{t} in Theorem 3; 2) bridging pϕtp^{t}_{\boldsymbol{\phi}} and the corresponding 𝝁ϕt{\bm{\mu}}^{t}_{\boldsymbol{\phi}}; and 3) updating 𝝁ϕt{\bm{\mu}}_{\boldsymbol{\phi}}^{t} by solving the WGF of negative entropy, leading to an evolving PDF pϕtp^{t}_{\boldsymbol{\phi}}. We want to emphasize that by learning to evolve as a WGF, the time tt is not an explicit parameter to learn.

Parameterization of pϕtp^{t}_{\boldsymbol{\phi}}

We use a deep neural network to parameterize the PDF. Because kt(𝐱)k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}}) also depends on 𝐱0\operatorname{\mathbf{x}}_{0}, we propose to parameterize pϕtp^{t}_{\boldsymbol{\phi}} as a function with two inputs: pϕt(𝐱0,𝐱)p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}), which is the evolving PDF to approximate the heat kernel (with certain initial conditions). To guarantee the positive semi-definite property of a heat kernel, we utilize some existing parameterizations using deep neural networks [9, 11, 12], where [11] is a special case of [12]. We adopt two ways to construct the kernel. The first way is based on [9], where the parametric kernel is constructed as:

pϕt(𝐱0,𝐱)=exp(hϕt(𝐱0)hϕt(𝐱)2)\displaystyle p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=\exp(-\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|^{2}) (1)

The second way is based on [12], and we construct a parametric kernel as:

pϕt(𝐱0,𝐱)=𝔼{cos{(𝝎𝝍1t)[hϕ1t(𝐱0)hϕ1t(𝐱)]}}+𝔼{cos{(𝝎𝝍2,𝐱0,𝐱t)[hϕ1t(𝐱0)hϕ1t(𝐱)]}}\displaystyle p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=\mathbb{E}\{\cos\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{1}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}_{1}}(\operatorname{\mathbf{x}}_{0})-h^{t}_{\boldsymbol{\phi}_{1}}(\operatorname{\mathbf{x}})\right]\}\}+\mathbb{E}\{\cos\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}_{1}}(\operatorname{\mathbf{x}}_{0})-h^{t}_{\boldsymbol{\phi}_{1}}(\operatorname{\mathbf{x}})\right]\}\} (2)

where ϕ{ϕ1,𝝍1,𝝍2}\boldsymbol{\phi}\triangleq\{\boldsymbol{\phi}_{1},{\bm{\psi}}_{1},{\bm{\psi}}_{2}\} in (2), hϕt,hϕ1th^{t}_{\boldsymbol{\phi}},h^{t}_{\boldsymbol{\phi}_{1}} are neural networks, 𝝎𝝍1t,𝝎𝝍2,𝐱0,𝐱t{\bm{\omega}}^{t}_{{\bm{\psi}}_{1}},{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}} are samples from some implicit distribution which are constructed using neural networks. Details of implementing (2) can be found in [12].

A potential issue with these two methods is that they can only approximate functions whose maximum value is 1, i.e., max𝐱pϕt(𝐱0,𝐱)=pϕt(𝐱0,𝐱0)=1,t>0\max_{\operatorname{\mathbf{x}}\in\mathcal{M}}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{0})=1,\forall t>0. In practice, this can be satisfied by scaling it with an unknown time-aware term, at=maxkta^{t}_{\mathcal{M}}=\max k_{\mathcal{M}}^{t}, as atpϕta^{t}_{\mathcal{M}}p_{\boldsymbol{\phi}}^{t}. Because ata^{t}_{\mathcal{M}} depends only on tt and \mathcal{M}, it can be seen as a constant for fixed time tt and manifold \mathcal{M}. As we will show later, the unknown term ata^{t}_{\mathcal{M}} will be cancelled, and thus would not affect our algorithm.

Bridging pϕtp^{t}_{\boldsymbol{\phi}} and 𝝁ϕt{\bm{\mu}}_{\boldsymbol{\phi}}^{t}

We rely on the WGF framework to learn the parametrized PDF. To achieve this, note that from Definition 2, pϕtp^{t}_{\boldsymbol{\phi}} and 𝝁ϕt{\bm{\mu}}_{\boldsymbol{\phi}}^{t} are connected by the Riemannian volume element dm\textrm{d}m. Thus, given dm\textrm{d}m, if one is able to solve 𝝁ϕt{\bm{\mu}}_{\boldsymbol{\phi}}^{t} in the WGF, pϕtp^{t}_{\boldsymbol{\phi}} is also readily solved. However, dm\textrm{d}m is typically intractable in practice. Furthermore, notice that pϕt(𝐱j,𝐱i)p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i}) is a function with two inputs. This means for nn data samples {𝐱i}i=1n\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n}, there are nn evolving PDFs {pϕt(𝐱i,)}i=1n\{p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i},\cdot)\}^{n}_{i=1} and nn corresponding trajectories {𝝁ϕ,it}i=1n\{{\bm{\mu}}^{t}_{\boldsymbol{\phi},i}\}_{i=1}^{n}, to be solved, which is impractical.

To overcome this challenge, we propose to solving the WGF of 𝝁~ϕti=1n𝝁ϕ,it/n\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}}\triangleq\sum_{i=1}^{n}{\bm{\mu}}^{t}_{\boldsymbol{\phi},i}/n, the averaged probability measure of {𝝁ϕ,it}i=1n\{{\bm{\mu}}^{t}_{\boldsymbol{\phi},i}\}_{i=1}^{n}. We approximate the averaged measure by Kernel Density Estimation (KDE) [22]: given samples {𝐱i}i=1n\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n} on a manifold \mathcal{M} and the parametric function pϕtp^{t}_{\boldsymbol{\phi}}, we calculate the unnormalized average 𝝁¯ϕt(𝐱i)atj=1npϕt(𝐱j,𝐱i)/n\bar{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})\approx a^{t}_{\mathcal{M}}\sum_{j=1}^{n}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})/n. Consequently, the normalized average satisfying i=1n𝝁~ϕt(𝐱i)=1\sum_{i=1}^{n}\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})=1 is formulated as:

𝝁~ϕt(𝐱i)=j=1npϕt(𝐱j,𝐱i)i=1nj=1npϕt(𝐱j,𝐱i).\displaystyle\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})=\dfrac{\sum_{j=1}^{n}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})}{\sum_{i=1}^{n}\sum_{j=1}^{n}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})}. (3)

We can see that the scalar ata^{t}_{\mathcal{M}} is cancelled, and it will not affect our final algorithm.

Updating 𝝁ϕt{\bm{\mu}}_{\boldsymbol{\phi}}^{t}

Finally, we are left with solving 𝝁~ϕt\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t} in the WGF. We follow the celebrated Jordan-Kinderlehrer-Otto (JKO) scheme [23] to solve 𝝁~ϕt\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t} with the discrete approximation (3). The JKO scheme is rephrased in the following lemma under our setting.

Lemma 6 ([20])

Consider probability measures in 𝒫()\mathcal{P\mathcal{(}{M})}. Fix a time step τ>0\tau>0 and an initial value 𝛍0{\bm{\mu}}^{0} with finite 2nd moment. Recursively define a sequence (𝛍τn)n({\bm{\mu}}_{\tau}^{n})_{n\in\mathbb{N}} of local minimizer by 𝛍τ0𝛍0,𝛍τnargminηH(η)+dW2(𝛍τn1,η)/(2τ),{\bm{\mu}}_{\tau}^{0}\coloneqq{\bm{\mu}}^{0},\ {\bm{\mu}}_{\tau}^{n}\coloneqq\operatorname*{arg\,min}_{\eta}H(\eta)+d_{W}^{2}({\bm{\mu}}_{\tau}^{n-1},\eta)/(2\tau), where dW2d_{W}^{2} denotes the 2-Wasserstein distance. If we further define a discrete trajectory: 𝛍¯τ0𝛍0,𝛍¯τt𝛍τn, if t((n1)τ,nτ]\bar{{\bm{\mu}}}_{\tau}^{0}\coloneqq{\bm{\mu}}^{0},\ \bar{{\bm{\mu}}}_{\tau}^{t}\coloneqq{\bm{\mu}}_{\tau}^{n},\text{ if $t\in((n-1)\tau,n\tau]$}. Then 𝛍τt¯𝛍t\bar{{\bm{\mu}}^{t}_{\tau}}\rightarrow{\bm{\mu}}^{t} weakly as τ0\tau\rightarrow 0 for t>0\forall t>0, where (𝛍t)t>0({\bm{\mu}}^{t})_{t>0} is a trajectory of the gradient flow of negative entropy H.

Algorithm 1 Heat Kernel Learning
  Input: samples {𝐱i}i=1n\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n} on the manifold \mathcal{M}, kernel parameterized by (1) or (2), hyper-parameters α\alpha, β\beta, λ\lambda, time step τ=α/2β\tau=\alpha/2\beta.
  Initialize function pϕ0p^{0}_{\boldsymbol{\phi}}, compute corresponding 𝝂=𝝁~ϕ0{\bm{\nu}}=\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{0} by (3).
  for k=1k=1 to mm do
     Solve (4), where 𝝁~ϕkτ\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau} is computed by (3). Update 𝝂𝝁~ϕkτ{\bm{\nu}}\leftarrow\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau}, where 𝝁~ϕkτ\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau} is computed by (3).
  end for

Based on Theorem 5 and Lemma 6 , we know that to learn the kernel function at time t<t<\infty, we can use the Lagrange multiplier to define the following optimization problem for time tt:

minϕαH(𝝁~ϕt)+βdW2(𝝂,𝝁~ϕt)λ𝔼𝐱i𝐱j[pϕt(𝐱j,𝐱i)],\displaystyle\min_{\boldsymbol{\phi}}\alpha H(\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})+\beta d_{W}^{2}({\bm{\nu}},\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})-\lambda\mathbb{E}_{\operatorname{\mathbf{x}}_{i}\neq\operatorname{\mathbf{x}}_{j}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})\right], (4)

where 𝐱i,𝐱j\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\in\mathcal{M}, α,β,λ\alpha,\ \beta,\ \lambda are hyper-parameters, time step is τ=α/2β\tau=\alpha/2\beta, 𝝂{\bm{\nu}} is a given probability measure corresponding to a previous time. The last term is introduced to reflect the constraint of pϕt(𝐱j,𝐱i)p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i}) reflected in Theorem 5. Also, the Wasserstein term dW2(𝝂,𝝁~ϕt)d_{W}^{2}({\bm{\nu}},\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}) can be approximated using the Sinkhorn algorithm [24]. Our final algorithm is described in Algorithm 1, some discussions are provided in the Appendix. Note that in practice, mini-batch training is often used to ease computational complexity.

3.3 Applications

3.3.1 Learning Kernels in SVGD

SVGD [10] is a particle-based algorithm for approximate Bayesian inference, whose update involves a kernel function kk. Given a set of particles {𝐱i}i=1n\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n}, at iteration ll, the particle 𝐱i\operatorname{\mathbf{x}}_{i} is updated by

𝐱il+1𝐱il+ϵϕ(𝐱il), where ϕ(𝐱il)=1nj=1n[logq(𝐱jl)k(𝐱jl,𝐱il)+𝐱jlk(𝐱jl,𝐱il)].\displaystyle\operatorname{\mathbf{x}}_{i}^{l+1}\leftarrow\operatorname{\mathbf{x}}_{i}^{l}+\epsilon\phi(\operatorname{\mathbf{x}}^{l}_{i}),\text{ where }\phi(\operatorname{\mathbf{x}}^{l}_{i})=\dfrac{1}{n}\sum_{j=1}^{n}\left[\nabla\text{log}q(\operatorname{\mathbf{x}}_{j}^{l})k(\operatorname{\mathbf{x}}_{j}^{l},\operatorname{\mathbf{x}}_{i}^{l})+\nabla_{\operatorname{\mathbf{x}}_{j}^{l}}k(\operatorname{\mathbf{x}}_{j}^{l},\operatorname{\mathbf{x}}_{i}^{l})\right]~{}. (5)

Here q()q(\cdot) is the target distribution to be sampled from. Usually, a pre-defined kernel such as RBF kernel with median trick is used in SVGD. Instead of using pre-defined kernels, we propose to improve SVGD by using our heat-kernel learning method: we learn the evolving PDF and use it as the kernel function in SVGD. By alternating between learning the kernel with Algorithm 1 and updating particles with (5), manifold information can be conveniently encoded into SVGD.

3.3.2 Learning Deep Generative Models

Our second application is to apply our framework to DGMs. Compared to that in SVGD, application in DGMs is more involved because there are actually two manifolds: the manifold of training data P\mathcal{M}_{P} and the manifold of the generated data 𝜽\mathcal{M}_{{\bm{\theta}}}. Furthermore, 𝜽\mathcal{M}_{{\bm{\theta}}} depends on model parameters 𝜽{\bm{\theta}}, and hence varies during the training process.

Let g𝜽g_{{\bm{\theta}}} denote a generator, which is a neural network parameterized by 𝜽{\bm{\theta}}. Let the generated sample be 𝐲=g𝜽(ϵ)\operatorname{\mathbf{y}}=g_{{\bm{\theta}}}({\bm{\epsilon}}), with ϵ{\bm{\epsilon}} random noise following some distribution such as the standard normal distribution. In our method, we assume that the learning process constitutes an manifold flow (𝜽s)s0(\mathcal{M}_{{\bm{\theta}}}^{s})_{s\geq 0} with ss representing generator’s training step. After each generator update, samples from the generator are assumed to form an manifold. Our goal is to learn a generator such that 𝜽\mathcal{M}_{{\bm{\theta}}}^{\infty} approaches P\mathcal{M}_{P}. Our method contains two steps: learning the generator and learning the kernel (evolving PDF).

Learning the generator

We adopt two popular kernel-based quantities as objective functions for our generator, the Maximum Mean Discrepancy (MMD) [25] and the Scaled MMD (SMMD) [26], in which our learned heat kernels are used to compute these quantities. MMD and SMMD can be used to measure the difference between distributions. Thus, we update the generator by minimizing them. Details of the MMD and SMMD are given in the Appendix.

Learning the kernel

Different from the simple single manifold setting in Algorithm 1, we consider both the training data manifold and the generated data manifold in learning DGMs. As a result, instead of learning the heat kernel of 𝜽s\mathcal{M}_{{\bm{\theta}}}^{s} or 𝒫\mathcal{M}_{\mathcal{P}}, we propose to learn the heat kernel of a new connected manifold, ~s\widetilde{\mathcal{M}}^{s}, that integrates both 𝜽s\mathcal{M}_{{\bm{\theta}}}^{s} and 𝒫\mathcal{M}_{\mathcal{P}}. We will derive a regularized objective based on (4) to achieve our goal.

The idea is to initialize ~s\widetilde{\mathcal{M}}^{s} with one of the two manifolds, 𝜽s\mathcal{M}_{{\bm{\theta}}}^{s} and 𝒫\mathcal{M}_{\mathcal{P}}, and then extend it to the other manifold. Without loss of generality, we assume that 𝜽ss~\mathcal{M}^{s}_{{\bm{\theta}}}\subseteq\widetilde{\mathcal{M}^{s}} at the beginning. Note that it is unwise to assume 𝜽s𝒫s~\mathcal{M}^{s}_{{\bm{\theta}}}\cup\mathcal{M}_{\mathcal{P}}\subseteq\widetilde{\mathcal{M}^{s}}, since 𝜽s\mathcal{M}^{s}_{{\bm{\theta}}} and 𝒫\mathcal{M}_{\mathcal{P}} could be very different at the beginning. As a result, s~=Rd\widetilde{\mathcal{M}^{s}}=R^{d} might be the only case satisfying 𝜽s𝒫s~\mathcal{M}^{s}_{{\bm{\theta}}}\cup\mathcal{M}_{\mathcal{P}}\subseteq\widetilde{\mathcal{M}^{s}}, which does not contain any useful geometric information. First of all, we start with 𝜽s\mathcal{M}^{s}_{{\bm{\theta}}} by consider pϕt(𝐲i,𝐲j),𝐲i,𝐲j𝜽ss~p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}),\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}\in\mathcal{M}^{s}_{{\bm{\theta}}}\subset\widetilde{\mathcal{M}^{s}} in (4). Next, to incorporate the information of 𝒫\mathcal{M}_{\mathcal{P}}, we consider pϕt(𝐲,𝐱)p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}}) in (4) and regularize it with pϕt(𝐲,𝐱)pϕt(𝐲,𝐳)\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})\|, where 𝐲𝜽s\operatorname{\mathbf{y}}\in\mathcal{M}^{s}_{{\bm{\theta}}}, 𝐱𝒫\operatorname{\mathbf{x}}\in\mathcal{M}_{\mathcal{P}} and 𝐳~s\operatorname{\mathbf{z}}\in\widetilde{\mathcal{M}}^{s} is the closest point to 𝐱\operatorname{\mathbf{x}} on ~s\widetilde{\mathcal{M}}^{s}. The regularization constrains 𝒫\mathcal{M}_{\mathcal{P}} to be closed to s~\widetilde{\mathcal{M}^{s}} (extending s~\widetilde{\mathcal{M}^{s}} to 𝒫\mathcal{M}_{\mathcal{P}}). Since the norm regularization is infeasible to calculate, we will derive an upper bound below and use that instead. Specifically, for kernels of form (1), by Taylor expansion, we have:

pϕt(𝐲,𝐱)pϕt(𝐲,𝐳)(pϕt(𝐲,𝐱)/𝐱)(𝐳𝐱)cpϕt(𝐲,𝐱)𝐱hϕt(𝐱)hϕt(𝐱)hϕt(𝐲)\displaystyle\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})\|\approx\|(\partial p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})/\partial\operatorname{\mathbf{x}})(\operatorname{\mathbf{z}}-\operatorname{\mathbf{x}})\|\leq cp^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|_{\mathcal{F}}\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})\| (6)

where c=𝐱𝐳c=\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{z}}\|, and \|\cdot\|_{\mathcal{F}} denotes the Frobenius norm. Consider pϕt(𝐲,𝐱)p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}}) in (4) will lead to the same bound because of symmetry.

Finally, we consider pϕt(𝐱i,𝐱j)p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}) in (4) and regularize pϕt(𝐱j,𝐱i)pϕt(𝐳j,𝐳i)\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{z}}_{j},\operatorname{\mathbf{z}}_{i})\|, where 𝐱i,𝐱j𝒫\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\in\mathcal{M}_{\mathcal{P}}, and 𝐳i,𝐳j~s\operatorname{\mathbf{z}}_{i},\operatorname{\mathbf{z}}_{j}\in\widetilde{\mathcal{M}}^{s} are the closest points to 𝐱i\operatorname{\mathbf{x}}_{i} and 𝐱j\operatorname{\mathbf{x}}_{j} on ~s\widetilde{\mathcal{M}}^{s}. A similar bound can be obtained.

Furthermore, instead of directly bounding the multiplicative terms in (6), we find it more stable to bound every component separately. Note that 𝐱hϕt(𝐱)\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|_{\mathcal{F}} can be bounded from above using spectral normalization [27] or being incorporated into the objective function as in [26]. We do not explicitly include it in our objective function. As a result, incorporating the base learning kernel from (1), our optimization problem becomes:

minϕαH(𝝁~ϕt)+βdW2(𝝂,𝝁~ϕt)λ𝔼𝐲𝐱[pϕt(𝐲,𝐱)]+𝔼𝐱,𝐲[γ1pϕt(𝐲,𝐱)+γ2hϕt(𝐱)hϕt(𝐲)]\displaystyle\min_{\boldsymbol{\phi}}\alpha H(\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})+\beta d_{W}^{2}({\bm{\nu}},\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})-\lambda\mathbb{E}_{\operatorname{\mathbf{y}}\neq\operatorname{\mathbf{x}}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\right]+\mathbb{E}_{\operatorname{\mathbf{x}}\sim\mathbb{P},\operatorname{\mathbf{y}}\sim\mathbb{Q}}\left[\gamma_{1}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})+\gamma_{2}\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})\|\right]
+𝔼𝐱i,𝐱j[γ3pϕt(𝐱j,𝐱i)+γ4hϕt(𝐱i)hϕt(𝐱j)], where\displaystyle+\mathbb{E}_{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\sim\mathbb{P}}\left[\gamma_{3}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})+\gamma_{4}\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j})\|\right]\text{, where} (7)
𝔼𝐲𝐱[pϕt(𝐲,𝐱)]=14{𝔼𝐲i,𝐲j[pϕt(𝐲j,𝐲i)]+2𝔼𝐱,𝐲[pϕt(𝐲,𝐱)]+𝔼𝐱i,𝐱j[pϕt(𝐱j,𝐱i)]}\displaystyle\mathbb{E}_{\operatorname{\mathbf{y}}\neq\operatorname{\mathbf{x}}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\right]=\dfrac{1}{4}\{\mathbb{E}_{\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}\sim\mathbb{Q}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}}_{j},\operatorname{\mathbf{y}}_{i})\right]+2\mathbb{E}_{\operatorname{\mathbf{x}}\sim\mathbb{P},\operatorname{\mathbf{y}}\sim\mathbb{Q}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\right]+\mathbb{E}_{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\sim\mathbb{P}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})\right]\} (8)

Our algorithm for DGM with heat kernel learning is shown in Appendix. When SMMD is used as the objective function for the generator, we also scale (3.3.2) by the same factor as scaling SMMD.

Theoretical property and relation with existing methods:

Following the work in [26], we first study the continuity in weak-topology of our kernel when applied in MMD. The continuity in weak topology is an important property because it means that the objective function can provide good signal to update the generator [26], without suffering from sudden jump as in the Jensen-Shannon (JS) divergence or Kullback-Leibler (KL) divergence [28].

Theorem 7

With (6) bounded, MMD of our proposed kernel is continuous in weak topology, i.e., if  𝐷\mathbb{Q}\xrightarrow[]{D}\mathbb{P} then MMDkϕt(n,)0\textup{MMD}_{k_{\boldsymbol{\phi}}^{t}}(\mathbb{Q}_{n},\mathbb{P})\xrightarrow{}0, where 𝐷\xrightarrow[]{D} means convergence in distribution.

Proof of Theorem 7 directly follows Theorem 2 in [26]. Plugging (8) into (3.3.2), it is interesting to see some connections of existing methods with ours: 1) If one sets α=β=γ2=γ3=γ4=0,γ1=4,λ=4\alpha=\beta=\gamma_{2}=\gamma_{3}=\gamma_{4}=0,\gamma_{1}=4,\lambda=4, our method reduces to MMD-GAN [9]. Furthermore, if the scaled objectives are used, our method reduces to SMMD-GAN [26]; 2) If one sets α=β=γ2=γ4=0,γ1+γ3=4,λ=4\alpha=\beta=\gamma_{2}=\gamma_{4}=0,\gamma_{1}+\gamma_{3}=4,\lambda=4, our method reduces to the MMD-GAN with repulsive loss [29].

In summary, our method can interpret the min-max game in MMD-based GANs from a kernel-learning point of view, where the discriminators try to learn the heat kernel of some underlying manifolds. As we will show in the experiments, our model achieves the best performance compared to the related methods. Although there are several hyper-parameters in (3.3.2), we have made our model easy to tune due to the connection with GANs. Specifically, one can start by selecting a kernel based GAN, e.g., setting α=β=γ2=γ3=γ4=0,γ1=4,λ=4\alpha=\beta=\gamma_{2}=\gamma_{3}=\gamma_{4}=0,\gamma_{1}=4,\lambda=4 as MMD-GAN, and only tune α,β\alpha,\beta.

4 Experiments

4.1 A Toy Experiment

We illustrate the effectiveness of our method by comparing the difference between a learned PDF and the true heat kernel on the real line RR, i.e., the 1-dimensional Euclidean space. In this setting, the heat kernel has a closed form of k(t,x0,x)=exp{(xx0)2/4t}/4πtk(t,x_{0},x)=\exp\{-(x-x_{0})^{2}/4t\}/\sqrt{4\pi t}, where the maximum value is at=1/4πta^{t}_{\mathcal{M}}=1/\sqrt{4\pi t}. We uniformly sample 512 points in [10,10][-10,10] as training data, the kernel is constructed by (1), where a 3-layer neural network is used. We assume that every gradient descent update corresponds to 0.010.01 time step. The evolution of atpϕt(0,x)a^{t}_{\mathcal{M}}p^{t}_{\boldsymbol{\phi}}(0,x) and ktk^{t}_{\mathcal{M}} are shown in Figure 1.

4.2 Improved SVGD

Refer to caption
(a) Iteration 1
Refer to caption
(b) Iteration 5
Refer to caption
(c) Iteration 20
Refer to caption
(d) Iteration 50
Figure 1: Evolution of the atpϕta^{t}_{\mathcal{M}}p^{t}_{\boldsymbol{\phi}} (blue solid line) and true heat kernel ktk^{t}_{\mathcal{M}} (red dashed line) on RR.

We next apply SVGD with the kernel learned by our framework for BNN regression on UCI datasets. For all experiments, a 2-layer BNN with 50 hidden units, 10 weight particles, ReLU activation is used. We assign the isotropic Gaussian prior to the network weights. Recently, [30] proposes the matrix-valued kernel for SVGD (denoted as MSVGD-a and MSVGD-m). Our method can also be used to improve their methods. Detailed experimental settings are provided in the Appendix due to the limitation of space. We denote our improved SVGD as HK-SVGD, and our improved matrix-valued SVGD as HK-MSVGD-a and HK-MSVGD-m. The results are reported in Table 1. We can see that our method can improve both the SVGD and matrix-valued SVGD. Additional test log-likelihoods are reported in the Appendix. One potential criticism is that 10 particles are not sufficient to well describe the parameter manifold. We thus conduct extra experiments, where instead of using particles, we use a 2-layer neural network to generate parameter samples for BNN. The results are also reported in the Appendix. Consistently, better performance is achieved.

Table 1: Average test RMSE ()(\downarrow) for UCI regression.

Combined Concrete Kin8nm Protein Wine Year SVGD 4.088±0.0334.088\pm 0.033 5.027±0.1165.027\pm 0.116 0.093±0.0010.093\pm 0.001 4.186±0.0174.186\pm 0.017 0.645±0.0090.645\pm 0.009 8.686±0.0108.686\pm 0.010 HK-SVGD (ours) 4.077±0.035\mathbf{4.077\pm 0.035} 4.814±0.112\mathbf{4.814\pm 0.112} 0.091±0.001\mathbf{0.091\pm 0.001} 4.138±0.019\mathbf{4.138\pm 0.019} 0.624±0.010\mathbf{0.624\pm 0.010} 8.656±0.007\mathbf{8.656\pm 0.007} MSVGD-a 4.056±0.0334.056\pm 0.033 4.869±0.1244.869\pm 0.124 0.092±0.0010.092\pm 0.001 3.997±0.0183.997\pm 0.018 0.637±0.0080.637\pm 0.008 8.637±0.0058.637\pm 0.005 MSVGD-m 4.029±0.0334.029\pm 0.033 4.721±0.1114.721\pm 0.111 0.090±0.0010.090\pm 0.001 3.852±0.0143.852\pm 0.014 0.637±0.0090.637\pm 0.009 8.594±0.0098.594\pm 0.009 HK-MSVGD-a (ours) 4.020±0.0434.020\pm 0.043 4.443±0.138\mathbf{4.443\pm 0.138} 0.090±0.0010.090\pm 0.001 4.001±0.0044.001\pm 0.004 0.614±0.007\mathbf{0.614\pm 0.007} 8.590±0.0108.590\pm 0.010 HK-MSVGD-m (ours) 3.998±0.046\mathbf{3.998\pm 0.046} 4.552±0.1464.552\pm 0.146 0.089±0.001\mathbf{0.089\pm 0.001} 3.762±0.015\mathbf{3.762\pm 0.015} 0.629±0.0080.629\pm 0.008 8.533±0.005\mathbf{8.533\pm 0.005}

4.3 Deep Generative Models

Table 2: Results on image generation.

CelebA ImageNet FID ()(\downarrow) IS ()(\uparrow) FID ()(\downarrow) IS ()(\uparrow) WGAN-GP 29.2±0.229.2\pm 0.2 2.7±0.12.7\pm 0.1 65.7±0.365.7\pm 0.3 7.5±0.17.5\pm 0.1 SN-GAN 22.6±0.122.6\pm 0.1 2.7±0.12.7\pm 0.1 47.5±0.147.5\pm 0.1 11.2±0.111.2\pm 0.1 SMMD-GAN 18.4±0.218.4\pm 0.2 2.7±0.12.7\pm 0.1 38.4±0.338.4\pm 0.3 10.7±0.210.7\pm 0.2 SN-SMMD-GAN 12.4±0.212.4\pm 0.2 2.8±0.12.8\pm 0.1 36.6±0.236.6\pm 0.2 10.9±0.110.9\pm 0.1 Repulsive 10.5±0.110.5\pm 0.1 2.8±0.12.8\pm 0.1 31.0±0.131.0\pm 0.1 11.5±0.111.5\pm 0.1 HK (ours) 9.7±0.1\mathbf{9.7\pm 0.1} 2.9±0.1\mathbf{2.9\pm 0.1} 29.6±0.129.6\pm 0.1 11.8±0.111.8\pm 0.1 HK-DK (ours) 29.2±0.1\mathbf{29.2\pm 0.1} 11.9±0.1\mathbf{11.9\pm 0.1} CIFAR-10 STL-10 FID ()(\downarrow) IS ()(\uparrow) FID ()(\downarrow) IS ()(\uparrow) DC-GAN architecture WGAN-GP 31.1±0.231.1\pm 0.2 6.9±0.26.9\pm 0.2 55.155.1 8.4±0.18.4\pm 0.1 SN-GAN 25.525.5 7.6±0.17.6\pm 0.1 43.243.2 8.8±0.18.8\pm 0.1 SMMD-GAN 31.5±0.431.5\pm 0.4 7.0±0.17.0\pm 0.1 43.7±0.243.7\pm 0.2 8.4±0.18.4\pm 0.1 SN-SMMD-GAN 25.0±0.325.0\pm 0.3 7.3±0.17.3\pm 0.1 40.6±0.140.6\pm 0.1 8.5±0.18.5\pm 0.1 CR-GAN 18.718.7 7.97.9 Repulsive 16.716.7 8.08.0 36.736.7 9.49.4 HK (ours) 14.9±0.114.9\pm 0.1 8.2±0.18.2\pm 0.1 31.8±0.131.8\pm 0.1 9.6±0.1\mathbf{9.6\pm 0.1} HK-DK (ours) 13.2±0.1\mathbf{13.2\pm 0.1} 8.4±0.1\mathbf{8.4\pm 0.1} 30.3±0.1\mathbf{30.3\pm 0.1} 9.6±0.1\mathbf{9.6\pm 0.1} ResNet architecture SN-GAN 21.7±0.221.7\pm 0.2 8.2±0.18.2\pm 0.1 40.1±0.540.1\pm 0.5 9.1±0.19.1\pm 0.1 CR-GAN 14.614.6 8.48.4 Repulsive 12.2±0.112.2\pm 0.1 8.3±0.18.3\pm 0.1 25.3±0.125.3\pm 0.1 10.2±0.110.2\pm 0.1 Auto-GAN 12.412.4 8.6±0.18.6\pm 0.1 31.131.1 9.2±0.19.2\pm 0.1 HK (ours) 11.5±0.111.5\pm 0.1 8.4±0.18.4\pm 0.1 24.3±0.124.3\pm 0.1 10.5±0.110.5\pm 0.1 HK-DK (ours) 10.3±0.1\mathbf{10.3\pm 0.1} 8.6±0.1\mathbf{8.6\pm 0.1} 24.0±0.1\mathbf{24.0\pm 0.1} 10.5±0.1\mathbf{10.5\pm 0.1} BigGAN Setting BigGAN 14.714.7 CR-BigGAN 11.711.7

Finally, we apply our framework to high-quality image generation. Four datasets are used in this experiment: CIFAR-10, STL-10, ImageNet, CelebA. Following [26, 29], images are scaled to the resolution of 32×3232\times 32, 48×48,64×6448\times 48,64\times 64 and 160×160160\times 160 respectively. Following [26, 27], we test 2 architectures on CIFAR-10 and STL-10, 1 architecture on CelebA and ImageNet. We report the standard Fréchet Inception Distance (FID) [31] and Inception Score (IS) [32] for evaluation. Architecture details and some experiments settings can be found in Appendix.

We compared our method with some popular and state-of-the-art GAN models under the same experimental setting, including WGAN-GP [33], SN-GAN[27], SMMD-GAN, SN-SMMD-GAN [26], CR-GAN [34], MMD-GAN with repulsive loss [29], Auto-GAN [35]. The results are reported in Table 2 where HK and HK-DK represent our model with kernel (1) and (2). More results are provided in the Appendix. HK-DK exhibits some convergence issues on CelebA, hence no result is reported. We can see that our models achieve the state-of-the-art results (under the same experimental setting such as the same/similar architectures). Furthermore, compared to Auto-GAN, which needs 43 hours to train on the CIFAR-10 dataset due to the expensive architecture search, our method (HK with ResNet architecture) needs only 12 hours to obtain better results. Some randomly generated images are also provided in the Appendix.

5 Conclusion

We introduce the concept of implicit manifold learning, which implicitly learns the geometric information of an unknown manifold by learning the corresponding heat kernel. Both theoretical analysis and practical algorithm are derived. Our framework is flexible and can be applied to general kernel-based models, including DGMs and Bayesian inference. Extensive experiments suggest that our methods achieve consistently better results on different tasks, compared to related methods.

Broader Impact

We propose a fundamentally novel method to implicitly learn the geometric information of a manifold by explicitly learning its associated heat kernel, which is the solution of heat equation with initial conditions given. Our proposed method is general and can be applied in many down-stream applications. Specifically, it could be used to improve many kernel-related algorithms and applications. It may also inspire researchers in deep learning to borrow ideas from other fields (mathematics, physics, etc.) and apply them to their own research. This can benefit both fields and thus promote interdisciplinary research.

Acknowledgements

The research of the first and third authors was supported in part by NSF through grants CCF-1716400 and IIS-1910492.

References

  • [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
  • [2] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • [3] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
  • [4] Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on the curvature of deep generative models, 2017.
  • [5] Hang Shao, Abhishek Kumar, and P. Thomas Fletcher. The riemannian geometry of deep generative models, 2017.
  • [6] Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
  • [7] Dimitris Kalatzis, David Eklund, Georgios Arvanitidis, and Søren Hauberg. Variational autoencoders with riemannian brownian motion priors. arXiv preprint arXiv:2002.05227, 2020.
  • [8] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • [9] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
  • [10] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378–2386, 2016.
  • [11] Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, and Barnabas Poczos. Implicit kernel learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2007–2016, 2019.
  • [12] Yufan Zhou, Changyou Chen, and Jinhui Xu. Kernelnet: A data-dependent kernel parameterization for deep generative modeling, 2019.
  • [13] John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media, 2006.
  • [14] Xavier Pennec. Probabilities and statistics on riemannian manifolds: Basic tools for geometric measurements. Citeseer, 1999.
  • [15] Omer Bobrowski and Sayan Mukherjee. The topology of probability distributions on manifolds. Probability theory and related fields, 161(3-4):651–686, 2015.
  • [16] Alexander Grigor’yan, Jiaxin Hu, and Ka-Sing Lau. Heat kernels on metric measure spaces. In Geometry and analysis of fractals, pages 147–207. Springer, 2014.
  • [17] John Lafferty and Guy Lebanon. Diffusion kernels on statistical manifolds. J. Mach. Learn. Res., 6:129–163, December 2005.
  • [18] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multi-scale signature based on heat diffusion. In Computer graphics forum, volume 28, pages 1383–1392. Wiley Online Library, 2009.
  • [19] Keenan Crane, Clarisse Weischedel, and Max Wardetzky. Geodesics in heat: A new approach to computing distance based on heat flow. ACM Transactions on Graphics (TOG), 32(5):152, 2013.
  • [20] Matthias Erbar. The heat equation on manifolds as a gradient flow in the wasserstein space. In Annales de l’IHP Probabilités et statistiques, volume 46, pages 1–23, 2010.
  • [21] Peter Li and Shing Tung Yau. On the parabolic kernel of the schrödinger operator. Acta Mathematica, 156(1):153–201, 1986.
  • [22] Zdravko I Botev, Joseph F Grotowski, Dirk P Kroese, et al. Kernel density estimation via diffusion. The annals of Statistics, 38(5):2916–2957, 2010.
  • [23] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
  • [24] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2292–2300. Curran Associates, Inc., 2013.
  • [25] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pages 513–520, 2007.
  • [26] Michael Arbel, Dougal Sutherland, MikoNaj Bińkowski, and Arthur Gretton. On gradient regularizers for mmd gans. In Advances in Neural Information Processing Systems, pages 6700–6710, 2018.
  • [27] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
  • [28] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 214–223, 2017.
  • [29] Wei Wang, Yuan Sun, and Saman Halgamuge. Improving mmd-gan training with repulsive loss function. In International Conference on Learning Representations, 2018.
  • [30] Dilin Wang, Ziyang Tang, Chandrajit Bajaj, and Qiang Liu. Stein variational gradient descent with matrix-valued kernels. In Advances in neural information processing systems, pages 7836–7846, 2019.
  • [31] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017.
  • [32] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
  • [33] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
  • [34] Han Zhang, Zizhao Zhang, Augustus Odena, and Honglak Lee. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020.
  • [35] Xinyu Gong, Shiyu Chang, Yifan Jiang, and Zhangyang Wang. Autogan: Neural architecture search for generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 3224–3234, 2019.
  • [36] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
  • [37] Sigmundur Gudmundsson. An Introduction to Riemannian Geometry - Lecture Notes in Mathematics. Lund University, 2018.
  • [38] Karl-Theodor Sturm et al. On the geometry of metric measure spaces. Acta mathematica, 196(1):65–131, 2006.
  • [39] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [41] MikoNaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. In International Conference on Learning Representations, 2018.

Appendix A Riemannian Manifold

Definition 3 (Manifold)

[36] Let \mathcal{M} be a set. If there exists a set of coordinate systems AA for \mathcal{M} satisfying the conditions below, (,A)(\mathcal{M},A) is called an nn-dimensional CC^{\infty} differentiable manifold, or simply manifold.

  1. 1.

    Each element ϕ\phi of AA is a one-to-one mapping from \mathcal{M} to some open subset of RnR^{n};

  2. 2.

    For all ϕA\phi\in A, given any one-to-one mapping ψ\psi from \mathcal{M} to RnR^{n}, the following holds:

    ψAψϕ1is a C diffeomorphism;\psi\in A\Leftrightarrow\psi\circ\phi^{-1}\text{is a $C^{\infty}$ diffeomorphism};

By CC^{\infty} diffeomorphism, we mean that ψϕ1\psi\circ\phi^{-1} and its inverse ϕψ1\phi\circ\psi^{-1} are both CC^{\infty} (infinitely many times differentiable). Infinitely differentiable is not necessary actually, we may consider this notation as ’sufficiently smooth’.

We will use T𝐱T_{\operatorname{\mathbf{x}}}\mathcal{M} to denote the tangent space of \mathcal{M} at point 𝐱\operatorname{\mathbf{x}}, and X,Y,ZX,Y,Z to denote the vector fields.

Definition 4 (Riemannian Metric and Riemannian Manifold)

[37] Let \mathcal{M} be a manifold, C()C^{\infty}(\mathcal{M}) be the comminicative ring of smooth functions on \mathcal{M}, and C(𝒯)C^{\infty}(\mathcal{TM}) be the set of smooth vector fields on \mathcal{M} forming a module over C()C^{\infty}(\mathcal{M}). A Riemannian metric gg on \mathcal{M} is a tensor field g:C(𝒯)C(𝒯)C()g:C^{\infty}(\mathcal{TM})\otimes C^{\infty}(\mathcal{TM})\rightarrow C^{\infty}(\mathcal{M}) such that for each 𝐱\operatorname{\mathbf{x}}\in\mathcal{M}, the restriction g𝐱g_{\operatorname{\mathbf{x}}} of gg to the tensor product T𝐱T𝐱T_{\operatorname{\mathbf{x}}}\mathcal{M}\otimes T_{\operatorname{\mathbf{x}}}\mathcal{M} with:

g𝐱:(X𝐱,Y𝐱)g(X,Y)(𝐱)g_{\operatorname{\mathbf{x}}}:(X_{\operatorname{\mathbf{x}}},Y_{\operatorname{\mathbf{x}}})\rightarrow g(X,Y)(\operatorname{\mathbf{x}})

is a real scalar product on the tangent space T𝐱T_{\operatorname{\mathbf{x}}}\mathcal{M}. The pair (,g)(\mathcal{M},g) is called a Riemannian manifold. The geometric properties of (,g)(\mathcal{M},g) which depend only on the metric gg are said to be intrinsic or metric properties.

One classical example is that the Riemannian manifold Em=(Rm,,Rm)E^{m}=(R^{m},\langle,\rangle_{R^{m}}) is nothing but the m-dimensional Euclidean space.

The Riemannian curvature tensor of a manifold \mathcal{M} is defined by

R(X,Y)Z=XYZYXZ[X,Y]ZR(X,Y)Z=\nabla_{X}\nabla_{Y}Z-\nabla_{Y}\nabla_{X}Z-\nabla_{\left[X,Y\right]}Z

on vector fields X,Y,ZX,Y,Z. For any two tangent vector ξ,ηT𝐱\mathbf{\xi},\mathbf{\eta}\in T_{\operatorname{\mathbf{x}}}\mathcal{M}, we use Ric𝐱(ξ,η)Ric_{\operatorname{\mathbf{x}}}(\mathbf{\xi},\mathbf{\eta}) to denote the Ricci tensor evaluated at (ξ,η)(\mathbf{\xi},\mathbf{\eta}), which is defined to be the trace of the mapping T𝐱T𝐱T_{\operatorname{\mathbf{x}}}\mathcal{M}\rightarrow T_{\operatorname{\mathbf{x}}}\mathcal{M} given by ζR(ζ,η)ξ\mathbf{\zeta}\rightarrow R(\mathbf{\zeta},\mathbf{\eta})\mathbf{\xi}.

We use RicKRic\geq K to denote that a manifold’s Ricci curvature is bounded from below by KK, in the sense that Ric𝐱(ξ,ξ)K|ξ|2Ric_{\operatorname{\mathbf{x}}}(\mathbf{\xi},\mathbf{\xi})\geq K|\mathbf{\xi}|^{2} for all 𝐱,ξT𝐱\operatorname{\mathbf{x}}\in\mathcal{M},\mathbf{\xi}\in T_{\operatorname{\mathbf{x}}}\mathcal{M}.

Appendix B Proof of Theorem 3

Theorem 3 Let (,g)(\mathcal{M},g) be a connected, complete Riemannian manifold with Riemannian volume element dm\mathrm{d}m and positive Ricci curvature. Assume that the heat kernel is Lipschitz continous. Let ptp^{t}, qtq^{t} be two evolving PDFs induced by WGF for negative entropy as defined in Definition 2, their corresponding probability measure are 𝛍t{\bm{\mu}}^{t} and 𝛎t{\bm{\nu}}^{t}. If dW2(𝛍0,𝛎0)<d^{2}_{W}({\bm{\mu}}^{0},{\bm{\nu}}^{0})<\infty, then pt(𝐱)=qt(𝐱)p^{t}(\operatorname{\mathbf{x}})=q^{t}(\operatorname{\mathbf{x}}) almost everywhere as tt\rightarrow\infty; furthermore, 𝐱pt(𝐱)qt(𝐱)2dm\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|p^{t}(\operatorname{\mathbf{x}})-q^{t}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m converges to 0 exponentially fast.

Proof  We start by introducing the following lemma, which is Proposition 4.4 in [20].

Lemma 8 ([20])

Assume RicKRic\geq K. Let (𝛍t)t0({\bm{\mu}}^{t}_{\mathcal{M}})_{t\geq 0} and (𝛎t)t0({\bm{\nu}}^{t}_{\mathcal{M}})_{t\geq 0} be two trajectories of the gradient flow of negative entropy functional with initial distribution 𝛍0{\bm{\mu}}^{0}_{\mathcal{M}} and 𝛎0{\bm{\nu}}^{0}_{\mathcal{M}}, respectively, then

dW2(𝝁t,𝝂t)eKtdW2(𝝁0,𝝂0).d_{W}^{2}({\bm{\mu}}^{t}_{\mathcal{M}},{\bm{\nu}}^{t}_{\mathcal{M}})\leq e^{-Kt}d_{W}^{2}({\bm{\mu}}^{0}_{\mathcal{M}},{\bm{\nu}}^{0}_{\mathcal{M}}).

In particular, for a given initial value 𝛍0{\bm{\mu}}^{0}_{\mathcal{M}}, there is at most one trajectory of the gradient flow.

What we need to do is to extend the results of Lemma 8 from probability measures to probability density functions. Following some previous work, we first define the projection operator.

Definition 5 (Projection Operator)

[8] Let πi,πi,j\pi^{i},\pi^{i,j} denote the projection operators defined on the product space 𝐗¯X1××XN\bar{\mathbf{X}}\coloneqq X_{1}\times...\times X_{N} respectively such that:

πi:(𝐱1,,𝐱N)𝐱iXi,πi,j:(𝐱1,,𝐱N)(𝐱i,𝐱j)Xi×Xj\pi^{i}:\ (\operatorname{\mathbf{x}}_{1},...,\operatorname{\mathbf{x}}_{N})\mapsto\operatorname{\mathbf{x}}_{i}\in X_{i},\ \pi^{i,j}:\ (\operatorname{\mathbf{x}}_{1},...,\operatorname{\mathbf{x}}_{N})\mapsto(\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j})\in X_{i}\times X_{j}

if μ¯𝒫(𝐗¯)\bar{\mathbf{\mu}}\in\mathcal{P}(\bar{\mathbf{X}}), the marginals of 𝛍¯\bar{{\bm{\mu}}} are the probability measures

𝝁iπ#i𝝁¯𝒫(Xi),𝝁i,jπ#i,j𝝁¯𝒫(Xi×Xj).{\bm{\mu}}^{i}\coloneqq\pi^{i}_{\#}\bar{{\bm{\mu}}}\in\mathcal{P}(X_{i}),\ {\bm{\mu}}^{i,j}\coloneqq\pi^{i,j}_{\#}\bar{{\bm{\mu}}}\in\mathcal{P}(X_{i}\times X_{j}).

We then introduce the following lemma which utilize the projection operator.

Lemma 9

[8] Let XiX_{i}, ii\in\mathbb{N}, be a sequence of Radon separable metric spaces, 𝛍i𝒫(Xi){\bm{\mu}}_{i}\in\mathcal{P}(X_{i}) and αi(i+1)Γ(𝛍i,𝛍i+1)\alpha^{i(i+1)}\in\Gamma({\bm{\mu}}^{i},{\bm{\mu}}^{i+1}), β1iΓ(𝛍1,𝛍i)\beta^{1i}\in\Gamma({\bm{\mu}}^{1},{\bm{\mu}}^{i}), where Γ(𝛍,𝛎)\Gamma({\bm{\mu}},{\bm{\nu}}) denotes the 2-plan (i.e. transportation plan between two distributions) with given marginals 𝛍,𝛎{\bm{\mu}},{\bm{\nu}}. Let 𝐗¯iXi\bar{\mathbf{X}}_{\infty}\coloneqq\prod_{i\in\mathbb{N}}X_{i}, with the canonical product topology. Then there exist 𝛎¯,𝛍¯𝒫(𝐗¯)\bar{{\bm{\nu}}},\bar{{\bm{\mu}}}\in\mathcal{P}(\bar{\mathbf{X}}_{\infty}) such that

π#i,i+1𝝁¯=αi(i+1),π#1,i𝝂¯=β1i,i.\displaystyle\pi^{i,i+1}_{\#}\bar{{\bm{\mu}}}=\alpha^{i(i+1)},\ \pi^{1,i}_{\#}\bar{{\bm{\nu}}}=\beta^{1i},\ \forall i\in\mathbb{N}. (9)

Now we are ready to prove the theorem.

We first construct a sequence of probability measures {𝝆t}t\{{\bm{\rho}}^{t}\}_{t\in\mathbb{N}}, such that 𝝆2t=𝝁ϕt,𝝆2t+1=𝝁t,t{\bm{\rho}}_{2t}={\bm{\mu}}^{t}_{\boldsymbol{\phi}},\ {\bm{\rho}}^{2t+1}={\bm{\mu}}^{t}_{\mathcal{M}},\ t\in\mathbb{N}. Then, we choose αt(t+1)Γo(𝝆t,𝝆t+1)\alpha^{t(t+1)}\in\Gamma_{o}({\bm{\rho}}^{t},{\bm{\rho}}^{t+1}), i.e. the optimal 2-plans given marginal 𝝆t,𝝆t+1{\bm{\rho}}^{t},{\bm{\rho}}^{t+1}. According to Lemma 9, we can find a probability measure 𝝁¯𝒫(¯)\bar{{\bm{\mu}}}\in\mathcal{P}(\bar{\mathcal{M}}_{\infty}) where ¯tt\bar{\mathcal{M}}_{\infty}\coloneqq\prod_{t\in\mathbb{N}}\mathcal{M}_{t} satisfying (9). Then the 2-Wasserstein distance can be written as:

dW2(𝝁ϕt,𝝁t)=dW2(𝝆2t,𝝆2t+1)=𝒅(π2t,π2t+1)L2(𝝁¯;¯),d^{2}_{W}({\bm{\mu}}^{t}_{\boldsymbol{\phi}},{\bm{\mu}}^{t}_{\mathcal{M}})=d^{2}_{W}({\bm{\rho}}^{2t},{\bm{\rho}}^{2t+1})=\bm{d}(\pi^{2t},\pi^{2t+1})_{L^{2}(\bar{{\bm{\mu}}};\bar{\mathcal{M}}_{\infty})},

where 𝒅(,)L2(𝝁¯;¯)\bm{d}(\cdot,\cdot)_{L^{2}(\bar{{\bm{\mu}}};\bar{\mathcal{M}}_{\infty})} denotes the norm of L2L^{2} space.

According to Lemma 8,

dW2(𝝁ϕt,𝝁t)eKtdW2(𝝁ϕ0,𝝁0),d^{2}_{W}({\bm{\mu}}^{t}_{\boldsymbol{\phi}},{\bm{\mu}}^{t}_{\mathcal{M}})\leq e^{-Kt}d^{2}_{W}({\bm{\mu}}^{0}_{\boldsymbol{\phi}},{\bm{\mu}}^{0}_{\mathcal{M}}),

which means:

𝒅(π2t,π2t+1)L2(𝝁¯;X)eKtdW2(𝝁ϕ0,𝝁0).\bm{d}(\pi^{2t},\pi^{2t+1})_{L^{2}(\bar{{\bm{\mu}}};X)}\leq e^{-Kt}d^{2}_{W}({\bm{\mu}}^{0}_{\boldsymbol{\phi}},{\bm{\mu}}^{0}_{\mathcal{M}}).
kϕtkt2dm\displaystyle\int_{\mathcal{M}}\|k_{\boldsymbol{\phi}}^{t}-k_{\mathcal{M}}^{t}\|^{2}\mathrm{d}m =(kϕt)2dm+(kt)2dm2ktkϕtdm\displaystyle=\int_{\mathcal{M}}(k_{\boldsymbol{\phi}}^{t})^{2}\mathrm{d}m+\int_{\mathcal{M}}(k_{\mathcal{M}}^{t})^{2}\mathrm{d}m-2\int_{\mathcal{M}}k_{\mathcal{M}}^{t}k_{\boldsymbol{\phi}}^{t}\mathrm{d}m
kϕtdμϕt+ktdμt\displaystyle\leq\int_{\mathcal{M}}k_{\boldsymbol{\phi}}^{t}\mathrm{d}\mu^{t}_{\boldsymbol{\phi}}+\int_{\mathcal{M}}k_{\mathcal{M}}^{t}\mathrm{d}\mu^{t}_{\mathcal{M}}
=¯(kϕtπ2t)d𝝁¯+¯(ktπ2t+1)d𝝁¯\displaystyle=\int_{\bar{\mathcal{M}}_{\infty}}(k_{\boldsymbol{\phi}}^{t}\circ\pi^{2t})\mathrm{d}\bar{{\bm{\mu}}}+\int_{\bar{\mathcal{M}}_{\infty}}(k_{\mathcal{M}}^{t}\circ\pi^{2t+1})\mathrm{d}\bar{{\bm{\mu}}}
¯kϕtπ2t+ktπ2t+1d𝝁¯\displaystyle\leq\int_{\bar{\mathcal{M}}_{\infty}}\|k_{\boldsymbol{\phi}}^{t}\circ\pi^{2t}+k_{\mathcal{M}}^{t}\circ\pi^{2t+1}\|\mathrm{d}\bar{{\bm{\mu}}}
(¯kϕtπ2t+ktπ2t+12d𝝁¯)1/2,by Jensen’s Inequality\displaystyle\leq(\int_{\bar{\mathcal{M}}_{\infty}}\|k_{\boldsymbol{\phi}}^{t}\circ\pi^{2t}+k_{\mathcal{M}}^{t}\circ\pi^{2t+1}\|^{2}\mathrm{d}\bar{{\bm{\mu}}})^{1/2},\ \text{by Jensen's Inequality}
(¯Lsup2π2tπ2t+12d𝝁¯)1/2\displaystyle\leq(\int_{\bar{\mathcal{M}}_{\infty}}L_{sup}^{2}\|\pi^{2t}-\pi^{2t+1}\|^{2}\mathrm{d}\bar{{\bm{\mu}}})^{1/2}
=C𝒅(π2t,π2t+1)L2(𝝁¯;¯),C is a constant because of Lipschitz continuity\displaystyle=C\bm{d}(\pi^{2t},\pi^{2t+1})_{L^{2}(\bar{{\bm{\mu}}};\bar{\mathcal{M}}_{\infty})},\text{C is a constant because of Lipschitz continuity}
CeKtdW2(𝝁ϕ0,𝝁0).\displaystyle\leq Ce^{-Kt}d^{2}_{W}({\bm{\mu}}^{0}_{\boldsymbol{\phi}},{\bm{\mu}}^{0}_{\mathcal{M}}).

which completes the proof. Readers who are interested may also refer to Chapter 5.3 in [8] and proof of Proposition 4.4 in [20] for more information.  

Appendix C Proof of Theorem 4 and Theorem 5

Theorem 5 Let (,g)(\mathcal{M},g) be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary \partial\mathcal{M}. If \mathcal{M} has positive Ricci curvature bounded by KK and its dimension dim()1dim(\mathcal{M})\geq 1. Let kt(𝐱)k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}}) be the heat kernel of

k(t,𝐱0,𝐱)t=𝐱k(t,𝐱0,𝐱),ktn=0 on  if applicable\dfrac{\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{\partial t}=\triangle_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\dfrac{\partial k^{t}_{\mathcal{M}}}{\partial n}=0\text{ on }\partial\mathcal{M}\text{ if applicable}

where nn denotes the outward-pointing unit normal to boundary \partial\mathcal{M}. Then

kt(𝐱)Γ(dim()/2+1)C(ϵ)(πt)dim()/2exp(π2π2dim()(4ϵ)Kt)k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\geq\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{C(\epsilon)(\pi t)^{dim(\mathcal{M})/2}}\exp(\dfrac{\pi^{2}-\pi^{2}dim(\mathcal{M})}{(4-\epsilon)Kt})

for 𝐱0,𝐱\forall\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}\in\mathcal{M} and small ϵ>0\epsilon>0, where C(ϵ)C(\epsilon) is a constant depending on ϵ>0\epsilon>0 and dd such that C(ϵ)0C(\epsilon)\rightarrow 0 as ϵ0\epsilon\rightarrow 0, and Γ\Gamma is the gamma function.

Proof  We start by introducing the following lemma:

Lemma 10

[21] Let \mathcal{M} be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary \partial\mathcal{M}. Suppose that the Ricci curvature of \mathcal{M} is non-negative. Let kt(𝐱)=k(t,𝐱0,𝐱)k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})=k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}) be the heat kernel of

k(t,𝐱0,𝐱)t=𝐱k(t,𝐱0,𝐱),ktn=0 on  if applicable\dfrac{\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{\partial t}=\triangle_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\dfrac{\partial k^{t}_{\mathcal{M}}}{\partial n}=0\text{ on }\partial\mathcal{M}\text{ if applicable}

Then, the heat kernel satisfies

k(t,𝐱0,𝐱)C1(ϵ)V1[B,𝐱(t)]exp(d2(𝐱0,𝐱)(4ϵ)t)k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\geq C^{-1}(\epsilon)V^{-1}[B_{\mathcal{M},\operatorname{\mathbf{x}}}(\sqrt{t})]\exp(\dfrac{-d_{\mathcal{M}}^{2}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{(4-\epsilon)t})

for some constant C(ϵ)C(\epsilon) depending on ϵ>0\epsilon>0 and nn such that C(ϵ)0C(\epsilon)\rightarrow 0 as ϵ0\epsilon\rightarrow 0, where B,𝐱(t),B_{\mathcal{M},\operatorname{\mathbf{x}}}(\sqrt{t})\subseteq\mathcal{M}, denotes the geodesic ball with radius t\sqrt{t} around 𝐱\operatorname{\mathbf{x}}. Moreover, by symmetrizing,

k(t,𝐱0,𝐱)C1(ϵ)V1/2[B,𝐱0(t)]V1/2[B,𝐱(t)]exp(d2(𝐱0,𝐱)(4ϵ)t).k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\geq C^{-1}(\epsilon)V^{-1/2}[B_{\mathcal{M},\operatorname{\mathbf{x}}_{0}}(\sqrt{t})]V^{-1/2}[B_{\mathcal{M},\operatorname{\mathbf{x}}}(\sqrt{t})]\exp(\dfrac{-d_{\mathcal{M}}^{2}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{(4-\epsilon)t}).

Lemma 10 comes from the Theorem 4.1 and Theorem 4.2 in [21]. Then we need to introduce the definition of CD condition and Lemma 11.

Definition 6 (CD condition)

[38] For Rimennian manifold \mathcal{M}, the curvature-dimension (CD) condition CD(K,d)CD(K,d) is satisfied if and only if the dimension of manifold is less or equal to d and Ricci curvature is bounded from below by KK.

Lemma 11

[38] For every metric measure space (,d,m)(\mathcal{M},d_{\mathcal{M}},m) which satisfies the curvature-dimension condition CD(K,d)CD(K,d) for some real numbers K>0K>0 and d1d\geq 1, the support of mm is compact and has diameter

Lπd1K,L\leq\pi\sqrt{\dfrac{d-1}{K}},

where the diameter is defined as

L=sup𝐱,𝐲×d(𝐱,𝐲).L=\sup_{\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}\in\mathcal{M}\times\mathcal{M}}d_{\mathcal{M}}(\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}).

Now we are ready to prove the Theorem.

Let (,g)(\mathcal{M},g) be a complete Riemannian manifold with positive Ricci curvature RicKRic\geq K and dimension dim()dim(\mathcal{M}), then

Ric0=[dim()1]0Ric\geq 0=\left[dim(\mathcal{M})-1\right]\cdot 0

Because Euclidean Space can be seen as a manifold with constant sectional curvature 0. By Bishop-Gromov inequality we have the following for manifold with non-negative Ricci curvature:

V[B,𝐱(r)]V[BRdim()(r)]V[B_{\mathcal{M},\operatorname{\mathbf{x}}}(r)]\leq V[B_{R^{dim(\mathcal{M})}}(r)]

i.e. the volume of a geodesic ball with radius rr around 𝐱\operatorname{\mathbf{x}} is less than the ball with same radius in a dim()dim(\mathcal{M})-dimensional Euclidean space. According to the volume of n-ball, we have

V[BRdim()(r)]=πdim()/2Γ(dim()/2+1)rdim()V[B_{R^{dim(\mathcal{M})}}(r)]=\dfrac{\pi^{dim(\mathcal{M})/2}}{\Gamma(dim(\mathcal{M})/2+1)}r^{dim(\mathcal{M})}

Thus we have

V1[B,𝐱(r)]Γ(dim()/2+1)πdim()/2rdim()V^{-1}[B_{\mathcal{M},\operatorname{\mathbf{x}}}(r)]\geq\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{\pi^{dim(\mathcal{M})/2}}r^{-dim(\mathcal{M})}

Using Lemma 10, we have:

k(t,𝐱0,𝐱)C1(ϵ)Γ(dim()/2+1)πdim()/2tdim()/2exp(d2(𝐱0,𝐱)(4ϵ)t)k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\geq C^{-1}(\epsilon)\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{\pi^{dim(\mathcal{M})/2}t^{dim(\mathcal{M})/2}}\exp(\dfrac{-d_{\mathcal{M}}^{2}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{(4-\epsilon)t})

Using Lemma 11 and d(𝐱0,𝐱)Lπdim()1Kd_{\mathcal{M}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\leq L\leq\pi\sqrt{\dfrac{dim(\mathcal{M})-1}{K}}, we have:

k(t,𝐱0,𝐱)Γ(dim()/2+1)C(ϵ)πdim()/2tdim()/2exp(π2π2dim()(4ϵ)Kt)k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\geq\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{C(\epsilon)\pi^{dim(\mathcal{M})/2}t^{dim(\mathcal{M})/2}}\exp(\dfrac{\pi^{2}-\pi^{2}dim(\mathcal{M})}{(4-\epsilon)Kt})

We now can conclude Theorem 5. The term exp(π2π2dim()(4ϵ)Kt)\exp(\dfrac{\pi^{2}-\pi^{2}dim(\mathcal{M})}{(4-\epsilon)Kt}) is increasing, and Γ(dim()/2+1)C(ϵ)πdim()/2tdim()/2\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{C(\epsilon)\pi^{dim(\mathcal{M})/2}t^{dim(\mathcal{M})/2}} is decreasing polynomially. We can see that the lower bound of heat kernel decrease to 0 polynomially with respect to t. Thus heat kernel value decrease to 0 at most polynomially with respect to t.  

Theorem 4 Let (,g)(\mathcal{M},g) be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary \partial\mathcal{M}. Assume it has positive Ricci curvature. dm\mathrm{d}m denotes its Riemannian volume element. Let kt(𝐱)k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}}) be the heat kernel of

k(t,𝐱0,𝐱)t=𝐱k(t,𝐱0,𝐱),ktn=0 on  if applicable\dfrac{\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{\partial t}=\triangle_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\dfrac{\partial k^{t}_{\mathcal{M}}}{\partial n}=0\text{ on }\partial\mathcal{M}\text{ if applicable}

where nn denotes the outward-pointing unit normal to boundary \partial\mathcal{M}. Then 𝐱kt(𝐱)2dm\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m converges to 0 at most polynomially as tt\rightarrow\infty, which is slower than 𝐱pt(𝐱)kt(𝐱)2dm\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|p^{t}(\operatorname{\mathbf{x}})-k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m .

Proof  For a given manifold \mathcal{M}, its Riemannian volume element doesn’t vary at different time t, thus the lower bound of integral 𝐱kt(𝐱)2dm\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m where kt(𝐱)=k(t,𝐱0,𝐱)k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})=k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}) also decrease polynomially because of Theorem 5. Then we can conclude Theorem 4.  

Appendix D MMD and SMMD

In our proposed method for deep generative models, MMD is used as the objective function for the generator, which is defined as:

MMD2(,)=\displaystyle\textsf{MMD}^{2}(\mathbb{P},\mathbb{Q})= 𝔼𝐱i,𝐱j(kϕ(𝐱i,𝐱j))+𝔼𝐲i,𝐲j(kϕ(𝐲i,𝐲j))\displaystyle\mathbb{E}_{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\sim\mathbb{P}}(k_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}))+\mathbb{E}_{\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}\sim\mathbb{Q}}(k_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}))
2𝔼𝐱i,𝐲j(kϕ(𝐱i,𝐲j)),\displaystyle-2\mathbb{E}_{\operatorname{\mathbf{x}}_{i}\sim\mathbb{P},\operatorname{\mathbf{y}}_{j}\sim\mathbb{Q}}(k_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{y}}_{j})), (10)

where ,\mathbb{P},\mathbb{Q} are probability distributions of the training data and the generated data, respectively. MMD measures the difference between two distributions. Thus, we want it to be minimized.

Generator can also use SMMD [26] as the objective function, which is defined as:

SMMD2(,)=σMMD2(,)\displaystyle\textsf{SMMD}^{2}(\mathbb{P},\mathbb{Q})=\sigma\textsf{MMD}^{2}(\mathbb{P},\mathbb{Q})
σ={ζ+𝔼𝐱[kϕ(t,𝐱,𝐱)]+i=1d𝔼𝐱[2kϕ(t,𝐲,𝐳)𝐲i𝐳i|(𝐲,𝐳)=(𝐱,𝐱)]}1,\displaystyle\sigma=\{\zeta+\mathbb{E}_{\operatorname{\mathbf{x}}\in\mathbb{P}}\left[k_{\boldsymbol{\phi}}(t,\operatorname{\mathbf{x}},\operatorname{\mathbf{x}})\right]+\sum_{i=1}^{d}\mathbb{E}_{\operatorname{\mathbf{x}}\in\mathbb{P}}\left[\dfrac{\partial^{2}k_{\boldsymbol{\phi}}(t,\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})}{\partial\operatorname{\mathbf{y}}_{i}\partial\operatorname{\mathbf{z}}_{i}}|_{(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})=(\operatorname{\mathbf{x}},\operatorname{\mathbf{x}})}\right]\}^{-1}, (11)

where dd is the dimensionality of the data, 𝐲i\operatorname{\mathbf{y}}_{i} denotes the ithi^{th} element of 𝐲\operatorname{\mathbf{y}}, and ζ\zeta is a hyper-parameter.

Appendix E Algorithms

E.1 Algorithms for DGM with heat kernel learning

First of all, let’s introduce a similar objective function for kernel with form (2), which can be used to replace (3.3.2).

Similarly to (6), for kernels with form (2), we have the following bound:

pϕt(𝐲,𝐱)pϕt(𝐲,𝐳)c1𝔼{sin{(𝝎𝝍1t)[hϕt(𝐲)hϕt(𝐱)]}}𝝎𝝍1t𝐱hϕt(𝐱)\displaystyle\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})\|\leq c_{1}\|\mathbb{E}\{\sin\{({\bm{\omega}}_{{\bm{\psi}}_{1}}^{t})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\}\|\|{\bm{\omega}}_{{\bm{\psi}}_{1}}^{t}\|\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|_{\mathcal{F}}
+c2𝔼{sin{(𝝎𝝍2,𝐱,𝐲t)[hϕt(𝐲)hϕt(𝐱)]}}𝝎𝝍2,𝐱,𝐲t𝐱hϕt(𝐱)\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+c_{2}\|\mathbb{E}\{\sin\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\}\|\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}\|\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|_{\mathcal{F}} (12)
where c1=𝐱𝐳,c2=𝐱𝐳[𝝎𝝍2,𝐱,𝐲t+hϕt(𝐲)hϕt(𝐱)𝝎𝝍2,𝐱,𝐲thϕt(𝐱)]\displaystyle\text{ where }c_{1}=\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{z}}\|,c_{2}=\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{z}}\|\left[\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}\|+\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|\big{\|}\dfrac{\partial{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}}{\partial h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})}\big{\|}_{\mathcal{F}}\right]

Incorporating this bound into objective as (3.3.2), the optimization problem for learning kernel with form (2) becomes:

minϕ\displaystyle\min_{\boldsymbol{\phi}} αH(𝝁~ϕt)+βdW2(𝝂,𝝁~ϕt)λ𝔼𝐲𝐱[pϕt(𝐲,𝐱)]\displaystyle\ \alpha H(\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})+\beta d_{W}^{2}({\bm{\nu}},\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})-\lambda\mathbb{E}_{\operatorname{\mathbf{y}}\neq\operatorname{\mathbf{x}}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\right] (13)
+𝔼𝐱,𝐲[γ1ksin(t,𝐲,𝐱)+γ2hϕt(𝐱)hϕt(𝐲)]\displaystyle+\mathbb{E}_{\operatorname{\mathbf{x}}\sim\mathbb{P},\operatorname{\mathbf{y}}\sim\mathbb{Q}}\left[\gamma_{1}k_{sin}(t,\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})+\gamma_{2}\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})\|\right]
+𝔼𝐱i,𝐱j[γ3ksin(t,𝐱j,𝐱i)+γ4hϕt(𝐱i)hϕt(𝐱j)]\displaystyle+\mathbb{E}_{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\sim\mathbb{P}}\left[\gamma_{3}k_{sin}(t,\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})+\gamma_{4}\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j})\|\right]
+γ5𝔼𝐱,𝐲𝐱[𝝎𝝍1t+𝝎𝝍2,𝐱,𝐲t+𝝎𝝍2,𝐱,𝐲thϕt(𝐱)],\displaystyle+\gamma_{5}\mathbb{E}_{\operatorname{\mathbf{x}}\sim\mathbb{P},\operatorname{\mathbf{y}}\neq\operatorname{\mathbf{x}}}\left[\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{1}}\|+\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}\|+\|\dfrac{\partial{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}}{\partial h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})}\|_{\mathcal{F}}\right],
where ksin(t,𝐲,𝐱)=𝔼{sin{(𝝎𝝍1t)[hϕt(𝐲)hϕt(𝐱)]}}\displaystyle\text{where }k_{sin}(t,\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})=\mathbb{E}\{\sin\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{1}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\}
+𝔼{sin{(𝝎𝝍2,𝐱,𝐲t)[hϕt(𝐲)hϕt(𝐱)]}},\displaystyle\qquad\qquad\qquad+\mathbb{E}\{\sin\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\},

Now we present the algorithm of DGM with heat kernel learning here.

Algorithm 2 Deep Generative Model with Heat Kernel Learning
  Input: training data {𝐱i}\{\operatorname{\mathbf{x}}_{i}\} on manifold 𝒫\mathcal{M}_{\mathcal{P}}, generator g𝜽g_{{\bm{\theta}}}, denote generated data as {𝐲i}\{\operatorname{\mathbf{y}}_{i}\}, kernel parameterized by (1) or (2), all the hyper-parameters in (3.3.2), time step τ=α/2β\tau=\alpha/2\beta.
  for training epochs ss do
     for iteration j do
        Sample {𝐱i}i=1n\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n} and {𝐲i}i=1n\{\operatorname{\mathbf{y}}_{i}\}_{i=1}^{n}.
        Initialize function pϕ0p^{0}_{\boldsymbol{\phi}}, compute corresponding 𝝂=𝝁~ϕ0{\bm{\nu}}=\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{0} by (3).
        for k=1k=1 to mm do
           Compute 𝝁~ϕkτ\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau} by (3), solve (3.3.2) or (13). Update 𝝂𝝁~ϕkτ{\bm{\nu}}\leftarrow\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau}, where 𝝁~ϕkτ\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau} is computed by (3).
        end for
     end for
     Sample {𝐱i}i=1n\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n} and {𝐲i}i=1n\{\operatorname{\mathbf{y}}_{i}\}_{i=1}^{n}. Update 𝜽{\bm{\theta}} by minimizing MMD computed with (1) or (2).
  end for

E.2 Some discussions

Instead of initializing the 𝝂=𝝁~ϕ0{\bm{\nu}}=\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{0} by Equation (3), we can also simply initialize it to be 1/n1/n, which represents the discrete uniform distribution. In this case, we set the 𝝁ϕt{\bm{\mu}}^{t}_{\boldsymbol{\phi}} for time tt to be

𝝁ϕt(𝐱)=j=1npϕt(𝐱j,𝐱)nj=1npϕ0(𝐱j,𝐱),{\bm{\mu}}^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})=\dfrac{\sum_{j=1}^{n}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}})}{n\sum_{j=1}^{n}p^{0}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}})},

where unknown constant αt\alpha^{t}_{\mathcal{M}} is also cancelled.

To approximate H(𝝁~ϕt)H(\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}}), we may use either

H(𝝁~ϕt)i=1n𝝁~ϕt(𝐱i)log𝝁~ϕt(𝐱i)H(\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}})\approx\sum_{i=1}^{n}\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})\log\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})

or

H(𝝁~ϕt)1nj=1ni=1n𝝁~ϕt(𝐱i)logpϕt(𝐱j,𝐱i).H(\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}})\approx\dfrac{1}{n}\sum_{j=1}^{n}\sum_{i=1}^{n}\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})\log p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i}).

In practice, these different implementations may need different hyper-parameter settings, and have different performances. Furthermore, we observed that using unnormalized density estimation 𝝁ϕt{\bm{\mu}}^{t}_{\boldsymbol{\phi}} instead of 𝝁~ϕt\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}} also leads to competitive results.

Appendix F Experimental Results and Settings on Improved SVGD

We provide some experimental settings here. Our implementation is based on TensorFlow with a Nvidia 2080 Ti GPU. To simplify the setting, the RBF-kernel is used in all layers except the last one, which is learned by our method (1) with hϕh_{\boldsymbol{\phi}} a 2-layer neural network.In other words, we only learn the parameter manifold for the last layer. Following [30], we run 20 trials on all the datasets except for Protein and Year, where 5 trials are used. At each trial, we randomly choose 90%90\% of the dataset as the training set, and the rest 10%10\% as the testing set. For large datasets like Year and Combined, we use the Adam optimizer with a batch size of 1000, and use batch size of 100 for all other datasets. Before every update of the BNN parameters, we run Algorithm 1 with m=1m=1; and 5 Adam update steps are implemented to solve (4). For matrix-valued SVGD, We use the same experimental setting, except that the number of update steps for solving (4) is chosen from {1,2,5,10}\{1,2,5,10\}, based on hyper-parameter tuning.

We report the average test log-likelihood in Table 3, from which we can also see that our proposed method improves model performance.

Table 3: Average test log-likelihood ()(\uparrow) for UCI regression.

Combined Concrete Kin8nm Protein Wine Year SVGD 2.832±0.009-2.832\pm 0.009 3.064±0.034-3.064\pm 0.034 0.964±0.0120.964\pm 0.012 2.846±0.003-2.846\pm 0.003 0.997±0.019-0.997\pm 0.019 3.577±0.002-3.577\pm 0.002 HK-SVGD (ours) 2.827±0.009-2.827\pm 0.009 3.015±0.037-3.015\pm 0.037 0.976±0.0070.976\pm 0.007 2.838±0.004-2.838\pm 0.004 0.958±0.021-0.958\pm 0.021 3.559±0.001-3.559\pm 0.001 MSVGD-a 2.824±0.009-2.824\pm 0.009 3.150±0.054-3.150\pm 0.054 0.956±0.0110.956\pm 0.011 2.796±0.004-2.796\pm 0.004 0.980±0.016-0.980\pm 0.016 3.569±0.001-3.569\pm 0.001 MSVGD-m 2.817±0.009-2.817\pm 0.009 3.207±0.071-3.207\pm 0.071 0.975±0.0110.975\pm 0.011 2.755±0.003-2.755\pm 0.003 0.988±0.018-0.988\pm 0.018 3.561±0.002-3.561\pm 0.002 HK-MSVGD-a (ours) 2.815±0.012-2.815\pm 0.012 3.011±0.076\mathbf{-3.011\pm 0.076} 0.982±0.0110.982\pm 0.011 2.800±0.001-2.800\pm 0.001 0.943±0.016\mathbf{-0.943\pm 0.016} 3.549±0.002-3.549\pm 0.002 HK-MSVGD-m (ours) 2.814±0.013\mathbf{-2.814\pm 0.013} 3.157±0.067-3.157\pm 0.067 0.989±0.009\mathbf{0.989\pm 0.009} 2.731±0.004\mathbf{-2.731\pm 0.004} 1.013±0.019-1.013\pm 0.019 3.534±0.001\mathbf{-3.534\pm 0.001}

Instead of using particles, we further improve our proposed HK-SVGD by introducing a parameter generator, which takes Gaussian noises as inputs and outputs samples of parameter distribution for BNNs. We use a 2-layer neural network to model this generator, 10 samples are generated at each iteration. We denote the resulting model as HK-ISVGD, and compare it with vanilla SVGD and our proposed HK-SVGD. Results on UCI regression are shown in Table 4 and Table 5. We can see that introducing the parameter sample generator will lead to performance improvement on most of the datasets.

Table 4: Average test RMSE ()(\downarrow) for UCI regression with parameter generator.

Combined Concrete Kin8nm Protein Wine Year SVGD 4.088±0.0334.088\pm 0.033 5.027±0.1165.027\pm 0.116 0.093±0.0010.093\pm 0.001 4.186±0.0174.186\pm 0.017 0.645±0.0090.645\pm 0.009 8.686±0.0108.686\pm 0.010 HK-SVGD (ours) 4.077±0.0354.077\pm 0.035 4.814±0.112\mathbf{4.814\pm 0.112} 0.091±0.0010.091\pm 0.001 4.138±0.0194.138\pm 0.019 0.624±0.0100.624\pm 0.010 8.656±0.0078.656\pm 0.007 HK-ISVGD (ours) 4.075±0.035\mathbf{4.075\pm 0.035} 4.824±0.1134.824\pm 0.113 0.089±0.001\mathbf{0.089\pm 0.001} 4.094±0.014\mathbf{4.094\pm 0.014} 0.616±0.009\mathbf{0.616\pm 0.009} 8.611±0.007\mathbf{8.611\pm 0.007}

Table 5: Average test log-likelihood ()(\uparrow) for UCI regression with parameter generator.

Combined Concrete Kin8nm Protein Wine Year SVGD 2.832±0.009-2.832\pm 0.009 3.064±0.034-3.064\pm 0.034 0.964±0.0120.964\pm 0.012 2.846±0.003-2.846\pm 0.003 0.997±0.019-0.997\pm 0.019 3.577±0.002-3.577\pm 0.002 HK-SVGD (ours) 2.827±0.009-2.827\pm 0.009 3.015±0.037\mathbf{-3.015\pm 0.037} 0.976±0.0070.976\pm 0.007 2.838±0.004-2.838\pm 0.004 0.958±0.021-0.958\pm 0.021 3.559±0.001\mathbf{-3.559\pm 0.001} HK-ISVGD (ours) 2.826±0.008\mathbf{-2.826\pm 0.008} 3.073±0.052-3.073\pm 0.052 0.989±0.008\mathbf{0.989\pm 0.008} 2.823±0.003\mathbf{-2.823\pm 0.003} 0.943±0.018\mathbf{-0.943\pm 0.018} 3.565±0.001-3.565\pm 0.001

Appendix G Model Architectures and Some Experiments Settings on DGM

We provide some experimental details of image generation here. Our implementation is based on TensorFlow with a Nvidia 2080 Ti GPU.

For CIFAR-10 and STL-10, we test them on 2 architectures: DC-GAN based [39] and ResNet based architectures [40, 41]. The DC-GAN based architecture contains a 4-layer convolutional neural network (CNN) as the generator, with a 7-layer CNN representing hϕth^{t}_{\boldsymbol{\phi}} in (1) and (2). In the ResNet based architecture, the generator and hϕth^{t}_{\boldsymbol{\phi}} are both 10-layer ResNet. For ImageNet, we use the same ResNet based architecture as CIFAR-10 and STL-10. For CelebA, the generator is a 10-layer ResNet, while hϕth^{t}_{\boldsymbol{\phi}} is a 4-layer CNN.

For CIFAR-10, STL-10 and ImageNet, spectral normalization is used, while we scale the weights after spectral normalization by 2 on CIFAR-10 and STL-10. We set β1=0.5\beta_{1}=0.5, β2=0.999\beta_{2}=0.999 for the Adam optimizer and m=1m=1, n=64n=64 in Algorithm 2. Only one step Adam update is implemented for solving (3.3.2). Output dimension of hϕth^{t}_{\boldsymbol{\phi}} is set to be 16. For all the experiments with kernel (2), both f𝝍1tf^{t}_{{\bm{\psi}}_{1}} and f𝝍2tf^{t}_{{\bm{\psi}}_{2}} are parameterized by 2-layer fully connected neural networks.

For CelebA, we scale the kernel learning objective, i.e. (3.3.2), by σ\sigma in (D)\eqref{eq:smmd} as SMMD. Spectral regularization [26] is used. We set β1=0.5\beta_{1}=0.5, β2=0.9\beta_{2}=0.9 for the Adam optimizer and m=1m=1, j=5j=5, n=64n=64 in Algorithm 2. Only one step Adam update is implemented for solving (3.3.2). Output dimension of hϕth^{t}_{\boldsymbol{\phi}} is set to be 1, because scaled objective with hϕth^{t}_{\boldsymbol{\phi}} dimension larger than 1 is time consuming.

As for evaluation, CIFAR-10, STL-10 and ImageNet are evaluated on 100k generated images, while CelebA is evaluated on 50k generated images due to the memory limitation.

Appendix H More Results on Image Generation

Refer to caption
(a) HK
Refer to caption
(b) HK-DK
Figure 2: Generated images on CIFAR-10 (32×32)(32\times 32) with DC-GAN architecture.
Refer to caption
(a) HK
Refer to caption
(b) HK-DK
Figure 3: Generated images on CIFAR-10 (32×32)(32\times 32) with ResNet architecture.
Refer to caption
(a) HK
Refer to caption
(b) HK-DK
Figure 4: Generated images on STL-10 (48×48)(48\times 48) with DC-GAN architecture.
Refer to caption
(a) HK
Refer to caption
(b) HK-DK
Figure 5: Generated images on STL-10 (48×48)(48\times 48) with ResNet architecture.
Refer to caption
(a) HK
Refer to caption
(b) HK-DK
Figure 6: Generated images on ImageNet (64×64)(64\times 64).
Refer to caption
(a) HK
Figure 7: Generated images on CelebA (160×160)(160\times 160).