Learning Manifold Implicitly via Explicit Heat-Kernel Learning

Yufan Zhou, Changyou Chen, Jinhui Xu
Department of Computer Science and Engineering
State University of New York at Buffalo
{yufanzho, changyou, jinhui}@buffalo.edu

Abstract

Manifold learning is a fundamental problem in machine learning with numerous applications. Most of the existing methods directly learn the low-dimensional embedding of the data in some high-dimensional space, and usually lack the flexibility of being directly applicable to down-stream applications. In this paper, we propose the concept of implicit manifold learning, where manifold information is implicitly obtained by learning the associated heat kernel. A heat kernel is the solution of the corresponding heat equation, which describes how “heat” transfers on the manifold, thus containing ample geometric information of the manifold. We provide both practical algorithm and theoretical analysis of our framework. The learned heat kernel can be applied to various kernel-based machine learning models, including deep generative models (DGM) for data generation and Stein Variational Gradient Descent for Bayesian inference. Extensive experiments show that our framework can achieve state-of-the-art results compared to existing methods for the two tasks.

1 Introduction

Manifold is an important concept in machine learning, where a typical assumption is that data are sampled from a low-dimensional manifold embedded in some high-dimensional space. There have been extensive research trying to utilize the hidden geometric information of data samples [1, 2, 3]. For example, Laplacian eigenmap [1], a popular dimensionality reduction algorithm, represents the low-dimensional manifold by a graph built based on the neighborhood information of the data. Each data point serves as a node in the graph, edges are determined by methods like k-nearest neighbors, and weights are computed using Gaussian kernel. With this graph, one can then compute its essential information such as graph Laplacian and eigenvalues, which can help embed the data points to a k-dimensional space (by using the $k$ smallest non-zero eigenvalues and eigenvectors), following the principle of preserving the proximity of data points in both the original and the embedded spaces. Such an approach ensures that as the number of data samples goes to infinity, graph Laplacian converges to the Laplacian-Beltrami operator, a key operator defining the heat equation used in our approach.

In deep learning, there are also some methods try to directly learn the Riemannian metric of manifold other than its embedding. For example, [4] and [5] approximate the Riemannian metric by using the Jacobian matrix of a function, which maps latent variables to data samples.

Different from the aforementioned existing results that try to learn the embedding or Riemannian metric directly, we propose to learn the manifold implicitly by explicitly learning its associated heat kernel. The heat kernel describes the heat diffusion on a manifold and thus encodes a great deal of geometric information of the manifold. Note that unlike Laplacian eigenmap that relies on graph construction, our proposed method targets at a lower-level problem of directly learning the geometry-encoded heat kernel, which can be subsequently used in Laplacian eigenmap or diffusion map [1, 6], where a kernel function is required. Once the heat kernel is learned, it can be directly applied to a large family of kernel-based machine learning models, thus making the geometric information of the manifold more applicable to down-stream applications. There is a recent work [7] utilizing heat kernel in variational inference, which uses a very different approach from ours. Specifically, our proposed framework approximates the unknown heat kernel by optimizing a deep neural network, based on the theory of Wasserstein Gradient Flows (WGFs) [8]. In summary, our paper has the following main contributions.

•

We introduce the concept of implicit manifold learning to learn the geometric information of the manifold through heat kernel, propose a theoretically grounded and practically simple algorithm based on the WGF framework.
•

We demonstrate how to apply our framework to different applications. Specifically, we show that DGMs like MMD-GAN [9] are special cases of our proposed framework, thus bringing new insights into Generative Adversarial Networks (GANs). We further show that Stein Variational Gradient Descent (SVGD) [10] can also be improved using our framework.
•

Experiments suggest that our proposed framework achieves the state-of-the-art results for applications including image generation and Bayesian neural network regression with SVGD.

Relation with traditional kernel-based learning

Our proposed method is also related to kernel selection and kernel learning, and thus can be used to improve many kernel based methods. Compared to pre-defined kernels, our learned kernels can seamlessly integrate the geometric information of the underlying manifold. Compared to some existing kernel-learning methods such as [11, 12], our framework is more theoretically motivated and practically superior. Furthermore, [11, 12] learn kernels by maximizing the Maximum Mean Discrepancy (MMD), which is not suitable when there is only one distribution involved, e.g., learning the parameter manifold in Bayesian Inference.

2 Preliminaries

2.1 Riemannian Manifold

We use $\mathcal{M}$ to denote manifold, and $dim(\mathcal{M})$ to denote the dimensionality of manifold $\mathcal{M}$ . We will only briefly introduce the needed concepts, with formal definitions and details provided in the Appendix. A Riemannian manifold, $(\mathcal{M},g)$ , is a real smooth manifold $\mathcal{M}\subset R^{d}$ associated with an inner product, defined by a positive definite metric tensor $g$ , varying smoothly on the tangent space of $\mathcal{M}$ . Given an oriented Riemannian manifold, there exists a Riemannian volume element $\mathrm{d}m$ [13], which can be expressed in local coordinates as: $\mathrm{d}m=\sqrt{|g|}\mathrm{d}x^{1}\land...\land\mathrm{d}x^{d}$ , where $|g|$ is the absolute value of the determinant of the metric tensor’s matrix representation; and $\land$ denotes the exterior product of differential forms. The Riemannian volume element allows us to integrate functions on manifolds. Let $f$ be a smooth, compactly supported function on manifold $\mathcal{M}$ . The integral of $f$ over $\mathcal{M}$ is defined as $\int_{\mathcal{M}}f\mathrm{d}m$ . Now we can define the probability density function (PDF) on manifold [14, 15]. Let ${\bm{\mu}}$ be a probability measure on $\mathcal{M}\subset R^{d}$ such that ${\bm{\mu}}(\mathcal{M})=1$ . A PDF $p$ of ${\bm{\mu}}$ on $\mathcal{M}$ is a real, positive and intergrable function satisfying: ${\bm{\mu}}(S)=\int_{\operatorname{\mathbf{x}}\in S}p(\operatorname{\mathbf{x}})\mathrm{d}m,\forall S\subset\mathcal{M}$ .

Ricci curvature tensor plays an important role in Riemannian geometry. It describes how a Riemannian manifold differs from an Euclidean space, represented as the volume difference between a narrow conical piece of a small geodesic ball in manifold and that of a standard ball with a same radius in Euclidean space. In this paper, we will focus on Riemannian manifolds with positive Ricci curvatures.

2.2 Heat Equation and Heat Kernel

The key ingredient in implicit manifold learning is heat kernel, which encodes extensive geodesic information of the manifold. Intuitively, the heat kernel $k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})$ describes the process of heat diffusion on the manifold $\mathcal{M}$ , given a heat source $\operatorname{\mathbf{x}}_{0}$ and time $t$ . Throughout the paper, when $\operatorname{\mathbf{x}}_{0}$ is assumed to be fixed, we will use $k_{\mathcal{M}}^{t}(\operatorname{\mathbf{x}})$ to denote $k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})$ for notation simplicity. Heat equation and heat kernel are defined as below.

Definition 1 ([16])

Let $(\mathcal{M},g)$ be a connected Riemannian manifold, and $\Delta$ be the Laplace-Beltrami operator on $\mathcal{M}$ . The heat kernel $k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})$ is the minimal positive solution of the heat equation: $\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})/\partial t=\Delta_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\lim_{t\rightarrow 0^{+}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=\delta_{\operatorname{\mathbf{x}}_{0}}(\operatorname{\mathbf{x}})$ .

Remarkably, heat kernel encodes a massive amount of geometric information of the manifold, and is closely related to the geodesic distance on the manifold.

Lemma 1 ([16])

For an arbitrary Riemannian manifold $\mathcal{M}$ , $\log k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\sim-d_{\mathcal{M}}^{2}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})/(4t)$ as $t\rightarrow 0$ , where $d_{\mathcal{M}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})$ is the geodesic distance on manifold ${\mathcal{M}}$ and $\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}\in{\mathcal{M}}$ .

This relation indicates that learning a heat kernel also learns the corresponding manifold implicitly. For this reason, we call our method “implicit” manifold learning. It is known that heat kernel is positive definite [17], contains all the information of intrinsic geometry, fully characterizes shapes up to isometry [18]. As a result, heat kernel has been widely used in computer graphics [18, 19].

2.3 Wasserstein Gradient Flows

Let $\mathcal{P}(\mathcal{M})$ denote the space of probability measures on $\mathcal{M}\subset R^{d}$ . Assume that $\mathcal{P}(\mathcal{M})$ is endowed with a Riemannian geometry induced by 2-Wasserstein distance, i.e., the distance between two probability measures ${\bm{\mu}}_{\mathcal{M}},{\bm{\nu}}_{\mathcal{M}}\in\mathcal{P}(\mathcal{M})$ is defined as: $d_{W}^{2}({\bm{\mu}}_{\mathcal{M}},{\bm{\nu}}_{\mathcal{M}})=\inf_{\mathbf{\pi\in\Gamma({\bm{\mu}}_{\mathcal{M}},{\bm{\nu}}_{\mathcal{M}})}}\int_{\mathcal{M}\times\mathcal{M}}\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{y}}\|^{2}\mathrm{d}\mathbf{\pi}$ , where $\Gamma({\bm{\mu}}_{\mathcal{M}},{\bm{\nu}}_{\mathcal{M}})$ is the set of joint distribution on $\mathcal{M}\times\mathcal{M}$ satisfying the condition that its two marginals equal to ${\bm{\mu}}_{\mathcal{M}}$ and ${\bm{\nu}}_{\mathcal{M}}$ , respectively. Let $F:\mathcal{P}(\mathcal{M})\rightarrow R$ be a functional on $\mathcal{P}(\mathcal{M})$ , mapping a probability measure to a real value. Wasserstein gradient flows describe the time-evolution of a probability measure ${\bm{\mu}}_{\mathcal{M}}^{t}$ , defined by a partial differential equation (PDE): $\partial{\bm{\mu}}_{\mathcal{M}}^{t}/\partial t=-\nabla_{W_{2}}F({\bm{\mu}}_{\mathcal{M}}^{t}),\text{ for }t>0$ , where $\nabla_{W_{2}}F({\bm{\mu}}_{\mathcal{M}})\triangleq-\nabla\cdot({\bm{\mu}}_{\mathcal{M}}\nabla(\partial F/\partial{\bm{\mu}}_{\mathcal{M}}))$ . Importantly, there is a close relation between WGF and the heat equation on manifold.

Theorem 2 ([20])

Let $(\mathcal{M},g)$ be a connected and complete Riemannian manifold with Riemannian volume element $\mathrm{d}m$ , and $\mathcal{P}_{2}(\mathcal{M})$ be the Wasserstein space of probability measures on $\mathcal{M}$ equipped with the 2-Wasserstein distance $d_{W}^{2}$ . Let $({\bm{\mu}}_{\mathcal{M}}^{t})_{t>0}$ be a continuous curve in $\mathcal{P}_{2}(\mathcal{M})$ . Then, the followings are equivalent:

1. $({\bm{\mu}}_{\mathcal{M}}^{t})_{t>0}$ is a trajectory of the gradient flow for the negative entropy $H({\bm{\mu}}^{t}_{\mathcal{M}})=\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}logk_{\mathcal{M}}^{t}(\operatorname{\mathbf{x}})\mathrm{d}{\bm{\mu}}_{\mathcal{M}}^{t}\triangleq F({\bm{\mu}}^{t}_{\mathcal{M}})$ ;

2. ${\bm{\mu}}^{t}_{\mathcal{M}}$ is given by ${\bm{\mu}}^{t}_{\mathcal{M}}(S)=\int_{\operatorname{\mathbf{x}}\in S}k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\mathrm{d}m$ for $t>0$ , where $(k^{t}_{\mathcal{M}})_{t>0}$ is a solution to the heat equation $\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})/\partial t=\Delta_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\lim_{t\rightarrow 0^{+}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=\delta_{\operatorname{\mathbf{x}}_{0}}(\operatorname{\mathbf{x}})$ , satisfying: $H({\bm{\mu}}_{\mathcal{M}}^{t})<\infty$ , and for $\forall 0<s_{0}<s_{1}$ , $\int_{s_{0}}^{s_{1}}\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|\triangle k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}/k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\mathrm{d}m\mathrm{d}t<\infty$ .

Different from [20], We use the term negative entropy instead of relative entropy for clearness, because relative entropy is often referred to KL-divergence. From the 2nd bullet of Theorem 2, one can see that $k^{t}_{\mathcal{M}}$ is the probability density function of ${\bm{\mu}}^{t}_{\mathcal{M}}$ on $\mathcal{M}$ [14, 15].

3 The Proposed Framework

Our intuition of learning the heat kernel (thus learning a manifold implicitly) is inspired by Theorem 2, which indicates that one can learn the probability density function (PDF) on the manifold from the corresponding WGF. To this end, we first define the evolving PDF induced by a WGF.

Definition 2 (Evolving PDF)

Let $(\mathcal{M},g)$ be a connected and complete Riemannian manifold with Riemannian volume element $\mathrm{d}m$ ; $({\bm{\mu}}^{t})_{t>0}$ be the trajectory of a WGF of negative entropy $H({\bm{\mu}}^{t})=\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\log p^{t}(\operatorname{\mathbf{x}})\mathrm{d}{\bm{\mu}}^{t}$ with initial point ${\bm{\mu}}^{0}$ . We call the evolving function $p^{t}$ satisfying ${\bm{\mu}}^{t}(S)=\int_{\operatorname{\mathbf{x}}\in S}p^{t}(\operatorname{\mathbf{x}})\mathrm{d}m(\forall t\geq 0)$ the evolving PDF of ${\bm{\mu}}^{t}$ induced by the WGF.

In the following, we start with some theoretical foundation of heat-kernel learning, which shows that two evolving PDFs induced by the WGF of negative entropy on a given manifold approaches each other at an exponential convergence rate, indicating the learnability of the heat kernel. We then propose an efficient and practical algorithm, where neural network and gradient descent are applied to learn a heat kernel. Finally, we apply our algorithm for Bayesian inference and DGMs as two down-stream applications. All proofs are provided in the Appendix.

3.1 Theoretical Foundation of Heat-Kernel Learning

Our goal in this section is to illustrate the feasibility and convergence speed of heat-kernel learning. We start from the following theorem.

Theorem 3

With the same setting as in Definition 2, suppose the manifold has positive Ricci curvature, let $p^{t}$ , $q^{t}$ be two evolving PDFs induced by the WGF of negative entropy with the corresponding probability measures being ${\bm{\mu}}^{t}$ and ${\bm{\nu}}^{t}$ , respectively. If $d^{2}_{W}({\bm{\mu}}^{0},{\bm{\nu}}^{0})<\infty$ , then $p^{t}(\operatorname{\mathbf{x}})=q^{t}(\operatorname{\mathbf{x}})$ almost everywhere as $t\rightarrow\infty$ . Furthermore, $\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|p^{t}(\operatorname{\mathbf{x}})-q^{t}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m$ converges to 0 exponentially fast.

Theorem 3 is a natural extension of Proposition 4.4 in [20], which says that two trajectories of WGF for negative entropy would approach each other. We extend their results from probability measures to evolving PDFs. Thus, if one can learn the trajectory of an evolving PDF $p^{t}(\operatorname{\mathbf{x}})$ , it can be used to approximate the true heat kernel $k^{t}_{\mathcal{M}}$ (which corresponds to $q^{t}$ in Theorem 3 by Theorem 2).

One potential issue is that if the heat kernel $k^{t}_{\mathcal{M}}$ itself converges fast enough to 0 in the time limit of $t\rightarrow\infty$ , the convergence result in Theorem 3 might be uninformative, because one ends up with an almost zero-output function. Fortunately, we can prove that the convergence rate of $\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|p^{t}(\operatorname{\mathbf{x}})-k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m$ is faster than that of $\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m$ .

Theorem 4

Let $(\mathcal{M},g)$ be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary $\partial\mathcal{M}$ . Suppose the manifold has positive Ricci curvature. Then $\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m$ converges to 0 at most polynomially as $t\rightarrow\infty$ .

In addition, we can also prove a lower bound of the heat kernel $k^{t}_{\mathcal{M}}$ for the non-asymptotic case of $\forall t<\infty$ . This plays an important role when developing our practical algorithm. We will incorporate the lower bound into optimization by Lagrangian multiplier in our algorithm.

Theorem 5

Let $(\mathcal{M},g)$ be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary $\partial\mathcal{M}$ . If $\mathcal{M}$ has positive Ricci curvature bounded by $K$ and its dimension $dim(\mathcal{M})\geq 1$ , we have $k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\geq\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{C(\epsilon)(\pi t)^{dim(\mathcal{M})/2}}\exp(\dfrac{\pi^{2}-\pi^{2}dim(\mathcal{M})}{(4-\epsilon)Kt})$ for $\forall\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}\in\mathcal{M}$ and small $\epsilon>0$ , where $C(\epsilon)$ is a constant depending on $\epsilon>0$ and $d$ such that $C(\epsilon)\rightarrow 0$ as $\epsilon\rightarrow 0$ , and $\Gamma$ is the gamma function.

Theorem 5 implies that for any finite time $t$ , there is a lower bound of the heat kernel, which depends on the time and manifold shape, and is independent of the distance between $\operatorname{\mathbf{x}}_{0}$ and $\operatorname{\mathbf{x}}$ . In fact, there also exists an upper bound [21], which depends on the geodesic distance between $\operatorname{\mathbf{x}}_{0}$ and $\operatorname{\mathbf{x}}$ . However, we will show later that the upper bound has little impact in our algorithm, and is thus omitted here.

3.2 A Practical Heat-Kernel Learning Algorithm

We now propose a practical framework to solve the heat-kernel learning problem. We decompose the procedure into three steps: 1) constructing a parametric function $p^{t}_{\boldsymbol{\phi}}$ to approximate the $p^{t}$ in Theorem 3; 2) bridging $p^{t}_{\boldsymbol{\phi}}$ and the corresponding ${\bm{\mu}}^{t}_{\boldsymbol{\phi}}$ ; and 3) updating ${\bm{\mu}}_{\boldsymbol{\phi}}^{t}$ by solving the WGF of negative entropy, leading to an evolving PDF $p^{t}_{\boldsymbol{\phi}}$ . We want to emphasize that by learning to evolve as a WGF, the time $t$ is not an explicit parameter to learn.

Parameterization of $p^{t}_{\boldsymbol{\phi}}$

We use a deep neural network to parameterize the PDF. Because $k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})$ also depends on $\operatorname{\mathbf{x}}_{0}$ , we propose to parameterize $p^{t}_{\boldsymbol{\phi}}$ as a function with two inputs: $p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})$ , which is the evolving PDF to approximate the heat kernel (with certain initial conditions). To guarantee the positive semi-definite property of a heat kernel, we utilize some existing parameterizations using deep neural networks [9, 11, 12], where [11] is a special case of [12]. We adopt two ways to construct the kernel. The first way is based on [9], where the parametric kernel is constructed as:

\displaystyle p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=\exp(-\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|^{2})

(1)

The second way is based on [12], and we construct a parametric kernel as:

\displaystyle p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=\mathbb{E}\{\cos\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{1}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}_{1}}(\operatorname{\mathbf{x}}_{0})-h^{t}_{\boldsymbol{\phi}_{1}}(\operatorname{\mathbf{x}})\right]\}\}+\mathbb{E}\{\cos\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}_{1}}(\operatorname{\mathbf{x}}_{0})-h^{t}_{\boldsymbol{\phi}_{1}}(\operatorname{\mathbf{x}})\right]\}\}

(2)

where $\boldsymbol{\phi}\triangleq\{\boldsymbol{\phi}_{1},{\bm{\psi}}_{1},{\bm{\psi}}_{2}\}$ in (2), $h^{t}_{\boldsymbol{\phi}},h^{t}_{\boldsymbol{\phi}_{1}}$ are neural networks, ${\bm{\omega}}^{t}_{{\bm{\psi}}_{1}},{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}}$ are samples from some implicit distribution which are constructed using neural networks. Details of implementing (2) can be found in [12].

A potential issue with these two methods is that they can only approximate functions whose maximum value is 1, i.e., $\max_{\operatorname{\mathbf{x}}\in\mathcal{M}}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})=p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{0})=1,\forall t>0$ . In practice, this can be satisfied by scaling it with an unknown time-aware term, $a^{t}_{\mathcal{M}}=\max k_{\mathcal{M}}^{t}$ , as $a^{t}_{\mathcal{M}}p_{\boldsymbol{\phi}}^{t}$ . Because $a^{t}_{\mathcal{M}}$ depends only on $t$ and $\mathcal{M}$ , it can be seen as a constant for fixed time $t$ and manifold $\mathcal{M}$ . As we will show later, the unknown term $a^{t}_{\mathcal{M}}$ will be cancelled, and thus would not affect our algorithm.

Bridging $p^{t}_{\boldsymbol{\phi}}$ and ${\bm{\mu}}_{\boldsymbol{\phi}}^{t}$

We rely on the WGF framework to learn the parametrized PDF. To achieve this, note that from Definition 2, $p^{t}_{\boldsymbol{\phi}}$ and ${\bm{\mu}}_{\boldsymbol{\phi}}^{t}$ are connected by the Riemannian volume element $\textrm{d}m$ . Thus, given $\textrm{d}m$ , if one is able to solve ${\bm{\mu}}_{\boldsymbol{\phi}}^{t}$ in the WGF, $p^{t}_{\boldsymbol{\phi}}$ is also readily solved. However, $\textrm{d}m$ is typically intractable in practice. Furthermore, notice that $p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})$ is a function with two inputs. This means for $n$ data samples $\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n}$ , there are $n$ evolving PDFs $\{p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i},\cdot)\}^{n}_{i=1}$ and $n$ corresponding trajectories $\{{\bm{\mu}}^{t}_{\boldsymbol{\phi},i}\}_{i=1}^{n}$ , to be solved, which is impractical.

To overcome this challenge, we propose to solving the WGF of $\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}}\triangleq\sum_{i=1}^{n}{\bm{\mu}}^{t}_{\boldsymbol{\phi},i}/n$ , the averaged probability measure of $\{{\bm{\mu}}^{t}_{\boldsymbol{\phi},i}\}_{i=1}^{n}$ . We approximate the averaged measure by Kernel Density Estimation (KDE) [22]: given samples $\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n}$ on a manifold $\mathcal{M}$ and the parametric function $p^{t}_{\boldsymbol{\phi}}$ , we calculate the unnormalized average $\bar{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})\approx a^{t}_{\mathcal{M}}\sum_{j=1}^{n}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})/n$ . Consequently, the normalized average satisfying $\sum_{i=1}^{n}\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})=1$ is formulated as:

\displaystyle\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})=\dfrac{\sum_{j=1}^{n}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})}{\sum_{i=1}^{n}\sum_{j=1}^{n}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})}.

(3)

We can see that the scalar $a^{t}_{\mathcal{M}}$ is cancelled, and it will not affect our final algorithm.

Updating ${\bm{\mu}}_{\boldsymbol{\phi}}^{t}$

Finally, we are left with solving $\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}$ in the WGF. We follow the celebrated Jordan-Kinderlehrer-Otto (JKO) scheme [23] to solve $\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}$ with the discrete approximation (3). The JKO scheme is rephrased in the following lemma under our setting.

Lemma 6 ([20])

Consider probability measures in $\mathcal{P\mathcal{(}{M})}$ . Fix a time step $\tau>0$ and an initial value ${\bm{\mu}}^{0}$ with finite 2nd moment. Recursively define a sequence $({\bm{\mu}}_{\tau}^{n})_{n\in\mathbb{N}}$ of local minimizer by ${\bm{\mu}}_{\tau}^{0}\coloneqq{\bm{\mu}}^{0},\ {\bm{\mu}}_{\tau}^{n}\coloneqq\operatorname*{arg\,min}_{\eta}H(\eta)+d_{W}^{2}({\bm{\mu}}_{\tau}^{n-1},\eta)/(2\tau),$ where $d_{W}^{2}$ denotes the 2-Wasserstein distance. If we further define a discrete trajectory: $\bar{{\bm{\mu}}}_{\tau}^{0}\coloneqq{\bm{\mu}}^{0},\ \bar{{\bm{\mu}}}_{\tau}^{t}\coloneqq{\bm{\mu}}_{\tau}^{n},\text{ if $t\in((n-1)\tau,n\tau]$}$ . Then $\bar{{\bm{\mu}}^{t}_{\tau}}\rightarrow{\bm{\mu}}^{t}$ weakly as $\tau\rightarrow 0$ for $\forall t>0$ , where $({\bm{\mu}}^{t})_{t>0}$ is a trajectory of the gradient flow of negative entropy H.

Algorithm 1 Heat Kernel Learning

Input: samples

\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n}

on the manifold

\mathcal{M}

, kernel parameterized by (1) or (2), hyper-parameters

\alpha

\beta

\lambda

, time step

\tau=\alpha/2\beta

Initialize function

p^{0}_{\boldsymbol{\phi}}

, compute corresponding

{\bm{\nu}}=\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{0}

by (3).

for

k=1

m

Solve (4), where

\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau}

is computed by (3). Update

{\bm{\nu}}\leftarrow\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau}

, where

\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau}

is computed by (3).

end for

Based on Theorem 5 and Lemma 6 , we know that to learn the kernel function at time $t<\infty$ , we can use the Lagrange multiplier to define the following optimization problem for time $t$ :

\displaystyle\min_{\boldsymbol{\phi}}\alpha H(\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})+\beta d_{W}^{2}({\bm{\nu}},\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})-\lambda\mathbb{E}_{\operatorname{\mathbf{x}}_{i}\neq\operatorname{\mathbf{x}}_{j}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})\right],

(4)

where $\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\in\mathcal{M}$ , $\alpha,\ \beta,\ \lambda$ are hyper-parameters, time step is $\tau=\alpha/2\beta$ , ${\bm{\nu}}$ is a given probability measure corresponding to a previous time. The last term is introduced to reflect the constraint of $p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})$ reflected in Theorem 5. Also, the Wasserstein term $d_{W}^{2}({\bm{\nu}},\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})$ can be approximated using the Sinkhorn algorithm [24]. Our final algorithm is described in Algorithm 1, some discussions are provided in the Appendix. Note that in practice, mini-batch training is often used to ease computational complexity.

3.3 Applications

3.3.1 Learning Kernels in SVGD

SVGD [10] is a particle-based algorithm for approximate Bayesian inference, whose update involves a kernel function $k$ . Given a set of particles $\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n}$ , at iteration $l$ , the particle $\operatorname{\mathbf{x}}_{i}$ is updated by

\displaystyle\operatorname{\mathbf{x}}_{i}^{l+1}\leftarrow\operatorname{\mathbf{x}}_{i}^{l}+\epsilon\phi(\operatorname{\mathbf{x}}^{l}_{i}),\text{ where }\phi(\operatorname{\mathbf{x}}^{l}_{i})=\dfrac{1}{n}\sum_{j=1}^{n}\left[\nabla\text{log}q(\operatorname{\mathbf{x}}_{j}^{l})k(\operatorname{\mathbf{x}}_{j}^{l},\operatorname{\mathbf{x}}_{i}^{l})+\nabla_{\operatorname{\mathbf{x}}_{j}^{l}}k(\operatorname{\mathbf{x}}_{j}^{l},\operatorname{\mathbf{x}}_{i}^{l})\right]~{}.

(5)

Here $q(\cdot)$ is the target distribution to be sampled from. Usually, a pre-defined kernel such as RBF kernel with median trick is used in SVGD. Instead of using pre-defined kernels, we propose to improve SVGD by using our heat-kernel learning method: we learn the evolving PDF and use it as the kernel function in SVGD. By alternating between learning the kernel with Algorithm 1 and updating particles with (5), manifold information can be conveniently encoded into SVGD.

3.3.2 Learning Deep Generative Models

Our second application is to apply our framework to DGMs. Compared to that in SVGD, application in DGMs is more involved because there are actually two manifolds: the manifold of training data $\mathcal{M}_{P}$ and the manifold of the generated data $\mathcal{M}_{{\bm{\theta}}}$ . Furthermore, $\mathcal{M}_{{\bm{\theta}}}$ depends on model parameters ${\bm{\theta}}$ , and hence varies during the training process.

Let $g_{{\bm{\theta}}}$ denote a generator, which is a neural network parameterized by ${\bm{\theta}}$ . Let the generated sample be $\operatorname{\mathbf{y}}=g_{{\bm{\theta}}}({\bm{\epsilon}})$ , with ${\bm{\epsilon}}$ random noise following some distribution such as the standard normal distribution. In our method, we assume that the learning process constitutes an manifold flow $(\mathcal{M}_{{\bm{\theta}}}^{s})_{s\geq 0}$ with $s$ representing generator’s training step. After each generator update, samples from the generator are assumed to form an manifold. Our goal is to learn a generator such that $\mathcal{M}_{{\bm{\theta}}}^{\infty}$ approaches $\mathcal{M}_{P}$ . Our method contains two steps: learning the generator and learning the kernel (evolving PDF).

Learning the generator

We adopt two popular kernel-based quantities as objective functions for our generator, the Maximum Mean Discrepancy (MMD) [25] and the Scaled MMD (SMMD) [26], in which our learned heat kernels are used to compute these quantities. MMD and SMMD can be used to measure the difference between distributions. Thus, we update the generator by minimizing them. Details of the MMD and SMMD are given in the Appendix.

Learning the kernel

Different from the simple single manifold setting in Algorithm 1, we consider both the training data manifold and the generated data manifold in learning DGMs. As a result, instead of learning the heat kernel of $\mathcal{M}_{{\bm{\theta}}}^{s}$ or $\mathcal{M}_{\mathcal{P}}$ , we propose to learn the heat kernel of a new connected manifold, $\widetilde{\mathcal{M}}^{s}$ , that integrates both $\mathcal{M}_{{\bm{\theta}}}^{s}$ and $\mathcal{M}_{\mathcal{P}}$ . We will derive a regularized objective based on (4) to achieve our goal.

The idea is to initialize $\widetilde{\mathcal{M}}^{s}$ with one of the two manifolds, $\mathcal{M}_{{\bm{\theta}}}^{s}$ and $\mathcal{M}_{\mathcal{P}}$ , and then extend it to the other manifold. Without loss of generality, we assume that $\mathcal{M}^{s}_{{\bm{\theta}}}\subseteq\widetilde{\mathcal{M}^{s}}$ at the beginning. Note that it is unwise to assume $\mathcal{M}^{s}_{{\bm{\theta}}}\cup\mathcal{M}_{\mathcal{P}}\subseteq\widetilde{\mathcal{M}^{s}}$ , since $\mathcal{M}^{s}_{{\bm{\theta}}}$ and $\mathcal{M}_{\mathcal{P}}$ could be very different at the beginning. As a result, $\widetilde{\mathcal{M}^{s}}=R^{d}$ might be the only case satisfying $\mathcal{M}^{s}_{{\bm{\theta}}}\cup\mathcal{M}_{\mathcal{P}}\subseteq\widetilde{\mathcal{M}^{s}}$ , which does not contain any useful geometric information. First of all, we start with $\mathcal{M}^{s}_{{\bm{\theta}}}$ by consider $p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}),\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}\in\mathcal{M}^{s}_{{\bm{\theta}}}\subset\widetilde{\mathcal{M}^{s}}$ in (4). Next, to incorporate the information of $\mathcal{M}_{\mathcal{P}}$ , we consider $p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})$ in (4) and regularize it with $\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})\|$ , where $\operatorname{\mathbf{y}}\in\mathcal{M}^{s}_{{\bm{\theta}}}$ , $\operatorname{\mathbf{x}}\in\mathcal{M}_{\mathcal{P}}$ and $\operatorname{\mathbf{z}}\in\widetilde{\mathcal{M}}^{s}$ is the closest point to $\operatorname{\mathbf{x}}$ on $\widetilde{\mathcal{M}}^{s}$ . The regularization constrains $\mathcal{M}_{\mathcal{P}}$ to be closed to $\widetilde{\mathcal{M}^{s}}$ (extending $\widetilde{\mathcal{M}^{s}}$ to $\mathcal{M}_{\mathcal{P}}$ ). Since the norm regularization is infeasible to calculate, we will derive an upper bound below and use that instead. Specifically, for kernels of form (1), by Taylor expansion, we have:

\displaystyle\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})\|\approx\|(\partial p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})/\partial\operatorname{\mathbf{x}})(\operatorname{\mathbf{z}}-\operatorname{\mathbf{x}})\|\leq cp^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|_{\mathcal{F}}\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})\|

(6)

where $c=\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{z}}\|$ , and $\|\cdot\|_{\mathcal{F}}$ denotes the Frobenius norm. Consider $p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})$ in (4) will lead to the same bound because of symmetry.

Finally, we consider $p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j})$ in (4) and regularize $\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{z}}_{j},\operatorname{\mathbf{z}}_{i})\|$ , where $\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\in\mathcal{M}_{\mathcal{P}}$ , and $\operatorname{\mathbf{z}}_{i},\operatorname{\mathbf{z}}_{j}\in\widetilde{\mathcal{M}}^{s}$ are the closest points to $\operatorname{\mathbf{x}}_{i}$ and $\operatorname{\mathbf{x}}_{j}$ on $\widetilde{\mathcal{M}}^{s}$ . A similar bound can be obtained.

Furthermore, instead of directly bounding the multiplicative terms in (6), we find it more stable to bound every component separately. Note that $\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\|_{\mathcal{F}}$ can be bounded from above using spectral normalization [27] or being incorporated into the objective function as in [26]. We do not explicitly include it in our objective function. As a result, incorporating the base learning kernel from (1), our optimization problem becomes:

		$\displaystyle\min_{\boldsymbol{\phi}}\alpha H(\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})+\beta d_{W}^{2}({\bm{\nu}},\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})-\lambda\mathbb{E}_{\operatorname{\mathbf{y}}\neq\operatorname{\mathbf{x}}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\right]+\mathbb{E}_{\operatorname{\mathbf{x}}\sim\mathbb{P},\operatorname{\mathbf{y}}\sim\mathbb{Q}}\left[\gamma_{1}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})+\gamma_{2}\\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})\\|\right]$
		$\displaystyle+\mathbb{E}_{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\sim\mathbb{P}}\left[\gamma_{3}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})+\gamma_{4}\\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j})\\|\right]\text{, where}$		(7)
		$\displaystyle\mathbb{E}_{\operatorname{\mathbf{y}}\neq\operatorname{\mathbf{x}}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\right]=\dfrac{1}{4}\{\mathbb{E}_{\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}\sim\mathbb{Q}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}}_{j},\operatorname{\mathbf{y}}_{i})\right]+2\mathbb{E}_{\operatorname{\mathbf{x}}\sim\mathbb{P},\operatorname{\mathbf{y}}\sim\mathbb{Q}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\right]+\mathbb{E}_{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\sim\mathbb{P}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})\right]\}$		(8)

Our algorithm for DGM with heat kernel learning is shown in Appendix. When SMMD is used as the objective function for the generator, we also scale (3.3.2) by the same factor as scaling SMMD.

Theoretical property and relation with existing methods:

Following the work in [26], we first study the continuity in weak-topology of our kernel when applied in MMD. The continuity in weak topology is an important property because it means that the objective function can provide good signal to update the generator [26], without suffering from sudden jump as in the Jensen-Shannon (JS) divergence or Kullback-Leibler (KL) divergence [28].

Theorem 7

With (6) bounded, MMD of our proposed kernel is continuous in weak topology, i.e., if $\mathbb{Q}\xrightarrow[]{D}\mathbb{P}$ then $\textup{MMD}_{k_{\boldsymbol{\phi}}^{t}}(\mathbb{Q}_{n},\mathbb{P})\xrightarrow{}0$ , where $\xrightarrow[]{D}$ means convergence in distribution.

Proof of Theorem 7 directly follows Theorem 2 in [26]. Plugging (8) into (3.3.2), it is interesting to see some connections of existing methods with ours: 1) If one sets $\alpha=\beta=\gamma_{2}=\gamma_{3}=\gamma_{4}=0,\gamma_{1}=4,\lambda=4$ , our method reduces to MMD-GAN [9]. Furthermore, if the scaled objectives are used, our method reduces to SMMD-GAN [26]; 2) If one sets $\alpha=\beta=\gamma_{2}=\gamma_{4}=0,\gamma_{1}+\gamma_{3}=4,\lambda=4$ , our method reduces to the MMD-GAN with repulsive loss [29].

In summary, our method can interpret the min-max game in MMD-based GANs from a kernel-learning point of view, where the discriminators try to learn the heat kernel of some underlying manifolds. As we will show in the experiments, our model achieves the best performance compared to the related methods. Although there are several hyper-parameters in (3.3.2), we have made our model easy to tune due to the connection with GANs. Specifically, one can start by selecting a kernel based GAN, e.g., setting $\alpha=\beta=\gamma_{2}=\gamma_{3}=\gamma_{4}=0,\gamma_{1}=4,\lambda=4$ as MMD-GAN, and only tune $\alpha,\beta$ .

4 Experiments

4.1 A Toy Experiment

We illustrate the effectiveness of our method by comparing the difference between a learned PDF and the true heat kernel on the real line $R$ , i.e., the 1-dimensional Euclidean space. In this setting, the heat kernel has a closed form of $k(t,x_{0},x)=\exp\{-(x-x_{0})^{2}/4t\}/\sqrt{4\pi t}$ , where the maximum value is $a^{t}_{\mathcal{M}}=1/\sqrt{4\pi t}$ . We uniformly sample 512 points in $[-10,10]$ as training data, the kernel is constructed by (1), where a 3-layer neural network is used. We assume that every gradient descent update corresponds to $0.01$ time step. The evolution of $a^{t}_{\mathcal{M}}p^{t}_{\boldsymbol{\phi}}(0,x)$ and $k^{t}_{\mathcal{M}}$ are shown in Figure 1.

4.2 Improved SVGD

We next apply SVGD with the kernel learned by our framework for BNN regression on UCI datasets. For all experiments, a 2-layer BNN with 50 hidden units, 10 weight particles, ReLU activation is used. We assign the isotropic Gaussian prior to the network weights. Recently, [30] proposes the matrix-valued kernel for SVGD (denoted as MSVGD-a and MSVGD-m). Our method can also be used to improve their methods. Detailed experimental settings are provided in the Appendix due to the limitation of space. We denote our improved SVGD as HK-SVGD, and our improved matrix-valued SVGD as HK-MSVGD-a and HK-MSVGD-m. The results are reported in Table 1. We can see that our method can improve both the SVGD and matrix-valued SVGD. Additional test log-likelihoods are reported in the Appendix. One potential criticism is that 10 particles are not sufficient to well describe the parameter manifold. We thus conduct extra experiments, where instead of using particles, we use a 2-layer neural network to generate parameter samples for BNN. The results are also reported in the Appendix. Consistently, better performance is achieved.

Table 1: Average test RMSE

(\downarrow)

for UCI regression.

Combined Concrete Kin8nm Protein Wine Year SVGD $4.088\pm 0.033$ $5.027\pm 0.116$ $0.093\pm 0.001$ $4.186\pm 0.017$ $0.645\pm 0.009$ $8.686\pm 0.010$ HK-SVGD (ours) $\mathbf{4.077\pm 0.035}$ $\mathbf{4.814\pm 0.112}$ $\mathbf{0.091\pm 0.001}$ $\mathbf{4.138\pm 0.019}$ $\mathbf{0.624\pm 0.010}$ $\mathbf{8.656\pm 0.007}$ MSVGD-a $4.056\pm 0.033$ $4.869\pm 0.124$ $0.092\pm 0.001$ $3.997\pm 0.018$ $0.637\pm 0.008$ $8.637\pm 0.005$ MSVGD-m $4.029\pm 0.033$ $4.721\pm 0.111$ $0.090\pm 0.001$ $3.852\pm 0.014$ $0.637\pm 0.009$ $8.594\pm 0.009$ HK-MSVGD-a (ours) $4.020\pm 0.043$ $\mathbf{4.443\pm 0.138}$ $0.090\pm 0.001$ $4.001\pm 0.004$ $\mathbf{0.614\pm 0.007}$ $8.590\pm 0.010$ HK-MSVGD-m (ours) $\mathbf{3.998\pm 0.046}$ $4.552\pm 0.146$ $\mathbf{0.089\pm 0.001}$ $\mathbf{3.762\pm 0.015}$ $0.629\pm 0.008$ $\mathbf{8.533\pm 0.005}$

4.3 Deep Generative Models

Table 2: Results on image generation.

CelebA ImageNet FID $(\downarrow)$ IS $(\uparrow)$ FID $(\downarrow)$ IS $(\uparrow)$ WGAN-GP $29.2\pm 0.2$ $2.7\pm 0.1$ $65.7\pm 0.3$ $7.5\pm 0.1$ SN-GAN $22.6\pm 0.1$ $2.7\pm 0.1$ $47.5\pm 0.1$ $11.2\pm 0.1$ SMMD-GAN $18.4\pm 0.2$ $2.7\pm 0.1$ $38.4\pm 0.3$ $10.7\pm 0.2$ SN-SMMD-GAN $12.4\pm 0.2$ $2.8\pm 0.1$ $36.6\pm 0.2$ $10.9\pm 0.1$ Repulsive $10.5\pm 0.1$ $2.8\pm 0.1$ $31.0\pm 0.1$ $11.5\pm 0.1$ HK (ours) $\mathbf{9.7\pm 0.1}$ $\mathbf{2.9\pm 0.1}$ $29.6\pm 0.1$ $11.8\pm 0.1$ HK-DK (ours) – – $\mathbf{29.2\pm 0.1}$ $\mathbf{11.9\pm 0.1}$ CIFAR-10 STL-10 FID $(\downarrow)$ IS $(\uparrow)$ FID $(\downarrow)$ IS $(\uparrow)$ DC-GAN architecture WGAN-GP $31.1\pm 0.2$ $6.9\pm 0.2$ $55.1$ $8.4\pm 0.1$ SN-GAN $25.5$ $7.6\pm 0.1$ $43.2$ $8.8\pm 0.1$ SMMD-GAN $31.5\pm 0.4$ $7.0\pm 0.1$ $43.7\pm 0.2$ $8.4\pm 0.1$ SN-SMMD-GAN $25.0\pm 0.3$ $7.3\pm 0.1$ $40.6\pm 0.1$ $8.5\pm 0.1$ CR-GAN $18.7$ $7.9$ – – Repulsive $16.7$ $8.0$ $36.7$ $9.4$ HK (ours) $14.9\pm 0.1$ $8.2\pm 0.1$ $31.8\pm 0.1$ $\mathbf{9.6\pm 0.1}$ HK-DK (ours) $\mathbf{13.2\pm 0.1}$ $\mathbf{8.4\pm 0.1}$ $\mathbf{30.3\pm 0.1}$ $\mathbf{9.6\pm 0.1}$ ResNet architecture SN-GAN $21.7\pm 0.2$ $8.2\pm 0.1$ $40.1\pm 0.5$ $9.1\pm 0.1$ CR-GAN $14.6$ $8.4$ – – Repulsive $12.2\pm 0.1$ $8.3\pm 0.1$ $25.3\pm 0.1$ $10.2\pm 0.1$ Auto-GAN $12.4$ $8.6\pm 0.1$ $31.1$ $9.2\pm 0.1$ HK (ours) $11.5\pm 0.1$ $8.4\pm 0.1$ $24.3\pm 0.1$ $10.5\pm 0.1$ HK-DK (ours) $\mathbf{10.3\pm 0.1}$ $\mathbf{8.6\pm 0.1}$ $\mathbf{24.0\pm 0.1}$ $\mathbf{10.5\pm 0.1}$ BigGAN Setting BigGAN $14.7$ – – – CR-BigGAN $11.7$ – – –

Finally, we apply our framework to high-quality image generation. Four datasets are used in this experiment: CIFAR-10, STL-10, ImageNet, CelebA. Following [26, 29], images are scaled to the resolution of $32\times 32$ , $48\times 48,64\times 64$ and $160\times 160$ respectively. Following [26, 27], we test 2 architectures on CIFAR-10 and STL-10, 1 architecture on CelebA and ImageNet. We report the standard Fréchet Inception Distance (FID) [31] and Inception Score (IS) [32] for evaluation. Architecture details and some experiments settings can be found in Appendix.

We compared our method with some popular and state-of-the-art GAN models under the same experimental setting, including WGAN-GP [33], SN-GAN[27], SMMD-GAN, SN-SMMD-GAN [26], CR-GAN [34], MMD-GAN with repulsive loss [29], Auto-GAN [35]. The results are reported in Table 2 where HK and HK-DK represent our model with kernel (1) and (2). More results are provided in the Appendix. HK-DK exhibits some convergence issues on CelebA, hence no result is reported. We can see that our models achieve the state-of-the-art results (under the same experimental setting such as the same/similar architectures). Furthermore, compared to Auto-GAN, which needs 43 hours to train on the CIFAR-10 dataset due to the expensive architecture search, our method (HK with ResNet architecture) needs only 12 hours to obtain better results. Some randomly generated images are also provided in the Appendix.

5 Conclusion

We introduce the concept of implicit manifold learning, which implicitly learns the geometric information of an unknown manifold by learning the corresponding heat kernel. Both theoretical analysis and practical algorithm are derived. Our framework is flexible and can be applied to general kernel-based models, including DGMs and Bayesian inference. Extensive experiments suggest that our methods achieve consistently better results on different tasks, compared to related methods.

Broader Impact

We propose a fundamentally novel method to implicitly learn the geometric information of a manifold by explicitly learning its associated heat kernel, which is the solution of heat equation with initial conditions given. Our proposed method is general and can be applied in many down-stream applications. Specifically, it could be used to improve many kernel-related algorithms and applications. It may also inspire researchers in deep learning to borrow ideas from other fields (mathematics, physics, etc.) and apply them to their own research. This can benefit both fields and thus promote interdisciplinary research.

Acknowledgements

The research of the first and third authors was supported in part by NSF through grants CCF-1716400 and IIS-1910492.

References

[1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
[2] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
[3] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
[4] Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on the curvature of deep generative models, 2017.
[5] Hang Shao, Abhishek Kumar, and P. Thomas Fletcher. The riemannian geometry of deep generative models, 2017.
[6] Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
[7] Dimitris Kalatzis, David Eklund, Georgios Arvanitidis, and Søren Hauberg. Variational autoencoders with riemannian brownian motion priors. arXiv preprint arXiv:2002.05227, 2020.
[8] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
[9] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
[10] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378–2386, 2016.
[11] Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, and Barnabas Poczos. Implicit kernel learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2007–2016, 2019.
[12] Yufan Zhou, Changyou Chen, and Jinhui Xu. Kernelnet: A data-dependent kernel parameterization for deep generative modeling, 2019.
[13] John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media, 2006.
[14] Xavier Pennec. Probabilities and statistics on riemannian manifolds: Basic tools for geometric measurements. Citeseer, 1999.
[15] Omer Bobrowski and Sayan Mukherjee. The topology of probability distributions on manifolds. Probability theory and related fields, 161(3-4):651–686, 2015.
[16] Alexander Grigor’yan, Jiaxin Hu, and Ka-Sing Lau. Heat kernels on metric measure spaces. In Geometry and analysis of fractals, pages 147–207. Springer, 2014.
[17] John Lafferty and Guy Lebanon. Diffusion kernels on statistical manifolds. J. Mach. Learn. Res., 6:129–163, December 2005.
[18] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multi-scale signature based on heat diffusion. In Computer graphics forum, volume 28, pages 1383–1392. Wiley Online Library, 2009.
[19] Keenan Crane, Clarisse Weischedel, and Max Wardetzky. Geodesics in heat: A new approach to computing distance based on heat flow. ACM Transactions on Graphics (TOG), 32(5):152, 2013.
[20] Matthias Erbar. The heat equation on manifolds as a gradient flow in the wasserstein space. In Annales de l’IHP Probabilités et statistiques, volume 46, pages 1–23, 2010.
[21] Peter Li and Shing Tung Yau. On the parabolic kernel of the schrödinger operator. Acta Mathematica, 156(1):153–201, 1986.
[22] Zdravko I Botev, Joseph F Grotowski, Dirk P Kroese, et al. Kernel density estimation via diffusion. The annals of Statistics, 38(5):2916–2957, 2010.
[23] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
[24] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2292–2300. Curran Associates, Inc., 2013.
[25] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pages 513–520, 2007.
[26] Michael Arbel, Dougal Sutherland, MikoNaj Bińkowski, and Arthur Gretton. On gradient regularizers for mmd gans. In Advances in Neural Information Processing Systems, pages 6700–6710, 2018.
[27] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
[28] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 214–223, 2017.
[29] Wei Wang, Yuan Sun, and Saman Halgamuge. Improving mmd-gan training with repulsive loss function. In International Conference on Learning Representations, 2018.
[30] Dilin Wang, Ziyang Tang, Chandrajit Bajaj, and Qiang Liu. Stein variational gradient descent with matrix-valued kernels. In Advances in neural information processing systems, pages 7836–7846, 2019.
[31] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017.
[32] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
[33] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
[34] Han Zhang, Zizhao Zhang, Augustus Odena, and Honglak Lee. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020.
[35] Xinyu Gong, Shiyu Chang, Yifan Jiang, and Zhangyang Wang. Autogan: Neural architecture search for generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 3224–3234, 2019.
[36] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
[37] Sigmundur Gudmundsson. An Introduction to Riemannian Geometry - Lecture Notes in Mathematics. Lund University, 2018.
[38] Karl-Theodor Sturm et al. On the geometry of metric measure spaces. Acta mathematica, 196(1):65–131, 2006.
[39] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[41] MikoNaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. In International Conference on Learning Representations, 2018.

Appendix A Riemannian Manifold

Definition 3 (Manifold)

[36] Let $\mathcal{M}$ be a set. If there exists a set of coordinate systems $A$ for $\mathcal{M}$ satisfying the conditions below, $(\mathcal{M},A)$ is called an $n$ -dimensional $C^{\infty}$ differentiable manifold, or simply manifold.

1.

Each element $\phi$ of $A$ is a one-to-one mapping from $\mathcal{M}$ to some open subset of $R^{n}$ ;
2.

For all $\phi\in A$ , given any one-to-one mapping $\psi$ from $\mathcal{M}$ to $R^{n}$ , the following holds:

$\psi\in A\Leftrightarrow\psi\circ\phi^{-1}\text{is a $C^{\infty}$ diffeomorphism};$

By $C^{\infty}$ diffeomorphism, we mean that $\psi\circ\phi^{-1}$ and its inverse $\phi\circ\psi^{-1}$ are both $C^{\infty}$ (infinitely many times differentiable). Infinitely differentiable is not necessary actually, we may consider this notation as ’sufficiently smooth’.

We will use $T_{\operatorname{\mathbf{x}}}\mathcal{M}$ to denote the tangent space of $\mathcal{M}$ at point $\operatorname{\mathbf{x}}$ , and $X,Y,Z$ to denote the vector fields.

Definition 4 (Riemannian Metric and Riemannian Manifold)

[37] Let $\mathcal{M}$ be a manifold, $C^{\infty}(\mathcal{M})$ be the comminicative ring of smooth functions on $\mathcal{M}$ , and $C^{\infty}(\mathcal{TM})$ be the set of smooth vector fields on $\mathcal{M}$ forming a module over $C^{\infty}(\mathcal{M})$ . A Riemannian metric $g$ on $\mathcal{M}$ is a tensor field $g:C^{\infty}(\mathcal{TM})\otimes C^{\infty}(\mathcal{TM})\rightarrow C^{\infty}(\mathcal{M})$ such that for each $\operatorname{\mathbf{x}}\in\mathcal{M}$ , the restriction $g_{\operatorname{\mathbf{x}}}$ of $g$ to the tensor product $T_{\operatorname{\mathbf{x}}}\mathcal{M}\otimes T_{\operatorname{\mathbf{x}}}\mathcal{M}$ with:

g_{\operatorname{\mathbf{x}}}:(X_{\operatorname{\mathbf{x}}},Y_{\operatorname{\mathbf{x}}})\rightarrow g(X,Y)(\operatorname{\mathbf{x}})

is a real scalar product on the tangent space $T_{\operatorname{\mathbf{x}}}\mathcal{M}$ . The pair $(\mathcal{M},g)$ is called a Riemannian manifold. The geometric properties of $(\mathcal{M},g)$ which depend only on the metric $g$ are said to be intrinsic or metric properties.

One classical example is that the Riemannian manifold $E^{m}=(R^{m},\langle,\rangle_{R^{m}})$ is nothing but the m-dimensional Euclidean space.

The Riemannian curvature tensor of a manifold $\mathcal{M}$ is defined by

R(X,Y)Z=\nabla_{X}\nabla_{Y}Z-\nabla_{Y}\nabla_{X}Z-\nabla_{\left[X,Y\right]}Z

on vector fields $X,Y,Z$ . For any two tangent vector $\mathbf{\xi},\mathbf{\eta}\in T_{\operatorname{\mathbf{x}}}\mathcal{M}$ , we use $Ric_{\operatorname{\mathbf{x}}}(\mathbf{\xi},\mathbf{\eta})$ to denote the Ricci tensor evaluated at $(\mathbf{\xi},\mathbf{\eta})$ , which is defined to be the trace of the mapping $T_{\operatorname{\mathbf{x}}}\mathcal{M}\rightarrow T_{\operatorname{\mathbf{x}}}\mathcal{M}$ given by $\mathbf{\zeta}\rightarrow R(\mathbf{\zeta},\mathbf{\eta})\mathbf{\xi}$ .

We use $Ric\geq K$ to denote that a manifold’s Ricci curvature is bounded from below by $K$ , in the sense that $Ric_{\operatorname{\mathbf{x}}}(\mathbf{\xi},\mathbf{\xi})\geq K|\mathbf{\xi}|^{2}$ for all $\operatorname{\mathbf{x}}\in\mathcal{M},\mathbf{\xi}\in T_{\operatorname{\mathbf{x}}}\mathcal{M}$ .

Appendix B Proof of Theorem 3

Theorem 3 Let $(\mathcal{M},g)$ be a connected, complete Riemannian manifold with Riemannian volume element $\mathrm{d}m$ and positive Ricci curvature. Assume that the heat kernel is Lipschitz continous. Let $p^{t}$ , $q^{t}$ be two evolving PDFs induced by WGF for negative entropy as defined in Definition 2, their corresponding probability measure are ${\bm{\mu}}^{t}$ and ${\bm{\nu}}^{t}$ . If $d^{2}_{W}({\bm{\mu}}^{0},{\bm{\nu}}^{0})<\infty$ , then $p^{t}(\operatorname{\mathbf{x}})=q^{t}(\operatorname{\mathbf{x}})$ almost everywhere as $t\rightarrow\infty$ ; furthermore, $\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|p^{t}(\operatorname{\mathbf{x}})-q^{t}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m$ converges to 0 exponentially fast.

Proof We start by introducing the following lemma, which is Proposition 4.4 in [20].

Lemma 8 ([20])

Assume $Ric\geq K$ . Let $({\bm{\mu}}^{t}_{\mathcal{M}})_{t\geq 0}$ and $({\bm{\nu}}^{t}_{\mathcal{M}})_{t\geq 0}$ be two trajectories of the gradient flow of negative entropy functional with initial distribution ${\bm{\mu}}^{0}_{\mathcal{M}}$ and ${\bm{\nu}}^{0}_{\mathcal{M}}$ , respectively, then

d_{W}^{2}({\bm{\mu}}^{t}_{\mathcal{M}},{\bm{\nu}}^{t}_{\mathcal{M}})\leq e^{-Kt}d_{W}^{2}({\bm{\mu}}^{0}_{\mathcal{M}},{\bm{\nu}}^{0}_{\mathcal{M}}).

In particular, for a given initial value ${\bm{\mu}}^{0}_{\mathcal{M}}$ , there is at most one trajectory of the gradient flow.

What we need to do is to extend the results of Lemma 8 from probability measures to probability density functions. Following some previous work, we first define the projection operator.

Definition 5 (Projection Operator)

[8] Let $\pi^{i},\pi^{i,j}$ denote the projection operators defined on the product space $\bar{\mathbf{X}}\coloneqq X_{1}\times...\times X_{N}$ respectively such that:

\pi^{i}:\ (\operatorname{\mathbf{x}}_{1},...,\operatorname{\mathbf{x}}_{N})\mapsto\operatorname{\mathbf{x}}_{i}\in X_{i},\ \pi^{i,j}:\ (\operatorname{\mathbf{x}}_{1},...,\operatorname{\mathbf{x}}_{N})\mapsto(\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j})\in X_{i}\times X_{j}

if $\bar{\mathbf{\mu}}\in\mathcal{P}(\bar{\mathbf{X}})$ , the marginals of $\bar{{\bm{\mu}}}$ are the probability measures

{\bm{\mu}}^{i}\coloneqq\pi^{i}_{\#}\bar{{\bm{\mu}}}\in\mathcal{P}(X_{i}),\ {\bm{\mu}}^{i,j}\coloneqq\pi^{i,j}_{\#}\bar{{\bm{\mu}}}\in\mathcal{P}(X_{i}\times X_{j}).

We then introduce the following lemma which utilize the projection operator.

Lemma 9

[8] Let $X_{i}$ , $i\in\mathbb{N}$ , be a sequence of Radon separable metric spaces, ${\bm{\mu}}_{i}\in\mathcal{P}(X_{i})$ and $\alpha^{i(i+1)}\in\Gamma({\bm{\mu}}^{i},{\bm{\mu}}^{i+1})$ , $\beta^{1i}\in\Gamma({\bm{\mu}}^{1},{\bm{\mu}}^{i})$ , where $\Gamma({\bm{\mu}},{\bm{\nu}})$ denotes the 2-plan (i.e. transportation plan between two distributions) with given marginals ${\bm{\mu}},{\bm{\nu}}$ . Let $\bar{\mathbf{X}}_{\infty}\coloneqq\prod_{i\in\mathbb{N}}X_{i}$ , with the canonical product topology. Then there exist $\bar{{\bm{\nu}}},\bar{{\bm{\mu}}}\in\mathcal{P}(\bar{\mathbf{X}}_{\infty})$ such that

\displaystyle\pi^{i,i+1}_{\#}\bar{{\bm{\mu}}}=\alpha^{i(i+1)},\ \pi^{1,i}_{\#}\bar{{\bm{\nu}}}=\beta^{1i},\ \forall i\in\mathbb{N}.

(9)

Now we are ready to prove the theorem.

We first construct a sequence of probability measures $\{{\bm{\rho}}^{t}\}_{t\in\mathbb{N}}$ , such that ${\bm{\rho}}_{2t}={\bm{\mu}}^{t}_{\boldsymbol{\phi}},\ {\bm{\rho}}^{2t+1}={\bm{\mu}}^{t}_{\mathcal{M}},\ t\in\mathbb{N}$ . Then, we choose $\alpha^{t(t+1)}\in\Gamma_{o}({\bm{\rho}}^{t},{\bm{\rho}}^{t+1})$ , i.e. the optimal 2-plans given marginal ${\bm{\rho}}^{t},{\bm{\rho}}^{t+1}$ . According to Lemma 9, we can find a probability measure $\bar{{\bm{\mu}}}\in\mathcal{P}(\bar{\mathcal{M}}_{\infty})$ where $\bar{\mathcal{M}}_{\infty}\coloneqq\prod_{t\in\mathbb{N}}\mathcal{M}_{t}$ satisfying (9). Then the 2-Wasserstein distance can be written as:

d^{2}_{W}({\bm{\mu}}^{t}_{\boldsymbol{\phi}},{\bm{\mu}}^{t}_{\mathcal{M}})=d^{2}_{W}({\bm{\rho}}^{2t},{\bm{\rho}}^{2t+1})=\bm{d}(\pi^{2t},\pi^{2t+1})_{L^{2}(\bar{{\bm{\mu}}};\bar{\mathcal{M}}_{\infty})},

where $\bm{d}(\cdot,\cdot)_{L^{2}(\bar{{\bm{\mu}}};\bar{\mathcal{M}}_{\infty})}$ denotes the norm of $L^{2}$ space.

According to Lemma 8,

d^{2}_{W}({\bm{\mu}}^{t}_{\boldsymbol{\phi}},{\bm{\mu}}^{t}_{\mathcal{M}})\leq e^{-Kt}d^{2}_{W}({\bm{\mu}}^{0}_{\boldsymbol{\phi}},{\bm{\mu}}^{0}_{\mathcal{M}}),

which means:

\bm{d}(\pi^{2t},\pi^{2t+1})_{L^{2}(\bar{{\bm{\mu}}};X)}\leq e^{-Kt}d^{2}_{W}({\bm{\mu}}^{0}_{\boldsymbol{\phi}},{\bm{\mu}}^{0}_{\mathcal{M}}).

	$\displaystyle\int_{\mathcal{M}}\\|k_{\boldsymbol{\phi}}^{t}-k_{\mathcal{M}}^{t}\\|^{2}\mathrm{d}m$	$\displaystyle=\int_{\mathcal{M}}(k_{\boldsymbol{\phi}}^{t})^{2}\mathrm{d}m+\int_{\mathcal{M}}(k_{\mathcal{M}}^{t})^{2}\mathrm{d}m-2\int_{\mathcal{M}}k_{\mathcal{M}}^{t}k_{\boldsymbol{\phi}}^{t}\mathrm{d}m$
		$\displaystyle\leq\int_{\mathcal{M}}k_{\boldsymbol{\phi}}^{t}\mathrm{d}\mu^{t}_{\boldsymbol{\phi}}+\int_{\mathcal{M}}k_{\mathcal{M}}^{t}\mathrm{d}\mu^{t}_{\mathcal{M}}$
		$\displaystyle=\int_{\bar{\mathcal{M}}_{\infty}}(k_{\boldsymbol{\phi}}^{t}\circ\pi^{2t})\mathrm{d}\bar{{\bm{\mu}}}+\int_{\bar{\mathcal{M}}_{\infty}}(k_{\mathcal{M}}^{t}\circ\pi^{2t+1})\mathrm{d}\bar{{\bm{\mu}}}$
		$\displaystyle\leq\int_{\bar{\mathcal{M}}_{\infty}}\\|k_{\boldsymbol{\phi}}^{t}\circ\pi^{2t}+k_{\mathcal{M}}^{t}\circ\pi^{2t+1}\\|\mathrm{d}\bar{{\bm{\mu}}}$
		$\displaystyle\leq(\int_{\bar{\mathcal{M}}_{\infty}}\\|k_{\boldsymbol{\phi}}^{t}\circ\pi^{2t}+k_{\mathcal{M}}^{t}\circ\pi^{2t+1}\\|^{2}\mathrm{d}\bar{{\bm{\mu}}})^{1/2},\ \text{by Jensen's Inequality}$
		$\displaystyle\leq(\int_{\bar{\mathcal{M}}_{\infty}}L_{sup}^{2}\\|\pi^{2t}-\pi^{2t+1}\\|^{2}\mathrm{d}\bar{{\bm{\mu}}})^{1/2}$
		$\displaystyle=C\bm{d}(\pi^{2t},\pi^{2t+1})_{L^{2}(\bar{{\bm{\mu}}};\bar{\mathcal{M}}_{\infty})},\text{C is a constant because of Lipschitz continuity}$
		$\displaystyle\leq Ce^{-Kt}d^{2}_{W}({\bm{\mu}}^{0}_{\boldsymbol{\phi}},{\bm{\mu}}^{0}_{\mathcal{M}}).$

which completes the proof. Readers who are interested may also refer to Chapter 5.3 in [8] and proof of Proposition 4.4 in [20] for more information.

Appendix C Proof of Theorem 4 and Theorem 5

Theorem 5 Let $(\mathcal{M},g)$ be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary $\partial\mathcal{M}$ . If $\mathcal{M}$ has positive Ricci curvature bounded by $K$ and its dimension $dim(\mathcal{M})\geq 1$ . Let $k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})$ be the heat kernel of

\dfrac{\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{\partial t}=\triangle_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\dfrac{\partial k^{t}_{\mathcal{M}}}{\partial n}=0\text{ on }\partial\mathcal{M}\text{ if applicable}

where $n$ denotes the outward-pointing unit normal to boundary $\partial\mathcal{M}$ . Then

k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\geq\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{C(\epsilon)(\pi t)^{dim(\mathcal{M})/2}}\exp(\dfrac{\pi^{2}-\pi^{2}dim(\mathcal{M})}{(4-\epsilon)Kt})

for $\forall\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}\in\mathcal{M}$ and small $\epsilon>0$ , where $C(\epsilon)$ is a constant depending on $\epsilon>0$ and $d$ such that $C(\epsilon)\rightarrow 0$ as $\epsilon\rightarrow 0$ , and $\Gamma$ is the gamma function.

Proof We start by introducing the following lemma:

Lemma 10

[21] Let $\mathcal{M}$ be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary $\partial\mathcal{M}$ . Suppose that the Ricci curvature of $\mathcal{M}$ is non-negative. Let $k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})=k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})$ be the heat kernel of

\dfrac{\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{\partial t}=\triangle_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\dfrac{\partial k^{t}_{\mathcal{M}}}{\partial n}=0\text{ on }\partial\mathcal{M}\text{ if applicable}

Then, the heat kernel satisfies

k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\geq C^{-1}(\epsilon)V^{-1}[B_{\mathcal{M},\operatorname{\mathbf{x}}}(\sqrt{t})]\exp(\dfrac{-d_{\mathcal{M}}^{2}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{(4-\epsilon)t})

for some constant $C(\epsilon)$ depending on $\epsilon>0$ and $n$ such that $C(\epsilon)\rightarrow 0$ as $\epsilon\rightarrow 0$ , where $B_{\mathcal{M},\operatorname{\mathbf{x}}}(\sqrt{t})\subseteq\mathcal{M},$ denotes the geodesic ball with radius $\sqrt{t}$ around $\operatorname{\mathbf{x}}$ . Moreover, by symmetrizing,

k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\geq C^{-1}(\epsilon)V^{-1/2}[B_{\mathcal{M},\operatorname{\mathbf{x}}_{0}}(\sqrt{t})]V^{-1/2}[B_{\mathcal{M},\operatorname{\mathbf{x}}}(\sqrt{t})]\exp(\dfrac{-d_{\mathcal{M}}^{2}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{(4-\epsilon)t}).

Lemma 10 comes from the Theorem 4.1 and Theorem 4.2 in [21]. Then we need to introduce the definition of CD condition and Lemma 11.

Definition 6 (CD condition)

[38] For Rimennian manifold $\mathcal{M}$ , the curvature-dimension (CD) condition $CD(K,d)$ is satisfied if and only if the dimension of manifold is less or equal to d and Ricci curvature is bounded from below by $K$ .

Lemma 11

[38] For every metric measure space $(\mathcal{M},d_{\mathcal{M}},m)$ which satisfies the curvature-dimension condition $CD(K,d)$ for some real numbers $K>0$ and $d\geq 1$ , the support of $m$ is compact and has diameter

L\leq\pi\sqrt{\dfrac{d-1}{K}},

where the diameter is defined as

L=\sup_{\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}\in\mathcal{M}\times\mathcal{M}}d_{\mathcal{M}}(\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}).

Now we are ready to prove the Theorem.

Let $(\mathcal{M},g)$ be a complete Riemannian manifold with positive Ricci curvature $Ric\geq K$ and dimension $dim(\mathcal{M})$ , then

Ric\geq 0=\left[dim(\mathcal{M})-1\right]\cdot 0

Because Euclidean Space can be seen as a manifold with constant sectional curvature 0. By Bishop-Gromov inequality we have the following for manifold with non-negative Ricci curvature:

V[B_{\mathcal{M},\operatorname{\mathbf{x}}}(r)]\leq V[B_{R^{dim(\mathcal{M})}}(r)]

i.e. the volume of a geodesic ball with radius $r$ around $\operatorname{\mathbf{x}}$ is less than the ball with same radius in a $dim(\mathcal{M})$ -dimensional Euclidean space. According to the volume of n-ball, we have

V[B_{R^{dim(\mathcal{M})}}(r)]=\dfrac{\pi^{dim(\mathcal{M})/2}}{\Gamma(dim(\mathcal{M})/2+1)}r^{dim(\mathcal{M})}

Thus we have

V^{-1}[B_{\mathcal{M},\operatorname{\mathbf{x}}}(r)]\geq\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{\pi^{dim(\mathcal{M})/2}}r^{-dim(\mathcal{M})}

Using Lemma 10, we have:

k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\geq C^{-1}(\epsilon)\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{\pi^{dim(\mathcal{M})/2}t^{dim(\mathcal{M})/2}}\exp(\dfrac{-d_{\mathcal{M}}^{2}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{(4-\epsilon)t})

Using Lemma 11 and $d_{\mathcal{M}}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\leq L\leq\pi\sqrt{\dfrac{dim(\mathcal{M})-1}{K}}$ , we have:

k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})\geq\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{C(\epsilon)\pi^{dim(\mathcal{M})/2}t^{dim(\mathcal{M})/2}}\exp(\dfrac{\pi^{2}-\pi^{2}dim(\mathcal{M})}{(4-\epsilon)Kt})

We now can conclude Theorem 5. The term $\exp(\dfrac{\pi^{2}-\pi^{2}dim(\mathcal{M})}{(4-\epsilon)Kt})$ is increasing, and $\dfrac{\Gamma(dim(\mathcal{M})/2+1)}{C(\epsilon)\pi^{dim(\mathcal{M})/2}t^{dim(\mathcal{M})/2}}$ is decreasing polynomially. We can see that the lower bound of heat kernel decrease to 0 polynomially with respect to t. Thus heat kernel value decrease to 0 at most polynomially with respect to t.

Theorem 4 Let $(\mathcal{M},g)$ be a complete Riemannian manifold without boundary or compact Riemannian manifold with convex boundary $\partial\mathcal{M}$ . Assume it has positive Ricci curvature. $\mathrm{d}m$ denotes its Riemannian volume element. Let $k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})$ be the heat kernel of

\dfrac{\partial k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})}{\partial t}=\triangle_{\operatorname{\mathbf{x}}}k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}),\dfrac{\partial k^{t}_{\mathcal{M}}}{\partial n}=0\text{ on }\partial\mathcal{M}\text{ if applicable}

where $n$ denotes the outward-pointing unit normal to boundary $\partial\mathcal{M}$ . Then $\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m$ converges to 0 at most polynomially as $t\rightarrow\infty$ , which is slower than $\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|p^{t}(\operatorname{\mathbf{x}})-k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m$ .

Proof For a given manifold $\mathcal{M}$ , its Riemannian volume element doesn’t vary at different time t, thus the lower bound of integral $\int_{\operatorname{\mathbf{x}}\in\mathcal{M}}\|k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})\|^{2}\mathrm{d}m$ where $k^{t}_{\mathcal{M}}(\operatorname{\mathbf{x}})=k_{\mathcal{M}}(t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}})$ also decrease polynomially because of Theorem 5. Then we can conclude Theorem 4.

Appendix D MMD and SMMD

In our proposed method for deep generative models, MMD is used as the objective function for the generator, which is defined as:

	$\displaystyle\textsf{MMD}^{2}(\mathbb{P},\mathbb{Q})=$	$\displaystyle\mathbb{E}_{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\sim\mathbb{P}}(k_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}))+\mathbb{E}_{\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}\sim\mathbb{Q}}(k_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}}_{i},\operatorname{\mathbf{y}}_{j}))$
		$\displaystyle-2\mathbb{E}_{\operatorname{\mathbf{x}}_{i}\sim\mathbb{P},\operatorname{\mathbf{y}}_{j}\sim\mathbb{Q}}(k_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{y}}_{j})),$		(10)

where $\mathbb{P},\mathbb{Q}$ are probability distributions of the training data and the generated data, respectively. MMD measures the difference between two distributions. Thus, we want it to be minimized.

Generator can also use SMMD [26] as the objective function, which is defined as:

		$\displaystyle\textsf{SMMD}^{2}(\mathbb{P},\mathbb{Q})=\sigma\textsf{MMD}^{2}(\mathbb{P},\mathbb{Q})$
		$\displaystyle\sigma=\{\zeta+\mathbb{E}_{\operatorname{\mathbf{x}}\in\mathbb{P}}\left[k_{\boldsymbol{\phi}}(t,\operatorname{\mathbf{x}},\operatorname{\mathbf{x}})\right]+\sum_{i=1}^{d}\mathbb{E}_{\operatorname{\mathbf{x}}\in\mathbb{P}}\left[\dfrac{\partial^{2}k_{\boldsymbol{\phi}}(t,\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})}{\partial\operatorname{\mathbf{y}}_{i}\partial\operatorname{\mathbf{z}}_{i}}\|_{(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})=(\operatorname{\mathbf{x}},\operatorname{\mathbf{x}})}\right]\}^{-1},$		(11)

where $d$ is the dimensionality of the data, $\operatorname{\mathbf{y}}_{i}$ denotes the $i^{th}$ element of $\operatorname{\mathbf{y}}$ , and $\zeta$ is a hyper-parameter.

Appendix E Algorithms

E.1 Algorithms for DGM with heat kernel learning

First of all, let’s introduce a similar objective function for kernel with form (2), which can be used to replace (3.3.2).

Similarly to (6), for kernels with form (2), we have the following bound:

		$\displaystyle\\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})\\|\leq c_{1}\\|\mathbb{E}\{\sin\{({\bm{\omega}}_{{\bm{\psi}}_{1}}^{t})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\}\\|\\|{\bm{\omega}}_{{\bm{\psi}}_{1}}^{t}\\|\\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\\|_{\mathcal{F}}$
		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+c_{2}\\|\mathbb{E}\{\sin\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\}\\|\\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}\\|\\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\\|_{\mathcal{F}}$		(12)
		$\displaystyle\text{ where }c_{1}=\\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{z}}\\|,c_{2}=\\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{z}}\\|\left[\\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}\\|+\\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\\|\big{\\|}\dfrac{\partial{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}}{\partial h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})}\big{\\|}_{\mathcal{F}}\right]$

Incorporating this bound into objective as (3.3.2), the optimization problem for learning kernel with form (2) becomes:

$\displaystyle\min_{\boldsymbol{\phi}}$	$\displaystyle\ \alpha H(\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})+\beta d_{W}^{2}({\bm{\nu}},\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t})-\lambda\mathbb{E}_{\operatorname{\mathbf{y}}\neq\operatorname{\mathbf{x}}}\left[p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})\right]$	(13)
	$\displaystyle+\mathbb{E}_{\operatorname{\mathbf{x}}\sim\mathbb{P},\operatorname{\mathbf{y}}\sim\mathbb{Q}}\left[\gamma_{1}k_{sin}(t,\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})+\gamma_{2}\\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})\\|\right]$
	$\displaystyle+\mathbb{E}_{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{x}}_{j}\sim\mathbb{P}}\left[\gamma_{3}k_{sin}(t,\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i})+\gamma_{4}\\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{i})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j})\\|\right]$
	$\displaystyle+\gamma_{5}\mathbb{E}_{\operatorname{\mathbf{x}}\sim\mathbb{P},\operatorname{\mathbf{y}}\neq\operatorname{\mathbf{x}}}\left[\\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{1}}\\|+\\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}\\|+\\|\dfrac{\partial{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}}{\partial h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})}\\|_{\mathcal{F}}\right],$

	$\displaystyle\text{where }k_{sin}(t,\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})=\mathbb{E}\{\sin\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{1}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\}$
	$\displaystyle\qquad\qquad\qquad+\mathbb{E}\{\sin\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\},$

Now we present the algorithm of DGM with heat kernel learning here.

Algorithm 2 Deep Generative Model with Heat Kernel Learning

Input: training data

\{\operatorname{\mathbf{x}}_{i}\}

on manifold

\mathcal{M}_{\mathcal{P}}

, generator

g_{{\bm{\theta}}}

, denote generated data as

\{\operatorname{\mathbf{y}}_{i}\}

, kernel parameterized by (1) or (2), all the hyper-parameters in (3.3.2), time step

\tau=\alpha/2\beta

for training epochs

s

for iteration j do

Sample

\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n}

and

\{\operatorname{\mathbf{y}}_{i}\}_{i=1}^{n}

Initialize function

p^{0}_{\boldsymbol{\phi}}

, compute corresponding

{\bm{\nu}}=\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{0}

by (3).

for

k=1

m

Compute

\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau}

by (3), solve (3.3.2) or (13). Update

{\bm{\nu}}\leftarrow\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau}

, where

\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{k\tau}

is computed by (3).

end for

Sample

\{\operatorname{\mathbf{x}}_{i}\}_{i=1}^{n}

and

\{\operatorname{\mathbf{y}}_{i}\}_{i=1}^{n}

. Update

{\bm{\theta}}

by minimizing MMD computed with (1) or (2).

end for

E.2 Some discussions

Instead of initializing the ${\bm{\nu}}=\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{0}$ by Equation (3), we can also simply initialize it to be $1/n$ , which represents the discrete uniform distribution. In this case, we set the ${\bm{\mu}}^{t}_{\boldsymbol{\phi}}$ for time $t$ to be

{\bm{\mu}}^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})=\dfrac{\sum_{j=1}^{n}p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}})}{n\sum_{j=1}^{n}p^{0}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}})},

where unknown constant $\alpha^{t}_{\mathcal{M}}$ is also cancelled.

To approximate $H(\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}})$ , we may use either

H(\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}})\approx\sum_{i=1}^{n}\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})\log\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})

H(\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}})\approx\dfrac{1}{n}\sum_{j=1}^{n}\sum_{i=1}^{n}\tilde{{\bm{\mu}}}_{\boldsymbol{\phi}}^{t}(\operatorname{\mathbf{x}}_{i})\log p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}}_{j},\operatorname{\mathbf{x}}_{i}).

In practice, these different implementations may need different hyper-parameter settings, and have different performances. Furthermore, we observed that using unnormalized density estimation ${\bm{\mu}}^{t}_{\boldsymbol{\phi}}$ instead of $\tilde{{\bm{\mu}}}^{t}_{\boldsymbol{\phi}}$ also leads to competitive results.

Appendix F Experimental Results and Settings on Improved SVGD

We provide some experimental settings here. Our implementation is based on TensorFlow with a Nvidia 2080 Ti GPU. To simplify the setting, the RBF-kernel is used in all layers except the last one, which is learned by our method (1) with $h_{\boldsymbol{\phi}}$ a 2-layer neural network.In other words, we only learn the parameter manifold for the last layer. Following [30], we run 20 trials on all the datasets except for Protein and Year, where 5 trials are used. At each trial, we randomly choose $90\%$ of the dataset as the training set, and the rest $10\%$ as the testing set. For large datasets like Year and Combined, we use the Adam optimizer with a batch size of 1000, and use batch size of 100 for all other datasets. Before every update of the BNN parameters, we run Algorithm 1 with $m=1$ ; and 5 Adam update steps are implemented to solve (4). For matrix-valued SVGD, We use the same experimental setting, except that the number of update steps for solving (4) is chosen from $\{1,2,5,10\}$ , based on hyper-parameter tuning.

We report the average test log-likelihood in Table 3, from which we can also see that our proposed method improves model performance.

Table 3: Average test log-likelihood

(\uparrow)

for UCI regression.

Combined Concrete Kin8nm Protein Wine Year SVGD $-2.832\pm 0.009$ $-3.064\pm 0.034$ $0.964\pm 0.012$ $-2.846\pm 0.003$ $-0.997\pm 0.019$ $-3.577\pm 0.002$ HK-SVGD (ours) $-2.827\pm 0.009$ $-3.015\pm 0.037$ $0.976\pm 0.007$ $-2.838\pm 0.004$ $-0.958\pm 0.021$ $-3.559\pm 0.001$ MSVGD-a $-2.824\pm 0.009$ $-3.150\pm 0.054$ $0.956\pm 0.011$ $-2.796\pm 0.004$ $-0.980\pm 0.016$ $-3.569\pm 0.001$ MSVGD-m $-2.817\pm 0.009$ $-3.207\pm 0.071$ $0.975\pm 0.011$ $-2.755\pm 0.003$ $-0.988\pm 0.018$ $-3.561\pm 0.002$ HK-MSVGD-a (ours) $-2.815\pm 0.012$ $\mathbf{-3.011\pm 0.076}$ $0.982\pm 0.011$ $-2.800\pm 0.001$ $\mathbf{-0.943\pm 0.016}$ $-3.549\pm 0.002$ HK-MSVGD-m (ours) $\mathbf{-2.814\pm 0.013}$ $-3.157\pm 0.067$ $\mathbf{0.989\pm 0.009}$ $\mathbf{-2.731\pm 0.004}$ $-1.013\pm 0.019$ $\mathbf{-3.534\pm 0.001}$

Instead of using particles, we further improve our proposed HK-SVGD by introducing a parameter generator, which takes Gaussian noises as inputs and outputs samples of parameter distribution for BNNs. We use a 2-layer neural network to model this generator, 10 samples are generated at each iteration. We denote the resulting model as HK-ISVGD, and compare it with vanilla SVGD and our proposed HK-SVGD. Results on UCI regression are shown in Table 4 and Table 5. We can see that introducing the parameter sample generator will lead to performance improvement on most of the datasets.

Table 4: Average test RMSE

(\downarrow)

for UCI regression with parameter generator.

Combined Concrete Kin8nm Protein Wine Year SVGD $4.088\pm 0.033$ $5.027\pm 0.116$ $0.093\pm 0.001$ $4.186\pm 0.017$ $0.645\pm 0.009$ $8.686\pm 0.010$ HK-SVGD (ours) $4.077\pm 0.035$ $\mathbf{4.814\pm 0.112}$ $0.091\pm 0.001$ $4.138\pm 0.019$ $0.624\pm 0.010$ $8.656\pm 0.007$ HK-ISVGD (ours) $\mathbf{4.075\pm 0.035}$ $4.824\pm 0.113$ $\mathbf{0.089\pm 0.001}$ $\mathbf{4.094\pm 0.014}$ $\mathbf{0.616\pm 0.009}$ $\mathbf{8.611\pm 0.007}$

Table 5: Average test log-likelihood

(\uparrow)

for UCI regression with parameter generator.

Appendix G Model Architectures and Some Experiments Settings on DGM

We provide some experimental details of image generation here. Our implementation is based on TensorFlow with a Nvidia 2080 Ti GPU.

For CIFAR-10 and STL-10, we test them on 2 architectures: DC-GAN based [39] and ResNet based architectures [40, 41]. The DC-GAN based architecture contains a 4-layer convolutional neural network (CNN) as the generator, with a 7-layer CNN representing $h^{t}_{\boldsymbol{\phi}}$ in (1) and (2). In the ResNet based architecture, the generator and $h^{t}_{\boldsymbol{\phi}}$ are both 10-layer ResNet. For ImageNet, we use the same ResNet based architecture as CIFAR-10 and STL-10. For CelebA, the generator is a 10-layer ResNet, while $h^{t}_{\boldsymbol{\phi}}$ is a 4-layer CNN.

For CIFAR-10, STL-10 and ImageNet, spectral normalization is used, while we scale the weights after spectral normalization by 2 on CIFAR-10 and STL-10. We set $\beta_{1}=0.5$ , $\beta_{2}=0.999$ for the Adam optimizer and $m=1$ , $n=64$ in Algorithm 2. Only one step Adam update is implemented for solving (3.3.2). Output dimension of $h^{t}_{\boldsymbol{\phi}}$ is set to be 16. For all the experiments with kernel (2), both $f^{t}_{{\bm{\psi}}_{1}}$ and $f^{t}_{{\bm{\psi}}_{2}}$ are parameterized by 2-layer fully connected neural networks.

For CelebA, we scale the kernel learning objective, i.e. (3.3.2), by $\sigma$ in $\eqref{eq:smmd}$ as SMMD. Spectral regularization [26] is used. We set $\beta_{1}=0.5$ , $\beta_{2}=0.9$ for the Adam optimizer and $m=1$ , $j=5$ , $n=64$ in Algorithm 2. Only one step Adam update is implemented for solving (3.3.2). Output dimension of $h^{t}_{\boldsymbol{\phi}}$ is set to be 1, because scaled objective with $h^{t}_{\boldsymbol{\phi}}$ dimension larger than 1 is time consuming.

As for evaluation, CIFAR-10, STL-10 and ImageNet are evaluated on 100k generated images, while CelebA is evaluated on 50k generated images due to the memory limitation.

		$\displaystyle\\|p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{x}})-p^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}},\operatorname{\mathbf{z}})\\|\leq c_{1}\\|\mathbb{E}\{\sin\{({\bm{\omega}}_{{\bm{\psi}}_{1}}^{t})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\}\\|\\|{\bm{\omega}}_{{\bm{\psi}}_{1}}^{t}\\|\\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\\|_{\mathcal{F}}$
		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+c_{2}\\|\mathbb{E}\{\sin\{({\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}})^{\intercal}\left[h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\right]\}\}\\|\\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}\\|\\|\nabla_{\operatorname{\mathbf{x}}}h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\\|_{\mathcal{F}}$		(12)
		$\displaystyle\text{ where }c_{1}=\\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{z}}\\|,c_{2}=\\|\operatorname{\mathbf{x}}-\operatorname{\mathbf{z}}\\|\left[\\|{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}\\|+\\|h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{y}})-h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})\\|\big{\\|}\dfrac{\partial{\bm{\omega}}^{t}_{{\bm{\psi}}_{2},\operatorname{\mathbf{x}},\operatorname{\mathbf{y}}}}{\partial h^{t}_{\boldsymbol{\phi}}(\operatorname{\mathbf{x}})}\big{\\|}_{\mathcal{F}}\right]$

Learning Manifold Implicitly via Explicit Heat-Kernel Learning

Abstract

1 Introduction

Relation with traditional kernel-based learning

2 Preliminaries

2.1 Riemannian Manifold

2.2 Heat Equation and Heat Kernel

Definition 1 ([16])

Lemma 1 ([16])

2.3 Wasserstein Gradient Flows

Theorem 2 ([20])

3 The Proposed Framework

Definition 2 (Evolving PDF)

3.1 Theoretical Foundation of Heat-Kernel Learning

Theorem 3

Theorem 4

Theorem 5

3.2 A Practical Heat-Kernel Learning Algorithm

Parameterization of pϕtp^{t}_{\boldsymbol{\phi}}

Bridging pϕtp^{t}_{\boldsymbol{\phi}} and 𝝁ϕt{\bm{\mu}}_{\boldsymbol{\phi}}^{t}

Updating 𝝁ϕt{\bm{\mu}}_{\boldsymbol{\phi}}^{t}

Lemma 6 ([20])

3.3 Applications

3.3.1 Learning Kernels in SVGD

3.3.2 Learning Deep Generative Models

Learning the generator

Learning the kernel

Theoretical property and relation with existing methods:

Theorem 7

4 Experiments

4.1 A Toy Experiment

4.2 Improved SVGD

4.3 Deep Generative Models

5 Conclusion

Broader Impact

Acknowledgements

References

Appendix A Riemannian Manifold

Definition 3 (Manifold)

Definition 4 (Riemannian Metric and Riemannian Manifold)

Appendix B Proof of Theorem 3

Lemma 8 ([20])

Definition 5 (Projection Operator)

Lemma 9

Appendix C Proof of Theorem 4 and Theorem 5

Lemma 10

Definition 6 (CD condition)

Lemma 11

Appendix D MMD and SMMD

Appendix E Algorithms

E.1 Algorithms for DGM with heat kernel learning

E.2 Some discussions

Appendix F Experimental Results and Settings on Improved SVGD

Appendix G Model Architectures and Some Experiments Settings on DGM

Appendix H More Results on Image Generation

Parameterization of $p^{t}_{\boldsymbol{\phi}}$

Bridging $p^{t}_{\boldsymbol{\phi}}$ and ${\bm{\mu}}_{\boldsymbol{\phi}}^{t}$

Updating ${\bm{\mu}}_{\boldsymbol{\phi}}^{t}$