This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Equivariant Energy Based Models
with Equivariant Stein Variational Gradient Descent

Priyank Jaini
Bosch-Delta Lab
University of Amsterdam
&Lars Holdijk
Bosch-Delta Lab
University of Amsterdam
&Max Welling
Bosch-Delta Lab
University of Amsterdam
Abstract

We focus on the problem of efficient sampling and learning of probability densities by incorporating symmetries in probabilistic models. We first introduce Equivariant Stein Variational Gradient Descent algorithm – an equivariant sampling method based on Stein’s identity for sampling from densities with symmetries. Equivariant SVGD explicitly incorporates symmetry information in a density through equivariant kernels which makes the resultant sampler efficient both in terms of sample complexity and the quality of generated samples. Subsequently, we define equivariant energy based models to model invariant densities that are learned using contrastive divergence. By utilizing our equivariant SVGD for training equivariant EBMs, we propose new ways of improving and scaling up training of energy based models. We apply these equivariant energy models for modelling joint densities in regression and classification tasks for image datasets, many-body particle systems and molecular structure generation.

1 Introduction

Many real-world observations comprise symmetries and admit probabilistic models that are invariant to such symmetry transformations. Naturally, overlooking these inductive biases while encoding such domains will lead to models with inferior performance capabilities. In this paper, we focus on the problem of efficient sampling and learning of equivariant probability densities by incorporating symmetries in probabilistic models.

We accomplish this by first proposing equivariant Stein variational descent algorithm in §3 for sampling from invariant densities. Stein Variational Gradient Descent (SVGD) is a kernel-based inference method that constructs a set of particles iteratively along an optimal gradient path in an RKHS to approximate and sample from a target distribution. We extend SVGD for invariant densities by considering equivariant kernel functions that evolve the set of particles such that the density at each time-step is invariant to the same symmetry transformations as encoded in the kernel. We demonstrate that equivariant SVGD is more sample efficient, produces a more diverse set of samples, and is more robust compared to regular SVGD when sampling from invariant densities.

Subsequently, in §4, we build equivariant Energy Based Models EBMs for learning invariant densities given access to i.i.d. data by leveraging the tremendous recent advances in geometric deep learning where the energy function is equivariant neural network. We train these equivariant EBMs through contrastive divergence by generating samples using equivariant SVGD. We show that incorporating the symmetries present in the data into the energy model as well as the sampler provides an efficient learning paradigm to train equivariant EBMs that generalize well beyond training data.

We empirically demonstrate the performance of equivariant EBMs using equivariant SVGD in §5. We consider real-world applications comprising of problems from many-body particle systems, molecular structure generation and, classification and generation for image datasets.

2 Preliminaries and Setup

In this section we set-up our main problem, introduce key definitions and notations and formulate an approach to incorporate symmetries in particle variational inference optimization methods through Stein variational gradient descent. Along the way, we also discuss directly related work and relegate a detailed discussion on previous work to Appendix B.

Let 𝒢\mathcal{G} be a group acting on d\mathds{R}^{d} through a representation R:𝒢GL(d)\mathrm{R}:\mathcal{G}\to\mathrm{GL}(d) where GL(d)\mathrm{GL}(d) is the general linear group on d\mathds{R}^{d}, such that g𝒢\forall g\in\mathcal{G}, gRgg\to\mathrm{R}_{g}. Given a target random variable 𝖷d\mathsf{X}\subseteq\mathds{R}^{d} with density π\pi, we say that π\pi is 𝒢\mathcal{G}-invariant if g𝒢\forall g\in\mathcal{G} and 𝒙d\boldsymbol{x}\in\mathds{R}^{d}, π(Rg𝒙)=π(𝒙)\pi(\mathrm{R}_{g}\boldsymbol{x})=\pi(\boldsymbol{x}). Additionally, a function f()f(\cdot) is 𝒢\mathcal{G}-equivariant if g𝒢\forall g\in\mathcal{G} and 𝒙d\boldsymbol{x}\in\mathds{R}^{d}, f(Rg𝒙)=Rgf(𝒙)f(\mathrm{R}_{g}\boldsymbol{x})=\mathrm{R}_{g}f(\boldsymbol{x}). We denote with 𝒪(𝒙)\mathcal{O}(\boldsymbol{x}) the orbit of an element 𝒙𝖷\boldsymbol{x}\in\mathsf{X} defined as 𝒪(𝒙):={𝒙:𝒙=Rg𝒙,g𝒢}\mathcal{O}(\boldsymbol{x}):=\{\boldsymbol{x}^{\prime}:\boldsymbol{x}^{\prime}=\mathrm{R}_{g}\boldsymbol{x},\forall g\in\mathcal{G}\}. We call π|𝒢\pi_{|\mathcal{G}} the factorized density of a 𝒢\mathcal{G}-invariant density π\pi where π|𝒢\pi_{|\mathcal{G}} has support on the set 𝖷|𝒢\mathsf{X}_{|\mathcal{G}} where the elements of 𝖷𝒢\mathsf{X}_{\mathcal{G}} are indexing the orbits i.e. if 𝒙,𝒙~𝖷𝒢\boldsymbol{x},\tilde{\boldsymbol{x}}\in\mathsf{X}_{\mathcal{G}} then 𝒙Rg𝒙~,g𝒢\boldsymbol{x}\neq\mathrm{R}_{g}\tilde{\boldsymbol{x}},\forall g\in\mathcal{G}. In this paper, we are interested to incorporate inductive biases given by symmetry groups to develop efficient sampling and learning paradigms for generative modelling. Precisely, we consider the following problems:

(i) Equivariant Learning:

Given access to an i.i.d. samples 𝒙1,,𝒙nπ\lbag\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n}\rbag\sim\pi from a 𝒢\mathcal{G}-invariant density π\pi, we want to approximate π\pi. Rezende et al., (2019) and Köhler et al., (2020) addressed this by learning an equivariant normalizing flow (Tabak and Vanden-Eijnden,, 2010; Tabak and Turner,, 2013; Rezende and Mohamed,, 2015) that transforms a simple latent 𝒢\mathcal{G}-invariant density q0q_{0} to the target density π\pi through a series of 𝒢\mathcal{G}-equivariant diffeomorphic transformations 𝐓=(𝐓1,𝐓2,,𝐓k)\mathbf{T}=(\mathbf{T}_{1},\mathbf{T}_{2},\cdots,\mathbf{T}_{k}) i.e. π:=𝐓#q0\pi:=\mathbf{T}_{\#}q_{0}. They achieved this by proving (cf. (Köhler et al.,, 2020, Theorem 1), (Rezende et al.,, 2019, Lemma1)) that if q0q_{0} is a 𝒢\mathcal{G}-invariant density in d\mathds{R}^{d}, \mathcal{F} is a proper sub-group of 𝒢\mathcal{G} i.e. <𝒢\mathcal{F}<\mathcal{G}, and 𝐓\mathbf{T} is an \mathcal{F}-equivariant diffeomorphic transformation, then π:=𝐓#q0\pi:=\mathbf{T}_{\#}q_{0} is \mathcal{F}-invariant. However, a major drawback of this formalism is that it requires 𝐓\mathbf{T} to not only be a \mathcal{F}-equivariant diffeomorphism, but computation of the inverse and Jacobian must be cheap as well. This is problematic in practice.

Köhler et al., (2020) overcame this issue by using continuous normalizing flows (Grathwohl et al.,, 2018) that define a dynamical system through a time-dependent Lipschitz velocity field Ψ:d×+d\Psi:\mathds{R}^{d}\times\mathds{R}_{+}\to\mathds{R}^{d} with the following system of ordinary differential equations(ODEs):

d𝒙(t)dt=Ψ(𝒙(t),t),𝒙(0)=𝒛\displaystyle\frac{\mathop{}\!\mathrm{d}\boldsymbol{x}(t)}{\mathop{}\!\mathrm{d}t}=\Psi(\boldsymbol{x}(t),t),\qquad\boldsymbol{x}(0)=\boldsymbol{z} (1)

This allows to define a bijective function 𝐓Ψ,t(𝒛):=𝒙(0)+0tΨ(𝒙(t),t)dt\mathbf{T}_{\Psi,t}(\boldsymbol{z}):=\boldsymbol{x}(0)+\int_{0}^{t}\Psi(\boldsymbol{x}(t),t)\mathop{}\!\mathrm{d}t which leads to a push-forward density qtq_{t} at each time-step tt satisfying dlogqtdt=𝖽𝗂𝗏(Ψ(𝒙(t),t))\frac{\mathop{}\!\mathrm{d}\log q_{t}}{\mathop{}\!\mathrm{d}t}=-\mathsf{div}\big{(}\Psi(\boldsymbol{x}(t),t)\big{)}, which implies to the following important result:

Lemma 1 ((Köhler et al.,, 2020, Theorem 2)).

Let Ψ\Psi be an \mathcal{F}-equivariant vector-field on d\mathds{R}^{d}. Then, the transformation 𝐓Ψ,t(𝐳):=𝐱(0)+0tΨ(𝐱(t),t)dt\mathbf{T}_{\Psi,t}(\boldsymbol{z}):=\boldsymbol{x}(0)+\int_{0}^{t}\Psi(\boldsymbol{x}(t),t)\mathop{}\!\mathrm{d}t is \mathcal{F}-equivariant t+\forall t\in\mathds{R}_{+}. Furthermore, the push-forward qt:=𝐓Ψ,t,#q0q_{t}:=\mathbf{T}_{\Psi,t,\#}q_{0} is \mathcal{F}-invariant t\forall t, if q0q_{0} is 𝒢\mathcal{G}-invariant and <𝒢\mathcal{F}<\mathcal{G}.

Lemma 1 conveniently provides a framework to transform any 𝒢\mathcal{G}-invariant density to an \mathcal{F}-invariant density along a path in which each intermediate density is also \mathcal{F}-invariant. However, equivariant normalizing flows cannot be used directly to generate samples when given access to an invariant density π\pi since they require i.i.d. samples from π\pi to train the flow111In Appendix A, we discuss a way to use equivariant normalizing flow for direct sampling given access to π\pi..

(ii) Equivariant Sampling:

In this paper, we are also interested in solving the inference problem i.e. we are interested in evaluating 𝔼π[f]\mathbb{E}_{\pi}[f], the expectation of ff when given access to a 𝒢\mathcal{G}-invariant density π\pi which typically involves generating samples 𝒙1,𝒙2,,𝒙nπ\lbag\boldsymbol{x}_{1},\boldsymbol{x}_{2},\cdots,\boldsymbol{x}_{n}\rbag\sim\pi. Intuitively, sampling from a 𝒢\mathcal{G}-invariant density can be reduced to sampling from its corresponding factorized distribution π|𝒢\pi_{|\mathcal{G}}. This is because any set of samples {𝒙~i}i=1nπ|𝒢\{\tilde{\boldsymbol{x}}_{i}\}_{i=1}^{n}\sim\pi_{|\mathcal{G}} can be used to get samples representing π\pi by applying group actions from 𝒢\mathcal{G} to {𝒙~i}i=1n\{\tilde{\boldsymbol{x}}_{i}\}_{i=1}^{n}. Indeed, sampling methods like Markov Chain Monte Carlo (MCMC) (Brooks et al.,, 2011) or Hybrid Monte Carlo (HMC) Neal et al., (2011) and their variants, in principle, can use this paradigm to sample from an invariant density π\pi. However, MCMC methods for approximate posterior sampling are often slow and it still remains challenging to scale them up to big data settings. An alternate to MCMC methods for approximate posterior sampling is Stein Variational Gradient Descent (SVGD) (Liu and Wang,, 2016) which is a particle optimization variational inference method that combines the paradigms of sampling and variational inference for Bayesian inference problems.

In SVGD, a set of nn particles {𝒙i}i=1n𝖷d\{\boldsymbol{x}_{i}\}_{i=1}^{n}\in\mathsf{X}\subseteq\mathds{R}^{d} are evolved following a dynamical system to approximate the target (posterior) density π(𝒙)𝖾𝗑𝗉(E(𝒙))\pi(\boldsymbol{x})\propto\mathsf{exp}\big{(}-\mathrm{E}(\boldsymbol{x})\big{)} where E()\mathrm{E}(\cdot) is the energy function. This is achieved in a series of TT discrete steps that transform the set of particles {𝒙i0}i=1nq0(𝒙)\{\boldsymbol{x}_{i}^{0}\}_{i=1}^{n}\sim q_{0}(\boldsymbol{x}) sampled from a base distribution q0q_{0} (e.g. Gaussian) at t=0t=0 using the map 𝒙t=𝐓(𝒙):=𝒙t1+εΨ(𝒙t1)\boldsymbol{x}^{t}=\mathbf{T}(\boldsymbol{x}):=\boldsymbol{x}^{t-1}+\varepsilon\cdot\Psi(\boldsymbol{x}^{t-1}) where ε\varepsilon is the step size and Ψ()\Psi(\cdot) is a vector field. Ψ()\Psi(\cdot) is chosen such that it maximally decreases the KL divergence between the push-forward density qt(𝒙)=𝐓#qt1(𝒙)q_{t}(\boldsymbol{x})=\mathbf{T}_{\#}q_{t-1}(\boldsymbol{x}) and the target π(𝒙)\pi(\boldsymbol{x}).

If Ψ\Psi is restricted to the unit ball of an RKHS kd\mathcal{H}^{d}_{k} with positive definite kernel k:d×dk:\mathds{R}^{d}\times\mathds{R}^{d}\to\mathds{R}, the direction of steepest descent that maximizes the negative gradient of the KL divergence is given by:

Ψq,π(𝒙):=argmaxΨkdε𝖪𝖫(q||π)|ε0=𝔼𝒙q[𝗍𝗋𝖺𝖼𝖾(𝒜πΨ(𝒙))],\displaystyle\Psi^{*}_{q,\pi}(\boldsymbol{x}):=\operatorname*{arg\,max}_{\Psi\in\mathcal{H}^{d}_{k}}-\nabla_{\varepsilon}\mathsf{KL}\big{(}q||\pi\big{)}|_{\varepsilon\to 0}=\mathbb{E}_{\boldsymbol{x}\sim q}[\mathsf{trace}(\mathcal{A}_{\pi}\Psi(\boldsymbol{x}))], (2)

where 𝒜πΨ(𝒙)=𝒙logπ(𝒙)Ψ(𝒙)+𝒙Ψ(𝒙)\mathcal{A}_{\pi}\Psi(\boldsymbol{x})=\nabla_{\boldsymbol{x}}\log\pi(\boldsymbol{x})\Psi(\boldsymbol{x})^{\top}+\nabla_{\boldsymbol{x}}\Psi(\boldsymbol{x}) is the Stein operator. Thus, an iterative paradigm can be easily implemented wherein a set of particles {𝒙10,𝒙20,,𝒙n0}q0\{\boldsymbol{x}^{0}_{1},\boldsymbol{x}^{0}_{2},\cdots,\boldsymbol{x}^{0}_{n}\}\sim q_{0} are transformed to approximate the target density π()\pi(\cdot) using the optimal update Ψq,π(𝒙)𝔼𝒙q[𝒜πk(𝒙,𝒙)]\Psi^{*}_{q,\pi}(\boldsymbol{x})\propto\mathbb{E}_{\boldsymbol{x}^{\prime}\sim q}[\mathcal{A}_{\pi}k(\boldsymbol{x}^{\prime},\boldsymbol{x})]. Since 𝒜πΨ(𝒙)=𝒙[π(𝒙)Ψ(𝒙)]/π(𝒙)\mathcal{A}_{\pi}\Psi(\boldsymbol{x})=\nabla_{\boldsymbol{x}}[\pi(\boldsymbol{x})\Psi(\boldsymbol{x})]/\pi(\boldsymbol{x}) we have that 𝔼𝒙π[𝒜πΨ(𝒙)]=0\mathbb{E}_{\boldsymbol{x}\sim\pi}[\mathcal{A}_{\pi}\Psi(\boldsymbol{x})]=0 for any Ψ\Psi implying convergence when q=πq=\pi. Replacing the expectation in the update with a Monte Carlo sum over the current set of particles that represent qq we get:

𝒙it+1𝒙it+εΨ~(𝒙it),where,Ψ~(𝒙it):=1nj=1n(𝒙jtk(𝒙jt,𝒙i)repulsive forcek(𝒙jt,𝒙i)𝒙jtE(𝒙jt)attractive force)\displaystyle\boldsymbol{x}^{t+1}_{i}\leftarrow\boldsymbol{x}^{t}_{i}+\varepsilon\tilde{\Psi}^{*}(\boldsymbol{x}^{t}_{i}),~{}~{}\text{where,}~{}~{}\tilde{\Psi}^{*}(\boldsymbol{x}^{t}_{i}):=\frac{1}{n}\sum_{j=1}^{n}\big{(}\underbrace{\nabla_{\boldsymbol{x}^{t}_{j}}k(\boldsymbol{x}_{j}^{t},\boldsymbol{x}_{i})}_{\text{repulsive force}}-\underbrace{k(\boldsymbol{x}_{j}^{t},\boldsymbol{x}_{i})\cdot\nabla_{\boldsymbol{x}_{j}^{t}}\mathrm{E}(\boldsymbol{x}^{t}_{j})}_{\text{attractive force}}\big{)} (3)

Stein variational gradient descent intuitively encourages diversity among particles by exploring different modes in the target distribution π\pi through a combination of the second term in Equation 3 which attracts the particles to high density regions using the score function and the repulsive force (first term) which ensures the particles do not collapse together. In the continuous time limit, as ε0\varepsilon\to 0, Equation 3 results in a system of ordinary differential equations describing the evolution of particles {𝒙10,𝒙20,,𝒙n0}\{\boldsymbol{x}^{0}_{1},\boldsymbol{x}^{0}_{2},\cdots,\boldsymbol{x}^{0}_{n}\} according to d𝒙dt=Ψ~(𝒙)\frac{\mathop{}\!\mathrm{d}\boldsymbol{x}}{\mathop{}\!\mathrm{d}t}=\tilde{\Psi}^{*}(\boldsymbol{x}).

Furthermore, as shown in Wang et al., (2019), geometric information using pre-conditioning matrices can be incorporated in Equation 3 by using matrix valued kernels (cf. Definition 2.3 (Reisert and Burkhardt,, 2007)) leading to the following generalized form of SVGD (Wang et al.,, 2019):

𝒙it+1𝒙it+εnj=1n(𝒙jt𝑲(𝒙jt,𝒙i)𝑲(𝒙jt,𝒙i)𝒙jtE(𝒙jt)),\displaystyle\boldsymbol{x}^{t+1}_{i}\leftarrow\boldsymbol{x}^{t}_{i}+\frac{\varepsilon}{n}\sum_{j=1}^{n}\big{(}\nabla_{\boldsymbol{x}^{t}_{j}}\boldsymbol{K}(\boldsymbol{x}_{j}^{t},\boldsymbol{x}_{i})-\boldsymbol{K}(\boldsymbol{x}_{j}^{t},\boldsymbol{x}_{i})\cdot\nabla_{\boldsymbol{x}_{j}^{t}}\mathrm{E}(\boldsymbol{x}^{t}_{j})\big{)}, (4)

where 𝑲(𝒙,𝒙)\boldsymbol{K}(\boldsymbol{x},\boldsymbol{x}^{\prime}) is a matrix valued kernel. Matrix-valued SVGD allows to flexibly incorporate preconditioning matrices yielding acceleration in the exploration of the given probability landscape.

SVGD has gained a lot of attention over the past few years as a flexible and scalable alternative to MCMC methods for approximate Bayesian posterior sampling. Further, it is more particle efficient since it generates diverse particles due to the deterministic repulsive force induced by kernels instead of Monte Carlo randomness. A natural question to ask is: Can we incorporate symmetry information into SVGD for more efficient sampling from invariant densities? We answer this in the affirmative in the next section by proposing equivariant Stein variational gradient descent algorithm for sampling from invariant densities.

3 Equivariant Stein Variational Gradient Descent

We begin this section by presenting the main result of this section by introducing equivariant Stein variational gradient descent (E-SVGD) by utilizing Lemma 1 and Equations (3) and (4).

Proposition 1.

Let π\pi be a 𝒢\mathcal{G}-invariant density and 𝐱10,𝐱20,,𝐱n0q0\lbag\boldsymbol{x}^{0}_{1},\boldsymbol{x}^{0}_{2},\cdots,\boldsymbol{x}^{0}_{n}\rbag\sim q_{0} be a set of particles at t=0t=0 with q0q_{0} being \mathcal{F}-invariant where >𝒢\mathcal{F}>\mathcal{G}. Then, the iterative update given by Equation 3 is 𝒢\mathcal{G}-equivariant and the density qt+1q_{t+1} defined by it at time t+1t+1 is 𝒢\mathcal{G}-invariant if the positive definite kernel k(,)k(\cdot,\cdot) is 𝒢\mathcal{G}-invariant. The same holds for Equation 4 if 𝐊(,)\boldsymbol{K}(\cdot,\cdot) is 𝒢\mathcal{G}-equivariant.

Proof.

Since the initial distribution q0q_{0} is \mathcal{F}-invariant, following Lemma 1 the update in Equation 3 is 𝒢\mathcal{G}-equivariant if Ψ\Psi is 𝒢\mathcal{G}-equivariant. If k(,)k(\cdot,\cdot) is 𝒢\mathcal{G}-invariant then 𝒙k(,𝒙)\nabla_{\boldsymbol{x}}k(\cdot,\boldsymbol{x}) is 𝒢\mathcal{G}-equivariant. Furthermore, since π=𝖾𝗑𝗉(E(𝒙))\pi=\mathsf{exp}\big{(}-\mathrm{E}(\boldsymbol{x})\big{)} is 𝒢\mathcal{G}-invariant, 𝒙E(𝒙)\nabla_{\boldsymbol{x}}\mathrm{E}(\boldsymbol{x}) is also 𝒢\mathcal{G}-equivariant. Thus, both the terms for Ψ\Psi are 𝒢\mathcal{G}-equivariant if k(,)k(\cdot,\cdot) is 𝒢\mathcal{G}-equivariant making the update in Equation 3 𝒢\mathcal{G}-equivariant. The result follows similarly for Equation 4 when 𝑲(,)\boldsymbol{K}(\cdot,\cdot) is 𝒢\mathcal{G}-equivariant. ∎

Following Proposition 1, we call the updates in Equations (3) & (4) equivariant Stein variational gradient descent when the kernel k(,)k(\cdot,\cdot) (and 𝑲(,)\boldsymbol{K}(\cdot,\cdot) respectively) is invariant (equivariant) and the initial set of particles 𝒙10,,𝒙n0\lbag\boldsymbol{x}^{0}_{1},\cdots,\boldsymbol{x}^{0}_{n}\rbag are sampled from an invariant density q0q_{0}. Thus, all that is required to sample from a 𝒢\mathcal{G}-invariant density π\pi using equivariant SVGD is to construct a positive definite kernel that is 𝒢\mathcal{G}-equivariant. Let us next give a few examples for constructing in- and equivariant positive definite kernels.

Example 1 (Invariant scalar kernel).

Let 𝒢\mathcal{G} be a finite group acting on d\mathds{R}^{d} with representation R\mathrm{R} such that g𝒢,gRg\forall g\in\mathcal{G},g\to\mathrm{R}_{g}. Then,

k𝒢(𝒙,𝒙)=𝒙𝒪(𝒙)𝒙𝒪(𝒙)k(𝒙,𝒙)\displaystyle k_{\mathcal{G}}(\boldsymbol{x},\boldsymbol{x}^{\prime})=\sum_{\boldsymbol{x}\in\mathcal{O}(\boldsymbol{x})}\sum_{\boldsymbol{x}^{\prime}\in\mathcal{O}(\boldsymbol{x}^{\prime})}k(\boldsymbol{x},\boldsymbol{x}^{\prime})

is 𝒢\mathcal{G}-invariant where k(,)k(\cdot,\cdot) is some positive-definite kernel. While this provides a general method to construct invariant kernels for finite groups, the double summation can be computationally expensive. In practice, usually simple kernels like RBF kernel (for rotation symmetries) or uniform kernel suffice as more practical alternatives.

Example 1 is only restricted to finite groups and does not directly apply to continuous symmetry groups. We can construct kernels for continuous groups following Example 1 by either using a Monte Carlo approximation or using a transformation that performs computations in the factorized space 𝖷|𝒢\mathsf{X}_{|\mathcal{G}} as we show in the next example.

Example 2 (Continuous Symmetry Groups).

Let π(𝐱)\pi(\boldsymbol{x}) be 𝖲𝖮(2)\mathsf{SO}(2)-invariant (cf. Figure 3(b) for an example) where 𝐱2\boldsymbol{x}\in\mathds{R}^{2} i.e. 𝒪(𝐱):={𝐱:𝐱=𝐱}\mathcal{O}(\boldsymbol{x}):=\{\boldsymbol{x}^{\prime}:\|\boldsymbol{x}\|=\|\boldsymbol{x}^{\prime}\|\}. We can either construct an invariant kernel for sampling from π\pi using a Monte Carlo approximation by sampling random rotations on a unit sphere i.e.

k𝒢(𝒙,𝒙)=i,j=1nk(gj𝒙,gi𝒙),gi,gj𝒢,(i,j)[n]×[n]\displaystyle k_{\mathcal{G}}(\boldsymbol{x},\boldsymbol{x}^{\prime})=\sum_{i,j=1}^{n}k(g_{j}\boldsymbol{x},g_{i}\boldsymbol{x}^{\prime}),\qquad g_{i},g_{j}\in\mathcal{G},~{}\forall(i,j)\in[n]\times[n]

Or alternately, we can consider the function Φ𝒢:2\Phi_{\mathcal{G}}:\mathds{R}^{2}\to\mathds{R} such that Φ𝒢(𝐱)=𝐱\Phi_{\mathcal{G}}(\boldsymbol{x})=\|\boldsymbol{x}\|. Then, Φ𝒢(𝐱)\Phi_{\mathcal{G}}(\boldsymbol{x}) is 𝖲𝖮\mathsf{SO}(2) invariant since Φ𝒢(g𝐱)=Φ𝒢(𝐱),g𝒢\Phi_{\mathcal{G}}(g\boldsymbol{x})=\Phi_{\mathcal{G}}(\boldsymbol{x}),\forall g\in\mathcal{G}. Thus, we can now use the following kernel

k𝒢(𝒙,𝒙)=k(Φ𝒢(𝒙),Φ𝒢(𝒙))\displaystyle k_{\mathcal{G}}(\boldsymbol{x},\boldsymbol{x}^{\prime})=k\big{(}\Phi_{\mathcal{G}}(\boldsymbol{x}),\Phi_{\mathcal{G}}(\boldsymbol{x}^{\prime})\big{)}

Examples (1) and (2) are both invariant scalar kernels. Let us next give an example of an equivariant matrix valued kernel for matrix valued SVGD (cf. Equation 4).

Refer to caption
Figure 1: Sample efficiency
Example 3 (Equivariant Matrix-Valued Kernels, Reisert and Burkhardt, (2007)).

Examples 1 and 2 define an invariant scalar kernel. Following Reisert and Burkhardt, (2007), we can also construct a 𝒢\mathcal{G}-equivariant matrix-valued kernel for the generalized update as in Equation 4 by defining:

𝑲(𝒙,𝒙)=𝒢k(𝒙,g𝒙)Rgdg\displaystyle\boldsymbol{K}(\boldsymbol{x},\boldsymbol{x}^{\prime})=\int_{\mathcal{G}}k(\boldsymbol{x},g\boldsymbol{x}^{\prime})\mathrm{R}_{g}\mathop{}\!\mathrm{d}g

where Rg\mathrm{R}_{g} is a group representation and k(,)k(\cdot,\cdot) is a scalar symmetric, 𝒢\mathcal{G}-invariant function. It is easy to check that 𝐊(𝐱,𝐱)\boldsymbol{K}(\boldsymbol{x},\boldsymbol{x}^{\prime}) is equivariant in the first argument and anti-equivariant in the second argument, leading to an equivariant 𝐊(𝐱,𝐱)\boldsymbol{K}(\boldsymbol{x},\boldsymbol{x}^{\prime}) (cf. Proposition 2.2 Reisert and Burkhardt, (2007)).

Advantages of Equivariant Sampling:

As we discussed briefly in Section 2, SVGD works by evolving a set of particles using a dynamical system through a combination of attractive and repulsive forces among the particles that are governed by the inter-particle distance. Thus, a particle exerts these forces in a restricted neighbourhood around it. Equivariant SVGD, on the other hand, is able to model long-range interactions among particles due to the use of equivariant kernel. Intuitively, any point 𝒙\boldsymbol{x} is able to exert these forces on any other point 𝒙\boldsymbol{x}^{\prime} in equivariant SVGD if 𝒙\boldsymbol{x}^{\prime} is in the neighbourhood of any point in the orbit 𝒪(𝒙)\mathcal{O}(\boldsymbol{x}) of 𝒙\boldsymbol{x}. This is because for any point 𝒙\boldsymbol{x}^{\prime} the repulsive and attractive force terms are the same in Equations (3) and (4) for all points that are in the orbit 𝒪(𝒙)\mathcal{O}(\boldsymbol{x}). This ability to capture long-range interactions by equivariant Stein variational gradient descent subsequently makes it more efficient in sample complexity and running time with better sample quality, and makes it more robust to different initial configurations of the particles compared to vanilla SVGD. We illustrate these next with the help of the following examples:

(i) C4C_{4}-Gaussians (cf. Figure 3(a) and 3(c)): This example consists of four Gaussians invariant to C4C_{4} symmetry group. In this case, the group factorized distribution π|C4\pi_{|C_{4}} is Gaussian with the original C4C_{4}-invariant density obtained by rotating π|C4\pi_{|C_{4}} through the set {0,90,180,270}\{0^{\circ},90^{\circ},180^{\circ},270^{\circ}\}. In Figure 3(a), the first column shows the samples generated by equivariant SVGD, the second column is the projection of these samples on the group factorized space 𝖷|C4\mathsf{X}_{|C_{4}} and, the third column shows the samples obtained by rotating the original samples through the C4C_{4}-symmetry group. Figure 3(c) shows a similar setup for vanilla SVGD.

Refer to caption
Figure 2: Robustness

(ii) Concentric Circles (cf. Figure 3(b) and 3(d)): This example comprises of two concentric circles invariant to the 𝖲𝖮(2)\mathsf{SO}(2) symmetry group. In this case, the group factorized space is a union of two disconnected lines with length equal to the thickness of the circles. In Figure 3(b), the first column shows the samples generated by equivariant SVGD and, the second column is the projection of these samples on the group factorized space 𝖷|𝖲𝖮(2)\mathsf{X}_{|\mathsf{SO}(2)}. Figure 3(d) shows a similar setup for vanilla SVGD.

We keep the experimental setup i.e. number of particles and number iterations exactly the same for both vanilla SVGD and equivariant SVGD. For both the examples, it may seem that the original samples from the vanilla SVGD capture the target distribution better than the equivariant counterpart (first column for Figs. 3(a)-3(d)). However, projecting the samples onto the factorized space (second column for the aforementioned figures) shows that equivariant SVGD more faithfully captures the target density compared to vanilla SVGD. Furthermore, due to its ability to model long-range interactions, we see for both examples that in the projected space of the invariant sampler, the samples are not close together whereas for vanilla SVGD most samples end up in a configuration where they reside in the same orbit. This phenomena is most evident for the concentric circles example where samples from vanilla SVGD reside on the high density region throughout the two circles resulting in all the samples being positioned on top of each other in the factorized space demonstrating its inability to capture the distribution. On the other hand, invariant SVGD prevents any sample from residing on the same orbit of another sample due to long-range repulsive force from the equivariant kernel allowing it to sample more faithfully from the invariant densities.

Secondly, we study the effect of increasing the number of particles used for vanilla SVGD for the two concentric circles example. In Figure 1, we plot the average log-likelihoods of the particles from vanilla SVGD and particles from invariant SVGD as a function of number of iterations and compare it to the ground-truth average log-likelihood. We run vanilla SVGD with up to 32 times more particles than invariant SVGD. As evident from the plot, invariant SVGD converges to the final configuration within the first 100 iterations with average log-likelihood closely matching the ground truth. Vanilla SVGD, on the other hand, is unable to converge to the ground truth with even 32 times more samples and 5000 iterations due to its inability to interact with particles at longer distances.

Finally, we study the effect of different configurations of the initial particles on the performance of vanilla and invariant SVGD in Figure 2 for the C4C_{4}-Gaussian example. As shown by Zhuo et al., (2018); Zhang et al., (2020) and D’Angelo and Fortuin, (2021), the particles in vanilla SVGD have a tendency to collapse to a few local modes that are closest to the initial distribution of the particles. We test the robustness of invariant SVGD to particles with initial distributions localized to different regions in the space. We plot the average log-likelihoods of the converged samples for both invariant and vanilla SVGD for all random initializations in Figure 2 and compare this to the ground truth average log-likelihood. The plot illustrates that equivariant SVGD is more robust to the initial distribution of particles than vanilla SVGD. Nevertheless, if the group-factorized space is multi-modal, equivariant SVGD might exhibit a tendency to favour one of modes. However, this can be easily alleviated by either adding some noise to the SVGD update as proposed by Zhang et al., (2020) similar to SGLD (Welling and Teh,, 2011) or using an annealing strategy (D’Angelo and Fortuin,, 2021).

Refer to caption
Refer to caption
Refer to caption
(a) C4C_{4} Gaussians : Invariant SVGD sampling
Refer to caption
Refer to caption
(b) Two Circles : Invariant SVGD sampling
Refer to caption
Refer to caption
Refer to caption
(c) C4C_{4} Gaussians : Vanilla SVGD Sampling
Refer to caption
Refer to caption
(d) Two Circles : Vanilla SVGD Sampling
Figure 3: Recommended to view in color. 3(a) (Left to Right) Original Samples from E-SVGD, samples projected on to the group-factorized space and, samples obtained after applying group actions to the original samples. Yellow, Green and, Blue samples represent original samples rotated by 9090^{\circ}, 180180^{\circ} and, 270270^{\circ} respectively. 3(b) (Left to Right) Original Samples from E-SVGD and, samples projected on to the group-factorized space. 3(c)-3(d): Same as 3(a)-3(b) but for vanilla SVGD.

4 Equivariant Joint Energy Model

In Section 3, we developed equivariant Stein variational gradient descent algorithm for sampling from invariant densities. In this section, we leverage the recent tremendous advances in deep geometric learning (Cohen and Welling,, 2016; Dieleman et al.,, 2016; Bronstein et al.,, 2021) to propose equivariant energy based models that are trained contrastively using our proposed equivariant Stein variational gradient descent algorithm to learn invariant (unnormalized) densities π\pi given access to i.i.d. samples 𝒙1,𝒙2,,𝒙nπ\lbag\boldsymbol{x}_{1},\boldsymbol{x}_{2},\cdots,\boldsymbol{x}_{n}\rbag\sim\pi.

Given a set of samples 𝒙1,𝒙2,,𝒙nd\lbag\boldsymbol{x}_{1},\boldsymbol{x}_{2},\cdots,\boldsymbol{x}_{n}\rbag\subseteq\mathds{R}^{d}, energy-based models (LeCun et al.,, 2006) learn an energy function E𝜽(𝒙):d\mathrm{E}_{\boldsymbol{\theta}}(\boldsymbol{x}):\mathds{R}^{d}\to\mathds{R} that defines a probability distribution π~𝜽(𝒙)=𝖾𝗑𝗉(E𝜽(𝒙))/Z𝜽\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x})=\nicefrac{{\mathsf{exp}\big{(}-\mathrm{E}_{\boldsymbol{\theta}}(\boldsymbol{x})\big{)}}}{{Z_{\boldsymbol{\theta}}}}, where Z𝜽=𝖾𝗑𝗉(E𝜽(𝒙))d𝒙Z_{\boldsymbol{\theta}}=\int\mathsf{exp}\big{(}-\mathrm{E}_{\boldsymbol{\theta}}(\boldsymbol{x})\big{)}\mathop{}\!\mathrm{d}\boldsymbol{x} is the partition function. Unlike other popular tractable density models like normalizing flows, EBMs are less restrictive in the parameterization of the functional form of π~𝜽()\tilde{\pi}_{\boldsymbol{\theta}}(\cdot) since the energy function does not need to integrate to one, it can be parameterized using any nonlinear function. Conveniently, if π\pi is 𝒢\mathcal{G}-invariant, we can use the existing equivariant deep network architectures to parameterize Eθ()\mathrm{E}_{\theta}(\cdot) to encode the symmetries into the energy network. Such an equivariant energy network defines an equivariant energy based model. EBMs are usually trained by maximizing the log-likelihood of the data under the given model:

𝜽:=argmin𝜽𝖬𝖫(𝜽)=𝔼𝒙π[logπ~𝜽(𝒙)]\displaystyle\boldsymbol{\theta}^{*}:=\operatorname*{arg\,min}_{\boldsymbol{\theta}}\mathcal{L}_{\mathsf{ML}}(\boldsymbol{\theta})=\mathbb{E}_{\boldsymbol{x}\sim\pi}\big{[}-\log\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x})\big{]} (5)

However, evaluating Z𝜽Z_{\boldsymbol{\theta}} is intractable for most (useful) choices of E𝜽()\mathrm{E}_{\boldsymbol{\theta}}(\cdot) which makes learning EBMs via maximum likelihood estimation problematic. Contrastive divergence (Hinton et al.,, 2006) provides a paradigm to learn EBMs using maximum likelihood estimation without needing to compute Z𝜽Z_{\boldsymbol{\theta}} by approximating the gradient of 𝜽𝖬𝖫(𝜽)\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\mathsf{ML}}(\boldsymbol{\theta}) in Equation 5 as follows:

𝜽𝖬𝖫(𝜽)𝔼𝒙+π[𝜽E𝜽(𝒙+)]𝔼𝒙π~𝜽[𝜽E𝜽(𝒙)]\displaystyle\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\mathsf{ML}}(\boldsymbol{\theta})\approx\mathbb{E}_{\boldsymbol{x}^{+}\sim\pi}\big{[}\nabla_{\boldsymbol{\theta}}\mathrm{E}_{\boldsymbol{\theta}}(\boldsymbol{x}^{+})\big{]}-\mathbb{E}_{\boldsymbol{x}^{-}\sim\tilde{\pi}_{\boldsymbol{\theta}}}\big{[}\nabla_{\boldsymbol{\theta}}\mathrm{E}_{\boldsymbol{\theta}}(\boldsymbol{x}^{-})\big{]} (6)

Intuitively, the gradient in Equation 6 drives the model such that it assigns higher energy to the negative samples 𝒙\boldsymbol{x}^{-} sampled from the current model and decreases the energy of the positive samples 𝒙+\boldsymbol{x}^{+} which are the data-points from the target distribution. Since, training an EBM using MLE requires sampling from the current model π~(𝜽)\tilde{\pi}(\boldsymbol{\theta}), successful training of EBMs relies heavily on sampling strategies that lead to faster mixing. Fortuitously, since E𝜽()\mathrm{E}_{\boldsymbol{\theta}}(\cdot) in our present setting is 𝒢\mathcal{G}-equivariant, we propose to use our equivariant sampler for more efficient training222compared to using a regular sampler with no encoded symmetries. of the equivariant energy based model.

Input: 𝒙1+,𝒙2+,,𝒙m+π(𝒙)\lbag\boldsymbol{x}^{+}_{1},\boldsymbol{x}^{+}_{2},\cdots,\boldsymbol{x}^{+}_{m}\rbag\sim\pi(\boldsymbol{x})
while not converged do
       \triangleright Generate samples from current eqNN model E𝜽\mathrm{E}_{\boldsymbol{\theta}} 𝒙1,𝒙2,,𝒙m=𝖤𝗊𝗎𝗂𝗏𝖺𝗋𝗂𝖺𝗇𝗍𝖲𝖵𝖦𝖣(E𝜽\lbag\boldsymbol{x}^{-}_{1},\boldsymbol{x}^{-}_{2},\cdots,\boldsymbol{x}^{-}_{m}\rbag=\mathsf{EquivariantSVGD}(\mathrm{E}_{\boldsymbol{\theta}}) ;
       \triangleright Optimize objective 𝖬𝖫(𝜽)\mathcal{L}_{\mathsf{ML}}(\boldsymbol{\theta}): Δ𝜽i=1m𝜽E𝜽(𝒙i+)𝜽E𝜽(𝒙i)\Delta\boldsymbol{\theta}\leftarrow\sum_{i=1}^{m}\nabla_{\boldsymbol{\theta}}\mathrm{E}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i}^{+})-\nabla_{\boldsymbol{\theta}}\mathrm{E}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i}^{-}) ;
       \triangleright Update 𝜽\boldsymbol{\theta} using Δ𝜽\Delta\boldsymbol{\theta} and Adam optimizer
end while
Algorithm 1 Equivariant EBM training

Additionally, following Grathwohl et al., (2019), we can extend equivariant energy based models to equivariant joint energy models. Let {(𝒙1,y1),(𝒙2,y2),,(𝒙n,yn)}d×[K]\{(\boldsymbol{x}_{1},y_{1}),(\boldsymbol{x}_{2},y_{2}),\cdots,(\boldsymbol{x}_{n},y_{n})\}\subseteq\mathds{R}^{d}\times[K] be a set of samples with observations 𝒙i\boldsymbol{x}_{i} and labels yiy_{i}. Given a parametric function f𝜽:dkf_{\boldsymbol{\theta}}:\mathds{R}^{d}\to\mathds{R}^{k}, a classifier uses the conditional distribution π~𝜽(y|𝒙)𝖾𝗑𝗉(f𝜽(𝒙)[y])\tilde{\pi}_{\boldsymbol{\theta}}(y|\boldsymbol{x})\propto\mathsf{exp}(f_{\boldsymbol{\theta}}(\boldsymbol{x})[y]) where f𝜽(𝒙)[y]f_{\boldsymbol{\theta}}(\boldsymbol{x})[y] is the logit corresponding to the ythy^{\text{th}} class label. As shown by Grathwohl et al., (2019), these logits can be used to define the joint density π~𝜽(𝒙,y)\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x},y) and marginal density π~𝜽(𝒙)\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x}) as follows:

π~𝜽(𝒙,y)=𝖾𝗑𝗉(f𝜽(𝒙)[y])Z𝜽,and,π~𝜽(𝒙)=y𝖾𝗑𝗉(f𝜽(𝒙)[y])Z𝜽\displaystyle\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x},y)=\frac{\mathsf{exp}\big{(}f_{\boldsymbol{\theta}}(\boldsymbol{x})[y]\big{)}}{Z_{\boldsymbol{\theta}}},\quad\text{and},\quad\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x})=\frac{\sum_{y}\mathsf{exp}\big{(}f_{\boldsymbol{\theta}}(\boldsymbol{x})[y]\big{)}}{Z_{\boldsymbol{\theta}}} (7)

Hence, the energy function at a point 𝒙\boldsymbol{x} is given by E𝜽=logy𝖾𝗑𝗉(f𝜽(𝒙)[y])\mathrm{E}_{\boldsymbol{\theta}}=-\log\sum_{y}\mathsf{exp}(f_{\boldsymbol{\theta}}(\boldsymbol{x})[y]) with joint energy E𝜽(𝒙,y)=f𝜽(𝒙)[y]\mathrm{E}_{\boldsymbol{\theta}}(\boldsymbol{x},y)=-f_{\boldsymbol{\theta}}(\boldsymbol{x})[y]. In our setting, the joint distribution π(𝒙,y)\pi(\boldsymbol{x},y) is 𝒢\mathcal{G}-invariant in the first argument i.e. π(Rg𝒙,y)=π(𝒙,y),g𝒢\pi(\mathrm{R}_{g}\boldsymbol{x},y)=\pi(\boldsymbol{x},y),\forall g\in\mathcal{G}. An example of such a setting is any image data-set where the class label does not change if the image is rotated by an angle. Using Equation 7, it suffices for the function f𝜽f_{\boldsymbol{\theta}} to be 𝒢\mathcal{G}-equivariant to model a 𝒢\mathcal{G}-invariant density π~𝜽(𝒙,y)\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x},y). Furthermore, a 𝒢\mathcal{G}-equivariant f𝜽f_{\boldsymbol{\theta}} also makes the marginal density π~𝜽(𝒙)\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x}) and conditional density π~𝜽(y|𝒙)\tilde{\pi}_{\boldsymbol{\theta}}(y|\boldsymbol{x}) 𝒢\mathcal{G}-invariant in the input 𝒙\boldsymbol{x}. We call such an energy model where f𝜽f_{\boldsymbol{\theta}} is equivariant to a symmetry transformation to be an equivariant joint energy model.

We can train this model by maximizing the log-likelihood of the joint distribution as follows:

(𝜽):\displaystyle\mathcal{L}(\boldsymbol{\theta}): =𝖬𝖫(𝜽)+𝖲𝖫(𝜽)=logπ~𝜽(𝒙)+logπ~𝜽(y|𝒙)\displaystyle=\mathcal{L}_{\mathsf{ML}}(\boldsymbol{\theta})+\mathcal{L}_{\mathsf{SL}}(\boldsymbol{\theta})=\log\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x})+\log\tilde{\pi}_{\boldsymbol{\theta}}(y|\boldsymbol{x}) (8)

where 𝖲𝖫(𝜽)\mathcal{L}_{\mathsf{SL}}(\boldsymbol{\theta}) is the supervised loss which is the cross-entropy loss in the case of classification. Thus, an equivariant joint energy model can now be trained by applying the gradient estimator in Equation 6 for logπ~𝜽(𝒙)\log\tilde{\pi}_{\boldsymbol{\theta}}(\boldsymbol{x}) and evaluating the gradient of logπ~𝜽(y|𝒙)\log\tilde{\pi}_{\boldsymbol{\theta}}(y|\boldsymbol{x}) through back-propagation. Conveniently, Equation 8 can also be used for semi-supervised learning with 𝖲𝖫((𝜽))\mathcal{L}_{\mathsf{SL}}((\boldsymbol{\theta})) substituted with the appropriate supervised loss e.g. MSE for regression.

Let us end this section with an empirical example for learning a mixture of C4C_{4}-Gaussians (Figure 4) as shown in row two of the leftmost column of Figure 4. The innermost C4C_{4}-Gaussian defines the class conditional probability π(𝒙|y=0)\pi(\boldsymbol{x}|y=0) (row 3) and the outer C4C_{4}-Gaussian defines π(𝒙|y=1)\pi(\boldsymbol{x}|y=1) (row 4). We learn a non-equivariant joint EBM using vanilla SVGD (cf. Figure 4 center column) and an equivariant joint EBM using equivariant SVGD (cf. Figure 4 right column) keeping the number of iterations and particles the same for training. In Figure 4, we plot the decision boundaries learned by the model in the top row. The star marked samples in the figure are the samples generated by the underlying model. We plot the joint distribution and the class conditional distributions in row two-four respectively. The figure abundantly demonstrates the superior performance of an equivariant joint energy model trained using equivariant SVGD over its non-equivariant counterpart. A more detailed figure with comparisons to an equivariant joint energy model trained using vanilla SVGD is presented in Section D.1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: C4C_{4}-Gaussian mixture model. Row 1: Decision Boundary. Row 2: Samples and energy of joint distribution π(𝒙,y)\pi(\boldsymbol{x},y). Row 3: Samples and energy of conditional distribution π(𝒙|y=0)\pi(\boldsymbol{x}|y=0). Row 4: Samples and energy of conditional distribution π(𝒙|y=1)\pi(\boldsymbol{x}|y=1). Left: Target distribution. Middle: Non-equivariant EBM trained with vanilla SVGD. Right: E-EBM trained with E-SVGD.

5 Experiments

In this section, we present empirical analysis of equivariant EBMs and E-SVGD through experiments to (i) reconstruct potential function describing a many-body particle system (DW-4) trained using limited number of meta-stable states, (ii) model a generative distribution of molecules (QM9) and generate novel samples and (iii) hybrid (generative & discriminative) model invariant to rotations for FashionMNIST trained using dataset with no rotations. Due to space constraints, details about all the experiments as well as detailed figures are relegated to Appendix E.

DW-4: In this many-body particles system, a double-well potential describes the configuration of four particles that is invariant to rotations, translations and, permutation of the particles. This system comprises five distinct metastable states which are characterized as the mimina in the potential function. In our experiment, we show that given access to only a single example of each metastable state configuration, an equivariant EBM trained with E-SVGD can recover other states with similar energy as those of in the training set. In Figure 7, the first column shows the metastable states present in the training set. The second column are the states recovered by an EBM trained with vanilla SVGD which results in configurations that exactly copy the training set. The third column shows configurations generated by the equivariant model which are distinct from the training set but mimic the energies of the corresponding metastable states in the training set. Our setup is different from that of Köhler et al., (2020); we discuss this in detail in Section E.1 and also produce similar results as Köhler et al., (2020) for our model.

Refer to caption
Figure 5: (a) Molecules sampled from a EBM parameterized by a E-GNN trained using E-SVGD. (b) Distribution of distance and angle between atom pairs and triplets.

QM9: QM9 is a molecular dataset containing over 145,000 molecules used for moleccular property prediction. However, we use this for molecular structure generation of constitutional isomers of \chC5H8O1. Similar to DW-4, the molecules here are invariant to rotations, translations and, permutations of the same atoms. We encode these symmetries using E-GNN (Satorras et al.,, 2021), an equivariant graph neural network, to represent the energy. We trained our model via E-SVGD using \chC5H8O1 molecules present in the QM9 dataset and used the trained energy model to generate novel samples that are isomers of \chC5H8O1. We show these novel generated molecules in Figure 5 wherein we used the relative distance between atoms as a proxy for determining the covalent bonds. Our generated molecules demonstrate the correct 3D arrangement of bonds while containing complex atom structures like aromatic rings. This is further supported by the plots comparing the radial distribution functions of the two most common heavy atom pairs to quantify our model fit to QM9 (Figure 5). While, the generated molecules have a larger distributional spread, the range of values and modes – for both angles and distances– resemble the true distribution. We provide more details in Section E.2.

Joint distribution
Refer to caption
Conditional distribution
Refer to caption
Accuracy
Refer to caption
Figure 6: Left & Center: Samples generated from joint and class-conditional distribution using equivariant EBM. Right: Plot of classification accuracy vs. training iterations for equivariant and regular EBMs trained using vanilla SVGD and E-SVGD.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Col. 1: Samples from true potential energy. Col. 2: Samples from EBM trained with SVGD. Col. 3: Samples from equivariant EBM trained with E-SVGD.

FashionMNIST: (Details in Section E.3) In this experiment, we take the FashionMNIST dataset with training set consisting regular images whereas the test set is processed to contain images that are randomly rotated using the C4C_{4}-symmetry group. We train an equivariant energy model where the energy function is a C4C_{4} steerable CNN (Weiler and Cesa,, 2019) with both E-SVGD and vanilla SVGD. Furthermore, we also compare to an energy model with no rotation symmetries and depict the performance in terms of classification accuracy on the held out images of these three models as a function of the number of training iterations. The plot in Figure 6 shows, albeit unsurprisingly, that an equivariant energy model performs better than a regular model. Furthermore, the results also illustrate that an equivariant model trained with E-SVGD converges faster than when trained with vanilla SVGD highlighting the benefit of using E-SVGD for training equivariant EBMs. Furthermore, in Figure 6, we show samples generated by E-SVGD using the trained equivariant EBM from the joint and the class-conditional distribution.

6 Discussion and Conclusion

In this paper, we focused on incorporating inductive bias in the form of symmetry transformations using equivariant functions for sampling and learning invariant densities. We first proposed equivariant Stein variational gradient descent algorithm for sampling from invariant densities by using equivariant kernels which affords many benefits in terms of efficiency due to its ability to model long-range interactions between particles. However, a major limitation of Stein variational gradient descent algorithm in general is its sensitivity to the kernel hyper-parameters. An interesting future work might be to develop strategies to either adapt or learn these hyper-parameters while running the SVGD dynamics.

Subsequently, we proposed equivariant energy based models wherein the energy function is parameterized by an equivariant network. In our experiments, we leveraged the recent advances in geometric deep learning to model EBMs using steerable CNNs (Weiler and Cesa,, 2019) for images, equivariant graph networks (Satorras et al.,, 2021) for representing molecules, and group equivariant networks (Cohen and Welling,, 2016) for many-body particle systems. We used equivariant SVGD to train these equivariant energy based models for modelling invariant densities and demonstrated that incorporating symmetries in the energy model as well as the sampler leads to efficient training. However, as discussed in previous works (Grathwohl et al.,, 2019), training EBMs using contrastive divergence and short sampling chains is often unstable and challenging. These issues remain with equivariant samplers and have to be addressed to be able to train large-scale energy based models.

References

Appendix A Sampling using Equivariant Flows

Neural transport augmented sampling, first introduced by Parno and Marzouk, (2018), is a general method for using normalizing flows to sample from a given density π\pi. Informally, the method proceeds by learning a diffeomorphic map 𝐓:𝖹Θ\mathbf{T}:\mathsf{Z}\to\Theta such that p~(𝒛)=π(𝜽)|𝐓(𝒛)|\tilde{p}(\boldsymbol{z})=\pi(\boldsymbol{\theta})\cdot|\mathbf{T}^{\prime}(\boldsymbol{z})| where 𝒛=𝐓1(𝜽)\boldsymbol{z}=\mathbf{T}^{-1}(\boldsymbol{\theta}) such that p(𝒛)p(\boldsymbol{z}) has a simple geometry amenable to efficient MCMC sampling. Thus, samples can be generated from π(𝜽)\pi(\boldsymbol{\theta}) by running MCMC chain in the 𝖹\mathsf{Z}-space and pushing these samples onto the Θ\Theta-space using 𝐓\mathbf{T}. The transformation 𝐓\mathbf{T} can be learned by minimizing the KL-divergence between a fixed distribution with simple geometry in the z-space e.g. a standard Gaussian and p~(𝒛)\tilde{p}(\boldsymbol{z}) above. The learning phase attempts to ensure that the distribution p~(𝐳)\tilde{p}(\mathbf{z}) is approximately close to the fixed distribution with easy geometry so that MCMC sampling is efficient.

Neural transport augmented samplers have been subsequently extended by Hoffman et al., (2019) who use more powerful flow architectures and, Jaini et al., (2021) who extend the idea to sampling from discrete probability densities using flows with surjective transformations (Nielsen et al.,, 2020). We believe these ideas and extensions for neural transport augmented samplers can also be used to sample from an invariant density π\pi by defining the flow transformations to be equivariant á la Köhler et al., (2020).

Appendix B Related Work

In this paper, we proposed equivariant Stein variational gradient descent algorithm for sampling from densities that are invariant to symmetry transformations. Another contribution of our work is subsequently using this equivariant sampling method to efficiently train equivariant energy based models for probabilistic modeling and inference. Perhaps the closest work to that presented in this manuscript is that of Liu and Wang, (2017) who first333As far as the authors are aware. used SVGD for training energy based models. However, their work does not consider incorporating symmetries in to either the sampler or the energy model itself.

Separately, a major contribution of our paper is indeed extending SVGD to incorporate symmetries present in the underlying target density. Since it was introduced by Liu and Wang, (2016), Stein variational gradient descent has garnered a lot of attention as an alternative to Monte Carlo methods for sampling in Bayesian inference problems courtesy of its flexibility and accuracy obtained by combining variational inference and sampling paradigm. Stein variational gradient descent has been subsequently extended by Wang et al., (2019) to incorporate geometry information using matrix valued kernels, and to discrete spaces in Han et al., (2020). While, Duncan et al., (2019) have studied the convergence properties of Stein variational gradient descent under mean-field convergence analysis, a thorough theoretical understanding is still lacking in finite particle limit. Several works have, however, empirically probed limitations of Stein variational gradient descent. Particularly, Stein variational gradient descent is susceptible to collapsing to a few modes depending on the initial configuration of the particles (Zhang et al.,, 2020; D’Angelo and Fortuin,, 2021). As we discussed towards the end of Section 3, incorporating symmetries alleviates this problem partially when the group factorized distribution is unimodal. Furthermore, the problem of mode collapse in Stein variational gradient descent can be addressed by either adding noise to the SVGD update (cf. Equation 3) or using an annealing strategy as proposed in D’Angelo and Fortuin, (2021).

Another contribution of this paper is learning equivariant Energy-Based Models using equivariant Stein variational gradient descent. Energy Based Models have witnessed a revival recently. The primary difficulty in training energy based models is the need to evaluate the partition function which is often intractable,. Thus, training energy based models require methods that can approximate this partition function. One line of ideas thus is to train an auxiliary sampling network that generates samples to approximate the partition function (Kumar et al.,, 2019; Xie et al.,, 2018) making it eerily similar in essence to Generative Adversarial Networks (GANs) (Finn et al.,, 2016). However, as discussed in Du and Mordatch, (2019), such strategies are prone to mode collapse since the sampling network is often trained without an entropy term.

An alternative to this is to use Markov Chain Monte Carlo Method to directly estimate the partition function providing several benefits afforded by MCMC sampling methods. This idea was first proposed by Hinton et al., (2006), termed as Contrastive Divergence algorithm, which used gradient free MCMC chains initialized from training data to estimate the partition function. This was subsequently extended by Tieleman, (2008) who introduced persistence in contrastive divergence where a single MCMC chain with a persistent state is employed to sample from the energy model. However, there are problems still with training EBMs using contrastive divergence. Specifically, EBMs training with contrastive divergence may not capture the target distribution faithfully since the MCMC chains used while training are truncated that lead to biased gradient updates hurting the learning dynamics Nijkamp et al., (2019); Schulz et al., (2010). Towards this end, a sampling procedure that converges quickly to sample from the energy function will potentially help to train energy based models. Thus, modelling invariant densities using equivariant energy functions that are trained using samplers that incorporate the symmetries in the energy function will potentially help with the training of such models.

Equivariant Stein variational gradient descent provides an efficient sampling procedure to train equivariant energy based models. However, modelling the equivariant energy function itself is mostly due to the tremendous advances in geometric deep learning (Bronstein et al.,, 2021). Particularly, we can leverage the various proposed architectures that that incorporate symmetries to model an equivariant energy function. In our experiments, we utilized these advances to model rotations using steerable CNNs (Weiler and Cesa,, 2019), and molecules using E(n)-equivariant graph neural nets (Satorras et al.,, 2021).

Appendix C Equivariant Stein Variational Gradient Descent

In this section, we provide the details for the toy experiments presented in Section 3 as well as additional experiments and plots.

C.1 Additional SO(3) concentric spheres

Refer to caption
Refer to caption
(a) Regular SVGD sampling
Refer to caption
Refer to caption
(b) Invariant SVGD sampling
Figure 8: Recommended to view in color. For both regular (8(a)) and invariant (8(b)) SVGD from left to right the original samples, and the samples projected on the group-factorized space are shown. Translucent yellow dots represent the distribution.

In Figure 8, we extend the experiments presented in Section 3 to a distribution invariant to the 𝖲𝖮(3)\mathsf{SO}(3) symmetry group. The results further support the observations discussed in Section 3 that for continuous symmetry groups equivariant SVGD is able to capture the target density more faithfully due to its ability to model long range interactions whereas vanilla SVGD collapses particles to a single orbit representing the high density region in the probability landscape.

C.2 Experimental setup

C4C_{4}-Gaussians

The target C4C_{4}-Gaussians distribution for which we use equivariant SVGD to draw samples from is defined as a mixture of four Gaussians with uniform mixing coefficients. The mean of each Gaussian is located at a radius of 3 and they all have a covariance of [1,15][1,\frac{1}{5}]. The 50 starting samples are drawn from a two-dimensional Normal distribution with the mean at the origin and a covariance of [2,2][2,2]. The samples are transformed over 25000 SVGD steps with a step size of 0.02. SVGD uses a scalar RBF kernel with a bandwidth of 0.2 for both regular and equivariant sampling.

Concentric circles

The inner and outer circle for the concentric circle example are located at a distance of 4 and 8 from the origin respectively. A normal distribution with variance 0.5 describes the width of the concentric circles. The starting distribution is given by the two-dimensional uniform distribution in the range [8,8)[-8,8) for both dimensions. The RBF kernel used for SVGD has a bandwidth of 0.005. All other SVGD settings are kept consistent with the C4C_{4}-Gaussians example.

Concentric spheres

The concentric spheres toy example is setup in a manner very similar to concentric circle experiment. Specifically, the target distribution is parameterized by the radius of the two spheres and the variance of the Gaussian distribution for the width of the spheres. In this instance, the two spheres have a radius of 4 and 9 and the two Gaussian distributions have a variance of 0.3. Again, similarly the starting distribution is given by uniform distribution in the range [8,8)[-8,8). However, this time it is a 3-dimensional distribution. The RBF bandwidth is set to 0.001 and a total of 100 samples are drawn.

Appendix D Equivariant Joint Energy Models

Here, we present experimental details for the toy example presented in Section 4, additional plots for Figure 4, as well as an additional toy experiment for training equivariant joint energy model using equivariant Stein variational gradient descent.

D.1 Equivariant JEM trained with vanilla SVGD

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Ground Truth
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) Regular EBM trained with Vanilla SVGD
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(c) Equivariant EBM trained with Vanilla SVGD
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(d) Equivariant EBM trained with Equivariant SVGD
Figure 9: Extended version of Figure 4 with equivariant model and regular SVGD. From left to right: decision Boundary, samples and energy of joint distribution π(𝒙,y)\pi(\boldsymbol{x},y), samples and energy of conditional distribution π(𝒙|y=0)\pi(\boldsymbol{x}|y=0), samples and energy of conditional distribution π(𝒙|y=1)\pi(\boldsymbol{x}|y=1). Row 1: Target distribution. Row 2: Non-equivariant EBM trained with vanilla SVGD. Row 3: E-EBM trained with E-SVGD. Row 4: E-EBM trained with vanilla SVGD.

In Figure 9, we continue the experiment presented in section Section 4. In this figure we provide the results for three models, namely: Regular energy based model trained with vanilla SVGD, equivariant energy based model trained with vanilla SVGD and, equivariant energy based model trained with equivariant SVGD. We find that in comparison with the non-equivariant EBM, an equivariant EBM (irrespective of the sampler used) better approximates the target distribution allowing the model to capture all four modes of the outer distribution while the non-equivariant EBM is unable to spread to those regions. However, compared to the equivariant-EBM trained with equivariant SVGD, we find that an equivariant-EBM trained with vanilla SVGD requires more training steps to reconstruct areas of low probability. This is due to the equivariant SVGD exploring a wider area of the landscape in the negative samples for the contrastive divergence algorithm since vanilla SVGD generates multiple samples in the same orbit due to only being able to capture local interactions.

D.2 Additional JEM concentric circles

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Visualization of (learned) distribution by EBMs trained concentric circles. Red dots represent the samples sampled during the last contrastive divergence step. Blue dots are the true samples used for training. Row 1: Two-dimensional visualization. Row 2: Samples projected on group-factorized spaces. From left to right: Target distribution, non-Equivariant EBM trained with vanilla SVGD, equivariant EBM trained with vanilla SVGD, equivariant EBM trained with equivariant SVGD.

In Figure 10, we present the results of training an EBM on samples drawn from the concentric circles toy distribution. This is done for the same three combinations of EBM and SVGD sampler as before: a non-equivariant EBM trained with regular SVGD, an equivariant EBM trained with regular SVGD and an equivariant EBM trained with equivariant SVGD. We find that the trained EBMs show the same results as observed in the JEM trained on the C4C_{4}-Gaussians. A non-equivariant EBM is by far the worst of the three combinations. It specifically has a hard time in reconstructing the outer-ring. Stepping up to a equivariant EBM does improve on this aspect as a slight uptick can be seen at the location of the outer-ring. However, to also fully capture the areas of low probability, the regular SVGD has to be replaced by our equivariant version.

D.3 Experimental setup

The experimental setup for the experiments with regular EBMs trained using vanilla SVGD consists of three parts: 1) defining the target distribution and the construction of the training dataset, 2) defining the energy model and the training parameters and, 3) defining the SVGD kernel and the sampling parameters. Note that for both the C4C_{4}-Gaussians and the concentric rings experiment the setup is kept consistent between the three combinations of EBM and SVGD method. Constructing the equivariant representations for the equivariant EBM models does not add additional parameters to the models.

C4C_{4} Mixture of Gaussians

The target distribution is defined as two sets of C4C_{4}-Gaussians, one at a distance of 7 from the origin and the other at a distance of 15. The inner-set of four Gaussians represents the distribution of the first class (i.e. π(𝒙|y=0)\pi(\boldsymbol{x}|y=0)) while the outer-set represents the second class (i.e. π(𝒙|y=1)\pi(\boldsymbol{x}|y=1)). Using this definition, both the class-conditional probabilities as well as the joint distribution are invariant to the C4C_{4} symmetry group. The dataset used for training the energy model contains of 128 samples equally divided amongst the two classes.

The regular EBM is defined as a 6 layer MLP with ReLU activation functions. The layers have 32, 64, 64, 64, 32, and 2 output nodes respectively with an input dimension as 2. The energy model is trained over 500 epochs with a fixed learning rate of 0.001 using the Adam (Kingma and Ba,, 2014) optimizer with a batch size of 32.

The SVGD kernel used for sampling the 32 negative samples for the contrastive divergence step uses an RBF kernel with a bandwidth of 0.1. Each sampling step does 10,000 steps of SVGD with a step size of 0.9. We consider the SVGD to have converged if the norm of the update of the samples between two consecutive SVGD steps is less then 10410^{-4}. Additionally, the sampling uses persistence (Tieleman,, 2008) with a 0.05 probability of resetting. When reset, new SVGD starting samples are drawn from the positive samples in the dataset. Furthermore, the positive samples in the dataset are used as additional repulsive forces by concatenating them to the batch of negative samples for calculating the update step in Equation 3.

Concentric circles

The concentric circles of the target distribution are located at a distance of 3 and 10 from the origin with equal mixing coefficients. This target distribution is used to sample a dataset with 128 training samples.

Both the energy function and most of the training setup are kept similar as for the C4C_{4}-Gaussians experiment. However, for this experiment a learning rate scheduler is used that reduces the learning rate at 150 and 400 epochs to 0.0005 and 0.0001 respectively. Additionally the Mean Square Error (MSE) loss is used as an additional supervision signal during training to demonstrate the use of loss given in Equation 8. The MSE loss is weighted by a factor of 0.5.

Negative samples are drawn using 10,000 SVGD steps with a bandwidth of 0.05. Additionally, we use a scheduler for the step-size for the Stein variational gradient descent algorithm which reduces the step-size at epoch 250 and 400 to 0.5 and 0.1 respectively. The use of persistence, its reset, and additional supervised SVGD repulsive forces is consistent with the C4C_{4}-Gaussians experiment.

Appendix E Experiments

In this section, we present the details for our experiments presented in Section 5.

E.1 Many-body Particle System (DW-4)

Experimental Setup: The dataset for the Double-Well with 4 particles (DW-4) experiment is constructed by sampling a single example from each of the five meta-stable state configurations for the DW-4 potential. Each of these samples is then duplicated 200 times for a total of a 1000 training samples. A small amount of Gaussian noise is added to each sample to make them unique.

The EBM used to reconstruct the potential is parameterized by a 3 layer MLP with 64, 64, and 1 output nodes respectively. The input of the MLP is 8-dimensional (one for each coordinate of the 4 particles). Except for the final layer the ReLU activation function is used after each layer. For the final layer we use the activation function log(1+x2)\log(1+x^{2}), where xx is the output of the final layer. The EBM is trained over 50 epochs using the Adam optimizer and a fixed learning rate of 0.01. The batch-size is 64.

We use scalar RBF kernels for all SVGD variant but depending on the type of SVGD used we use a different kernel bandwidth for the DW-4 experiment. Specifically, for regular SVGD we use a bandwidth of 0.1 and for equivariant SVGD we use 0.001. The influence of the RBF kernel bandwidth on the final results is discussed more in the next section. For each batch of negative samples, we evolved SVGD for 5,000 time-steps with a step size of 0.1 using the dataset samples as repulsive force. Persistence was used with a reset probability of 0.10. When reset, the starting coordinates for the DW-4 particles are independently sampled from the uniform distribution in the range [5,5)[-5,5).

Differences with Köhler et al., (2020): As mentioned in the main paper, the experiment performed using the DW-4 is a slight deviation from an earlier proposed experiment using the same DW-4 potential in Köhler et al., (2020). In this earlier work, the authors propose to investigate the capacity of their proposed density estimation method (equivariant flows) to recover unseen meta-stable states of the potential when given only access to a single meta-stable states. To clarify, we refer to all possible local minima of the potential that are equivalent under rotation, translation and permutation symmetry as being a single distinct meta-stable states. Using this definition there are a total of 5 distinct meta-stable states (see Figure 7) for the DW-4 potential.

While the results presented in the original paper shows great success, it is however our understanding that the presented results can not be due to the proposed equivariance constraint on the normalizing flow. Precisely, the equivariant flow density proposed by the authors is invariant with respect to permutation of particles, rotation of the system of particles around its center of mass, and translation of the entire system. Thus, given only access to one of the 5 distinct meta-stable states, the equivariant flow only learns to assign a high probability to particle configurations that belong to this same meta-stable states. The 4 other distinct meta-stable states can not be recovered from the single presented meta-stable state using the symmetry transformations that the flow is invariant to. As the density estimated by the EBM trained using our equivariant SVGD proposed in this work is also only invariant to these same symmetries it would face the same restrictions. Instead, we believe that the method recovers the other 4 meta-stable state primarily due to the additional spread we observed in the equivariant sampling process (see Section 3) and the addition of the Gaussian noise to the training example. By adding this small amount of noise, the density estimator can not collapse its density into the single training example. As a result, every possible configuration of the four particles ultimately has a small non-zero probability. Given enough samples, there is herefore a non-trivial chance of sampling other meta-stable states as well. The equivariant constraint on the normalizing flow further amplifies this as the spread of non-zero probabilities not only occurs around the single training example, but also around all symmetry transformations of it.

The hypothesis of the previous paragraph is substantiated by the experimental results presented in Figure 12. We find that if we train an equivariant EBM using equivariant SVGD on the dataset supplied in Köhler et al., (2020) (see Figure 11) instead of the one we constructed ourselves, the sampling procedure can be forced to replicate the same results as presented by Köhler et al., (2020). If we use equivariant SVGD to sample from the trained EBM with the RBF kernel bandwidth set too high, the variance in the samples becomes too large. Thus, with a sufficiently large number of samples, searching through these samples reveals that all distinct meta-stable states have accidentally been recovered. When we significantly reduce the bandwidth, the same number of samples can be drawn but the variance will be low enough to only recover symmetry transformations of the original meta-stable state in the dataset. The later is the expected results given the set of symmetries the estimated densities are invariant too.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Training samples from the original dataset provided in Köhler et al., (2020). All samples represent the same meta-stable state and only differ by the addition of some Gaussian noise to the particle locations.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) High bandwidth
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) Low bandwidth
Figure 12: Results of sampling using equivariant SVGD from an equivariant EBM trained with the dataset provided in Köhler et al., (2020). In fig. 12(b) a bandwidth of 0.0001 for the RBF-kernel is used while in fig. 12(a) a bandwidth of 2.0 is used. Note, sampling is done from the same EBM trained using equivariant SVGD with a bandwidth of 0.001.

E.2 Molecular Generation using QM9

For the molecular generation experiments we limit the large QM9 dataset to constitutional isomers of \chC5H8O1. This molecule was selected due to the relatively high number of constitutional isomers (35) in the dataset in combination with its low atomic charge (46). This allows for sufficient variation within the dataset while keeping the molecules small enough to easily visualize and interpret.

The equivariant EBM is created using an Equivariant Graph Neural Network E-GNN (Satorras et al.,, 2021) with 4 Equivariant Graph Convolutional Layers. Each layer has 64 units. All other model configurations are kept consistent with those deployed in Satorras et al., (2021) for the same dataset. The model was trained over 2500 epochs with a step wise learning rate scheduler. We start with a learning rate of 0.01 and reduce it to 0.005 and 0.001 at epochs 250 and 1,000 respectively. We used an RBF kernel with a bandwidth of 1 for equivariant SVGD which was evolved for 2,500 time-steps with a step size of 0.5. The persistent samples were reset with a 0.2 probability.

Bond estimation

Before visualizing the generated structures, we first post-process them to infer the bonds between the atoms. We do this by using the distance between atoms as a proxy for their probability of being bonded. In the following, we will describe this process in detail. We will rely on graph terminology where atoms are represented by nodes and bonds as undirected edges. The three steps for this process can be roughly described as: 1) bond all heavy atoms together such that they form a connected graph, 2) bond each hydrogen atom to one of the heavy atoms and, 3) create new bonds between atoms or double-up bonds between already connected atoms until each heavy atom has the required number of bonds.

As stated, the goal of the first step is to form a connected graph containing all the heavy atoms. We use the atom closest to the origin as the starting graph consisting of only this single node. From there on, we continuously add the atom that is closest to any of the atoms already in the connected graph. The atom newly added to the connected graph and the atom it was closest to are then connected by an edge/bond such that the graph maintains its connectivity. Note that when determining the distance to the connected graph for each atom, we only consider the distance to atoms that still have bonds available.

In the second step, we connect all hydrogen atoms to the connected graph of heavy atoms. We order this process based on the distance of each hydrogen atom to its closest heavy atom in the connected graph. In other words, we first calculate the distance from each hydrogen atom to all heavy atoms in the connected graph. Given all these distances we then iteratively bond the hydrogen atom that is furthest away from its closest heavy atom in the graph that still has bonds available to this heavy atom. After each newly connected atom pair, we update the number of bonds available for each heavy atom before continuing.

In the last step, we spend all remaining bonds available for the heavy atoms. To do this we repeat the following process until all bonds are spend. First, find the pair of heavy atoms that still have bonds left that are closest together. Second, find the closest pair of atoms that have bonds left and are not yet connected. If the distance between the second pair is not further than 1.1 times the distance between the first pair, then bond the second pair. Otherwise, connect the first pair.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Full batch of 64 molecules sampled from the trained EBM. Note that this number is larger then the total number of molecules trained on.

In Figure 13 we show a full batch of sampled molecules. We find that in addition to the ones presented in the main body of the paper, most other molecules are also anecdotally correct. However, we do also find some weird structures. These can be roughly categorized in two classes. The first class contains faulty molecules that result from the sampling procedure. If we label the rows by the letter A till H and the columns by the numbers 1 till 8, the molecule in C2 is one such example. The second class contains all molecules that show weird bonds. Molecule D7 and E3 are clear examples of this.

E.3 Joint and Conditional Generation for FashionMNIST

We use the FashionMNIST dataset (Xiao et al.,, 2017) for this experiment with no data augmentation i.e. all the models are trained with images in their natural orientation. The test set is however preprocessed to contain images that have been randomly rotated using the C4C_{4} symmetry group i.e. rotation angles from the set {0,90,180,270}\{0^{\circ},90^{\circ},180^{\circ},270^{\circ}\}. The equivariant energy based model is created using a C4C_{4}-steerable CNNs (Weiler and Cesa,, 2019) consisting for eight C4C_{4}-steerable convolutional layers followed by a group pooling layer and a fully connected layer. The regular EBM consists of the same architecture wherein the individual layers are not equivariant i.e. the steerable CNNs are replaced by normal CNN layers. We train all the models using the joint-energy model loss described in Equation 8. We train the models using both equivariant SVGD and vanilla SVGD resulting in three combinations of models, namely: Equivariant EBM trained with equivariant SVGD, equivariant EBM trained with vanilla SVGD and, regular EBM trained with vanilla SVGD.

We trained each model for 300 epochs using the Adam optimizer (Kingma and Ba,, 2014) using a batch size of 64. For equivariant SVGD, we evolve the dynamics for 1,000 time-steps per mini-batch but due to slow convergence we had to increase this to 3,000 time-steps for vanilla SVGD. We used an RBF kernel for SVGD with a bandwidth of 0.1 for vanilla SVGD and 0.005 for equivariant SVGD with a step size of 0.08. We also used persistence with a reset probability of 0.1.