This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

Denis Blessing    Xiaogang Jia    Johannes Esslinger    Francisco Vargas    Gerhard Neumann
Abstract

Monte Carlo methods, Variational Inference, and their combinations play a pivotal role in sampling from intractable probability distributions. However, current studies lack a unified evaluation framework, relying on disparate performance measures and limited method comparisons across diverse tasks, complicating the assessment of progress and hindering the decision-making of practitioners. In response to these challenges, our work introduces a benchmark that evaluates sampling methods using a standardized task suite and a broad range of performance criteria. Moreover, we study existing metrics for quantifying mode collapse and introduce novel metrics for this purpose. Our findings provide insights into strengths and weaknesses of existing sampling methods, serving as a valuable reference for future developments. The code is publicly available here.

Machine Learning, ICML

1 Introduction

Sampling methods are designed to address the challenge of generating approximate samples or estimating the intractable normalization constant ZZ for a probability density π\pi on d\mathbb{R}^{d} of the form

π(𝐱)=γ(𝐱)Z,Z=dγ(𝐱)d𝐱,\pi(\mathbf{x})=\frac{\gamma(\mathbf{x})}{Z},\quad Z=\int_{\mathbb{R}^{d}}\gamma(\mathbf{x})\text{d}\mathbf{x}, (1)

where γ:d+\gamma:\mathbb{R}^{d}\rightarrow\mathbb{R}^{+} can be pointwise evaluated. This formulation has broad applications in fields such as Bayesian statistics and the natural sciences (Liu & Liu, 2001; Stoltz et al., 2010; Frenkel & Smit, 2023; Mittal et al., 2023).

Monte Carlo (MC) methods (Hammersley, 2013), including Annealed Importance Sampling (AIS) (Neal, 2001) and its Sequential Monte Carlo (SMC) extensions (Del Moral et al., 2006), have traditionally been considered the gold standard for addressing the sampling problem. An alternative approach is Variational Inference (VI) (Blei et al., 2017), where a tractable family of distributions is parameterized, and optimization tools are employed to maximize similarity to the intractable target distribution π\pi.

In recent years, there has been a surge of interest in the development of sampling methods that merge MC with VI techniques to approximate complex, potentially multimodal distributions (Wu et al., 2020a; Zhang & Chen, 2021; Arbel et al., 2021; Matthews et al., 2022; Jankowiak & Phan, 2022; Midgley et al., 2022; Berner et al., 2022; Richter et al., 2023; Vargas et al., 2023a, b; Akhound-Sadegh et al., 2024).

However, the evaluation of these methods faces significant challenges, including the absence of a standardized set of tasks and diverse performance criteria. This diversity complicates meaningful comparisons between methods. Existing evaluation protocols, such as the evidence lower bound (ELBO), often rely on samples from the model, restricting their evaluation capabilities to the model’s support. This limitation becomes especially problematic when assessing the ability to mitigate mode collapse on target densities with well-separated modes. To overcome this challenge, others propose the use of integral probability metrics (IPMs), like maximum mean discrepancy (Arenz et al., 2018) or Wasserstein distance (Richter et al., 2023; Vargas et al., 2023a), leveraging samples from the target density to assess performance beyond the model’s support. However, these metrics often involve subjective design choices such as kernel selection or cost function determination, potentially leading to biased evaluation protocols.

To address these challenges, our work introduces a comprehensive set of tasks for evaluating variational methods for sampling. We explore existing evaluation criteria and propose a novel metric specifically tailored to quantify mode collapse. Through this evaluation, we aim to provide valuable insights into the strengths and weaknesses of current sampling methods, contributing to the future design of more effective techniques and the establishment of standardized evaluation protocols.

2 Preliminaries

We provide an overview of Monte Carlo methods, Variational Inference, and combinations. The notation introduced in this section is used throughout the remainder of this work.

Monte Carlo Methods. A variety of Monte Carlo (MC) techniques have been developed to tackle the sampling problem and estimation of ZZ. In particular sequential importance sampling methods such as Annealed Importance Sampling (AIS) (Neal, 2001) and its Sequential Monte Carlo (SMC) extensions (Del Moral et al., 2006) are often regarded as a gold standard to compute ZZ. These approaches construct a sequence of distributions (πt)t=1T(\pi_{t})_{t=1}^{T} that ‘anneal’ smoothly from a tractable proposal distribution π0\pi_{0} to the target distribution πT=π\pi_{T}=\pi. One typically uses the geometric average, that is, γt(𝐱)=π0(𝐱)1βtγ(𝐱)βt\gamma_{t}(\mathbf{x})=\pi_{0}(\mathbf{x})^{1-\beta_{t}}\gamma(\mathbf{x})^{\beta_{t}}, with πtγt\pi_{t}\propto\gamma_{t} for 0=β0<β1<<βT=10=\beta_{0}<\beta_{1}<...<\beta_{T}=1. Approximate samples from π\pi are then obtained by starting from 𝐱0π0()\mathbf{x}_{0}\sim\pi_{0}(\cdot) and running a sequence of Markov chain Monte Carlo (MCMC) transitions that target (πt)t=1T(\pi_{t})_{t=1}^{T}.

Variational Inference. Variational inference (VI) (Blei et al., 2017) is a popular alternative to MCMC and SMC where one considers a flexible family of easy-to-sample distributions qθq^{\theta} whose parameters θ\theta are optimized by minimizing the reverse Kullback–Leibler (KL) divergence, i.e.,

DKL(qθ(𝐱)π(𝐱))=𝔼qθ(𝐱)[logγ(𝐱)qθ(𝐱)]ELBO+logZD_{\text{KL}}(q^{\theta}(\mathbf{x})\|\pi(\mathbf{x}))=-\underbrace{\mathbb{E}_{q^{\theta}(\mathbf{x})}\Big{[}\log\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{]}}_{\text{ELBO}}+\log Z (2)

It is well known that minimizing the reverse KL is equivalent to maximizing the evidence lower bound (ELBO) and that ELBOlogZ\text{ELBO}\leq\log Z with equality if and only if qθ=πq^{\theta}=\pi. Later, VI was extended to other variational objectives such as α\alpha-divergences (Li & Turner, 2016; Midgley et al., 2022), log-variance loss (Richter et al., 2020), trajectory balance, (Malkin et al., 2022a) or general ff- divergences (Wan et al., 2020). Typical choices for qθq^{\theta} include mean-field approximations (Wainwright & Jordan, 2008), mixture models (Arenz et al., 2022) or normalizing flows (Papamakarios et al., 2021). To construct more flexible variational distributions (Agakov & Barber, 2004) modeled qθ(𝐱)q^{\theta}(\mathbf{x}) as the marginal of a latent variable model, i.e. qθ(𝐱)=qθ(𝐱,𝐳)d𝐳q^{\theta}(\mathbf{x})=\int q^{\theta}(\mathbf{x},\mathbf{z})\mathrm{d}\mathbf{z} 111Agakov & Barber (2004) coined the term ‘augmentation’ for 𝐳\mathbf{z}. We adopt the more established terminology and refer to 𝐳\mathbf{z} as a latent variable.. As this marginal is typically intractable, θ\theta is then learned by minimizing a discrepancy measure between qθ(𝐱,𝐳)q^{\theta}(\mathbf{x},\mathbf{z}) and an extended target pθ(𝐱,𝐳)=π(𝐱)pθ(𝐳|𝐱)p^{\theta}(\mathbf{x},\mathbf{z})=\pi(\mathbf{x})p^{\theta}(\mathbf{z}|\mathbf{x}) where pθ(𝐳|𝐱)p^{\theta}(\mathbf{z}|\mathbf{x}) is an auxiliary conditional distribution. Using the chain rule for the KL-divergence (Cover, 1999) one obtains an extended version of the ELBO, that is,

DKL(qθ(𝐱)π(𝐱))𝔼qθ(𝐱,𝐳)[logγ(𝐱)pθ(𝐳|𝐱)qθ(𝐱,𝐳)]ELBO¯+logZ.D_{\text{KL}}(q^{\theta}(\mathbf{x})\|\pi(\mathbf{x}))\leq-\underbrace{\mathbb{E}_{q^{\theta}(\mathbf{x},\mathbf{z})}\Big{[}\log\frac{\gamma(\mathbf{x})p^{\theta}(\mathbf{z}|\mathbf{x})}{q^{\theta}(\mathbf{x},\mathbf{z})}\Big{]}}_{\,\overline{\!{\text{ELBO}}}}+\log Z. (3)

Although the extended ELBO is often referred to as ELBO, latent variables 𝐳\mathbf{z} introduce additional looseness, i.e., ELBO¯ELBO\,\overline{\!{\text{ELBO}}}\leq\text{ELBO} with equality if pθ(𝐳|𝐱)=qθ(𝐱,𝐳)/qθ(𝐱)p^{\theta}(\mathbf{z}|\mathbf{x})=q^{\theta}(\mathbf{x},\mathbf{z})/q^{\theta}(\mathbf{x}). To compute expectations with respect to qθ(𝐱,𝐳)q^{\theta}(\mathbf{x},\mathbf{z}), one typically chooses tractable distributions qθ(𝐱|𝐳)q^{\theta}(\mathbf{x}|\mathbf{z}) and qθ(𝐳)q^{\theta}(\mathbf{z}) and performs a Monte Carlo estimate using ancestral sampling.

Variational Monte Carlo Methods. Over recent years, the idea of using extended distributions has been further explored (Wu et al., 2020b; Geffner & Domke, 2021; Thin et al., 2021; Zhang et al., 2021; Doucet et al., 2022b; Geffner & Domke, 2022). In particular, these ideas marry Monte Carlo with variational techniques by constructing the variational distribution and extended target as Markov chains, i.e., qθ(𝐱0:T)=π0(𝐱0)t=1TFtθ(𝐱t|𝐱t1)q^{\theta}(\mathbf{x}_{0:T})=\pi_{0}(\mathbf{x}_{0})\prod_{t=1}^{T}F^{\theta}_{t}(\mathbf{x}_{t}|\mathbf{x}_{t-1}) and pθ(𝐱0:T)=π(𝐱T)t=0T1Btθ(𝐱t|𝐱t+1)p^{\theta}(\mathbf{x}_{0:T})=\pi(\mathbf{x}_{T})\prod_{t=0}^{T-1}B^{\theta}_{t}(\mathbf{x}_{t}|\mathbf{x}_{t+1}) with 𝐱=𝐱T\mathbf{x}=\mathbf{x}_{T}, 𝐳=(𝐱0,,𝐱T1)\mathbf{z}=(\mathbf{x}_{0},\ldots,\mathbf{x}_{T-1}) and tractable π0\pi_{0}. Common choices of transition kernels Ftθ,BtθF^{\theta}_{t},B^{\theta}_{t} include Gaussian distributions (Doucet et al., 2022b; Geffner & Domke, 2022) or normalizing flow maps (Wu et al., 2020a; Arbel et al., 2021; Matthews et al., 2022) and can be optimized by e.g. maximization of the extended ELBO via stochastic gradient ascent. Recently, Vargas et al. (2023a); Zhang & Chen (2021); Vargas et al. (2023b, 2024); Richter et al. (2023); Berner et al. (2022) explored the limit of TT\rightarrow\infty in which case the Markov chains converge to forward and backward time stochastic differential equations (SDEs) (Anderson, 1982; Song et al., 2020) inducing the path distributions θ\mathbb{Q}^{\theta} and θ\mathbb{P}^{\theta} which can be thought of as continuous time analogous of qθq^{\theta} and pθp^{\theta} respectively. Zhang & Chen (2021); Berner et al. (2022); Richter et al. (2023); Vargas et al. (2024) leveraged the continuous-time perspective to establish connections with Schrödinger bridges (Léonard, 2013) and stochastic optimal control (Dai Pra, 1991), resulting in the development of novel sampling algorithms.

Refer to caption
Figure 1: Illustration of the evidence upper (EUBO) and lower bound (ELBO). The mode-seeking nature of reverse KL results in ELBOlogZ\text{ELBO}\ll\log Z if the model density qθq^{\theta} (indicated by the samples ×\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}{\times}) averages over the target π\pi (indicated by the level plot) (t1t_{1}) and ELBOlogZ\text{ELBO}\approx\log Z if π0\pi\geq 0 whenever qθ0q^{\theta}\geq 0 (t2t4)t_{2}-t_{4}). As a result, the ELBO is not sensitive to mode collapse. In contrast, the mass-covering nature of the forward KL ensures that EUBOlogZ\text{EUBO}\gg\log Z if qθ0q^{\theta}\approx 0 whenever π>0\pi>0 (t2)t_{2}) and EUBOlogZ\text{EUBO}\approx\log Z if qθ0q^{\theta}\geq 0 whenever π0\pi\geq 0 (t1t_{1}). Consequently, the EUBO is well suited to quantify mode collapse.

Performance Criteria. Several performance criteria have been proposed for evaluating sampling methods, notably, those comparing the density ratio between the target and model density and integral probability metrics (IPMs).

Density ratio-based criteria make use of the ratio between the (unnormalized) target density γ(𝐱)\gamma(\mathbf{x}) and the model qθ(𝐱)q^{\theta}(\mathbf{x}). Due to the intractability of qθ(𝐱)q^{\theta}(\mathbf{x}) for methods that work with latent variables, the density ratio between the joint distributions of 𝐱\mathbf{x} and 𝐳\mathbf{z} is considered, i.e.,

w=γ(𝐱)qθ(𝐱), and w¯=γ(𝐱)pθ(𝐳|𝐱)qθ(𝐱,𝐳),w=\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})},\ \text{ and }\ \,\overline{\!{w}}=\frac{\gamma(\mathbf{x})p^{\theta}(\mathbf{z}|\mathbf{x})}{q^{\theta}(\mathbf{x},\mathbf{z})}, (4)

respectively. Note that ww and w¯\,\overline{\!{w}} are also referred to as (unnormalized) importance weights. Using this notation, we can recover commonly used metrics such as the reverse effective sample size (ESSr) or the ELBO, that is,

ESSr=(𝔼qθ[w])2𝔼qθ[w2] and ELBO=𝔼qθ[logw],\text{ESS}_{r}=\frac{(\mathbb{E}_{q^{\theta}}[w])^{2}}{\mathbb{E}_{q^{\theta}}[w^{2}]}\ \text{ and }\ \text{ELBO}=\mathbb{E}_{q^{\theta}}[\log w], (5)

respectively. Here, ‘reverse’ is used to denote that expecations are computed with respect to qθq^{\theta}. In addition, if the true normalization constant is known, an importance-weighted reverse estimate of logZ\log Z is often employed to report the esimation bias, i.e., ΔlogZr=|logZlogZ^r|\Delta\log Z_{r}=|\log Z-\log\hat{Z}_{r}| with

logZ^r=log𝔼qθ[w].\log\hat{Z}_{r}=\log\mathbb{E}_{q^{\theta}}[w]. (6)

Please note that extended versions of these criteria are obtainable by replacing ww with the extended version w¯\,\overline{\!{w}} and taking expectations under the joint distribution qθ(𝐱,𝐳)q^{\theta}(\mathbf{x},\mathbf{z}).

3 Quantifying Mode-Collapse

Quantifying the ability to avoid mode collapse is difficult as identifying all modes of the target density π\pi and determining whether a model captures them accurately is inherently challenging. In particular, methods that are optimized using the reverse KL divergence are forced to assign high probability to regions with non-negligible probability in the target distribution π\pi. This is referred to as mode-seeking behavior and can result in an overemphasis on a limited set of modes, leading to mode collapse. Consequently, performance criteria that use expectations under the model qθq^{\theta}, such as ELBO, (reverse) ESS, or ΔlogZr\Delta\log Z_{r}, are influenced by the mode-seeking nature of the reverse KL divergence, making them less sensitive to mode collapse.

Here, we aim to explore criteria that are sensitive to mode collapse such as density-ratio based ‘forward’ criteria, that leverage expectations under π\pi and integral probability metrics (IPMs). Furthermore, we introduce entropic mode coverage, a novel criterion that leverages prior knowledge about the target to heuristically quantify mode coverage.

Forward Criteria. We discuss the ‘forward’ versions of the criteria discussed in Section 2. First, evidence upper bounds (EUBOs) are based on the forward KL divergence and have already been leveraged as learning objectives in VI (Ji & Shen, 2019). Here, we explore them as performance criteria that are sensitive to mode collapse. Formally, the EUBO is the sum of the forward KL and logZ\log Z, that is,

DKL(π(𝐱)qθ(𝐱))=𝔼π(𝐱)[logw]EUBOlogZ,D_{\text{KL}}(\pi(\mathbf{x})\|q^{\theta}(\mathbf{x}))=\underbrace{\mathbb{E}_{\pi(\mathbf{x})}[\log w]}_{\text{EUBO}}-\log Z, (7)

with importance weights w=γ(𝐱)/qθ(𝐱)w={\gamma(\mathbf{x})}/{q^{\theta}(\mathbf{x})}. Due to the non-negativeness of the KL divergence, it is easy to see that EUBOlogZ\text{EUBO}\geq\log Z with equality if and only if qθ=πq^{\theta}=\pi. Hence, a lower EUBO means that qθq^{\theta} is closer to π\pi in a DKLD_{\text{KL}} sense. The mass-covering nature of the forward KL leads to high EUBO values if the model fails to cover regions of non-negligible probability in the target distribution π\pi and is therefore well suited to quantify mode-collapse. This is further illustrated in Figure 1. We can again leverage the chain rule for the KL-divergence (Cover, 1999) to obtain an extended version of the EUBO, i.e., 𝔼π(𝐱,𝐳)[logw¯]\mathbb{E}_{\pi(\mathbf{x},\mathbf{z})}[\log\,\overline{\!{w}}] that satisfies EUBO¯EUBO\,\overline{\!{\text{EUBO}}}\geq\text{EUBO}, where the introduction of latent variables introduce additional looseness. The extended EUBO requires computing the importance weights w¯\,\overline{\!{w}} and expectations under π(𝐱,𝐳)\pi(\mathbf{x},\mathbf{z}). The former depends on the specific choice of sampling algorithm and is further discussed in Section 4 when introducing the methods considered for evaluation. The latter can be approximated by propagating target samples 𝐱\mathbf{x} back to 𝐳\mathbf{z} using π(𝐳|𝐱)\pi(\mathbf{z}|\mathbf{x}). Additionally, having access to samples from π\pi allows for computing forward versions of ZZ and ESS which have already been used to quantify mode collapse by e.g. (Midgley et al., 2023). Formally, they are defined as

Zf=1/𝔼π[w1], and ESSf=Zf/𝔼π[w],Z_{f}=1/\mathbb{E}_{\pi}[w^{-1}],\ \text{ and }\ \text{ESS}_{f}={Z_{f}}/{\mathbb{E}_{\pi}[w]}, (8)

where expectations are taken with respect to the target π\pi. For a detailed discussion see Appendix A.1.

Integral Probability Metrics. Alternatively, IPMs are often employed if samples from the target distribution π\pi are available (Arenz et al., 2018; Richter et al., 2023; Vargas et al., 2023a, 2024). Common IPMs for assessing sample quality are 2-Wasserstein distance (𝒲2\mathcal{W}_{2}) (Peyré et al., 2019) or the maximum mean discrepancy (MMD) (Gretton et al., 2012). The former uses a cost function to calculate the minimum cost required to transport probability mass from one distribution to another while the latter assesses distribution dissimilarity by examining the differences in their mean embeddings within a reproducing kernel Hilbert space (Aronszajn, 1950). For further details see Appendix A.2.

Entropic Mode Coverage. Inspired by inception scores and distances from generative modelling (Salimans et al., 2016; Heusel et al., 2017) we propose a heuristic approach for detecting mode collapse by introducing the entropic mode coverage (EMC). To compute EMC, we partition d\mathbb{R}^{d} into disjoint subsets ξi,i{1,,M}\xi_{i},i\in\{1,\ldots,M\} that describe different modes of the target density π\pi. Moreover, we introduce an auxiliary distribution that measures the probability of a sample 𝐱\mathbf{x} being element of a mode descriptor, i.e., p(ξi|𝐱)=p(𝐱ξi)p(\xi_{i}|\mathbf{x})=p(\mathbf{x}\in\xi_{i}). We then compute the expected entropy of the auxiliary distribution, that is,

EMC\displaystyle\text{EMC}\coloneqq 𝔼qθ(𝐱)(p(ξ|𝐱))\displaystyle\ \mathbb{E}_{q^{\theta}(\mathbf{x})}\mathcal{H}\left(p(\xi|\mathbf{x})\right)
\displaystyle\approx 1N𝐱qθi=1Mp(ξi|𝐱)logMp(ξi|𝐱),\displaystyle-\frac{1}{N}\sum_{\mathbf{x}\sim q^{\theta}}\sum_{i=1}^{M}p(\xi_{i}|\mathbf{x})\log_{M}{p(\xi_{i}|\mathbf{x})}, (9)

where the expectation is approximated using a Monte Carlo estimate. Here, NN denotes the number of samples drawn from qθq^{\theta}. Please note that we employ the logarithm with a base of MM to ensure that EMC[0,1]\text{EMC}\in[0,1]. This choice of base facilitates a straightforward interpretation: A value of 0 signifies a model that consistently produces samples that are elements of the same mode descriptor. In contrast, a value of 11 represents a model that can produce samples from all mode descriptors with equal probability.

Clearly, EMC is limited to targets where mode descriptors are available which is further discussed in Section 5. Moreover, a suitable criterion for cases where mode descriptors are not equally probable is discussed in Appendix A.3.

4 Benchmarking Methods

Acronym Method Reference
MFVI Gaussian Mean-Field VI (Bishop, 2006)
GMMVI Gaussian Mixture Model VI (Arenz et al., 2022)
NFVI Normalizing Flow VI (Rezende & Mohamed, 2015)
SMC Sequential Monte Carlo (Del Moral et al., 2006)
AFT Annealed Flow Transport (Arbel et al., 2021)
CRAFT Continual Repeated AFT (Matthews et al., 2022)
FAB Flow Annealed IS Bootstrap (Midgley et al., 2022)
ULA Uncorrected Langevin Annealing (Thin et al., 2021)
MCD Monte Carlo Diffusion (Doucet et al., 2022a)
UHA Uncorrected Hamiltonian Annealing (Geffner & Domke, 2021)
LDVI Langevin Diffusion VI (Geffner & Domke, 2022)
CMCD Controlled MCD (Vargas et al., 2024)
PIS Path Integral Sampler (Zhang & Chen, 2021)
DIS Time-Reversed Diffusion Sampler (Berner et al., 2022)
DDS Denoising Diffusion Sampler (Vargas et al., 2023a)
GFN Generative Flow Networks (Lahlou et al., 2023)
GBS General Bridge Sampler (Richter et al., 2023)
Table 1: Sampling Methods. For methods marked with ‘\dagger’, implementation is available, but the results are either not included or only partially presented in this work.
Funnel Credit Seeds Cancer Brownian Ionosphere Sonar Digits Fashion LGCP MoG MoS
True logZ\log Z
Samples from π\pi
Mode descriptors ξ\xi
Dimensionality DD 10 25 26 31 32 35 61 196 784 1600 +\mathbb{N}_{+} +\mathbb{N}_{+}
Table 2: Target densities π(𝐱)=γ(𝐱)/Z\pi(\mathbf{x})=\gamma(\mathbf{x})/Z considered in this work.

In this section, we elaborate on the methods included in this benchmark, categorizing them into three distinct groups based on the computation of importance weights. Please refer to Table 1 for an overview of these methods and to Appendix B for further details.

Tractable Density Models. Tractable density models allow for computing the model likelihood qθ(𝐱)q^{\theta}(\mathbf{x}). It is therefore straightforward to compute performance criteria associated with importance weights w=γ(𝐱)/qθ(𝐱)w={\gamma(\mathbf{x})}/{q^{\theta}(\mathbf{x})}. Notable works include factorized (‘mean-field’) Gaussian distributions (MFVI), Normalizing Flows (NFVI) (Rezende & Mohamed, 2015) and full rank Gaussian mixture models (GMMVI) (Arenz et al., 2022).

Sequential Importance Sampling Methods. Sequential importance sampling (SIS) methods define w¯\,\overline{\!{w}} in terms of incremental importance sampling (IS) weights, that is, w¯=t=1TGt(𝐱t1,𝐱t)\,\overline{\!{w}}=\prod_{t=1}^{T}G_{t}(\mathbf{x}_{t-1},\mathbf{x}_{t}) with

Gt(𝐱t1,𝐱t)=γt(𝐱t)Bt1θ(𝐱t1|𝐱t)γt1(𝐱t1)Ftθ(𝐱t|𝐱t1),G_{t}(\mathbf{x}_{t-1},\mathbf{x}_{t})=\frac{\gamma_{t}(\mathbf{x}_{t})B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{{t}})}{\gamma_{t-1}(\mathbf{x}_{t-1})F^{\theta}_{t}(\mathbf{x}_{{t}}|\mathbf{x}_{{t-1}})}, (10)

with annealed versions γt\gamma_{t} of γ\gamma. For example, choosing Bt1θ(𝐱t1|𝐱t)=πt(𝐱t1)Ftθ(𝐱t|𝐱t1)/πt(𝐱t)B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{{t}})=\pi_{t}(\mathbf{x}_{t-1})F^{\theta}_{t}(\mathbf{x}_{t}|\mathbf{x}_{t-1})/\pi_{t}(\mathbf{x}_{t}) recovers AIS (Neal, 2001). Midgley et al. (2022) proposed to parameterize the proposal distribution π0\pi_{0} with normalizing flows and, in combination with AIS, to minimize the α\alpha-divergence, resulting in the Flow Annealed Importance Sampling Bootstrap (FAB) algorithm. Additionally, when AIS is coupled with resampling, it gives rise to Sequential Monte Carlo (SMC) as originally proposed by Del Moral et al. (2006).

Recent advancements include the development of Stochastic Normalizing Flows (Wu et al., 2020a), Annealed Flow Transport (AFT) (Arbel et al., 2021), and Continual Repeated AFT (CRAFT) (Matthews et al., 2022). These methods extend Sequential Monte Carlo by employing sets of normalizing flows that define deterministic transport maps between neighboring distributions γt\gamma_{t}. For further details on Ftθ,Bt1θF^{\theta}_{t},B^{\theta}_{t-1} and the corresponding GtG_{t} see Table 7. For an in-depth exploration of the commonalities and distinctions among these methods, please refer to (Matthews et al., 2022).

Diffusion-Based Methods. Diffusion-based methods typically build on stochastic differential equations (SDEs) with parameterized drift terms (Tzen & Raginsky, 2019), i.e.,

d𝐱t\displaystyle\mathrm{d}\mathbf{x}_{t} =ftθ(𝐱t)dt+σtd𝐰t,𝐱0π0,\displaystyle=f_{t}^{\theta}(\mathbf{x}_{t})\mathrm{d}t+\sigma_{t}{\mathrm{d}}\mathbf{w}_{t},\qquad\mathbf{x}_{0}\sim\pi_{0},
d𝐱t\displaystyle\mathrm{d}\mathbf{x}_{t} =btθ(𝐱t)dt+σtd𝐰¯t,𝐱TπT,\displaystyle=b_{t}^{\theta}(\mathbf{x}_{t})\mathrm{d}t+\sigma_{t}{\mathrm{d}}\bar{\mathbf{w}}_{t},\qquad\mathbf{x}_{T}\sim\pi_{T}, (11)

with diffusion coefficient σt\sigma_{t} and standard Brownian motion 𝐰t,𝐰¯t\mathbf{w}_{t},\bar{\mathbf{w}}_{t}. Using the Euler-Maruyama method (Särkkä & Solin, 2019), their discretized counterparts can be characterized by Gaussian forward-backward transition kernels

Ft+1θ(𝐱t+1|𝐱t)\displaystyle F^{\theta}_{t+1}(\mathbf{x}_{t+1}|\mathbf{x}_{t}) =𝒩(𝐱t+1|𝐱t+ftθ(𝐱t)Δt,σt2Δt) and\displaystyle=\mathcal{N}(\mathbf{x}_{t+1}|\mathbf{x}_{t}+f_{t}^{\theta}(\mathbf{x}_{t})\Delta_{t},\sigma_{t}^{2}\Delta_{t})\text{ and}
Bt1θ(𝐱t1|𝐱t)\displaystyle B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{t}) =𝒩(𝐱t1|𝐱t+btθ(𝐱t)Δt,σt2Δt),\displaystyle=\mathcal{N}(\mathbf{x}_{t-1}|\mathbf{x}_{t}+b_{t}^{\theta}(\mathbf{x}_{t})\Delta_{t},\sigma_{t}^{2}\Delta_{t}), (12)

with discretization step size Δt\Delta_{t}. The extended (unnormalized) importance weights w¯\,\overline{\!{w}} can then be constructed as

pθ(𝐱0:T)qθ(𝐱0:T)w¯=γ(𝐱T)t=1TBt1θ(𝐱t1|𝐱t)π0(𝐱0)t=0T1Ft+1θ(𝐱t+1|𝐱t).\frac{p^{\theta}(\mathbf{x}_{0:T})}{q^{\theta}(\mathbf{x}_{0:T})}\propto\,\overline{\!{w}}=\frac{\gamma(\mathbf{x}_{T})\prod_{t=1}^{T}\ B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{\pi_{0}(\mathbf{x}_{0})\prod_{t=0}^{T-1}F^{\theta}_{t+1}(\mathbf{x}_{t+1}|\mathbf{x}_{t})}. (13)

One line of work considers annealed Langevin dynamics to model Eq. (4). Works include Unadjusted Langevin Annealing (ULA) (Thin et al., 2021), Monte Carlo Diffusions (MCD) (Doucet et al., 2022a), Controlled Monte Carlo Diffusion (CMCD) (Vargas et al., 2024), Uncorrected Hamiltonian Annealing (UHA) (Geffner & Domke, 2021) and Langevin Diffusion Variational Inference (LDVI) (Geffner & Domke, 2022). A second line of work describes diffusion-based sampling from a stochastic optimal control perspective (Dai Pra, 1991). Works include methods such as Path Integral Sampler (PIS) (Zhang & Chen, 2021; Vargas et al., 2023b), Denoising Diffusion Sampler (DDS) (Vargas et al., 2023a), Time-Reversed Diffusion Sampler (DIS) (Berner et al., 2022) and Generative Flow Networks (GFN) (Lahlou et al., 2023; Malkin et al., 2022b; Zhang et al., 2023). Furthermore, Richter et al. (2023) identify several of these methods as special cases of a General Bridge Sampler (GBS) where both processes in Eq. 4 are freely parameterized. Specific choices of π0,Ft+1θ,Bt1θ\pi_{0},F^{\theta}_{t+1},B^{\theta}_{t-1} are detailed in Table 6. Lastly, we refer the interested reader to (Sendera et al., 2024) which concurrently benchmarked diffusion-based sampling methods.

[Uncaptioned image]
[Uncaptioned image]
Figure 2: Mean and standard deviation of EMC values for MoG and MoS across varying dimensions dd.

5 Benchmarking Target Densities

Funnel MoS MoG
Refer to caption Refer to caption Refer to caption
Figure 3: Synthetic target densities. Left: First two dimensions of the funnel density. Middle: Mixture of Student-t distribution with 1515 components (MoS). Right: Mixture of 4040 isotropic Gaussian distributions (MoG).

Here, we briefly summarize the target densities π\pi considered in this work. The dimensionality of the problem, if we have access to the log normalizer logZ\log Z, target samples, or mode descriptors for computing the entropic mode coverage is presented in Table 2. Further details and formal definitions of the target densities can be found in Appendix C.

Bayesian Logistic Regression. We consider four experiments where we perform inference over the parameters of a Bayesian logistic regression model for binary classification. The datasets Credit and Cancer were taken from Nishihara et al. (2014). The former distinguishes individuals as either good or bad credit risks, while the latter deals with the classification of recurrence events in breast cancer. The Ionosphere dataset (Sigillito et al., 1989) involves classifying radar signals passing through the ionosphere as either good or bad. Similarly, the Sonar dataset (Gorman & Sejnowski, 1988) tackles the classification of sonar signals bounced off a metal cylinder versus those bounced off a roughly cylindrical rock.

Random Effect Regression. The Seeds data was collected by (Crowder, 1978). The goal is to perform inference over the variables of a random effect regression model that models the germination proportion of seeds arranged in a factorial layout by seed and type of root.

Time Series Models. We consider the Brownian time series model obtained by discretizing a stochastic differential equation, modeling a Brownian motion with a Gaussian observation model, developed by (Sountsov et al., 2020).

Spatial Statistics. The log Gaussian Cox process (LGCP) (Møller et al., 1998) is a probabilistic model commonly used in statistics to model spatial point patterns. In this work, the log Gaussian Cox process is applied to modeling the positions of pine saplings in Finland.

Synthetic Targets. We additionally consider synthetic target densities as they commonly give access to the true normalization constant ZZ, target samples, and mode descriptors. The Funnel target was introduced by (Neal, 2003) and provides a complex ‘funnel’-shaped distribution. Moreover, we consider two different types of mixture models: a mixture of isotropic Gaussians (MoG) as proposed by Midgley et al. (2022), and Student-t distributions (MoS). To obtain mode descriptors for a mixture model with KK components, i.e., π(𝐱)=kπk(𝐱)/K\pi(\mathbf{x})=\sum_{k}\pi_{k}(\mathbf{x})/K we compute the density per component πk(𝐱)\pi_{k}(\mathbf{x}) and say that 𝐱ξi\mathbf{x}\in\xi_{i} if i=argmaxk{πk(𝐱)}k=1Ki=\text{argmax}_{k}\{\pi_{k}(\mathbf{x})\}_{k=1}^{K}. Lastly, we follow Doucet et al. (2022a) and train NICE (Dinh et al., 2014) on a down-sampled 14×1414\times 14 variant of MNIST (Digits) (LeCun et al., 1998) and a 28×2828\times 28 variant of Fashion MNIST (Fashion) and use the trained model as target density. Here, we obtain the mode descriptors by training a classifier p(𝐜|𝐱)p(\mathbf{c}|\mathbf{x}) on samples from π\pi where the classes 𝐜\mathbf{c} are represented by ten different digits. If i=argmax𝐜p(𝐜|𝐱)i=\text{argmax}_{\mathbf{c}}\ p(\mathbf{c}|\mathbf{x}) we conclude 𝐱ξi\mathbf{x}\in\xi_{i}.

6 Hyperparameters and Tuning

In this section, we provide details on hyperparameter tuning. For further information, please refer to Appendix D.

Tractable Density Methods. For MFVI, we used a batch size of 2000 and performed 100k gradient steps, tuning the learning rate via grid search. For targets with known support, we adjusted the initial model variance accordingly. For GMMVI, we adhered to the default settings from (Arenz et al., 2022), utilizing 100 samples per mixture component. We initialized with 10 components and employed an adaptive scheme to add and remove components heuristically. The initial variance of the components was set based on the target support, and we conducted 3000 training iterations.

Sequential Importance Sampling Methods. In SIS methods, we employed 2000 particles for training. All methods except FAB used 128 annealing steps; FAB followed the original 12 steps as proposed by its authors. The choice and parameters of the MCMC transition kernel significantly impacted performance. Hamilton-Monte Carlo (Duane et al., 1987) generally outperformed Metropolis-Hastings (Chib & Greenberg, 1995) (see Appendix F.3). Step sizes for βt0.5\beta_{t}\geq 0.5 and βt<0.5\beta_{t}<0.5 were tuned using grid search. For AFT and CRAFT, we used diagonal affine flows (Papamakarios et al., 2021), which yielded more robust results than complex flows like inverse autoregressive flows (Kingma et al., 2016) or neural spline flows (Durkan et al., 2019) (see Appendix F.6). FAB employed RealNVP (Dinh et al., 2016) for the proposal distribution π0\pi_{0}. Learning rates for these flows were also tuned via grid search. For targets with known support, the variance of π0(𝐱)=𝒩(0,σ02I)\pi_{0}(\mathbf{x})=\mathcal{N}({0},\sigma^{2}_{0}{I}) was set accordingly, otherwise, a grid search was performed. We used multinomial resampling with a threshold of 0.3 (Douc & Cappé, 2005).

Diffusion-based Methods. Training involved a batch size of 2000 and 40k gradient steps. SDEs were discretized with 128 steps, T=1T=1, and a fixed Δt\Delta t. The diffusion coefficient was chosen as σt=σmaxcos2(π(Tt)/2T)\sigma_{t}=\sigma_{\text{max}}\cos^{2}({\pi(T-t)}/{2T}), following (Vargas et al., 2023a) for better performance compared to linear or constant schedules. We used the architecture from (Zhang & Chen, 2021) with 2 layers of 64 hidden units each. For targets with known prior support, the initial model support was set accordingly. For all methods except PIS, this involved setting the variance of the prior distribution π0(𝐱)=𝒩(0,σ02I)\pi_{0}(\mathbf{x})=\mathcal{N}({0},\sigma^{2}_{0}{I}). For PIS, σmax\sigma_{\text{max}} was carefully chosen. In MCD and LDVI, we learned the annealing schedule βt\beta_{t} and σmax\sigma_{\text{max}} end-to-end by maximizing the ELBO.

Funnel MoG (d=50)(d=50) MoS (d=50)(d=50) 14×1414\times 14 Digits 28×2828\times 28 Fashion
𝒲2\mathcal{W}_{2} \downarrow MMD \downarrow 𝒲2\mathcal{W}_{2} \downarrow MMD \downarrow 𝒲2\mathcal{W}_{2} \downarrow MMD \downarrow 𝒲2\mathcal{W}_{2} \downarrow MMD \downarrow 𝒲2\mathcal{W}_{2} \downarrow MMD \downarrow
MFVI 178.264±0.271178.264\scriptstyle\pm 0.271 0.303±0.0020.303\scriptstyle\pm 0.002 39360.196±12.4939360.196\scriptstyle\pm 12.49 0.209±0.0000.209\scriptstyle\pm 0.000 2462.260±1.0092462.260\scriptstyle\pm 1.009 0.215±0.0000.215\scriptstyle\pm 0.000 254.179±0.025254.179\scriptstyle\pm 0.025 0.351±0.0000.351\scriptstyle\pm 0.000 1327.517±0.8451327.517\scriptstyle\pm 0.845 0.285±0.0000.285\scriptstyle\pm 0.000
GMMVI 105.620±3.472\mathbf{105.620\scriptstyle\pm 3.472} 0.031±0.000\mathbf{0.031\scriptstyle\pm 0.000} 32004.968±1069.32004.968\scriptstyle\pm 1069. 0.203±0.0130.203\scriptstyle\pm 0.013 1255.216±296.91255.216\scriptstyle\pm 296.9 0.135±0.0170.135\scriptstyle\pm 0.017 207.163±14.60207.163\scriptstyle\pm 14.60 0.373±0.0420.373\scriptstyle\pm 0.042 1343.495±136.91343.495\scriptstyle\pm 136.9 0.462±0.0330.462\scriptstyle\pm 0.033
SMC 149.353±2.973149.353\scriptstyle\pm 2.973 0.162±0.0150.162\scriptstyle\pm 0.015 46351.236±4.79546351.236\scriptstyle\pm 4.795 0.631±0.0000.631\scriptstyle\pm 0.000 3297.640±1372.3297.640\scriptstyle\pm 1372. 0.431±0.1610.431\scriptstyle\pm 0.161 159.255±1.877159.255\scriptstyle\pm 1.877 1.168±0.0081.168\scriptstyle\pm 0.008 6696.287±250.46696.287\scriptstyle\pm 250.4 1.556±0.0081.556\scriptstyle\pm 0.008
AFT 145.138±6.061145.138\scriptstyle\pm 6.061 0.159±0.0100.159\scriptstyle\pm 0.010 44914.194±1154.44914.194\scriptstyle\pm 1154. 0.622±0.0090.622\scriptstyle\pm 0.009 2648.410±301.32648.410\scriptstyle\pm 301.3 0.395±0.0820.395\scriptstyle\pm 0.082 172.685±3.661172.685\scriptstyle\pm 3.661 1.180±0.0041.180\scriptstyle\pm 0.004 6413.147±548.66413.147\scriptstyle\pm 548.6 1.538±0.0101.538\scriptstyle\pm 0.010
CRAFT 134.335±0.663134.335\scriptstyle\pm 0.663 0.115±0.0030.115\scriptstyle\pm 0.003 43412.038±420.943412.038\scriptstyle\pm 420.9 0.604±0.0020.604\scriptstyle\pm 0.002 1893.926±117.31893.926\scriptstyle\pm 117.3 0.257±0.0240.257\scriptstyle\pm 0.024 151.791±11.02151.791\scriptstyle\pm 11.02 0.139±0.0320.139\scriptstyle\pm 0.032 1413.303±11.201413.303\scriptstyle\pm 11.20 0.562±0.0020.562\scriptstyle\pm 0.002
FAB 153.894±3.916153.894\scriptstyle\pm 3.916 0.032±0.000\mathbf{0.032\scriptstyle\pm 0.000} 9567.319±626.19567.319\scriptstyle\pm 626.1 0.073±0.0050.073\scriptstyle\pm 0.005 1204.160±147.7\mathbf{1204.160\scriptstyle\pm 147.7} 0.093±0.014\mathbf{0.093\scriptstyle\pm 0.014} 126.863±0.581\mathbf{126.863\scriptstyle\pm 0.581} 0.129±0.003\mathbf{0.129\scriptstyle\pm 0.003} 1186.967±263.41186.967\scriptstyle\pm 263.4 0.347±0.0070.347\scriptstyle\pm 0.007
MCD 163.317±0.101163.317\scriptstyle\pm 0.101 0.228±0.0010.228\scriptstyle\pm 0.001 5026.147±40.035026.147\scriptstyle\pm 40.03 0.043±0.0000.043\scriptstyle\pm 0.000 6418.981±22.156418.981\scriptstyle\pm 22.15 0.256±0.0000.256\scriptstyle\pm 0.000 220.710±5.547220.710\scriptstyle\pm 5.547 0.252±0.0070.252\scriptstyle\pm 0.007 1898.472±3.7831898.472\scriptstyle\pm 3.783 0.327±0.0020.327\scriptstyle\pm 0.002
LDVI N/A N/A 5038.420±73.775038.420\scriptstyle\pm 73.77 0.043±0.0000.043\scriptstyle\pm 0.000 2919.688±103.42919.688\scriptstyle\pm 103.4 0.182±0.0030.182\scriptstyle\pm 0.003 154.167±0.816154.167\scriptstyle\pm 0.816 0.133±0.0000.133\scriptstyle\pm 0.000 3432.724±406.23432.724\scriptstyle\pm 406.2 0.284±0.0160.284\scriptstyle\pm 0.016
PIS N/A N/A 10495.164±83.2010495.164\scriptstyle\pm 83.20 0.083±0.0000.083\scriptstyle\pm 0.000 2113.172±31.172113.172\scriptstyle\pm 31.17 0.218±0.0070.218\scriptstyle\pm 0.007 186.007±0.466186.007\scriptstyle\pm 0.466 0.193±0.0010.193\scriptstyle\pm 0.001 1484.598±5.1251484.598\scriptstyle\pm 5.125 0.240±0.0000.240\scriptstyle\pm 0.000
DIS 118.947±12.81118.947\scriptstyle\pm 12.81 0.159±0.0360.159\scriptstyle\pm 0.036 3044.733±464.7\mathbf{3044.733\scriptstyle\pm 464.7} 0.034±0.003\mathbf{0.034\scriptstyle\pm 0.003} 2200.590±18.732200.590\scriptstyle\pm 18.73 0.155±0.0010.155\scriptstyle\pm 0.001 220.392±11.69220.392\scriptstyle\pm 11.69 0.194±0.0110.194\scriptstyle\pm 0.011 3927.754±858.93927.754\scriptstyle\pm 858.9 0.282±0.0190.282\scriptstyle\pm 0.019
DDS 142.890±9.552142.890\scriptstyle\pm 9.552 0.172±0.0310.172\scriptstyle\pm 0.031 5551.107±116.45551.107\scriptstyle\pm 116.4 0.046±0.0010.046\scriptstyle\pm 0.001 2154.884±3.8612154.884\scriptstyle\pm 3.861 0.131±0.0010.131\scriptstyle\pm 0.001 188.789±2.297188.789\scriptstyle\pm 2.297 0.173±0.0030.173\scriptstyle\pm 0.003 1811.685±24.471811.685\scriptstyle\pm 24.47 0.208±0.006\mathbf{0.208\scriptstyle\pm 0.006}
GBS 178.075±0.103178.075\scriptstyle\pm 0.103 0.305±0.0020.305\scriptstyle\pm 0.002 5080.413±125.85080.413\scriptstyle\pm 125.8 0.043±0.0010.043\scriptstyle\pm 0.001 5722.074±22.715722.074\scriptstyle\pm 22.71 0.232±0.0000.232\scriptstyle\pm 0.000 186.436±1.834186.436\scriptstyle\pm 1.834 0.176±0.0050.176\scriptstyle\pm 0.005 1137.399±1.819\mathbf{1137.399\scriptstyle\pm 1.819} 0.246±0.0030.246\scriptstyle\pm 0.003
ΔlogZr{\Delta\log Z}_{r}\downarrow ΔlogZf{\Delta\log Z}_{f}\downarrow ΔlogZr{\Delta\log Z}_{r}\downarrow ΔlogZf{\Delta\log Z}_{f}\downarrow ΔlogZr{\Delta\log Z}_{r}\downarrow ΔlogZf{\Delta\log Z}_{f}\downarrow ΔlogZr{\Delta\log Z}_{r}\downarrow ΔlogZf{\Delta\log Z}_{f}\downarrow ΔlogZr{\Delta\log Z}_{r}\downarrow ΔlogZf{\Delta\log Z}_{f}\downarrow
MFVI 0.612±0.1010.612\scriptstyle\pm 0.101 0.036±0.0010.036\scriptstyle\pm 0.001 3.658±0.0403.658\scriptstyle\pm 0.040 0.185±0.0020.185\scriptstyle\pm 0.002 3.009±0.2913.009\scriptstyle\pm 0.291 0.048±0.002\mathbf{0.048\scriptstyle\pm 0.002} 7.388±0.1077.388\scriptstyle\pm 0.107 5.866±0.0165.866\scriptstyle\pm 0.016 34.389±0.75734.389\scriptstyle\pm 0.757 108.379±0.438108.379\scriptstyle\pm 0.438
GMMVI 0.001±0.000\mathbf{0.001\scriptstyle\pm 0.000} 0.001±0.000\mathbf{0.001\scriptstyle\pm 0.000} 1.715±0.119\mathbf{1.715\scriptstyle\pm 0.119} 0.048±0.007\mathbf{0.048\scriptstyle\pm 0.007} 1.282±0.2211.282\scriptstyle\pm 0.221 0.084±0.0550.084\scriptstyle\pm 0.055 3.098±0.1403.098\scriptstyle\pm 0.140 0.124±0.079\mathbf{0.124\scriptstyle\pm 0.079} 8.099±1.919\mathbf{8.099\scriptstyle\pm 1.919} 11.676±4.041\mathbf{11.676\scriptstyle\pm 4.041}
SMC 0.187±0.0540.187\scriptstyle\pm 0.054 2.676±0.0002.676\scriptstyle\pm 0.000 690.721±11.21690.721\scriptstyle\pm 11.21 161.796±0.000161.796\scriptstyle\pm 0.000 3.880±1.1053.880\scriptstyle\pm 1.105 80.992±0.00080.992\scriptstyle\pm 0.000 80.184±0.16280.184\scriptstyle\pm 0.162 375.676±0.000375.676\scriptstyle\pm 0.000 11742.014±139.211742.014\scriptstyle\pm 139.2 1530.824±0.0001530.824\scriptstyle\pm 0.000
AFT 0.181±0.1060.181\scriptstyle\pm 0.106 414.619±141.5414.619\scriptstyle\pm 141.5 765.624±108.0765.624\scriptstyle\pm 108.0 110.955±18.37110.955\scriptstyle\pm 18.37 4.081±1.5794.081\scriptstyle\pm 1.579 205.297±23.91205.297\scriptstyle\pm 23.91 16.726±2.51116.726\scriptstyle\pm 2.511 163.871±6.557163.871\scriptstyle\pm 6.557 11653.343±1628.11653.343\scriptstyle\pm 1628. 1071.777±9.4751071.777\scriptstyle\pm 9.475
CRAFT 0.091±0.0180.091\scriptstyle\pm 0.018 255.046±7.478255.046\scriptstyle\pm 7.478 337.094±9.296337.094\scriptstyle\pm 9.296 100.987±0.065100.987\scriptstyle\pm 0.065 0.822±0.087\mathbf{0.822\scriptstyle\pm 0.087} 210.245±6.098210.245\scriptstyle\pm 6.098 1.458±0.4061.458\scriptstyle\pm 0.406 63.792±3.32963.792\scriptstyle\pm 3.329 445.101±8.273445.101\scriptstyle\pm 8.273 1156.718±7.8101156.718\scriptstyle\pm 7.810
FAB 0.001±0.000\mathbf{0.001\scriptstyle\pm 0.000} 0.019±0.0030.019\scriptstyle\pm 0.003 2.952±0.2472.952\scriptstyle\pm 0.247 126.363±1.789126.363\scriptstyle\pm 1.789 3.358±1.0623.358\scriptstyle\pm 1.062 84.592±22.6484.592\scriptstyle\pm 22.64 0.847±0.076\mathbf{0.847\scriptstyle\pm 0.076} 63.910±1.56563.910\scriptstyle\pm 1.565 350.544±599.0350.544\scriptstyle\pm 599.0 3721.720±4646.3721.720\scriptstyle\pm 4646.
MCD 0.207±0.0390.207\scriptstyle\pm 0.039 N/A 31.319±1.79331.319\scriptstyle\pm 1.793 21.148±1.47821.148\scriptstyle\pm 1.478 28.607±1.27528.607\scriptstyle\pm 1.275 24.757±0.84124.757\scriptstyle\pm 0.841 884.610±9.674884.610\scriptstyle\pm 9.674 258.840±3.047258.840\scriptstyle\pm 3.047 15122.090±996.715122.090\scriptstyle\pm 996.7 1125.475±5.1981125.475\scriptstyle\pm 5.198
LDVI N/A N/A 8.159±0.7758.159\scriptstyle\pm 0.775 15.477±0.81515.477\scriptstyle\pm 0.815 4.360±0.7414.360\scriptstyle\pm 0.741 5.472±0.9385.472\scriptstyle\pm 0.938 537.763±25.07537.763\scriptstyle\pm 25.07 265.674±1.181265.674\scriptstyle\pm 1.181 12237.989±381.812237.989\scriptstyle\pm 381.8 1087.592±4.8441087.592\scriptstyle\pm 4.844
PIS 0.918±0.5980.918\scriptstyle\pm 0.598 0.436±0.0020.436\scriptstyle\pm 0.002 7.122±0.6307.122\scriptstyle\pm 0.630 3113.492±1.9783113.492\scriptstyle\pm 1.978 12.248±0.32612.248\scriptstyle\pm 0.326 54.090±0.15154.090\scriptstyle\pm 0.151 104.002±0.847104.002\scriptstyle\pm 0.847 2149.224±19.392149.224\scriptstyle\pm 19.39 1884.013±10.201884.013\scriptstyle\pm 10.20 8785.873±9.8808785.873\scriptstyle\pm 9.880
DIS 0.113±0.0830.113\scriptstyle\pm 0.083 25.544±8.26725.544\scriptstyle\pm 8.267 87.709±8.94287.709\scriptstyle\pm 8.942 369.352±16.29369.352\scriptstyle\pm 16.29 10.448±0.60710.448\scriptstyle\pm 0.607 87.897±5.25587.897\scriptstyle\pm 5.255 569.837±35.40569.837\scriptstyle\pm 35.40 1354.472±181.11354.472\scriptstyle\pm 181.1 8807.430±337.68807.430\scriptstyle\pm 337.6 17566.520±256.617566.520\scriptstyle\pm 256.6
DDS 0.190±0.0770.190\scriptstyle\pm 0.077 0.321±0.0520.321\scriptstyle\pm 0.052 1.739±0.4421.739\scriptstyle\pm 0.442 207.545±1.163207.545\scriptstyle\pm 1.163 7.952±0.2997.952\scriptstyle\pm 0.299 53.411±0.02453.411\scriptstyle\pm 0.024 82.460±5.48082.460\scriptstyle\pm 5.480 659.497±9.786659.497\scriptstyle\pm 9.786 1579.602±41.651579.602\scriptstyle\pm 41.65 2910.345±71.252910.345\scriptstyle\pm 71.25
GBS 0.553±0.2730.553\scriptstyle\pm 0.273 0.127±0.0080.127\scriptstyle\pm 0.008 8.103±1.6968.103\scriptstyle\pm 1.696 9.321±0.7769.321\scriptstyle\pm 0.776 53.767±0.73253.767\scriptstyle\pm 0.732 47.441±0.09847.441\scriptstyle\pm 0.098 75.160±2.32175.160\scriptstyle\pm 2.321 62.733±1.16862.733\scriptstyle\pm 1.168 1495.194±42.031495.194\scriptstyle\pm 42.03 527.580±9.426527.580\scriptstyle\pm 9.426
ELBO \uparrow EUBO \downarrow ELBO \uparrow EUBO \downarrow ELBO \uparrow EUBO \downarrow ELBO \uparrow EUBO \downarrow ELBO \uparrow EUBO \downarrow
MFVI 1.834±0.009-1.834\scriptstyle\pm 0.009 105.694±0.002105.694\scriptstyle\pm 0.002 3.690±0.000-3.690\scriptstyle\pm 0.000 164.114±0.000164.114\scriptstyle\pm 0.000 5.957±0.007-5.957\scriptstyle\pm 0.007 72.663±0.00572.663\scriptstyle\pm 0.005 14.004±0.005-14.004\scriptstyle\pm 0.005 210.713±0.024210.713\scriptstyle\pm 0.024 58.082±0.009-58.082\scriptstyle\pm 0.009 938.632±0.055938.632\scriptstyle\pm 0.055
GMMVI 0.011±0.001\mathbf{-0.011\scriptstyle\pm 0.001} 0.012±0.001\mathbf{0.012\scriptstyle\pm 0.001} 1.715±0.119\mathbf{-1.715\scriptstyle\pm 0.119} 240.459±51.13240.459\scriptstyle\pm 51.13 3.890±0.122-3.890\scriptstyle\pm 0.122 57.746±1.92857.746\scriptstyle\pm 1.928 7.135±0.148\mathbf{-7.135\scriptstyle\pm 0.148} 142.636±9.701142.636\scriptstyle\pm 9.701 18.478±4.104\mathbf{-18.478\scriptstyle\pm 4.104} 595.239±120.4595.239\scriptstyle\pm 120.4
SMC 0.242±0.047-0.242\scriptstyle\pm 0.047 4.690±0.0004.690\scriptstyle\pm 0.000 877.034±10.23-877.034\scriptstyle\pm 10.23 161.921±0.000161.921\scriptstyle\pm 0.000 4.634±1.088-4.634\scriptstyle\pm 1.088 81.325±0.00081.325\scriptstyle\pm 0.000 185.057±0.257-185.057\scriptstyle\pm 0.257 376.093±0.000376.093\scriptstyle\pm 0.000 12187.873±134.6-12187.873\scriptstyle\pm 134.6 1532.904±0.0001532.904\scriptstyle\pm 0.000
AFT 0.293±0.088-0.293\scriptstyle\pm 0.088 431.329±143.1431.329\scriptstyle\pm 143.1 927.160±103.8-927.160\scriptstyle\pm 103.8 117.630±22.16117.630\scriptstyle\pm 22.16 4.923±1.546-4.923\scriptstyle\pm 1.546 207.625±24.14207.625\scriptstyle\pm 24.14 64.442±4.464-64.442\scriptstyle\pm 4.464 214.486±4.870214.486\scriptstyle\pm 4.870 11828.529±1608.-11828.529\scriptstyle\pm 1608. 1448.335±11.081448.335\scriptstyle\pm 11.08
CRAFT 0.027±0.060-0.027\scriptstyle\pm 0.060 263.474±7.864263.474\scriptstyle\pm 7.864 451.399±7.561-451.399\scriptstyle\pm 7.561 103.674±0.069103.674\scriptstyle\pm 0.069 0.295±0.256\mathbf{-0.295\scriptstyle\pm 0.256} 212.210±6.160212.210\scriptstyle\pm 6.160 11.154±0.307-11.154\scriptstyle\pm 0.307 89.518±1.90489.518\scriptstyle\pm 1.904 520.475±5.531-520.475\scriptstyle\pm 5.531 1578.114±2.3601578.114\scriptstyle\pm 2.360
FAB 0.014±0.003\mathbf{-0.014\scriptstyle\pm 0.003} 0.012±0.002\mathbf{0.012\scriptstyle\pm 0.002} 299.916±253.4-299.916\scriptstyle\pm 253.4 93.560±5.08693.560\scriptstyle\pm 5.086 26.496±1.875-26.496\scriptstyle\pm 1.875 18.088±2.503\mathbf{18.088\scriptstyle\pm 2.503} 11.396±0.153-11.396\scriptstyle\pm 0.153 12.084±0.171\mathbf{12.084\scriptstyle\pm 0.171} 892.971±1518.-892.971\scriptstyle\pm 1518. 394.346±263.6\mathbf{394.346\scriptstyle\pm 263.6}
MCD 0.611±0.005-0.611\scriptstyle\pm 0.005 N/A 185.021±0.743-185.021\scriptstyle\pm 0.743 43.670±0.457\mathbf{43.670\scriptstyle\pm 0.457} 69.358±0.633-69.358\scriptstyle\pm 0.633 47.834±0.82047.834\scriptstyle\pm 0.820 1457.646±13.80-1457.646\scriptstyle\pm 13.80 293.191±0.208293.191\scriptstyle\pm 0.208 21196.583±472.8-21196.583\scriptstyle\pm 472.8 1276.456±1.0331276.456\scriptstyle\pm 1.033
LDVI N/A N/A 29.034±0.591-29.034\scriptstyle\pm 0.591 51.137±0.17751.137\scriptstyle\pm 0.177 28.471±1.018-28.471\scriptstyle\pm 1.018 20.887±1.04220.887\scriptstyle\pm 1.042 875.104±43.59-875.104\scriptstyle\pm 43.59 323.158±0.142323.158\scriptstyle\pm 0.142 16227.975±738.0-16227.975\scriptstyle\pm 738.0 1185.331±6.6601185.331\scriptstyle\pm 6.660
PIS 3.198±0.042-3.198\scriptstyle\pm 0.042 104.975±0.002104.975\scriptstyle\pm 0.002 16.881±0.026-16.881\scriptstyle\pm 0.026 3626.120±1.3603626.120\scriptstyle\pm 1.360 29.261±1.743-29.261\scriptstyle\pm 1.743 88.192±0.00588.192\scriptstyle\pm 0.005 172.988±0.630-172.988\scriptstyle\pm 0.630 2748.938±19.192748.938\scriptstyle\pm 19.19 2988.210±14.13-2988.210\scriptstyle\pm 14.13 11179.374±11.7211179.374\scriptstyle\pm 11.72
DIS 1.021±0.436-1.021\scriptstyle\pm 0.436 40.892±38.4840.892\scriptstyle\pm 38.48 181.348±15.47-181.348\scriptstyle\pm 15.47 546.335±30.86546.335\scriptstyle\pm 30.86 36.704±0.629-36.704\scriptstyle\pm 0.629 193.270±3.293193.270\scriptstyle\pm 3.293 840.122±18.66-840.122\scriptstyle\pm 18.66 1745.719±205.71745.719\scriptstyle\pm 205.7 15337.229±154.0-15337.229\scriptstyle\pm 154.0 20347.781±318.720347.781\scriptstyle\pm 318.7
DDS 0.597±0.142-0.597\scriptstyle\pm 0.142 148.841±7.347148.841\scriptstyle\pm 7.347 13.284±0.460-13.284\scriptstyle\pm 0.460 291.867±0.047291.867\scriptstyle\pm 0.047 31.681±0.363-31.681\scriptstyle\pm 0.363 86.014±0.00186.014\scriptstyle\pm 0.001 156.145±6.063-156.145\scriptstyle\pm 6.063 881.476±22.83881.476\scriptstyle\pm 22.83 2617.761±46.78-2617.761\scriptstyle\pm 46.78 3925.231±106.83925.231\scriptstyle\pm 106.8
GBS 2.600±0.078-2.600\scriptstyle\pm 0.078 110.167±0.000110.167\scriptstyle\pm 0.000 35.771±1.105-35.771\scriptstyle\pm 1.105 67.819±2.15767.819\scriptstyle\pm 2.157 99.369±0.158-99.369\scriptstyle\pm 0.158 73.545±0.10773.545\scriptstyle\pm 0.107 154.186±1.387-154.186\scriptstyle\pm 1.387 106.777±0.113106.777\scriptstyle\pm 0.113 2198.997±36.56-2198.997\scriptstyle\pm 36.56 705.996±11.66705.996\scriptstyle\pm 11.66
Table 3: Results for various sampling methods. Evaluation criteria include 2-Wasserstein distance (𝒲2\mathcal{W}_{2}), maximum mean discrepancy (MMD), reverse and forward partition function error (ΔlogZr\Delta\log Z_{r}, ΔlogZf\Delta\log Z_{f}), and lower and upper evidence bounds (ELBO, EUBO). The best results are highlighted in bold. Arrows (\uparrow, \downarrow) indicate whether higher or lower values are preferable, respectively. N/A denotes cases where reasonable results could not be obtained due to numerical issues.

7 Experiments

Here, we offer an overview of the evaluation protocol. Next, we present the results obtained for synthetic target densities, followed by those for real targets. We provide further results in Appendix E and ablation studies in Appendix F.

Evaluation Protocol. We compute all performance criteria 100 times during training, applying a running average with a length of 5 over these evaluations to obtain robust results within a single run. To ensure robustness across runs, we use four different random seeds and average the best results from each run. We use 2000 samples to compute the performance criteria and tune key hyperparameters such as the learning rate and variance of the initial proposal distribution π0\pi_{0}. We report the EMC values corresponding to the method’s highest ELBO value to avoid high EMC values caused by a large initial support of the model.

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
𝐱iπ\mathbf{x}_{i}\sim\pi MFVI GMMVI 𝐱iπ\mathbf{x}_{i}\sim\pi MFVI GMMVI
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
CRAFT FAB LDVI GBS FAB LDVI
EMC \uparrow 14×1414\times 14 Digits 28×2828\times 28 Fashion
MFVI 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
GMMVI 0.164±0.0810.164\scriptstyle\pm 0.081 0.217±0.1670.217\scriptstyle\pm 0.167
SMC 0.873±0.0000.873\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
AFT 0.727±0.0000.727\scriptstyle\pm 0.000 0.011±0.0000.011\scriptstyle\pm 0.000
CRAFT 0.772±0.0700.772\scriptstyle\pm 0.070 0.016±0.0270.016\scriptstyle\pm 0.027
FAB 0.915±0.0070.915\scriptstyle\pm 0.007 0.349±0.1370.349\scriptstyle\pm 0.137
MCD 0.851±0.0100.851\scriptstyle\pm 0.010 0.619±0.0010.619\scriptstyle\pm 0.001
LDVI 0.951±0.002\mathbf{0.951\scriptstyle\pm 0.002} 0.608±0.0050.608\scriptstyle\pm 0.005
PIS 0.816±0.0110.816\scriptstyle\pm 0.011 0.620±0.0040.620\scriptstyle\pm 0.004
DIS 0.818±0.0090.818\scriptstyle\pm 0.009 0.612±0.0080.612\scriptstyle\pm 0.008
DDS 0.816±0.0120.816\scriptstyle\pm 0.012 0.621±0.0080.621\scriptstyle\pm 0.008
GBS 0.796±0.0050.796\scriptstyle\pm 0.005 0.621±0.006\mathbf{0.621\scriptstyle\pm 0.006}
Table 4: Sample visualizations for Digits (left) and Fashion (middle) using various methods, as indicated by the subcaptions. ‘𝐱iπ\mathbf{x}_{i}\sim\pi’ refers to samples from the target density. Visualizations for the remaining methods are provided in Figure 5. Corresponding EMC values are reported on the right.

7.1 Evaluation on Synthetic Target Densities

Funnel. We utilize the funnel distribution as a testing ground to assess whether sampling methods capture high curvatures in the target density. Our findings indicate that while most methods successfully capture the funnel-like structure, they struggle to generate samples at the neck and opening of the funnel, except for FAB and GMMVI (cf. Figure 4). This observation is further supported by quantitative analysis, revealing that both FAB and GMMVI achieve the best performance in terms of reverse and forward estimation of logZ\log Z and evidence bounds as shown in Table 3.

Digits and Fashion. For a comprehensive assessment of sampling methods, we conduct both qualitative and quantitative analyses on two high-dimensional target densities. For the qualitative analysis, model samples are interpreted as images and shown in Table 4. For the quantitative analysis, we report various performance criteria values, with results presented in Table 3. Additionally, we report EMC values in Table 4 to quantify mode collapse.

For Digits, most methods are able to find the majority of modes and produce high-quality samples, as visually evident from the sample visualizations and EMC values in Table 4. However, many methods, particularly diffusion-based ones, struggle to obtain reasonable estimations of logZ\log Z. They also perform poorly in terms of lower and upper evidence bounds, as shown in Table 3. For Fashion, we observe that methods either suffer from mode collapse or produce low-quality samples. Interestingly, the methods experiencing mode collapse achieve the lowest estimation error of logZ\log Z in both reverse and forward estimations.

Mixture Models. We employ MoG and MoS to investigate mode collapse across different dimensions, specifically considering d2,50,200d\in{2,50,200}. For d=2d=2, all methods except MFVI demonstrate the capability to generate samples from all modes, as indicated by EMC1\text{EMC}\approx 1. This is further supported by visualizations in Figure 2. According to EMC, all methods except diffusion-based ones exhibit mode collapse for d=50d=50 and d=200d=200.

We also report additional evaluation criteria for MoG and MoS, including 2-Wasserstein distance, maximum mean discrepancy, reverse and forward partition function error, lower and upper evidence bounds, and reverse and forward effective sample size in Appendix E Table 9.

7.2 Evaluation on Real Target Densities

For real-world target densities, we do not have access to the ground truth normalizer ZZ or samples from π\pi. Consequently, we present the ELBO values in Table 5. Surprisingly, we find that GMMVI performs well across all tasks, often outperforming more complex variational Monte Carlo methods. However, it is noteworthy that GMMVI encounters scalability challenges in very high-dimensional problems, such as LGCP. Another method, FAB, consistently performs well across a majority of tasks.

ELBO \uparrow Credit Seeds Cancer Brownian Ionosphere Sonar LGCP
MFVI 524.859±0.035-524.859\scriptstyle\pm 0.035 76.733±0.012-76.733\scriptstyle\pm 0.012 29.407±0.557-29.407\scriptstyle{\pm 0.557} 3.872±0.012-3.872\scriptstyle\pm 0.012 123.419±0.040-123.419\scriptstyle\pm 0.040 137.672±0.043-137.672\scriptstyle\pm 0.043 383.18±0.059383.18\scriptstyle\pm 0.059
GMMVI 504.487±0.001\mathbf{-504.487\scriptstyle\pm 0.001} 73.415±0.002\mathbf{-73.415\scriptstyle\pm 0.002} 121.442±5.591\mathbf{121.442\scriptstyle\pm 5.591} 1.092±0.006\mathbf{1.092\scriptstyle\pm 0.006} 111.832±0.007-111.832\scriptstyle\pm 0.007 108.726±0.007-108.726\scriptstyle\pm 0.007 OOM
SMC 580.936±15.915-580.936\scriptstyle\pm 15.915 74.699±0.100-74.699\scriptstyle\pm 0.100 67.959±4.345-67.959\scriptstyle\pm 4.345 1.874±0.622-1.874\scriptstyle\pm 0.622 114.751±0.238-114.751\scriptstyle\pm 0.238 111.355±1.177-111.355\scriptstyle\pm 1.177 393.907±5.727393.907\scriptstyle\pm 5.727
AFT 584.766±13.979-584.766\scriptstyle\pm 13.979 74.269±0.090-74.269\scriptstyle\pm 0.090 15.515±5.100-15.515\scriptstyle\pm 5.100 N/A 113.272±0.647-113.272\scriptstyle\pm 0.647 110.671±1.240-110.671\scriptstyle\pm 1.240 394.271±6.432394.271\scriptstyle\pm 6.432
CRAFT 573.387±17.59-573.387\scriptstyle\pm 17.59 73.793±0.015-73.793\scriptstyle\pm 0.015 19.283±0.52319.283\scriptstyle\pm 0.523 0.886±0.0530.886\scriptstyle\pm 0.053 112.386±0.182-112.386\scriptstyle\pm 0.182 115.618±1.316-115.618\scriptstyle\pm 1.316 495.291±0.384\mathbf{495.291\scriptstyle\pm 0.384}
FAB 504.496±0.001\mathbf{-504.496\scriptstyle\pm 0.001} 73.418±0.002\mathbf{-73.418\scriptstyle\pm 0.002} 39.922±8.20039.922\scriptstyle\pm 8.200 1.031±0.0101.031\scriptstyle\pm 0.010 111.678±0.003\mathbf{-111.678\scriptstyle\pm 0.003} 108.593±0.008\mathbf{-108.593\scriptstyle\pm 0.008} 402.212±0.941402.212\scriptstyle\pm 0.941
MCD N/A 73.652±0.003-73.652\scriptstyle\pm 0.003 N/A 0.643±0.0120.643\scriptstyle\pm 0.012 111.942±0.006-111.942\scriptstyle\pm 0.006 109.534±0.055-109.534\scriptstyle\pm 0.055 444.313±0.452444.313\scriptstyle\pm 0.452
LDVI N/A 73.530±0.003-73.530\scriptstyle\pm 0.003 N/A 0.772±0.0160.772\scriptstyle\pm 0.016 111.788±0.003-111.788\scriptstyle\pm 0.003 108.841±0.006-108.841\scriptstyle\pm 0.006 161.839±1.436161.839\scriptstyle\pm 1.436
PIS 846.568±2.417-846.568\scriptstyle\pm 2.417 88.919±2.051-88.919\scriptstyle\pm 2.051 39.542±5.30239.542\scriptstyle\pm 5.302 N/A 125.030±0.688-125.030\scriptstyle\pm 0.688 142.868±3.289-142.868\scriptstyle\pm 3.289 479.542±0.403479.542\scriptstyle\pm 0.403
DDS 514.736±1.223-514.736\scriptstyle\pm 1.223 75.206±0.209-75.206\scriptstyle\pm 0.209 19.997±0.69019.997\scriptstyle\pm 0.690 0.561±0.2280.561\scriptstyle\pm 0.228 114.191±0.105-114.191\scriptstyle\pm 0.105 121.222±5.985-121.222\scriptstyle\pm 5.985 N/A
GBS 508.108±0.145-508.108\scriptstyle\pm 0.145 88.778±0.109-88.778\scriptstyle\pm 0.109 23.495±0.737-23.495\scriptstyle\pm 0.737 N/A 133.777±0.152-133.777\scriptstyle\pm 0.152 153.094±0.500-153.094\scriptstyle\pm 0.500 N/A
Table 5: ELBO values for various target densities. The best results are highlighted in bold. N/A denotes cases where reasonable results could not be obtained due to numerical issues. OOM refers to problems caused by memory constraints.

8 Discussion and Conclusion

Here, we list several general observations O1)-O6) and observations tied to specific methods M1)-M6) that are based on the experiments from Section 7 and Appendix E and the Ablation studies in Appendix F.

O1) Mode collapse gets worse in high dimensions. We observe that several methods, that do not suffer from mode collapse in low-dimensional problems encounter significant mode collapse when applied to higher-dimensional ones (cf. Fig 2).

O2) ELBO and reverse logZ\log Z estimates are not well suited for evaluating a model’s capability to avoid mode collapse. This observation is evident, for instance, in Table 4, where MFVI achieves relatively good ELBO and logZ\log Z estimates despite suffering from mode collapse.

O3) While the EUBO helps to quantify mode collapse, comparing different method categories is challenging due to the additional looseness introduced by latent variables in the extended EUBO. This is evident on the Fashion dataset, where MFVI and GMMVI achieve a lower EUBO compared to most other methods, despite suffering from mode collapse.

O4) Despite being influenced by subjective design choices like the kernel or cost function, the 2-Wasserstein distance and Maximum Mean Discrepancy (MMD) generally show consistent performance across different sampling methods, as demonstrated in Table 3. Additionally, the quantitative results frequently align with the qualitative outcomes. For instance, this alignment is evident from GMMVI samples on Funnel or the GBS samples on the Fashion.

O5) For multimodal target distributions, both forward and reverse ESS tend to exhibit a ’binary’ pattern, frequently taking values of 0 or 1. Forward ESS, in particular, often tends to be predominantly zero for higher dimensional problems, further complicating the evaluation of mode collapse severity. In contrast, EUBO and ELBO offer a more continuous and informative perspective in assessing model performance (cf. Appendix E, Table 9).

O6) No single method exhibits superiority across all situations. Generally, GMMVI and FAB demonstrate good ELBO values across a diverse set of tasks, although both tend to suffer from mode collapse in high dimensions. In contrast, diffusion-based methods such as MCD and LDVI exhibit resilience against mode collapse but frequently fall short of achieving satisfactory ELBO values.

M1) Resampling causes mode collapse in high dimensions (cf. Ablation F.3). SIS methods, in particular, experience severe mode collapse in high dimensions, as illustrated in Figure 2. Notably, eliminating the resampling step in Sequential Monte Carlo (SMC) proves effective in mitigating this issue, but results in worse ELBO values.

M2) There exists an exploration-exploitation trade-off when setting the support of the proposal distribution π0\pi_{0} in Variational Monte Carlo (cf. Ablation F.4). Opting for a small initial support of π0\pi_{0} results in tight ELBO values but can limit coverage to only a few modes. Conversely, employing a sufficiently large initial support helps prevent mode collapse but introduces additional looseness in the ELBO.

M3) Learning the proposal distribution π0\pi_{0} in Variational Monte Carlo methods often leads to mode collapse, especially in high dimensions. Training the base distribution end-to-end by maximizing the extended ELBO or pre-training the base distribution, for example, using methods like MFVI, results in mode collapse, as indicated in Ablation F.5 and Ablation F.8. Despite the occurrence of mode collapse, these strategies yield higher ELBOs, emphasizing the inherent exploration-exploitation trade-off discussed in M2).

M4) Variational Monte Carlo methods heavily benefit from using a large number of steps TT. This is shown in Ablation F.2, where increasing the annealing steps for SIS methods and discretization steps for diffusion-based methods leads to tighter evidence bounds. However, increasing TT results in prolonged computational runtimes and demands substantial memory resources.

M5) GMMVI exhibits high sample efficiency (cf. Table 10). Arenz et al. (2022) employ a replay buffer to enhance the sample efficiency of GMMVI, leading to orders of magnitude fewer target evaluations required for convergence. Consequently, GMMVI may be the preferable choice when target evaluations are time-consuming.

M6) Langevin diffusion methods demonstrate low sample efficiency, as highlighted in Table 10. These methods require evaluating the target at each intermediate discretization step due to the score function being part of the SDE, and they typically need many iterations to converge. Other diffusion-based methods that do not require target evaluations at every step, such as DDS, often perform poorly and suffer from mode collapse (cf. Ablation F.7). To address this, Zhang & Chen (2021) proposed incorporating the score function into the network architecture, resulting again in poor sample efficiency.

9 Conclusion

In this work, we assessed the latest sampling methods using a standardized set of tasks. Our exploration encompassed various performance criteria, with a specific focus on quantifying mode collapse. Through a comprehensive evaluation, we illuminated the strengths and weaknesses of state-of-the-art sampling methods, thereby offering a valuable reference for future developments.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgments and Disclosure of Funding

We thank Julius Berner, Lorenz Richter, and Vincent Stimper for many useful discussions. D.B. acknowledges support by funding from the pilot program Core Informatics of the Helmholtz Association (HGF) and the state of Baden-Württemberg through bwHPC, as well as the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-Württemberg and by the German Federal Ministry of Education and Research.

References

  • Agakov & Barber (2004) Agakov, F. V. and Barber, D. An auxiliary variational method. In Advances in Neural Information Processing Systems, 2004.
  • Akhound-Sadegh et al. (2024) Akhound-Sadegh, T., Rector-Brooks, J., Bose, A. J., Mittal, S., Lemos, P., Liu, C.-H., Sendera, M., Ravanbakhsh, S., Gidel, G., Bengio, Y., et al. Iterated denoising energy matching for sampling from boltzmann densities. arXiv preprint arXiv:2402.06121, 2024.
  • Anderson (1982) Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  • Arbel et al. (2021) Arbel, M., Matthews, A., and Doucet, A. Annealed flow transport monte carlo. In International Conference on Machine Learning, pp.  318–330. PMLR, 2021.
  • Arenz et al. (2018) Arenz, O., Neumann, G., and Zhong, M. Efficient gradient-free variational inference using policy search. In International conference on machine learning, pp.  234–243. PMLR, 2018.
  • Arenz et al. (2022) Arenz, O., Dahlinger, P., Ye, Z., Volpp, M., and Neumann, G. A unified perspective on natural gradient variational inference with gaussian mixture models. arXiv preprint arXiv:2209.11533, 2022.
  • Aronszajn (1950) Aronszajn, N. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
  • Berner et al. (2022) Berner, J., Richter, L., and Ullrich, K. An optimal control perspective on diffusion-based generative modeling. arXiv preprint arXiv:2211.01364, 2022.
  • Bishop (2006) Bishop, C. Pattern recognition and machine learning. Springer google schola, 2:531–537, 2006.
  • Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
  • Chib & Greenberg (1995) Chib, S. and Greenberg, E. Understanding the metropolis-hastings algorithm. The american statistician, 49(4):327–335, 1995.
  • Cover (1999) Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.
  • Crowder (1978) Crowder, M. J. Beta-binomial anova for proportions. Applied statistics, pp.  34–37, 1978.
  • Dai Pra (1991) Dai Pra, P. A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization, 23(1):313–329, 1991.
  • Del Moral et al. (2006) Del Moral, P., Doucet, A., and Jasra, A. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
  • Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • Dinh et al. (2016) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
  • Douc & Cappé (2005) Douc, R. and Cappé, O. Comparison of resampling schemes for particle filtering. In ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005., pp.  64–69. Ieee, 2005.
  • Doucet et al. (2022a) Doucet, A., Grathwohl, W., Matthews, A. G., and Strathmann, H. Score-based diffusion meets annealed importance sampling. Advances in Neural Information Processing Systems, 35:21482–21494, 2022a.
  • Doucet et al. (2022b) Doucet, A., Grathwohl, W., Matthews, A. G. d. G., and Strathmann, H. Score-based diffusion meets annealed importance sampling. In Advances in Neural Information Processing Systems, 2022b.
  • Duane et al. (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
  • Durkan et al. (2019) Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. Advances in neural information processing systems, 32, 2019.
  • Frenkel & Smit (2023) Frenkel, D. and Smit, B. Understanding molecular simulation: from algorithms to applications. Elsevier, 2023.
  • Geffner & Domke (2021) Geffner, T. and Domke, J. MCMC variational inference via uncorrected Hamiltonian annealing. In Advances in Neural Information Processing Systems, 2021.
  • Geffner & Domke (2022) Geffner, T. and Domke, J. Langevin diffusion variational inference. arXiv preprint arXiv:2208.07743, 2022.
  • Gorman & Sejnowski (1988) Gorman, R. P. and Sejnowski, T. J. Analysis of hidden units in a layered network trained to classify sonar targets. Neural networks, 1(1):75–89, 1988.
  • Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • Hammersley (2013) Hammersley, J. Monte carlo methods. Springer Science & Business Media, 2013.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Jankowiak & Phan (2022) Jankowiak, M. and Phan, D. Surrogate likelihoods for variational annealed importance sampling. In International Conference on Machine Learning, pp.  9881–9901. PMLR, 2022.
  • Ji & Shen (2019) Ji, C. and Shen, H. Stochastic variational inference via upper bound. arXiv preprint arXiv:1912.00650, 2019.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29, 2016.
  • Lahlou et al. (2023) Lahlou, S., Deleu, T., Lemos, P., Zhang, D., Volokhova, A., Hernández-Garcıa, A., Ezzine, L. N., Bengio, Y., and Malkin, N. A theory of continuous generative flow networks. In International Conference on Machine Learning, pp.  18269–18300. PMLR, 2023.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, pp.  2278–2324, 1998. doi: 10.1109/5.726791.
  • Léonard (2013) Léonard, C. A survey of the schr\\backslash” odinger problem and some of its connections with optimal transport. arXiv preprint arXiv:1308.0215, 2013.
  • Li & Turner (2016) Li, Y. and Turner, R. E. Rényi divergence variational inference. Advances in neural information processing systems, 29, 2016.
  • Liu & Liu (2001) Liu, J. S. and Liu, J. S. Monte Carlo strategies in scientific computing, volume 75. Springer, 2001.
  • Malkin et al. (2022a) Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in gflownets. Advances in Neural Information Processing Systems, 35:5955–5967, 2022a.
  • Malkin et al. (2022b) Malkin, N., Lahlou, S., Deleu, T., Ji, X., Hu, E., Everett, K., Zhang, D., and Bengio, Y. Gflownets and variational inference. arXiv preprint arXiv:2210.00580, 2022b.
  • Matthews et al. (2022) Matthews, A., Arbel, M., Rezende, D. J., and Doucet, A. Continual repeated annealed flow transport monte carlo. In International Conference on Machine Learning, pp.  15196–15219. PMLR, 2022.
  • Midgley et al. (2022) Midgley, L. I., Stimper, V., Simm, G. N., Schölkopf, B., and Hernández-Lobato, J. M. Flow annealed importance sampling bootstrap. arXiv preprint arXiv:2208.01893, 2022.
  • Midgley et al. (2023) Midgley, L. I., Stimper, V., Antorán, J., Mathieu, E., Schölkopf, B., and Hernández-Lobato, J. M. Se (3) equivariant augmented coupling flows. arXiv preprint arXiv:2308.10364, 2023.
  • Mittal et al. (2023) Mittal, S., Bracher, N. L., Lajoie, G., Jaini, P., and Brubaker, M. A. Exploring exchangeable dataset amortization for bayesian posterior inference. In ICML 2023 Workshop on Structured Probabilistic Inference {\{\\backslash&}\} Generative Modeling, 2023.
  • Møller et al. (1998) Møller, J., Syversveen, A. R., and Waagepetersen, R. P. Log gaussian cox processes. Scandinavian journal of statistics, 25(3):451–482, 1998.
  • Neal (2001) Neal, R. M. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.
  • Neal (2003) Neal, R. M. Slice sampling. The annals of statistics, 31(3):705–767, 2003.
  • Nishihara et al. (2014) Nishihara, R., Murray, I., and Adams, R. P. Parallel mcmc with generalized elliptical slice sampling. The Journal of Machine Learning Research, 15(1):2087–2112, 2014.
  • Papamakarios et al. (2021) Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617–2680, 2021.
  • Peyré et al. (2019) Peyré, G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International conference on machine learning, pp.  1530–1538. PMLR, 2015.
  • Richter et al. (2020) Richter, L., Boustati, A., Nüsken, N., Ruiz, F., and Akyildiz, O. D. Vargrad: a low-variance gradient estimator for variational inference. Advances in Neural Information Processing Systems, 33:13481–13492, 2020.
  • Richter et al. (2023) Richter, L., Berner, J., and Liu, G.-H. Improved sampling via learned diffusions. arXiv preprint arXiv:2307.01198, 2023.
  • Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  • Särkkä & Solin (2019) Särkkä, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.
  • Sendera et al. (2024) Sendera, M., Kim, M., Mittal, S., Lemos, P., Scimeca, L., Rector-Brooks, J., Adam, A., Bengio, Y., and Malkin, N. On diffusion models for amortized inference: Benchmarking and improving stochastic control and sampling. arXiv preprint arXiv:2402.05098, 2024.
  • Shapiro (2003) Shapiro, A. Monte carlo sampling methods. Handbooks in operations research and management science, 10:353–425, 2003.
  • Sigillito et al. (1989) Sigillito, V. G., Wing, S. P., Hutton, L. V., and Baker, K. B. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10(3):262–266, 1989.
  • Song et al. (2020) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • Sountsov et al. (2020) Sountsov, P., Radul, A., and contributors. Inference gym, 2020. URL https://pypi.org/project/inference_gym.
  • Stoltz et al. (2010) Stoltz, G., Rousset, M., et al. Free energy computations: A mathematical perspective. World Scientific, 2010.
  • Thin et al. (2021) Thin, A., Kotelevskii, N., Durmus, A., Moulines, E., Panov, M., and Doucet, A. Monte Carlo variational auto-encoders. In International Conference on Machine Learning, 2021.
  • Tzen & Raginsky (2019) Tzen, B. and Raginsky, M. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019.
  • Vargas et al. (2023a) Vargas, F., Grathwohl, W., and Doucet, A. Denoising diffusion samplers. arXiv preprint arXiv:2302.13834, 2023a.
  • Vargas et al. (2023b) Vargas, F., Ovsianas, A., Fernandes, D., Girolami, M., Lawrence, N. D., and Nüsken, N. Bayesian learning via neural schrödinger–föllmer flows. Statistics and Computing, 33(1):3, 2023b.
  • Vargas et al. (2024) Vargas, F., Padhy, S., Blessing, D., and Nüsken, N. Transport meets variational inference: Controlled monte carlo diffusions. In The Twelfth International Conference on Learning Representations, 2024.
  • Wainwright & Jordan (2008) Wainwright, M. J. and Jordan, M. I. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, November 2008.
  • Wan et al. (2020) Wan, N., Li, D., and Hovakimyan, N. F-divergence variational inference. Advances in neural information processing systems, 33:17370–17379, 2020.
  • Wu et al. (2020a) Wu, H., Köhler, J., and Noé, F. Stochastic normalizing flows. Advances in Neural Information Processing Systems, 33:5933–5944, 2020a.
  • Wu et al. (2020b) Wu, H., Köhler, J., and Noé, F. Stochastic normalizing flows. In Advances in Neural Information Processing Systems, 2020b.
  • Zhang et al. (2023) Zhang, D., Chen, R. T. Q., Liu, C.-H., Courville, A., and Bengio, Y. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. arXiv preprint arXiv:2310.02679, 2023.
  • Zhang et al. (2021) Zhang, G., Hsu, K., Li, J., Finn, C., and Grosse, R. Differentiable annealed importance sampling and the perils of gradient noise. In Advances in Neural Information Processing Systems, 2021.
  • Zhang & Chen (2021) Zhang, Q. and Chen, Y. Path integral sampler: a stochastic control approach for sampling. arXiv preprint arXiv:2111.15141, 2021.

Appendix A Performance Criteria Details

Here, we provide further details on the computation of the various performance criteria introduced in the main manuscript.

A.1 Density-Ratio-Based Criteria

Forward and Reverse Importance-Weighted Estimation of Z{Z}. Using the definition of the normalization constant, the importance-weighted reverse estimate of ZZ is given by

Zrγ(𝐱)d𝐱=qθ(𝐱)qθ(𝐱)γ(𝐱)=𝔼qθ[γ(𝐱)qθ(𝐱)]1Nqθ𝐱iqθγ(𝐱i)qθ(𝐱i){Z}_{r}\coloneqq\int\gamma(\mathbf{x})\text{d}\mathbf{x}=\int\frac{q^{\theta}(\mathbf{x})}{q^{\theta}(\mathbf{x})}\gamma(\mathbf{x})=\mathbb{E}_{q^{\theta}}\Big{[}\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{]}\approx\frac{1}{N_{q^{\theta}}}\sum_{\mathbf{x}_{i}\sim q^{\theta}}\frac{\gamma(\mathbf{x}_{i})}{q^{\theta}(\mathbf{x}_{i})} (14)

where NqθN_{q^{\theta}} denotes the number of samples from qθq^{\theta} used for the Monte Carlo estimate of the expectation. Using the identity Z1=π(𝐱)/γ(𝐱)Z^{-1}=\pi(\mathbf{x})/\gamma(\mathbf{x}), we obtain the forward estimation of ZZ as

Z1=Z1qθ(𝐱)d𝐱=𝔼π[qθ(𝐱)γ(𝐱)],and thus, Zf1/𝔼π[qθ(𝐱)γ(𝐱)]1/(1Nπ𝐱iπqθ(𝐱i)γ(𝐱i)),Z^{-1}=\int Z^{-1}q^{\theta}(\mathbf{x})\text{d}\mathbf{x}=\mathbb{E}_{\pi}\Big{[}\frac{q^{\theta}(\mathbf{x})}{\gamma(\mathbf{x})}\Big{]},\ \text{and thus, }\ Z_{f}\coloneqq{1}/{\mathbb{E}_{\pi}\Big{[}\frac{q^{\theta}(\mathbf{x})}{\gamma(\mathbf{x})}\Big{]}}\approx 1/\Big{(}\frac{1}{N_{\pi}}\sum_{\mathbf{x}_{i}\sim\pi}\frac{q^{\theta}(\mathbf{x}_{i})}{\gamma(\mathbf{x}_{i})}\Big{)}, (15)

where NπN_{\pi} denotes the number of samples from π\pi used for the Monte Carlo estimate of the expectation.

Forward and Reverse Effective Sample Size. The (reverse) effective sample size (ESS), or equivalently, reverse ESS (Shapiro, 2003) is defined as

ESSr1/𝔼qθ[(π(𝐱)qθ(𝐱))2]=Zr2/𝔼qθ[(γ(𝐱)qθ(𝐱))2]=(𝔼qθ[γ(𝐱)qθ(𝐱)])2/𝔼qθ[(γ(𝐱)qθ(𝐱))2],\text{ESS}_{r}\coloneqq{1}/{\mathbb{E}_{q^{\theta}}\Big{[}\Big{(}\frac{\pi(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{)}^{2}\Big{]}}={Z_{r}^{2}}/{\mathbb{E}_{q^{\theta}}\Big{[}\Big{(}\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{)}^{2}\Big{]}}={\Big{(}\mathbb{E}_{q^{\theta}}\Big{[}\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{]}\Big{)}^{2}}/{\mathbb{E}_{q^{\theta}}\Big{[}\Big{(}\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{)}^{2}\Big{]}}, (16)

where ZZ is approximated using the reverse estimate as defined in Eq. 14. Using the definition of the ESS, it is straightforward to see that

ESSf1/𝔼qθ[(π(𝐱)qθ(𝐱))2]=1/𝔼π[π(𝐱)qθ(𝐱)]=Zf/𝔼π[γ(𝐱)qθ(𝐱)]=𝔼π[qθ(𝐱)γ(𝐱)]1/𝔼π[γ(𝐱)qθ(𝐱)],\text{ESS}_{f}\coloneqq{1}/{\mathbb{E}_{q^{\theta}}\Big{[}\Big{(}\frac{\pi(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{)}^{2}\Big{]}}={1}/{\mathbb{E}_{\pi}\Big{[}\frac{\pi(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{]}}={Z_{f}}/{\mathbb{E}_{\pi}\Big{[}\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{]}}={\mathbb{E}_{\pi}\Big{[}\frac{q^{\theta}(\mathbf{x})}{\gamma(\mathbf{x})}\Big{]}^{-1}}/\ {\mathbb{E}_{\pi}\Big{[}\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})}\Big{]}}, (17)

where ZZ is approximated using the forward estimate as defined in Eq. 15.

A.2 Integral Probability Metrics

Maximum Mean Discrepancy. The Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) is a kernel-based measure of distance between two distributions. The MMD quantifies the dissimilarity between these distributions by comparing their mean embeddings in a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950) with kernel kk. In our setting, we are interested in computing the MMD between a model qθq^{\theta} and target distribution π\pi. Formally, if k\mathcal{H}_{k} is the RKHS associated with kernel function kk, the MMD between qθq^{\theta} and π\pi is the integral probability metric defined by:

MMDk(qθ,π)=supfk:fk1(𝔼𝐱qθ[f(𝐱)]𝔼𝐲π[f(𝐲)]),\text{MMD}_{k}(q^{\theta},\pi)=\sup_{f\in\mathcal{H}_{k}:\|f\|_{\mathcal{H}_{k}}\leq 1}\big{(}\mathbb{E}_{\mathbf{x}\sim q^{\theta}}[f(\mathbf{x})]-\mathbb{E}_{\mathbf{y}\sim\pi}[f(\mathbf{y})]\big{)}, (18)

with MMDk(qθ,π)0\text{MMD}_{k}(q^{\theta},\pi)\geq 0 and MMDk(qθ,π)=0\text{MMD}_{k}(q^{\theta},\pi)=0 if and only if qθ=πq^{\theta}=\pi. The minimum variance unbiased estimate of MMDk\text{MMD}_{k} between two sample sets 𝐗qθ\mathbf{X}\sim q^{\theta} and 𝐘π\mathbf{Y}\sim\pi with sizes nn and mm respectively is given by

MMDk(qθ,π)1n(n1)i,jnk(𝐱i,𝐱j)+1m(m1)i,jmk(𝐲i,𝐲j)2nminjmk(𝐱i,𝐲j),\text{MMD}_{k}(q^{\theta},\pi)\approx\sqrt{\frac{1}{n(n-1)}\sum_{i,j}^{n}k(\mathbf{x}_{i},\mathbf{x}_{j})+\frac{1}{m(m-1)}\sum_{i,j}^{m}k(\mathbf{y}_{i},\mathbf{y}_{j})-\frac{2}{nm}\sum_{i}^{n}\sum_{j}^{m}k(\mathbf{x}_{i},\mathbf{y}_{j})}, (19)

In our experiments, we took a squared exponential kernel given by k(𝐱,𝐲)=exp(𝐱𝐲22/α)k(\mathbf{x},\mathbf{y})=\exp\big{(}-\|\mathbf{x}-\mathbf{y}\|_{2}^{2}/\alpha\big{)}, where the bandwidth α\alpha is determined using the median heuristic (Gretton et al., 2012). The code for computing the MMD was built upon https://github.com/antoninschrab/mmdfuse-paper.

Entropic Optimal Transport Distance. The 2-Wasserstein distance is given by

W2(qθ,π)=inf{d×dc(𝐱,𝐲)ξ(𝐱,𝐲)dxdy:dξ(𝐱,𝐲)d𝐲=qθ(𝐱),dξ(𝐱,𝐲)d𝐱=π(𝐲)}1/2,W_{2}(q^{\theta},\pi)=\inf\left\{\ \int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}c(\mathbf{x},\mathbf{y})\xi(\mathbf{x},\mathbf{y})\text{d}x\text{d}y:\ \int_{\mathbb{R}^{d}}\xi(\mathbf{x},\mathbf{y})\text{d}\mathbf{y}=q^{\theta}(\mathbf{x}),\ \int_{\mathbb{R}^{d}}\xi(\mathbf{x},\mathbf{y})\text{d}\mathbf{x}=\pi(\mathbf{y})\right\}^{1/2}, (20)

with cost cc, chosen as c(𝐱,𝐲)=𝐱𝐲2c(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|^{2} in our experiments. To obtain a tractable objective, an entropy regularized version has been proposed (Peyré et al., 2019), that is,

W2,ε(qθ,π)=inf{d×dc(𝐱,𝐲)ξ(𝐱,𝐲)dxdyε(ξ):dξ(𝐱,𝐲)d𝐲=qθ(𝐱),dξ(𝐱,𝐲)d𝐱=π(𝐲)}1/2.W_{2,\varepsilon}(q^{\theta},\pi)=\inf\left\{\ \int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}c(\mathbf{x},\mathbf{y})\xi(\mathbf{x},\mathbf{y})\text{d}x\text{d}y-\varepsilon\mathcal{H}(\xi):\ \int_{\mathbb{R}^{d}}\xi(\mathbf{x},\mathbf{y})\text{d}\mathbf{y}=q^{\theta}(\mathbf{x}),\ \int_{\mathbb{R}^{d}}\xi(\mathbf{x},\mathbf{y})\text{d}\mathbf{x}=\pi(\mathbf{y})\right\}^{1/2}. (21)

with entropy (ξ)=d×dξ(𝐱,𝐲)logξ(𝐱,𝐲)dxdy\mathcal{H}(\xi)=-\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\xi(\mathbf{x},\mathbf{y})\log\xi(\mathbf{x},\mathbf{y})\text{d}x\text{d}y. We chose ε=103\varepsilon=10^{-3} for all experiments. The code was taken from https://github.com/ott-jax/ott.

A.3 Extending the Entropic Mode Coverage

If the true mode probabilities p(ξ|𝐱)p^{*}(\xi|\mathbf{x}) are not uniformly distributed, EMC=1 does not correspond to the optimal value. In that case, we propose the expected Jensen-Shannon divergence, that is,

EJS𝔼qθ(𝐱)DJS(p(ξ|𝐱)p(ξ|𝐱)),\text{EJS}\coloneqq\mathbb{E}_{q^{\theta}(\mathbf{x})}D_{\text{JS}}(p(\xi|\mathbf{x})\|p^{*}(\xi|\mathbf{x})), (22)

with

DJS(p(ξ|𝐱)p(ξ|𝐱))=12DKL(p(ξ|𝐱)p(ξ|𝐱)+p(ξ|𝐱)2)+12DKL(p(ξ|𝐱)p(ξ|𝐱)+p(ξ|𝐱)2),D_{\text{JS}}(p(\xi|\mathbf{x})\|p^{*}(\xi|\mathbf{x}))=\frac{1}{2}D_{\text{KL}}\left(p(\xi|\mathbf{x})\|\frac{p^{*}(\xi|\mathbf{x})+p(\xi|\mathbf{x})}{2}\right)+\frac{1}{2}D_{\text{KL}}\left(p^{*}(\xi|\mathbf{x})\|\frac{p^{*}(\xi|\mathbf{x})+p(\xi|\mathbf{x})}{2}\right), (23)

as an alternative heuristic to quantify mode collapse. Similar to EMC, EJS is bounded and is straightforward to interpret: When employing the base 2 logarithm, EJS remains bounded, i.e., 0EJS.10\leq\text{EJS}.\leq 1. Moreover EJS=0\text{EJS}=0 implies that the model matches the potentially unbalanced true mode probabilities, while EJS=1\text{EJS}=1 indicates that pp and pp^{*} possess no overlapping probability mass.

Appendix B Details on Unnormalized Importance Weights / Density Ratios

Here, we provide further details on how the unnormalized importance weights / density ratios are computed for different methods.

Tractable Density Methods. For models with tractable density qθ(𝐱)q^{\theta}(\mathbf{x}) the marginal (unnormalized) importance weights can trivially computed using

w=γ(𝐱)qθ(𝐱).w=\frac{\gamma(\mathbf{x})}{q^{\theta}(\mathbf{x})}.

Diffusion-based Methods. For diffusion-based methods, the extended importance weights can then be constructed as

pθ(𝐱0:T)qθ(𝐱0:T)=π(𝐱T)t=1TBt1θ(𝐱t1|𝐱t)π0(𝐱0)t=0T1Ft+1θ(𝐱t+1|𝐱t).\frac{p^{\theta}(\mathbf{x}_{0:T})}{q^{\theta}(\mathbf{x}_{0:T})}=\frac{\pi(\mathbf{x}_{T})\prod_{t=1}^{T}\ B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{\pi_{0}(\mathbf{x}_{0})\prod_{t=0}^{T-1}F^{\theta}_{t+1}(\mathbf{x}_{t+1}|\mathbf{x}_{t})}. (24)

The different choices of forward and backward transition kernels Ft+1θ,Bt1θF^{\theta}_{t+1},B^{\theta}_{t-1} are listed in Table 6. Some methods such as DDS (Vargas et al., 2023a), PIS (Zhang & Chen, 2021) and GFN (Zhang et al., 2023) introduce a reference process prefp^{\text{ref}} with

pref(𝐱0:T)=p0ref(𝐱0)t=0T1Ft+1ref(𝐱t+1|𝐱t)=pTref(𝐱T)t=1TBt1θ(𝐱t1|𝐱t).p^{\text{ref}}(\mathbf{x}_{0:T})=p^{\text{ref}}_{0}(\mathbf{x}_{0})\prod_{t=0}^{T-1}F^{\text{ref}}_{t+1}(\mathbf{x}_{t+1}|\mathbf{x}_{t})=p^{\text{ref}}_{T}(\mathbf{x}_{T})\prod_{t=1}^{T}B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{t}). (25)

This allows for rewriting Eq. 24 as

pθ(𝐱0:T)qθ(𝐱0:T)=pθ(𝐱0:T)pref(𝐱0:T)pref(𝐱0:T)qθ(𝐱0:T)=π(𝐱T)pTref(𝐱T)pref(𝐱0:T)qθ(𝐱0:T),\frac{p^{\theta}(\mathbf{x}_{0:T})}{q^{\theta}(\mathbf{x}_{0:T})}=\frac{p^{\theta}(\mathbf{x}_{0:T})}{p^{\text{ref}}(\mathbf{x}_{0:T})}\cdot\frac{p^{\text{ref}}(\mathbf{x}_{0:T})}{q^{\theta}(\mathbf{x}_{0:T})}=\frac{\pi(\mathbf{x}_{T})}{p^{\text{ref}}_{T}(\mathbf{x}_{T})}\cdot\frac{p^{\text{ref}}(\mathbf{x}_{0:T})}{q^{\theta}(\mathbf{x}_{0:T})}, (26)

potentially resulting in more tractable density ratios compared to Eq. 24. For concrete examples see e.g. (Zhang et al., 2023). A continuous-time analogous of the reference process is detailed in (Vargas et al., 2024). Moreover, in continuous-time, the importance weights correspond to a Radon–Nikodym derivative. For the sake of simplicity, we only consider the discrete-time setting in this work. We refer the reader to (Vargas et al., 2024; Richter et al., 2023) for further details.

Method π0(𝐱0)\pi_{0}(\mathbf{x}_{0}) Ft+1θ(𝐱t+1|𝐱t)F^{\theta}_{t+1}(\mathbf{x}_{t+1}|\mathbf{x}_{t}) Bt1θ(𝐱t1|𝐱t)B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{t})
DDS 𝒩(𝐱0|0,σ02I)\mathcal{N}(\mathbf{x}_{0}|0,\sigma_{0}^{2}\text{I}) 𝒩(𝐱t+1|(1βt𝐱t+𝐬θ(𝐱t,t))Δt,βtσ02Δt)\mathcal{N}(\mathbf{x}_{t+1}|(\sqrt{1-\beta_{t}}\mathbf{x}_{t}+\mathbf{s}^{\theta}(\mathbf{x}_{t},t))\Delta_{t},\beta_{t}\sigma_{0}^{2}\Delta_{t}) 𝒩(𝐱t1|1βt𝐱tΔt,βtσ02Δt)\mathcal{N}(\mathbf{x}_{t-1}|\sqrt{1-\beta_{t}}\mathbf{x}_{t}\Delta_{t},\beta_{t}\sigma_{0}^{2}\Delta_{t})
DIS 𝒩(𝐱0|0,σ02I)\mathcal{N}(\mathbf{x}_{0}|0,\sigma_{0}^{2}\text{I}) 𝒩(𝐱t+1|𝐱t+(βt𝐱t+𝐬θ(𝐱t,t))Δt,2βtσ02Δt)\mathcal{N}(\mathbf{x}_{t+1}|\mathbf{x}_{t}+(\beta_{t}\mathbf{x}_{t}+\mathbf{s}^{\theta}(\mathbf{x}_{t},t))\Delta_{t},2\beta_{t}\sigma_{0}^{2}\Delta_{t}) 𝒩(𝐱t1|(𝐱tβt𝐱t)Δt,2βtσ02Δt)\mathcal{N}(\mathbf{x}_{t-1}|(\mathbf{x}_{t}-\beta_{t}\mathbf{x}_{t})\Delta_{t},2\beta_{t}\sigma_{0}^{2}\Delta_{t})
PIS/GFN δ0\delta_{0} 𝒩(𝐱t+1|𝐱t+𝐬θ(𝐱t,t)Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t+1}|\mathbf{x}_{t}+\mathbf{s}^{\theta}(\mathbf{x}_{t},t)\Delta_{t},\sigma_{t}^{2}\Delta_{t}) 𝒩(𝐱t1|tΔtt𝐱t,tΔttσt2Δt)\mathcal{N}(\mathbf{x}_{t-1}|\frac{t-\Delta_{t}}{t}\mathbf{x}_{t},\frac{t-\Delta_{t}}{t}\sigma_{t}^{2}\Delta_{t})
ULA arbitrary 𝒩(𝐱t+1|𝐱t+𝐱tσt2logπt(𝐱t)Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t+1}|\mathbf{x}_{t}+\nabla_{\mathbf{x}_{t}}\sigma_{t}^{2}\log\pi_{t}(\mathbf{x}_{t})\Delta_{t},\sigma_{t}^{2}\Delta_{t}) 𝒩(𝐱t1|𝐱t+𝐱tσt2logπt(𝐱t)Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t-1}|\mathbf{x}_{t}+\nabla_{\mathbf{x}_{t}}\sigma_{t}^{2}\log\pi_{t}(\mathbf{x}_{t})\Delta_{t},\sigma_{t}^{2}\Delta_{t})
MCD arbitrary 𝒩(𝐱t+1|𝐱t+𝐱tσt2logπt(𝐱t)Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t+1}|\mathbf{x}_{t}+\nabla_{\mathbf{x}_{t}}\sigma_{t}^{2}\log\pi_{t}(\mathbf{x}_{t})\Delta_{t},\sigma_{t}^{2}\Delta_{t}) 𝒩(𝐱t1|𝐱t+(𝐱tσt2logπt(𝐱t)+𝐬θ(𝐱t,t))Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t-1}|\mathbf{x}_{t}+(\nabla_{\mathbf{x}_{t}}\sigma_{t}^{2}\log\pi_{t}(\mathbf{x}_{t})+\mathbf{s}^{\theta}(\mathbf{x}_{t},t))\Delta_{t},\sigma_{t}^{2}\Delta_{t})
CMCD arbitrary 𝒩(𝐱t+1|𝐱t+(𝐱tσt2logπt(𝐱t)+𝐬θ(𝐱t,t))Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t+1}|\mathbf{x}_{t}+(\nabla_{\mathbf{x}_{t}}\sigma_{t}^{2}\log\pi_{t}(\mathbf{x}_{t})+\mathbf{s}^{\theta}(\mathbf{x}_{t},t))\Delta_{t},\sigma_{t}^{2}\Delta_{t}) 𝒩(𝐱t1|𝐱t+(𝐱tσt2logπt(𝐱t)𝐬θ(𝐱t,t))Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t-1}|\mathbf{x}_{t}+(\nabla_{\mathbf{x}_{t}}\sigma_{t}^{2}\log\pi_{t}(\mathbf{x}_{t})-\mathbf{s}^{\theta}(\mathbf{x}_{t},t))\Delta_{t},\sigma_{t}^{2}\Delta_{t})
GBS arbitrary 𝒩(𝐱t+1|𝐱t+(𝐱tσt2𝐟θ(𝐱t,t))Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t+1}|\mathbf{x}_{t}+(\nabla_{\mathbf{x}_{t}}\sigma_{t}^{2}\mathbf{f}^{\theta}(\mathbf{x}_{t},t))\Delta_{t},\sigma_{t}^{2}\Delta_{t}) 𝒩(𝐱t1|𝐱t+(𝐱tσt2𝐛θ(𝐱t,t))Δt,σt2Δt)\mathcal{N}(\mathbf{x}_{t-1}|\mathbf{x}_{t}+(\nabla_{\mathbf{x}_{t}}\sigma_{t}^{2}\mathbf{b}^{\theta}(\mathbf{x}_{t},t))\Delta_{t},\sigma_{t}^{2}\Delta_{t})
Table 6: Characterization of diffusion-based sampling methods. Here, 𝐬θ,𝐟θ,𝐛θ:d×[0,T]d\mathbf{s}^{\theta},\mathbf{f}^{\theta},\mathbf{b}^{\theta}:\mathbb{R}^{d}\times[0,T]\rightarrow\mathbb{R}^{d} denotes a parameterized function approximator. In our experiments, we choose π0(𝐱0)=𝒩(𝐱0|0,σ02Δt)\pi_{0}(\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{0}|0,\sigma_{0}^{2}\Delta_{t}).

Sequential Importance Sampling Methods. Sequential importance sampling methods express the importance weights in terms of incremental importance sampling weights, i.e.,

w¯=t=1TGt(xt1,xt)withGt(𝐱t1,𝐱t)=γt(𝐱t)Bt1θ(𝐱t1|𝐱t)γt1(𝐱t1)Ftθ(𝐱t|𝐱t1).\,\overline{\!{w}}=\prod_{t=1}^{T}G_{t}(x_{t-1},x_{t})\qquad\text{with}\qquad G_{t}(\mathbf{x}_{t-1},\mathbf{x}_{t})=\frac{\gamma_{t}(\mathbf{x}_{t})B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{{t}})}{\gamma_{t-1}(\mathbf{x}_{t-1})F^{\theta}_{t}(\mathbf{x}_{{t}}|\mathbf{x}_{{t-1}})}.

For given forward transitions FtθF_{t}^{\theta}, the optimal backward transitions Bt1θB^{\theta}_{t-1} ensure that w¯=w\,\overline{\!{w}}=w. As the optimal transitions are typically not available, SMC uses the AIS approximation (Neal, 2001). Moreover flow transport methods (Wu et al., 2020a; Arbel et al., 2021; Matthews et al., 2022) use a flow as a deterministic map Tθ{T}^{\theta} to approximate the incremental IS weights. In Table 7, we list different Ftθ,Bt1θF^{\theta}_{t},B^{\theta}_{t-1} and their corresponding incremental importance sampling weights.

Method Bt1θ(𝐱t1|𝐱t)B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{{t}}) Gt(𝐱t1,𝐱t)G_{t}(\mathbf{x}_{t-1},\mathbf{x}_{t}) Ftθ(𝐱t|𝐱t1)F^{\theta}_{t}(\mathbf{x}_{{t}}|\mathbf{x}_{{t-1}}) Gt(𝐱t1,𝐱t)G_{t}(\mathbf{x}_{t-1},\mathbf{x}_{t})
Optimal πt1(𝐱t1)Ftθ(𝐱t|𝐱t1)/πt(𝐱t){\pi_{t-1}(\mathbf{x}_{t-1})}{F^{\theta}_{t}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}/\pi_{t}(\mathbf{x}_{t}) Zt/Zt1Z_{t}/Z_{t-1} πt(𝐱t)Bt1θ(𝐱t1|𝐱t)/πt1(𝐱t1)\pi_{t}(\mathbf{x}_{t}){B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{{t}})}/{\pi_{t-1}(\mathbf{x}_{t-1})} Zt/Zt1Z_{t}/Z_{t-1}
AIS/SMC/FAB πt(𝐱t1)Ftθ(𝐱t|𝐱t1)/πt(𝐱t){\pi_{t}(\mathbf{x}_{t-1})}{F^{\theta}_{t}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}/\pi_{t}(\mathbf{x}_{t}) γt(𝐱t1)/γt1(𝐱t1)\gamma_{t}(\mathbf{x}_{t-1})/\gamma_{t-1}(\mathbf{x}_{t-1}) πt1(𝐱t)Bt1θ(𝐱t1|𝐱t)/πt1(𝐱t1)\pi_{t-1}(\mathbf{x}_{t}){B^{\theta}_{t-1}(\mathbf{x}_{t-1}|\mathbf{x}_{{t}})}/{\pi_{t-1}(\mathbf{x}_{t-1})} γt(𝐱t)/γt1(𝐱t)\gamma_{t}(\mathbf{x}_{t})/\gamma_{t-1}(\mathbf{x}_{t})
AFT/CRAFT δTtθ(xt)(xt1)\delta_{{T}^{\theta}_{t}(x_{t})}(x_{t-1}) γt(Ttθ(𝐱t1))|detTtθ(𝐱t1)|/γt1(𝐱t1){\gamma_{t}({T}^{\theta}_{t}(\mathbf{x}_{t-1}))}|\det\nabla{T}^{\theta}_{t}(\mathbf{x}_{t-1})|/\gamma_{t-1}(\mathbf{x}_{t-1}) δ(Tt1θ)1(xt1)(xt)\delta_{({T}^{\theta}_{t-1})^{-1}(x_{t-1})}(x_{t}) γt(𝐱t)/γt1((Tt1θ)1(𝐱t))|det(Tt1θ)1(𝐱t)|{\gamma_{t}(\mathbf{x}_{t})}/\gamma_{t-1}(({T}^{\theta}_{t-1})^{-1}(\mathbf{x}_{t}))|\det\nabla({T}^{\theta}_{t-1})^{-1}(\mathbf{x}_{t})|
Table 7: Characterization of Sequential Importance Methods methods: The middle column shows the backward kernels Bt1θB^{\theta}_{t-1} and the corresponding GtG_{t} when transporting samples from the prior π0\pi_{0} to the target πT\pi_{T} to compute reverse criteria. The right-most column shows the forward kernels FtθF^{\theta}_{t} and the corresponding GtG_{t} when transporting samples from the target πT\pi_{T} back to the prior πT\pi_{T} to compute forward criteria.

Appendix C Benchmark Target Details

Here, we introduce the target densities considered in this benchmark more formally.

C.1 Bayesian Logistic Regression

We used four binary classification problems in our benchmark, which have also been used in various other work to compare different state-of-the-art methods in variational inference and Markov chain Monte Carlo. We assess the performance of a Bayesian logistic model with:

𝐱𝒩(0,σw2I),\displaystyle\mathbf{x}\sim\mathcal{N}\left(0,\sigma^{2}_{w}I\right),
yiBernoulli(sigmoid(𝐱ui))\displaystyle y_{i}\sim\operatorname{Bernoulli}(\operatorname{sigmoid}(\mathbf{x}^{\top}u_{i}))

on two standardized datasets {(ui,yi)}i\{(u_{i},y_{i})\}_{i}, namely Ionosphere (d=35d=35) with 351 data points and Sonar (d=61d=61) with 208 data points.

The German Credit dataset consists of (d=25d=25) features and 1000 data points, while the Breast Cancer dataset has (d=31d=31) dimensions with 569 data points, which we standardize and apply linear logistic regression.

C.2 Random Effect Regression

The Seeds (d=26d=26) target is a random effect regression model trained on the seeds dataset:

τGamma(0.01,0.01)\displaystyle\tau\sim\operatorname{Gamma}(01,01)
a0,a1,a2,a12𝒩(0,10)\displaystyle a_{0},a_{1},a_{2},a_{12}\sim\mathcal{N}(0,0)
bi𝒩(0,1τ),i=1,,21\displaystyle b_{i}\sim\mathcal{N}\left(0,\frac{1}{\sqrt{\tau}}\right),\quad{i=1,\ldots,21}
logitsi=a0+a1xi+a2yi+a12xiyi+b1,i=1,,21\displaystyle\operatorname{logits}_{i}=a_{0}+a_{1}x_{i}+a_{2}y_{i}+a_{12}x_{i}y_{i}+b_{1},\quad{i=1,\ldots,21}
riBinomial(logitsi,Ni),i=1,,21.\displaystyle r_{i}\sim\operatorname{Binomial}\left(\operatorname{logits}_{i},N_{i}\right),\quad{i=1,\ldots,21}.

The goal is to do inference over the variables τ,a0,a1,a2,a12\tau,a_{0},a_{1},a_{2},a_{12} and bib_{i} for i=1,,21i=1,\ldots,21, given observed values for xi,yix_{i},y_{i} and NiN_{i}.

C.3 Time Series Models

The Brownian (d=32d=32) model corresponds to the time discretization of a Brownian motion:

αinn\displaystyle\alpha_{\mathrm{inn}} LogNormal(0,2),\displaystyle\sim\mathrm{LogNormal}(0,2),
αobs\displaystyle\alpha_{\mathrm{obs}} LogNormal(0,2),\displaystyle\sim\mathrm{LogNormal}(0,2),
x1\displaystyle x_{1} 𝒩(0,αinn),\displaystyle\sim{\mathcal{N}}(0,\alpha_{\mathrm{inn}}),
xi\displaystyle x_{i} 𝒩(xi1,αinn),i=2,20,\displaystyle\sim{\mathcal{N}}(x_{i-1},\alpha_{\mathrm{inn}}),\quad i=2,\ldots 20,
yi\displaystyle y_{i} 𝒩(xi,αobs),i=1,30.\displaystyle\sim{\mathcal{N}}(x_{i},\alpha_{\mathrm{obs}}),\quad i=1,\ldots 30.

inference is performed over the variables αinn,αobs\alpha_{\mathrm{inn}},\alpha_{\mathrm{obs}} and {xi}i=130\{x_{i}\}_{i=1}^{30} given the observations {yi}i=110{yi}i=2030\{y_{i}\}_{i=1}^{10}\cup\{y_{i}\}_{i=20}^{30}.

C.4 Spatial Statistics

The Log Gaussian Cox process (d=1600d=1600) is a popular high-dimensional task in spatial statistics (Møller et al., 1998) which models the position of pine saplings. Using a d=M×M=1600d=M\times M=1600 grid, we obtain the unnormalized target density by

𝒩(𝐱;μ,K)i[1:M]2exp(xiyiaexp(xi)).\mathcal{N}(\mathbf{x};\mu,K)\prod_{i\in[1:M]^{2}}\exp\left(x_{i}y_{i}-a\exp\left(x_{i}\right)\right).

C.5 Synthetic Targets

We evaluate on three different mixture models which all follow the structure, that is,

π(𝐱)=\displaystyle\pi(\mathbf{x})= k=1Kwkπk(𝐱),\displaystyle\sum_{k=1}^{K}w_{k}\pi_{k}(\mathbf{x}),
k=1Kwk=1,\displaystyle\sum_{k=1}^{K}w_{k}=1,

where KK denotes the number of components.

The MoG (d=d=\mathbb{N}) distribution, taken from (Midgley et al., 2022), consists of K=40K=40 mixture components with

πk(𝐱)=𝒩(μk,I)\displaystyle\pi_{k}(\mathbf{x})=\mathcal{N}(\mu_{k},I)
μk𝒰(40,40)\displaystyle\mu_{k}\sim\mathcal{U}(-0,0)
wk=1/K,\displaystyle w_{k}=1/K,

where 𝒰(l,u)\mathcal{U}(l,u) refers to a uniform distribution on [l,u][l,u].

The MoS (d=d=\mathbb{N}) comprises 10 Student’s t-distributions t2t_{2}, where the 22 refers to the degree of freedom. Generally, Student’s t-distributions have heavier tails compared to Gaussian distributions, making them sharper and more challenging to approximate.

πk(𝐱)=t2+μk,\displaystyle\pi_{k}(\mathbf{x})=t_{2}+\mu_{k},
μk𝒰(10,10),\displaystyle\mu_{k}\sim\mathcal{U}(-0,0),
wk=1/K,\displaystyle{w}_{k}=1/K,

where μk\mu_{k} refers to the translation of the individual components.

The Funnel (d=10d=10) target introduced in (Neal, 2003) is a challenging funnel-shaped distribution given by

π(𝐱)=𝒩(x1;0,σf2)𝒩(x2:10;0,exp(x1)I),\pi(\mathbf{x})=\mathcal{N}(x_{1};0,\sigma_{f}^{2})\mathcal{N}(x_{2:10};0,\exp(x_{1})I),

with σf2=9\sigma_{f}^{2}=9.

Lastly, we follow Doucet et al. (2022a) and use NICE (Dinh et al., 2014) to train a normalizing flow on a 14×1414\times 14 and 28×2828\times 28 variant of MNIST (DIGITS) and on the 28×2828\times 28 Fashion MNIST dataset (Fashion).

Appendix D Algorithms and Parameter Choices

Here, we discuss the parameter choices of all methods. Most of these choices are based on recommendations of the authors. For some choices, we run ablation studies to find suitable values.

Gaussian Mean Field Variational Inference (MFVI). We updated the mean and the diagonal covariance matrix using the Adam optimizer (Kingma & Ba, 2014) for 100k100k iterations with a batch size of 20002000. We ensured non-negativeness of the variance by using a log\log transformation. The mean is initialized at 0{0} for all experiments. The initial covariance/scale and the learning rate are set according to Table 8.

Gaussian Mixture Model Variational Inference (GMMVI). For GMMVI, we ported the tensorflow implementation of https://github.com/OlegArenz/gmmvi to Jax and integrated it into our framework. We use the specifications (Arenz et al., 2022) described as SAMTRUX. We make use of their adaptive component initializer and start using ten components. The initial variance of the components is set according to Table 8.

Sequential Monte Carlo (SMC). For the Sequential Monte Carlo (SMC) approach, we leveraged the codebase available at https://github.com/google-deepmind/annealed_flow_transport. We used 20002000 particles and 128128 annealing steps (temperatures) TT. We used resampling with a threshold of 0.30.3. We used one Hamiltonian Monte Carlo (HMC) step per temperature with 1010 leapfrog steps. We tuned the stepsize of HMC according to Table 8 where we used different stepsizes depending on the annealing parameter βt\beta_{t}. We additionally tune the scale of the initial proposal distribution π0\pi_{0} as shown in Table 8.

Continual Repeated Annealed Flow Transport (CRAFT/AFT). As AFT and CRAFT build on Sequential Monte Carlo (SMC), we employed the same SMC specifications detailed above. Notably, we found that employing simpler flows in conjunction with a greater number of temperatures yielded superior and more robust performance compared to the use of more sophisticated flows such as RealNVP or Neural Spline Flows. Consequently, we opted for 128 temperatures, utilizing diagonal affine flows as the transport map. Specifically for AFT, we determined that 400 iterations per temperature were sufficient to achieve converged training results. For CRAFT and SNF, we found that a total of 3000 iterations provided satisfactory convergence during training. For all methods, we use 20002000 particles for training and testing and tune the learning rate and the scale of the initial proposal distribution π0\pi_{0} as shown in Table 8.

Flow Annealed Importance Sampling Bootstrap (FAB). We built our implementation off of https://github.com/lollcat/fab-jax. We adjusted the parameters of FAB in accordance with the author’s most important hyperparameter suggestions and to ensure that SMC performs reasonably well. To achieve this, we set the number of temperatures to 128128 and used HMC as MCMC kernel. For the flow architecture we use RealNVP (Dinh et al., 2016) where the conditioner is given by an 8-layer MLP. Furthermore, we utilized FAB’s replay buffer to speed up computations. The learning rate and base distribution scale are adjusted for target specificity, following the specifications outlined in Table 8. We used a batch size of 2048 and trained FAB for 3000 iterations which proved sufficient for achieving a satisfactory convergence.

Denoising Diffusion Sampler (DDS) and Path Integral Sampler (PIS). We use the implementation of https://github.com/franciscovargas/denoising_diffusion_samplers to integrate the Diffusion and Path Integral Sampler into our framework. We set the parameters of the SDEs according to the authors, i.e., (Zhang & Chen, 2021) and (Vargas et al., 2023a) and use 128 timesteps and a batch size of 2000 if not otherwise stated. Both methods use the network proposed in (Zhang & Chen, 2021) which uses a sinusoidal position embedding for the timestep and uses the gradient of the log target density as an additional term. As proposed, we use a two-layer neural network with 6464 hidden units. For DDS we use a cosine scheduler (Vargas et al., 2023a) and for PIS a uniform time scheduler (Zhang & Chen, 2021). Both methods were trained using 40k40k iterations.

Monte Carlo Diffusions (MCD) and Langevin Diffusion Variational Inference (LDVI). We build our implementation of Langevin Diffusion methods on https://github.com/tomsons22/LDVI. For experiments where performance is solely measured in terms of ELBO, due to the lack of samples from π\pi or access to ZZ, we train all parameters of the SDE by maximizing the EUBO as suggested by (Geffner & Domke, 2021). For multimodal target densities, we fix the proposal distribution and the magnitude of the timestep. We found that this stabilizes training and yields better results (cf. Ablation 15). We use the network architecture proposed by (Zhang & Chen, 2021) with two hidden layers with 64 hidden units each. We discretize the SDEs using 128 timesteps and a batchsize of 2000 if not otherwise stated. All methods were trained using 40k40k iterations.

Time-Reversed Diffusion Sampler (DIS) and General Bridge Samples (GBS). We base the implementation of DIS and GBS on https://github.com/juliusberner/sde_sampler and implemented them in Jax. The remaining parameters follow the description of DDS and PIS above.

Methods / Parameters Grid MoG MoS Funnel Digits/Fashion Credit Cancer Brownian Sonar Seeds Ionosphere LGCP
MFVI
Initial Scale {0.1, 1, 10} - - 1 0.1 0.1 10 1 0.1 1 1 0.1
Learning Rate {102,103,104,105}\{10^{-2},10^{-3},10^{-4},10^{-5}\} 10210^{-2} 10310^{-3} 10210^{-2} 10410^{-4} 10310^{-3} 10210^{-2} 10310^{-3} 10310^{-3} 10410^{-4} 10410^{-4} 10310^{-3}
GMMVI
Initial Scale {0.1, 1, 10} - - 0.1 10 0.1 1 0.1 1 0.1 10 NA
SMC
Initial Scale {0.1, 1, 10} - - 1 1 0.1 1 1 1 1 1 1
HMC stepsize (β0.5\beta\leq 0.5) {0.001, 0.01, 0.05, 0.1, 0.2} 0.2 0.2 0.001 0.2 0.1 0.05 0.001 0.05 0.2 0.2 0.01
HMC stepsize (β>0.5\beta>0.5) {0.001, 0.01, 0.05, 0.1, 0.2} 0.001 0.2 0.1 0.2 0.1 0.01 0.05 0.001 0.05 0.2 0.2
AFT
Initial Scale {0.1, 1, 10} - - 1 - 0.1 0.1 NA 1 1 1 1
Learning Rate {103,104,105}\{10^{-3},10^{-4},10^{-5}\} 10310^{-3} 10410^{-4} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} NA 10310^{-3} 10410^{-4} 10310^{-3} 10310^{-3}
CRAFT
Initial Scale {0.1, 1, 10} - - 1 - 0.1 1 1 1 0.1 0.1 1
Learning Rate {103,104,105}\{10^{-3},10^{-4},10^{-5}\} 10310^{-3} 10410^{-4} 10310^{-3} 10410^{-4} 10410^{-4} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3}
FAB
Initial Scale {0.1, 1, 10} - - 1 - 0.1 0.1 1 0.1 0.1 1 0.1
Learning Rate {103,104,105}\{10^{-3},10^{-4},10^{-5}\} 10510^{-5} - 10410^{-4} 10310^{-3} 10410^{-4} 10310^{-3} 10310^{-3} 10310^{-3} 10410^{-4} 10310^{-3} 10310^{-3}
DDS/DIS
Initial Scale {0.1, 1, 10} - - 1 - 1 1 0.1 1 0.1 1 0.1
Learning Rate {103,104,105}\{10^{-3},10^{-4},10^{-5}\} 10410^{-4} 10410^{-4} 10310^{-3} 10310^{-3} 10310^{-3} 10410^{-4} 10410^{-4} 10310^{-3} 10410^{-4} 10410^{-4} 10410^{-4}
PIS
Learning Rate {103,104,105}\{10^{-3},10^{-4},10^{-5}\} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10410^{-4} 10410^{-4} 10410^{-4} 10310^{-3} 10310^{-3} 10410^{-4} NA
LDVI
Initial Scale {0.1, 1, 10} - - 1 - 0.1 0.1 1 1 0.1 0.1 0.1
Learning Rate {103,104,105}\{10^{-3},10^{-4},10^{-5}\} 10310^{-3} 10310^{-3} 10410^{-4} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3}
MCD
Initial Scale {0.1, 1, 10} - - 1 - 0.1 0.1 1 0.1 0.1 1 1
Learning Rate {103,104,105}\{10^{-3},10^{-4},10^{-5}\} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10410^{-4} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3} 10310^{-3}
Table 8: Hyperparameter selection for all different sampling algorithms. The ‘Grid’ column indicates the values over which we performed a grid search. The values in the column which are marked with experiment names indicate which values were chosen for the reported results.

Appendix E Further Experimental results

We additionally provide sample visualizations for Funnel and MoG in Figure 4, and Digits and Fashion in Figure 5. We also report additional evaluation criteria for MoG and MoS, including 2-Wasserstein distance, maximum mean discrepancy, reverse and forward partition function error, lower and upper evidence bounds, and reverse and forward effective sample size in Table 9. Lastly, we provide insights into the models efficiency by providing values for the number of target queries and wallclock time needed, for obtaining the best ELBO value. These results are shown in Table 10.

Refer to caption Refer to caption Refer to caption Refer to caption
MFVI GMMVI SMC AFT
Refer to caption Refer to caption Refer to caption Refer to caption
CRAFT FAB MCD LDVI
Refer to caption Refer to caption Refer to caption Refer to caption
PIS DIS DDS GBS
Refer to caption Refer to caption Refer to caption Refer to caption
MFVI GMMVI SMC AFT
Refer to caption Refer to caption Refer to caption Refer to caption
CRAFT FAB MCD LDVI
Refer to caption Refer to caption Refer to caption Refer to caption
PIS DIS DDS GBS
Figure 4: Visualization of samples drawn from different sampling methods for Funnel (top) and MoG (bottom).
Refer to caption Refer to caption Refer to caption Refer to caption
MFVI GMMVI SMC AFT
Refer to caption Refer to caption Refer to caption Refer to caption
CRAFT FAB MCD LDVI
Refer to caption Refer to caption Refer to caption Refer to caption
PIS DIS DDS GBS
Refer to caption Refer to caption Refer to caption Refer to caption
MFVI GMMVI SMC AFT
Refer to caption Refer to caption Refer to caption Refer to caption
CRAFT FAB MCD LDVI
Refer to caption Refer to caption Refer to caption Refer to caption
PIS DIS DDS GBS
Figure 5: Visualization of samples drawn from different sampling methods for Digits (top) and Fashion (bottom).
MoG MoS
d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200
𝒲2\mathcal{W}_{2}\downarrow MMD \downarrow 𝒲2\mathcal{W}_{2}\downarrow MMD \downarrow
MFVI 506.967±7.385506.967\scriptstyle\pm 7.385 36158.898±8.76536158.898\scriptstyle\pm 8.765 148945.539±14.42148945.539\scriptstyle\pm 14.42 0.251±0.0020.251\scriptstyle\pm 0.002 0.209±0.0000.209\scriptstyle\pm 0.000 0.211±0.0000.211\scriptstyle\pm 0.000 24.688±0.22524.688\scriptstyle\pm 0.225 2282.540±1.9592282.540\scriptstyle\pm 1.959 12956.415±6.53012956.415\scriptstyle\pm 6.530 0.162±0.0010.162\scriptstyle\pm 0.001 0.187±0.0010.187\scriptstyle\pm 0.001 0.195±0.0000.195\scriptstyle\pm 0.000
GMMVI 76.474±20.6076.474\scriptstyle\pm 20.60 31983.344±1065.31983.344\scriptstyle\pm 1065. 140166.746±3020.140166.746\scriptstyle\pm 3020. 0.052±0.0100.052\scriptstyle\pm 0.010 0.202±0.0130.202\scriptstyle\pm 0.013 0.214±0.0120.214\scriptstyle\pm 0.012 2.851±0.1282.851\scriptstyle\pm 0.128 1249.010±297.31249.010\scriptstyle\pm 297.3 10402.243±870.910402.243\scriptstyle\pm 870.9 0.036±0.0000.036\scriptstyle\pm 0.000 0.133±0.0180.133\scriptstyle\pm 0.018 0.211±0.0260.211\scriptstyle\pm 0.026
SMC 32.387±9.21932.387\scriptstyle\pm 9.219 46351.236±4.79546351.236\scriptstyle\pm 4.795 176586.789±3.638176586.789\scriptstyle\pm 3.638 0.047±0.0040.047\scriptstyle\pm 0.004 0.631±0.0000.631\scriptstyle\pm 0.000 0.611±0.0000.611\scriptstyle\pm 0.000 34.963±2.83334.963\scriptstyle\pm 2.833 3297.640±1372.3297.640\scriptstyle\pm 1372. 17612.889±2423.17612.889\scriptstyle\pm 2423. 0.069±0.0030.069\scriptstyle\pm 0.003 0.431±0.1610.431\scriptstyle\pm 0.161 0.509±0.1130.509\scriptstyle\pm 0.113
AFT 21.571±6.37421.571\scriptstyle\pm 6.374 44914.194±1154.44914.194\scriptstyle\pm 1154. 184075.172±4347.184075.172\scriptstyle\pm 4347. 0.040±0.0030.040\scriptstyle\pm 0.003 0.622±0.0090.622\scriptstyle\pm 0.009 0.622±0.0080.622\scriptstyle\pm 0.008 41.299±11.2741.299\scriptstyle\pm 11.27 2648.410±301.32648.410\scriptstyle\pm 301.3 20207.756±998.620207.756\scriptstyle\pm 998.6 0.077±0.0110.077\scriptstyle\pm 0.011 0.395±0.0820.395\scriptstyle\pm 0.082 0.611±0.0190.611\scriptstyle\pm 0.019
CRAFT 24.554±4.21624.554\scriptstyle\pm 4.216 42953.544±389.942953.544\scriptstyle\pm 389.9 177039.500±329.3177039.500\scriptstyle\pm 329.3 0.041±0.0030.041\scriptstyle\pm 0.003 0.600±0.0030.600\scriptstyle\pm 0.003 0.609±0.0020.609\scriptstyle\pm 0.002 10.108±0.18610.108\scriptstyle\pm 0.186 1806.321±117.41806.321\scriptstyle\pm 117.4 14411.712±305.914411.712\scriptstyle\pm 305.9 0.048±0.0000.048\scriptstyle\pm 0.000 0.233±0.0210.233\scriptstyle\pm 0.021 0.425±0.0240.425\scriptstyle\pm 0.024
FAB 57.111±24.5357.111\scriptstyle\pm 24.53 9567.319±626.19567.319\scriptstyle\pm 626.1 58832.370±1092.58832.370\scriptstyle\pm 1092. 0.047±0.0070.047\scriptstyle\pm 0.007 0.073±0.0050.073\scriptstyle\pm 0.005 0.099±0.0010.099\scriptstyle\pm 0.001 8.868±1.6738.868\scriptstyle\pm 1.673 1193.455±152.3\mathbf{1193.455\scriptstyle\pm 152.3} 7490.803±433.9\mathbf{7490.803\scriptstyle\pm 433.9} 0.035±0.003\mathbf{0.035\scriptstyle\pm 0.003} 0.093±0.014\mathbf{0.093\scriptstyle\pm 0.014} 0.102±0.012\mathbf{0.102\scriptstyle\pm 0.012}
MCD 211.657±3.504211.657\scriptstyle\pm 3.504 4892.591±71.264892.591\scriptstyle\pm 71.26 30977.775±276.6\mathbf{30977.775\scriptstyle\pm 276.6} 0.136±0.0010.136\scriptstyle\pm 0.001 0.043±0.0000.043\scriptstyle\pm 0.000 0.054±0.000\mathbf{0.054\scriptstyle\pm 0.000} 102.002±0.338102.002\scriptstyle\pm 0.338 6406.902±20.876406.902\scriptstyle\pm 20.87 32034.058±40.8632034.058\scriptstyle\pm 40.86 0.215±0.0010.215\scriptstyle\pm 0.001 0.256±0.0000.256\scriptstyle\pm 0.000 0.257±0.0000.257\scriptstyle\pm 0.000
LDVI 178.241±3.129178.241\scriptstyle\pm 3.129 4931.898±87.434931.898\scriptstyle\pm 87.43 31019.831±278.631019.831\scriptstyle\pm 278.6 0.118±0.0030.118\scriptstyle\pm 0.003 0.043±0.0000.043\scriptstyle\pm 0.000 0.054±0.000\mathbf{0.054\scriptstyle\pm 0.000} 38.758±4.94038.758\scriptstyle\pm 4.940 2899.472±102.92899.472\scriptstyle\pm 102.9 17435.914±299.817435.914\scriptstyle\pm 299.8 0.084±0.0080.084\scriptstyle\pm 0.008 0.181±0.0030.181\scriptstyle\pm 0.003 0.183±0.0020.183\scriptstyle\pm 0.002
PIS 10.398±1.599\mathbf{10.398\scriptstyle\pm 1.599} 10405.749±69.4110405.749\scriptstyle\pm 69.41 92623.455±1219.92623.455\scriptstyle\pm 1219. 0.031±0.001\mathbf{0.031\scriptstyle\pm 0.001} 0.082±0.0000.082\scriptstyle\pm 0.000 0.168±0.0030.168\scriptstyle\pm 0.003 2.476±0.236\mathbf{2.476\scriptstyle\pm 0.236} 2078.751±41.512078.751\scriptstyle\pm 41.51 32415.244±63.1132415.244\scriptstyle\pm 63.11 0.033±0.001\mathbf{0.033\scriptstyle\pm 0.001} 0.205±0.0080.205\scriptstyle\pm 0.008 0.258±0.0010.258\scriptstyle\pm 0.001
DIS 65.162±35.7265.162\scriptstyle\pm 35.72 3044.733±464.7\mathbf{3044.733\scriptstyle\pm 464.7} 31573.015±702.431573.015\scriptstyle\pm 702.4 0.071±0.0170.071\scriptstyle\pm 0.017 0.034±0.003\mathbf{0.034\scriptstyle\pm 0.003} 0.055±0.001\mathbf{0.055\scriptstyle\pm 0.001} 3.486±0.2143.486\scriptstyle\pm 0.214 2200.590±18.732200.590\scriptstyle\pm 18.73 13059.766±72.1213059.766\scriptstyle\pm 72.12 0.037±0.0020.037\scriptstyle\pm 0.002 0.155±0.0010.155\scriptstyle\pm 0.001 0.152±0.0010.152\scriptstyle\pm 0.001
DDS 16.217±3.20216.217\scriptstyle\pm 3.202 5435.177±172.25435.177\scriptstyle\pm 172.2 38576.259±392.938576.259\scriptstyle\pm 392.9 0.035±0.0020.035\scriptstyle\pm 0.002 0.045±0.0010.045\scriptstyle\pm 0.001 0.065±0.0010.065\scriptstyle\pm 0.001 3.641±0.2243.641\scriptstyle\pm 0.224 2145.188±3.9602145.188\scriptstyle\pm 3.960 24187.186±256.424187.186\scriptstyle\pm 256.4 0.034±0.003\mathbf{0.034\scriptstyle\pm 0.003} 0.124±0.0010.124\scriptstyle\pm 0.001 0.219±0.0030.219\scriptstyle\pm 0.003
GBS 140.138±39.76140.138\scriptstyle\pm 39.76 5027.819±103.75027.819\scriptstyle\pm 103.7 31970.248±1177.31970.248\scriptstyle\pm 1177. 0.108±0.0210.108\scriptstyle\pm 0.021 0.043±0.0000.043\scriptstyle\pm 0.000 0.055±0.001\mathbf{0.055\scriptstyle\pm 0.001} 2.572±0.0992.572\scriptstyle\pm 0.099 5708.871±20.915708.871\scriptstyle\pm 20.91 22914.911±300.422914.911\scriptstyle\pm 300.4 0.034±0.001\mathbf{0.034\scriptstyle\pm 0.001} 0.232±0.0000.232\scriptstyle\pm 0.000 0.203±0.0010.203\scriptstyle\pm 0.001
𝚫log𝐙r\mathbf{\Delta\log Z}_{r}\downarrow 𝚫log𝐙f\mathbf{\Delta\log Z}_{f}\downarrow 𝚫log𝐙r\mathbf{\Delta\log Z}_{r}\downarrow 𝚫log𝐙f\mathbf{\Delta\log Z}_{f}\downarrow
MFVI 0.084±0.0660.084\scriptstyle\pm 0.066 3.658±0.0403.658\scriptstyle\pm 0.040 3.676±0.01303.676\scriptstyle\pm 0.0130 0.150±0.0020.150\scriptstyle\pm 0.002 0.185±0.0020.185\scriptstyle\pm 0.002 0.176±0.00500.176\scriptstyle\pm 0.0050 0.018±0.0030.018\scriptstyle\pm 0.003 3.009±0.2913.009\scriptstyle\pm 0.291 8.048±0.7588.048\scriptstyle\pm 0.758 0.114±0.0000.114\scriptstyle\pm 0.000 0.048±0.002\mathbf{0.048\scriptstyle\pm 0.002} 5.982±0.0195.982\scriptstyle\pm 0.019
GMMVI 0.044±0.0110.044\scriptstyle\pm 0.011 1.715±0.119\mathbf{1.715\scriptstyle\pm 0.119} 1.709±0.0580\mathbf{1.709\scriptstyle\pm 0.0580} 0.003±0.002\mathbf{0.003\scriptstyle\pm 0.002} 0.048±0.007\mathbf{0.048\scriptstyle\pm 0.007} 0.028±0.0270\mathbf{0.028\scriptstyle\pm 0.0270} 0.000±0.000\mathbf{0.000\scriptstyle\pm 0.000} 1.282±0.2211.282\scriptstyle\pm 0.221 7.126±0.377\mathbf{7.126\scriptstyle\pm 0.377} 0.000±0.000\mathbf{0.000\scriptstyle\pm 0.000} 0.084±0.0550.084\scriptstyle\pm 0.055 5.708±0.478\mathbf{5.708\scriptstyle\pm 0.478}
SMC 0.069±0.0100.069\scriptstyle\pm 0.010 690.721±11.21690.721\scriptstyle\pm 11.21 6326.621±51.4286326.621\scriptstyle\pm 51.428 2.728±0.0002.728\scriptstyle\pm 0.000 161.796±0.000161.796\scriptstyle\pm 0.000 661.945±0.0000661.945\scriptstyle\pm 0.0000 0.016±0.0090.016\scriptstyle\pm 0.009 3.880±1.1053.880\scriptstyle\pm 1.105 49.846±7.63849.846\scriptstyle\pm 7.638 1.262±0.0001.262\scriptstyle\pm 0.000 80.992±0.00080.992\scriptstyle\pm 0.000 338.745±0.000338.745\scriptstyle\pm 0.000
AFT 0.023±0.0150.023\scriptstyle\pm 0.015 765.624±108.0765.624\scriptstyle\pm 108.0 5567.272±277.525567.272\scriptstyle\pm 277.52 1.157±0.0381.157\scriptstyle\pm 0.038 110.955±18.37110.955\scriptstyle\pm 18.37 420.932±12.987420.932\scriptstyle\pm 12.987 0.024±0.0140.024\scriptstyle\pm 0.014 4.081±1.5794.081\scriptstyle\pm 1.579 47.121±6.69347.121\scriptstyle\pm 6.693 0.639±0.0950.639\scriptstyle\pm 0.095 205.297±23.91205.297\scriptstyle\pm 23.91 12765.117±2877.12765.117\scriptstyle\pm 2877.
CRAFT 0.008±0.0010.008\scriptstyle\pm 0.001 337.094±9.296337.094\scriptstyle\pm 9.296 2504.363±64.9702504.363\scriptstyle\pm 64.970 0.901±0.0070.901\scriptstyle\pm 0.007 100.987±0.065100.987\scriptstyle\pm 0.065 415.277±0.4550415.277\scriptstyle\pm 0.4550 0.004±0.0010.004\scriptstyle\pm 0.001 0.822±0.087\mathbf{0.822\scriptstyle\pm 0.087} 19.738±0.34219.738\scriptstyle\pm 0.342 0.333±0.0130.333\scriptstyle\pm 0.013 210.245±6.098210.245\scriptstyle\pm 6.098 12516.502±631.812516.502\scriptstyle\pm 631.8
FAB 0.007±0.003\mathbf{0.007\scriptstyle\pm 0.003} 2.952±0.2472.952\scriptstyle\pm 0.247 3.331±0.22903.331\scriptstyle\pm 0.2290 1.193±0.1251.193\scriptstyle\pm 0.125 126.363±1.789126.363\scriptstyle\pm 1.789 545.226±5.6200545.226\scriptstyle\pm 5.6200 0.005±0.0010.005\scriptstyle\pm 0.001 3.358±1.0623.358\scriptstyle\pm 1.062 43.419±4.69043.419\scriptstyle\pm 4.690 0.268±0.0930.268\scriptstyle\pm 0.093 84.592±22.6484.592\scriptstyle\pm 22.64 13514.417±101.913514.417\scriptstyle\pm 101.9
MCD 0.010±0.0020.010\scriptstyle\pm 0.002 31.319±1.79331.319\scriptstyle\pm 1.793 2354.020±60.8552354.020\scriptstyle\pm 60.855 0.009±0.0050.009\scriptstyle\pm 0.005 21.148±1.47821.148\scriptstyle\pm 1.478 305.656±2.6620305.656\scriptstyle\pm 2.6620 0.010±0.0020.010\scriptstyle\pm 0.002 28.607±1.27528.607\scriptstyle\pm 1.275 210.536±1.393210.536\scriptstyle\pm 1.393 0.068±0.0100.068\scriptstyle\pm 0.010 24.757±0.84124.757\scriptstyle\pm 0.841 147.321±1.272147.321\scriptstyle\pm 1.272
LDVI 0.038±0.0150.038\scriptstyle\pm 0.015 8.159±0.7758.159\scriptstyle\pm 0.775 647.953±7.2120647.953\scriptstyle\pm 7.2120 0.031±0.0080.031\scriptstyle\pm 0.008 15.477±0.81515.477\scriptstyle\pm 0.815 282.699±4.1050282.699\scriptstyle\pm 4.1050 0.004±0.0010.004\scriptstyle\pm 0.001 4.360±0.7414.360\scriptstyle\pm 0.741 103.224±2.118103.224\scriptstyle\pm 2.118 0.017±0.0030.017\scriptstyle\pm 0.003 5.472±0.9385.472\scriptstyle\pm 0.938 83.029±0.81983.029\scriptstyle\pm 0.819
CMCD 0.026±0.0110.026\scriptstyle\pm 0.011 51.218±2.80951.218\scriptstyle\pm 2.809 306.127±24.673306.127\scriptstyle\pm 24.673 0.030±0.0080.030\scriptstyle\pm 0.008 79.227±3.75879.227\scriptstyle\pm 3.758 440.341±4.2520440.341\scriptstyle\pm 4.2520 0.004±0.0010.004\scriptstyle\pm 0.001 10.533±0.40410.533\scriptstyle\pm 0.404 167.654±1.564167.654\scriptstyle\pm 1.564 0.005±0.0020.005\scriptstyle\pm 0.002 12.835±0.27512.835\scriptstyle\pm 0.275 148.676±2.851148.676\scriptstyle\pm 2.851
PIS 0.267±0.0060.267\scriptstyle\pm 0.006 7.122±0.6307.122\scriptstyle\pm 0.630 40.699±0.543040.699\scriptstyle\pm 0.5430 0.094±0.0380.094\scriptstyle\pm 0.038 3113.492±1.9783113.492\scriptstyle\pm 1.978 16071.743±3.046016071.743\scriptstyle\pm 3.0460 0.275±0.0160.275\scriptstyle\pm 0.016 12.248±0.32612.248\scriptstyle\pm 0.326 209.981±2.573209.981\scriptstyle\pm 2.573 0.342±0.0010.342\scriptstyle\pm 0.001 54.090±0.15154.090\scriptstyle\pm 0.151 304.178±0.329304.178\scriptstyle\pm 0.329
DIS 0.058±0.0300.058\scriptstyle\pm 0.030 87.709±8.94287.709\scriptstyle\pm 8.942 11646.394±15938.11646.394\scriptstyle\pm 15938. 1.390±0.4581.390\scriptstyle\pm 0.458 369.352±16.29369.352\scriptstyle\pm 16.29 14376.906±17877.14376.906\scriptstyle\pm 17877. 0.049±0.0050.049\scriptstyle\pm 0.005 10.448±0.60710.448\scriptstyle\pm 0.607 658.634±4.952658.634\scriptstyle\pm 4.952 3.212±0.0283.212\scriptstyle\pm 0.028 87.897±5.25587.897\scriptstyle\pm 5.255 433.741±10.78433.741\scriptstyle\pm 10.78
DDS 0.012±0.0050.012\scriptstyle\pm 0.005 1.739±0.4421.739\scriptstyle\pm 0.442 27.506±2.584027.506\scriptstyle\pm 2.5840 1.698±0.0291.698\scriptstyle\pm 0.029 207.545±1.163207.545\scriptstyle\pm 1.163 1052.805±1.73201052.805\scriptstyle\pm 1.7320 0.005±0.0010.005\scriptstyle\pm 0.001 7.952±0.2997.952\scriptstyle\pm 0.299 155.502±3.594155.502\scriptstyle\pm 3.594 0.315±0.0000.315\scriptstyle\pm 0.000 53.411±0.02453.411\scriptstyle\pm 0.024 291.566±0.102291.566\scriptstyle\pm 0.102
GBS 0.007±0.001\mathbf{0.007\scriptstyle\pm 0.001} 8.103±1.6968.103\scriptstyle\pm 1.696 87.971±14.65687.971\scriptstyle\pm 14.656 0.008±0.0010.008\scriptstyle\pm 0.001 9.321±0.7769.321\scriptstyle\pm 0.776 72.634±12.30172.634\scriptstyle\pm 12.301 0.002±0.0000.002\scriptstyle\pm 0.000 53.767±0.73253.767\scriptstyle\pm 0.732 157.791±2.947157.791\scriptstyle\pm 2.947 0.010±0.0020.010\scriptstyle\pm 0.002 47.441±0.09847.441\scriptstyle\pm 0.098 101.874±2.214101.874\scriptstyle\pm 2.214
ELBO \uparrow EUBO \downarrow ELBO \uparrow EUBO \downarrow
MFVI 3.011±0.002-3.011\scriptstyle\pm 0.002 3.690±0.000-3.690\scriptstyle\pm 0.000 3.695±0.0010-3.695\scriptstyle\pm 0.0010 3.089±0.0003.089\scriptstyle\pm 0.000 164.114±0.000164.114\scriptstyle\pm 0.000 666.954±0.0000666.954\scriptstyle\pm 0.0000 1.038±0.007-1.038\scriptstyle\pm 0.007 5.957±0.007-5.957\scriptstyle\pm 0.007 16.969±0.011-16.969\scriptstyle\pm 0.011 1.218±0.0011.218\scriptstyle\pm 0.001 72.663±0.00572.663\scriptstyle\pm 0.005 324.202±0.044324.202\scriptstyle\pm 0.044
GMMVI 0.045±0.011\mathbf{-0.045\scriptstyle\pm 0.011} 1.715±0.119\mathbf{-1.715\scriptstyle\pm 0.119} 1.709±0.0580\mathbf{-1.709\scriptstyle\pm 0.0580} 3.619±1.3083.619\scriptstyle\pm 1.308 240.459±51.13240.459\scriptstyle\pm 51.13 645.405±6.3090645.405\scriptstyle\pm 6.3090 0.001±0.000\mathbf{-0.001\scriptstyle\pm 0.000} 3.890±0.122-3.890\scriptstyle\pm 0.122 15.649±0.173\mathbf{-15.649\scriptstyle\pm 0.173} 0.002±0.001\mathbf{0.002\scriptstyle\pm 0.001} 57.746±1.92857.746\scriptstyle\pm 1.928 268.513±17.65268.513\scriptstyle\pm 17.65
SMC 2.095±0.009-2.095\scriptstyle\pm 0.009 877.034±10.23-877.034\scriptstyle\pm 10.23 6816.697±44.195-6816.697\scriptstyle\pm 44.195 2.734±0.0002.734\scriptstyle\pm 0.000 161.921±0.000161.921\scriptstyle\pm 0.000 662.404±0.0000662.404\scriptstyle\pm 0.0000 0.010±0.016-0.010\scriptstyle\pm 0.016 4.634±1.088-4.634\scriptstyle\pm 1.088 52.535±7.564-52.535\scriptstyle\pm 7.564 1.272±0.0001.272\scriptstyle\pm 0.000 81.325±0.00081.325\scriptstyle\pm 0.000 340.984±0.000340.984\scriptstyle\pm 0.000
AFT 1.778±0.090-1.778\scriptstyle\pm 0.090 927.16±103.8-927.16\scriptstyle\pm 103.8 6053.823±260.99-6053.823\scriptstyle\pm 260.99 1.248±0.0451.248\scriptstyle\pm 0.045 117.63±22.16117.63\scriptstyle\pm 22.16 439.434±16.788439.434\scriptstyle\pm 16.788 0.041±0.031-0.041\scriptstyle\pm 0.031 4.923±1.546-4.923\scriptstyle\pm 1.546 50.328±6.627-50.328\scriptstyle\pm 6.627 0.67±0.1000.67\scriptstyle\pm 0.100 207.625±24.14207.625\scriptstyle\pm 24.14 12801.561±2892.12801.561\scriptstyle\pm 2892.
CRAFT 0.666±0.026-0.666\scriptstyle\pm 0.026 451.399±7.561-451.399\scriptstyle\pm 7.561 2836.471±57.695-2836.471\scriptstyle\pm 57.695 0.976±0.0070.976\scriptstyle\pm 0.007 103.674±0.069103.674\scriptstyle\pm 0.069 425.500±0.7070425.500\scriptstyle\pm 0.7070 0.002±0.002\mathbf{-0.002\scriptstyle\pm 0.002} 0.339±0.180\mathbf{-0.339\scriptstyle\pm 0.180} 22.687±0.358-22.687\scriptstyle\pm 0.358 0.346±0.0130.346\scriptstyle\pm 0.013 212.210±6.160212.210\scriptstyle\pm 6.160 12553.883±645.712553.883\scriptstyle\pm 645.7
FAB 19.932±12.80-19.932\scriptstyle\pm 12.80 299.916±253.4-299.916\scriptstyle\pm 253.4 63.212±56.191-63.212\scriptstyle\pm 56.191 0.865±0.1130.865\scriptstyle\pm 0.113 93.560±5.08693.560\scriptstyle\pm 5.086 386.884±12.161386.884\scriptstyle\pm 12.161 0.257±0.075-0.257\scriptstyle\pm 0.075 75.735±175.875.735\scriptstyle\pm 175.8 98.558±7.688-98.558\scriptstyle\pm 7.688 0.162±0.0550.162\scriptstyle\pm 0.055 18.088±2.503\mathbf{18.088\scriptstyle\pm 2.503} 227.514±0.320227.514\scriptstyle\pm 0.320
MCD 0.651±0.014-0.651\scriptstyle\pm 0.014 185.021±0.743-185.021\scriptstyle\pm 0.743 4017.832±20.356-4017.832\scriptstyle\pm 20.356 0.652±0.0080.652\scriptstyle\pm 0.008 43.670±0.457\mathbf{43.670\scriptstyle\pm 0.457} 358.687±2.1120358.687\scriptstyle\pm 2.1120 1.215±0.005-1.215\scriptstyle\pm 0.005 69.358±0.633-69.358\scriptstyle\pm 0.633 308.728±0.450-308.728\scriptstyle\pm 0.450 0.734±0.0020.734\scriptstyle\pm 0.002 47.834±0.82047.834\scriptstyle\pm 0.820 208.626±0.525208.626\scriptstyle\pm 0.525
LDVI 0.986±0.136-0.986\scriptstyle\pm 0.136 29.034±0.591-29.034\scriptstyle\pm 0.591 956.576±6.2700-956.576\scriptstyle\pm 6.2700 1.072±0.2421.072\scriptstyle\pm 0.242 51.137±0.17751.137\scriptstyle\pm 0.177 375.527±3.1100375.527\scriptstyle\pm 3.1100 0.311±0.034-0.311\scriptstyle\pm 0.034 28.471±1.018-28.471\scriptstyle\pm 1.018 173.716±2.629-173.716\scriptstyle\pm 2.629 0.198±0.0080.198\scriptstyle\pm 0.008 20.887±1.04220.887\scriptstyle\pm 1.042 132.711±1.817\mathbf{132.711\scriptstyle\pm 1.817}
PIS 0.585±0.016-0.585\scriptstyle\pm 0.016 16.881±0.026-16.881\scriptstyle\pm 0.026 65.700±0.2010-65.700\scriptstyle\pm 0.2010 7.344±0.0047.344\scriptstyle\pm 0.004 3626.120±1.3603626.120\scriptstyle\pm 1.360 16979.347±4.470016979.347\scriptstyle\pm 4.4700 0.387±0.009-0.387\scriptstyle\pm 0.009 29.261±1.743-29.261\scriptstyle\pm 1.743 306.678±0.548-306.678\scriptstyle\pm 0.548 1.868±0.0001.868\scriptstyle\pm 0.000 88.192±0.00588.192\scriptstyle\pm 0.005 363.435±0.030363.435\scriptstyle\pm 0.030
DIS 1.850±0.359-1.850\scriptstyle\pm 0.359 181.348±15.47-181.348\scriptstyle\pm 15.47 14142.693±17807.-14142.693\scriptstyle\pm 17807. 6.653±0.3576.653\scriptstyle\pm 0.357 546.335±30.86546.335\scriptstyle\pm 30.86 15792.004±18966.15792.004\scriptstyle\pm 18966. 0.157±0.023-0.157\scriptstyle\pm 0.023 36.704±0.629-36.704\scriptstyle\pm 0.629 819.959±6.264-819.959\scriptstyle\pm 6.264 4.778±0.0384.778\scriptstyle\pm 0.038 193.270±3.293193.270\scriptstyle\pm 3.293 658.575±7.820658.575\scriptstyle\pm 7.820
DDS 0.527±0.022-0.527\scriptstyle\pm 0.022 13.284±0.460-13.284\scriptstyle\pm 0.460 60.642±2.3330-60.642\scriptstyle\pm 2.3330 4.176±0.0004.176\scriptstyle\pm 0.000 291.867±0.047291.867\scriptstyle\pm 0.047 1224.926±2.48501224.926\scriptstyle\pm 2.4850 0.110±0.007-0.110\scriptstyle\pm 0.007 31.681±0.363-31.681\scriptstyle\pm 0.363 244.188±3.504-244.188\scriptstyle\pm 3.504 1.783±0.0001.783\scriptstyle\pm 0.000 86.014±0.00186.014\scriptstyle\pm 0.001 351.204±0.005351.204\scriptstyle\pm 0.005
GBS 0.473±0.061-0.473\scriptstyle\pm 0.061 35.771±1.105-35.771\scriptstyle\pm 1.105 161.259±20.704-161.259\scriptstyle\pm 20.704 0.485±0.047\mathbf{0.485\scriptstyle\pm 0.047} 67.819±2.15767.819\scriptstyle\pm 2.157 204.498±48.539\mathbf{204.498\scriptstyle\pm 48.539} 0.064±0.004-0.064\scriptstyle\pm 0.004 99.369±0.158-99.369\scriptstyle\pm 0.158 258.263±2.639-258.263\scriptstyle\pm 2.639 0.064±0.0040.064\scriptstyle\pm 0.004 73.545±0.10773.545\scriptstyle\pm 0.107 147.412±1.504147.412\scriptstyle\pm 1.504
ESSr\textbf{ESS}_{r} \uparrow ESSf\textbf{ESS}_{f} \uparrow ESSr\textbf{ESS}_{r} \uparrow ESSf\textbf{ESS}_{f} \uparrow
MFVI 0.077±0.0160.077\scriptstyle\pm 0.016 0.997±0.0000.997\scriptstyle\pm 0.000 0.988±0.0010.988\scriptstyle\pm 0.001 0.286±0.0000.286\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.180±0.0070.180\scriptstyle\pm 0.007 0.031±0.007\mathbf{0.031\scriptstyle\pm 0.007} 0.006±0.001\mathbf{0.006\scriptstyle\pm 0.001} 0.163±0.0000.163\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
GMMVI 1.000±0.000\mathbf{1.000\scriptstyle\pm 0.000} 1.000±0.000\mathbf{1.000\scriptstyle\pm 0.000} 1.000±0.000\mathbf{1.000\scriptstyle\pm 0.000} 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.997±0.000\mathbf{0.997\scriptstyle\pm 0.000} 0.027±0.0040.027\scriptstyle\pm 0.004 0.006±0.001\mathbf{0.006\scriptstyle\pm 0.001} 0.997±0.000\mathbf{0.997\scriptstyle\pm 0.000} 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
MCD 0.311±0.0130.311\scriptstyle\pm 0.013 0.001±0.0000.001\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.289±0.0100.289\scriptstyle\pm 0.010 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.332±0.0040.332\scriptstyle\pm 0.004 0.001±0.0000.001\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.352±0.0040.352\scriptstyle\pm 0.004 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
LDVI 0.207±0.0440.207\scriptstyle\pm 0.044 0.002±0.0000.002\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.269±0.0460.269\scriptstyle\pm 0.046 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.742±0.0060.742\scriptstyle\pm 0.006 0.002±0.0000.002\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.761±0.0040.761\scriptstyle\pm 0.004 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
PIS 0.529±0.0120.529\scriptstyle\pm 0.012 0.006±0.0010.006\scriptstyle\pm 0.001 0.002±0.0000.002\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.840±0.0040.840\scriptstyle\pm 0.004 0.003±0.0000.003\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.042±0.0000.042\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
DIS 0.078±0.0250.078\scriptstyle\pm 0.025 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.580±0.0030.580\scriptstyle\pm 0.003 0.002±0.0000.002\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
DDS 0.338±0.0030.338\scriptstyle\pm 0.003 0.003±0.0000.003\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.780±0.0100.780\scriptstyle\pm 0.010 0.002±0.0000.002\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.032±0.0000.032\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
GBS 0.405±0.0290.405\scriptstyle\pm 0.029 0.002±0.0000.002\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.380±0.027\mathbf{0.380\scriptstyle\pm 0.027} 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.879±0.0020.879\scriptstyle\pm 0.002 0.001±0.0000.001\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000 0.721±0.1610.721\scriptstyle\pm 0.161 0.000±0.0000.000\scriptstyle\pm 0.000 0.000±0.0000.000\scriptstyle\pm 0.000
Table 9: Results for various sampling methods for MoG and MoS with varying dimensions dd. Evaluation criteria include 2-Wasserstein distance (𝒲2\mathcal{W}_{2}), maximum mean discrepancy (MMD), reverse and forward partition function error (ΔlogZr\Delta\log Z_{r}, ΔlogZf\Delta\log Z_{f}), lower and upper evidence bounds (ELBO, EUBO), reverse and forward effective sample size (ESSr\text{ESS}_{r}, ESSf\text{ESS}_{f}). The best results are highlighted in bold. Arrows (\uparrow, \downarrow) indicate whether higher or lower values are preferable, respectively.
NFE \downarrow
Method d=2d=2 d=50d=50 d=200d=200
MFVI 6.5×1066.5\times 10^{6} 2.3×1062.3\times 10^{6} 1.9×1061.9\times 10^{6}
GMMVI 1.4×1051.4\times 10^{5} 5.9×1055.9\times 10^{5} 7.9×1057.9\times 10^{5}
SMC 2.8×1062.8\times 10^{6} 2.8×1062.8\times 10^{6} 2.8×1062.8\times 10^{6}
AFT 2.0×1052.0\times 10^{5} 2.0×1052.0\times 10^{5} 2.0×1052.0\times 10^{5}
CRAFT 4.5×1094.5\times 10^{9} 4.4×1094.4\times 10^{9} 4.5×1094.5\times 10^{9}
FAB 1.5×1071.5\times 10^{7} 3.4×1073.4\times 10^{7} 3.4×1073.4\times 10^{7}
DDS 6.0×1086.0\times 10^{8} 4.8×1084.8\times 10^{8} 3.6×1083.6\times 10^{8}
MCD 1.3×1091.3\times 10^{9} 1.3×1091.3\times 10^{9} 1.2×1091.2\times 10^{9}
LDVI 1.3×1091.3\times 10^{9} 1.3×1091.3\times 10^{9} 1.3×1091.3\times 10^{9}
Table 10: Number of function evaluations (NFE), that is number of times a sampling method queries γ(𝐱)\gamma(\mathbf{x}) until achieving the highest ELBO value for varying dimensions dd on MoG.

Appendix F Ablation Studies

F.1 Ablation Study: Batchsize and Number of Particles

Experimental Setup. We test the influence of different batchsizes/number of particles on ELBO and EMC on the MoG experiment for various methods. We use the parameters detailed in Appendix D.

Discussion. The results for the ablation study for the batchsize can be found in Table 11. We find that increasing batchsizes do not yield significant performance increases for simple methods such as MFVI. For more complex methods such as MCD or DDS, larger batchsizes tend to yield consistently better ELBO values across varying dimensionalities of the target density. In contrast, EMC values are unaffected by larger batchsizes (cf. MCD d=200d=200).

The results for the number of particles can be found in Table 12. Surprisingly, ELBO values do often not improve beyond 512512 particles, despite particle interactions through resampling (Del Moral et al., 2006). Moreover, similar to the batch size, EMC does not change significantly when using a larger number of particles.

ELBO \uparrow EMC \uparrow
Method Batchsize d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200
MFVI 64 3.011±0.003\mathbf{-3.011\scriptstyle\pm 0.003} 3.707±0.002-3.707\scriptstyle\pm 0.002 3.746±0.001-3.746\scriptstyle\pm 0.001 0.383±0.002\mathbf{0.383\scriptstyle\pm 0.002} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
128 3.012±0.004\mathbf{-3.012\scriptstyle\pm 0.004} 3.698±0.002\mathbf{-3.698\scriptstyle\pm 0.002} 3.731±0.002-3.731\scriptstyle\pm 0.002 0.382±0.003\mathbf{0.382\scriptstyle\pm 0.003} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
512 3.011±0.004\mathbf{-3.011\scriptstyle\pm 0.004} 3.694±0.0\mathbf{-3.694\scriptstyle\pm 0.0} 3.706±0.0\mathbf{-3.706\scriptstyle\pm 0.0} 0.382±0.002\mathbf{0.382\scriptstyle\pm 0.002} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
1024 3.012±0.003\mathbf{-3.012\scriptstyle\pm 0.003} 3.692±0.001\mathbf{-3.692\scriptstyle\pm 0.001} 3.701±0.002\mathbf{-3.701\scriptstyle\pm 0.002} 0.382±0.002\mathbf{0.382\scriptstyle\pm 0.002} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
2048 3.012±0.003\mathbf{-3.012\scriptstyle\pm 0.003} 3.691±0.0\mathbf{-3.691\scriptstyle\pm 0.0} 3.697±0.001\mathbf{-3.697\scriptstyle\pm 0.001} 0.383±0.002\mathbf{0.383\scriptstyle\pm 0.002} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
MCD 64 3.017±0.2-3.017\scriptstyle\pm 0.2 942.74±8.447-942.74\scriptstyle\pm 8.447 4699.422±269.44-4699.422\scriptstyle\pm 269.44 0.796±0.003\mathbf{0.796\scriptstyle\pm 0.003} 0.994±0.001\mathbf{0.994\scriptstyle\pm 0.001} 0.989±0.0\mathbf{0.989\scriptstyle\pm 0.0}
128 2.685±0.168-2.685\scriptstyle\pm 0.168 889.472±7.41-889.472\scriptstyle\pm 7.41 4145.279±179.564-4145.279\scriptstyle\pm 179.564 0.798±0.001\mathbf{0.798\scriptstyle\pm 0.001} 0.994±0.001\mathbf{0.994\scriptstyle\pm 0.001} 0.988±0.0\mathbf{0.988\scriptstyle\pm 0.0}
512 2.409±0.05-2.409\scriptstyle\pm 0.05 876.718±6.132-876.718\scriptstyle\pm 6.132 3442.883±260.824\mathbf{-3442.883\scriptstyle\pm 260.824} 0.796±0.002\mathbf{0.796\scriptstyle\pm 0.002} 0.994±0.0\mathbf{0.994\scriptstyle\pm 0.0} 0.988±0.0\mathbf{0.988\scriptstyle\pm 0.0}
1024 2.277±0.131-2.277\scriptstyle\pm 0.131 844.588±10.761-844.588\scriptstyle\pm 10.761 OOM 0.797±0.002\mathbf{0.797\scriptstyle\pm 0.002} 0.994±0.0\mathbf{0.994\scriptstyle\pm 0.0} OOM
2048 2.257±0.075\mathbf{-2.257\scriptstyle\pm 0.075} 823.443±18.151\mathbf{-823.443\scriptstyle\pm 18.151} OOM 0.796±0.002\mathbf{0.796\scriptstyle\pm 0.002} 0.994±0.0\mathbf{0.994\scriptstyle\pm 0.0} OOM
DDS 64 0.807±0.036-0.807\scriptstyle\pm 0.036 16.83±0.404-16.83\scriptstyle\pm 0.404 67.053±0.993-67.053\scriptstyle\pm 0.993 0.973±0.0020.973\scriptstyle\pm 0.002 0.992±0.0\mathbf{0.992\scriptstyle\pm 0.0} 0.984±0.001\mathbf{0.984\scriptstyle\pm 0.001}
128 0.716±0.009-0.716\scriptstyle\pm 0.009 16.092±0.247-16.092\scriptstyle\pm 0.247 65.232±0.4-65.232\scriptstyle\pm 0.4 0.978±0.0020.978\scriptstyle\pm 0.002 0.991±0.001\mathbf{0.991\scriptstyle\pm 0.001} 0.983±0.001\mathbf{0.983\scriptstyle\pm 0.001}
512 0.611±0.022-0.611\scriptstyle\pm 0.022 15.61±0.206-15.61\scriptstyle\pm 0.206 63.135±0.348-63.135\scriptstyle\pm 0.348 0.984±0.001\mathbf{0.984\scriptstyle\pm 0.001} 0.992±0.001\mathbf{0.992\scriptstyle\pm 0.001} 0.983±0.001\mathbf{0.983\scriptstyle\pm 0.001}
1024 0.593±0.011-0.593\scriptstyle\pm 0.011 15.414±0.161-15.414\scriptstyle\pm 0.161 62.086±0.258-62.086\scriptstyle\pm 0.258 0.985±0.001\mathbf{0.985\scriptstyle\pm 0.001} 0.992±0.0\mathbf{0.992\scriptstyle\pm 0.0} 0.982±0.002\mathbf{0.982\scriptstyle\pm 0.002}
2048 0.556±0.009\mathbf{-0.556\scriptstyle\pm 0.009} 15.313±0.165\mathbf{-15.313\scriptstyle\pm 0.165} 61.576±0.384\mathbf{-61.576\scriptstyle\pm 0.384} 0.988±0.001\mathbf{0.988\scriptstyle\pm 0.001} 0.992±0.001\mathbf{0.992\scriptstyle\pm 0.001} 0.979±0.001\mathbf{0.979\scriptstyle\pm 0.001}
Table 11: ELBO and EMC values for varying batch sizes for different methods, and dimensions of the MoG target density. Best values are marked with bold font. Here, OOM refers to ‘out of memory’.
ELBO \uparrow EMC \uparrow
Method Particles d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200
SMC 64 9.267±0.217-9.267\scriptstyle\pm 0.217 2622.073±21.637-2622.073\scriptstyle\pm 21.637 17904.276±25.557-17904.276\scriptstyle\pm 25.557 0.824±0.0180.824\scriptstyle\pm 0.018 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
128 9.08±0.041-9.08\scriptstyle\pm 0.041 2647.733±31.935-2647.733\scriptstyle\pm 31.935 16999.909±14.959-16999.909\scriptstyle\pm 14.959 0.879±0.0350.879\scriptstyle\pm 0.035 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
512 8.823±0.033-8.823\scriptstyle\pm 0.033 1911.449±8.527\mathbf{-1911.449\scriptstyle\pm 8.527} 16867.03±44.663-16867.03\scriptstyle\pm 44.663 0.941±0.0040.941\scriptstyle\pm 0.004 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
1024 8.595±0.035\mathbf{-8.595\scriptstyle\pm 0.035} 2323.482±13.52-2323.482\scriptstyle\pm 13.52 15565.314±78.958-15565.314\scriptstyle\pm 78.958 0.971±0.004\mathbf{0.971\scriptstyle\pm 0.004} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
2048 10.317±0.028-10.317\scriptstyle\pm 0.028 2041.686±20.993-2041.686\scriptstyle\pm 20.993 15032.371±46.049\mathbf{-15032.371\scriptstyle\pm 46.049} 0.965±0.002\mathbf{0.965\scriptstyle\pm 0.002} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
CRAFT 64 3.666±0.048-3.666\scriptstyle\pm 0.048 793.354±19.752-793.354\scriptstyle\pm 19.752 4646.891±77.062-4646.891\scriptstyle\pm 77.062 0.986±0.001\mathbf{0.986\scriptstyle\pm 0.001} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
128 3.604±0.039-3.604\scriptstyle\pm 0.039 790.385±23.036-790.385\scriptstyle\pm 23.036 4656.227±80.153-4656.227\scriptstyle\pm 80.153 0.986±0.001\mathbf{0.986\scriptstyle\pm 0.001} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
512 3.6±0.061-3.6\scriptstyle\pm 0.061 784.881±14.364-784.881\scriptstyle\pm 14.364 4624.869±63.618\mathbf{-4624.869\scriptstyle\pm 63.618} 0.987±0.0\mathbf{0.987\scriptstyle\pm 0.0} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
1024 3.552±0.041\mathbf{-3.552\scriptstyle\pm 0.041} 785.251±16.847-785.251\scriptstyle\pm 16.847 4632.063±68.715-4632.063\scriptstyle\pm 68.715 0.986±0.002\mathbf{0.986\scriptstyle\pm 0.002} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
2048 3.553±0.05\mathbf{-3.553\scriptstyle\pm 0.05} 782.068±13.855\mathbf{-782.068\scriptstyle\pm 13.855} 4625.769±55.853-4625.769\scriptstyle\pm 55.853 0.987±0.001\mathbf{0.987\scriptstyle\pm 0.001} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
Table 12: ELBO and EMC values for varying number of particles and dimensions of the MoG target density.

F.2 Ablation Study: Number of Temperatures / Timesteps T

Experimental Setup. We test the influence of different number of temperatures/timesteps TT for methods of sequential nature such as sequential importance sampling or SDE based methods. We use batch sizes of 512512. The remaining parameters are set according to Appendix D.

Discussion. The results are illustrated in Figure 6. We can see that using larger values of TT tend to improves ELBO and EUBO values across all methods.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 6: Negative ELBO and EUBO values for varying temperatures/timesteps TT for different dimensions of the MoG target density. Best values are marked with bold font. Missing values for T=256T=256 are caused by out-of-memory problems.

F.3 Ablation Study: Sequential Monte Carlo Design Choices

Experimental Setup. As Sequential Monte Carlo is the basis for many sampling methods such as SNF (Wu et al., 2020a), AFT (Arbel et al., 2021), CRAFT (Matthews et al., 2022), or FAB (Midgley et al., 2022) we perform a thorough ablation of its design choices. In particular, we ablate the influence of the MCMC kernel and whether or not resampling is used. We tested Metropolis-Hastings (MH) and Hamiltonian Monte Carlo (HMC) MCMC kernels where we used the same number of function evaluations and hand-tuned the stepsizes such obtained a rejection rate 0.65\approx 0.65. The results are shown in Table 13.

Discussion. HMC outperforms MH across all dimensions with respect to both, ELBO and EMC values. Surprisingly, not using resampling avoids mode collapse entirely as indicated by EMC1\text{EMC}\approx 1.

MCMC Re- ELBO \uparrow EMC \uparrow
Kernel sampling d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200
- 9.473±0.000-9.473\scriptstyle{\pm 0.000} 32034.303±0.000-32034.303\scriptstyle{\pm 0.000} 292642.344±0.000-292642.344\scriptstyle{\pm 0.000} 0.785±0.0000.785\scriptstyle{\pm 0.000} 0.987±0.000\mathbf{0.987\scriptstyle{\pm 0.000}} 0.988±0.000\mathbf{0.988\scriptstyle{\pm 0.000}}
- 9.28±0.2044-9.28\scriptstyle{\pm 0.2044} 27534.303±72.32-27534.303\scriptstyle{\pm 72.32} 288123.325±108.010-288123.325\scriptstyle{\pm 108.010} 0.618±0.1910.618\scriptstyle{\pm 0.191} 0±00\scriptstyle{\pm 0} 0±00\scriptstyle{\pm 0}
MH 9.166±0.138-9.166\scriptstyle{\pm 0.138} 26686.496±412.669-26686.496\scriptstyle{\pm 412.669} 275404.367±1375.306-275404.367\scriptstyle{\pm 1375.306} 0.785±0.0030.785\scriptstyle{\pm 0.003} 0.987±0.000\mathbf{0.987\scriptstyle{\pm 0.000}} 0.988±0.000\mathbf{0.988\scriptstyle{\pm 0.000}}
MH 9.064±0.034-9.064\scriptstyle{\pm 0.034} 22411.798±69.874-22411.798\scriptstyle{\pm 69.874} 251904.734±422.895-251904.734\scriptstyle{\pm 422.895} 0.864±0.0210.864\scriptstyle{\pm 0.021} 0±00\scriptstyle{\pm 0} 0±00\scriptstyle{\pm 0}
HMC 8.736±0.031\mathbf{-8.736\scriptstyle{\pm 0.031}} 2272.619±96.639-2272.619\scriptstyle{\pm 96.639} 18270.795±91.703-18270.795\scriptstyle{\pm 91.703} 0.798±0.0060.798\scriptstyle{\pm 0.006} 0.986±0.000\mathbf{0.986\scriptstyle{\pm 0.000}} 0.988±0.000\mathbf{0.988\scriptstyle{\pm 0.000}}
HMC 8.850±0.110{-8.850\scriptstyle{\pm 0.110}} 1931.168±18.844\mathbf{-1931.168\scriptstyle{\pm 18.844}} 16952.94±49.119\mathbf{-16952.94\scriptstyle{\pm 49.119}} 0.940±0.006\mathbf{0.940\scriptstyle{\pm 0.006}} 0±00\scriptstyle{\pm 0} 0±00\scriptstyle{\pm 0}
Table 13: Ablation study for Sequential Monte Carlo (Del Moral et al., 2006). ELBO and EUBO values for different MCMC kernels and whether or not resampling is used. Here, MH refers to Metropolis-Hastings and HMC to Hamiltonian Monte Carlo (Bishop, 2006). Results are reported for different dimensions of the MoG target density.

F.4 Ablation Study: Initial Model Support

Experimental Setup. We test the influence of the initial model support for different methods of sequential nature. In particular, we vary the scale σ02\sigma_{0}^{2} of the initial proposal/base distribution π0(𝐱)=𝒩(0,σ02I)\pi_{0}(\mathbf{x})=\mathcal{N}(0,\sigma_{0}^{2}\textbf{I}). To that end, we report ELBO and EUBO values on the MoG experiment for varying dimensions. We use the parameters detailed in Appendix D. The results are shown in Table 14.

Discussion. The results of the ablation study investigating varied initial standard deviations for parameterizing the base distribution can be found in Table 14. We observe that, in terms of the ELBO, most methods exhibit poor performance with a higher initial scale, particularly in higher dimensions. Conversely, EMC values tend to get 0 for small initial scales and 11 for large initial scales.

Method Initial ELBO \uparrow EMC \uparrow
Scale d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200
SMC 1 3.717±0.056-3.717\scriptstyle\pm 0.056 1800.882±38.348-1800.882\scriptstyle\pm 38.348 12181.939±110.871-12181.939\scriptstyle\pm 110.871 0.002±0.0020.002\scriptstyle\pm 0.002 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
10 0.175±0.408\mathbf{-0.175\scriptstyle\pm 0.408} 313.926±4.847\mathbf{-313.926\scriptstyle\pm 4.847} 2008.238±10.954\mathbf{-2008.238\scriptstyle\pm 10.954} 0.722±0.0220.722\scriptstyle\pm 0.022 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
30 0.666±0.081-0.666\scriptstyle\pm 0.081 674.1±12.747-674.1\scriptstyle\pm 12.747 5014.372±27.284-5014.372\scriptstyle\pm 27.284 0.957±0.007\mathbf{0.957\scriptstyle\pm 0.007} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
60 8.823±0.033-8.823\scriptstyle\pm 0.033 1911.449±8.527-1911.449\scriptstyle\pm 8.527 16867.03±44.663-16867.03\scriptstyle\pm 44.663 0.941±0.0040.941\scriptstyle\pm 0.004 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
CRAFT 1 2.675±0.236-2.675\scriptstyle\pm 0.236 11.333±0.644\mathbf{-11.333\scriptstyle\pm 0.644} 83.301±1.267\mathbf{-83.301\scriptstyle\pm 1.267} 0.143±0.0330.143\scriptstyle\pm 0.033 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
10 0.633±0.538-0.633\scriptstyle\pm 0.538 136.414±2.482-136.414\scriptstyle\pm 2.482 1090.374±22.117-1090.374\scriptstyle\pm 22.117 0.657±0.2330.657\scriptstyle\pm 0.233 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
30 0.229±0.018\mathbf{-0.229\scriptstyle\pm 0.018} 350.247±11.605-350.247\scriptstyle\pm 11.605 2482.919±12.176-2482.919\scriptstyle\pm 12.176 0.974±0.0040.974\scriptstyle\pm 0.004 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
60 3.563±0.057-3.563\scriptstyle\pm 0.057 784.881±14.364-784.881\scriptstyle\pm 14.364 4624.869±63.618-4624.869\scriptstyle\pm 63.618 0.987±0.001\mathbf{0.987\scriptstyle\pm 0.001} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
MCD 1 3.676±0.001-3.676\scriptstyle\pm 0.001 3.292±0.011\mathbf{-3.292\scriptstyle\pm 0.011} 4.281±0.039\mathbf{-4.281\scriptstyle\pm 0.039} 0.005±0.00.005\scriptstyle\pm 0.0 0.187±0.00.187\scriptstyle\pm 0.0 0.005±0.0010.005\scriptstyle\pm 0.001
10 1.653±0.032-1.653\scriptstyle\pm 0.032 87.5±0.519-87.5\scriptstyle\pm 0.519 144.237±4.133-144.237\scriptstyle\pm 4.133 0.613±0.0030.613\scriptstyle\pm 0.003 0.658±0.0040.658\scriptstyle\pm 0.004 0.647±0.0020.647\scriptstyle\pm 0.002
30 1.138±0.064\mathbf{-1.138\scriptstyle\pm 0.064} 441.73±2.245-441.73\scriptstyle\pm 2.245 1265.551±6.991-1265.551\scriptstyle\pm 6.991 0.94±0.001\mathbf{0.94\scriptstyle\pm 0.001} 0.961±0.00.961\scriptstyle\pm 0.0 0.942±0.0020.942\scriptstyle\pm 0.002
60 2.384±0.059-2.384\scriptstyle\pm 0.059 878.12±8.598-878.12\scriptstyle\pm 8.598 3458.28±248.958-3458.28\scriptstyle\pm 248.958 0.798±0.0030.798\scriptstyle\pm 0.003 0.994±0.001\mathbf{0.994\scriptstyle\pm 0.001} 0.988±0.0\mathbf{0.988\scriptstyle\pm 0.0}
DDS 1 3.622±0.012-3.622\scriptstyle\pm 0.012 6.053±0.624\mathbf{-6.053\scriptstyle\pm 0.624} 49.0±10.277-49.0\scriptstyle\pm 10.277 0.0±0.00.0\scriptstyle\pm 0.0 0.187±0.00.187\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
10 0.737±0.024-0.737\scriptstyle\pm 0.024 6.954±0.146-6.954\scriptstyle\pm 0.146 20.149±0.075\mathbf{-20.149\scriptstyle\pm 0.075} 0.85±0.0010.85\scriptstyle\pm 0.001 0.26±0.0310.26\scriptstyle\pm 0.031 0.348±0.0110.348\scriptstyle\pm 0.011
30 0.408±0.01\mathbf{-0.408\scriptstyle\pm 0.01} 10.604±0.165-10.604\scriptstyle\pm 0.165 42.396±0.105-42.396\scriptstyle\pm 0.105 0.989±0.001\mathbf{0.989\scriptstyle\pm 0.001} 0.941±0.0030.941\scriptstyle\pm 0.003 0.841±0.0110.841\scriptstyle\pm 0.011
60 0.612±0.019-0.612\scriptstyle\pm 0.019 15.598±0.106-15.598\scriptstyle\pm 0.106 63.101±0.253-63.101\scriptstyle\pm 0.253 0.984±0.001\mathbf{0.984\scriptstyle\pm 0.001} 0.992±0.001\mathbf{0.992\scriptstyle\pm 0.001} 0.983±0.002\mathbf{0.983\scriptstyle\pm 0.002}
Table 14: ELBO and EMC values for varying initial scales, and dimensions of the MoG target density.

F.5 Ablation Study: Langevin Methods

Experimental Setup. The augemented ELBO allows for end-to-end training of several parameters that otherwise need careful tuning. (Geffner & Domke, 2022) showed that learning the mean and variance of the proposal distribution π0\pi_{0}, the time discretization stepsize Δt\Delta_{t} and annealing schedule (βt)t=1T(\beta_{t})_{t=1}^{T} by maximizing the extended ELBO. Here, we test the influence of training vs. fixing these paramters for MCD (Doucet et al., 2022b) on the MoG target for varying dimensions. The results are shown in Table 15. The fixed parameters are chosen according to Table D.

Discussion. We observe that learning more parameters tend to yield higher ELBO values. However, especially learning the parameters of the proposal π0\pi_{0} results in low EMC values.

Trainable ELBO \uparrow EMC \uparrow
σt\sigma_{t} βt\beta_{t} π0\pi_{0} d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200
3.519±0.154-3.519\scriptstyle\pm 0.154 2513.292±26.017-2513.292\scriptstyle\pm 26.017 13575.6±414.217-13575.6\scriptstyle\pm 414.217 0.799±0.0040.799\scriptstyle\pm 0.004 0.994±0.0\mathbf{0.994\scriptstyle\pm 0.0} 0.988±0.001\mathbf{0.988\scriptstyle\pm 0.001}
2.441±0.079-2.441\scriptstyle\pm 0.079 1141.639±18.651-1141.639\scriptstyle\pm 18.651 6574.401±114.962-6574.401\scriptstyle\pm 114.962 0.819±0.0040.819\scriptstyle\pm 0.004 0.994±0.0\mathbf{0.994\scriptstyle\pm 0.0} 0.988±0.001\mathbf{0.988\scriptstyle\pm 0.001}
2.384±0.059-2.384\scriptstyle\pm 0.059 878.12±8.598-878.12\scriptstyle\pm 8.598 3458.28±248.958-3458.28\scriptstyle\pm 248.958 0.798±0.0030.798\scriptstyle\pm 0.003 0.994±0.001\mathbf{0.994\scriptstyle\pm 0.001} 0.988±0.0\mathbf{0.988\scriptstyle\pm 0.0}
1.51±0.035-1.51\scriptstyle\pm 0.035 173.002±1.548-173.002\scriptstyle\pm 1.548 825.303±44.797-825.303\scriptstyle\pm 44.797 0.828±0.0030.828\scriptstyle\pm 0.003 0.993±0.0\mathbf{0.993\scriptstyle\pm 0.0} 0.989±0.001\mathbf{0.989\scriptstyle\pm 0.001}
1.621±0.216-1.621\scriptstyle\pm 0.216 38.022±41.035-38.022\scriptstyle\pm 41.035 43.416±7.242-43.416\scriptstyle\pm 7.242 0.927±0.0150.927\scriptstyle\pm 0.015 0.276±0.1320.276\scriptstyle\pm 0.132 0.236±0.0530.236\scriptstyle\pm 0.053
1.235±0.072-1.235\scriptstyle\pm 0.072 29.238±20.912-29.238\scriptstyle\pm 20.912 33.686±4.508\mathbf{-33.686\scriptstyle\pm 4.508} 0.95±0.005\mathbf{0.95\scriptstyle\pm 0.005} 0.309±0.1090.309\scriptstyle\pm 0.109 0.393±0.0140.393\scriptstyle\pm 0.014
1.137±0.118-1.137\scriptstyle\pm 0.118 8.323±1.718\mathbf{-8.323\scriptstyle\pm 1.718} 103.968±68.449-103.968\scriptstyle\pm 68.449 0.936±0.0110.936\scriptstyle\pm 0.011 0.19±0.0040.19\scriptstyle\pm 0.004 0.341±0.1220.341\scriptstyle\pm 0.122
1.05±0.099\mathbf{-1.05\scriptstyle\pm 0.099} 10.526±2.256-10.526\scriptstyle\pm 2.256 36.254±15.018-36.254\scriptstyle\pm 15.018 0.913±0.0170.913\scriptstyle\pm 0.017 0.381±0.0750.381\scriptstyle\pm 0.075 0.435±0.0620.435\scriptstyle\pm 0.062
Table 15: ELBO and EMC values of MCD for learning the mean and variance of the proposal distribution π0\pi_{0}, the diffusion coefficient σt\sigma_{t} and annealing schedule (βt)t=1T(\beta_{t})_{t=1}^{T} by maximizing the extended ELBO for varying dimensions dd on the MoG target.

F.6 Ablation Study: Transport Flow Type

Experimental Setup. We test different flow types as transport maps for CRAFT using a different number of temperatures TT. In particular, we consider diagonal affine flows, inverse autoregressive flows (Kingma et al., 2016) and neural spline flows (Durkan et al., 2019) where we set the spline bounds to match the support of the MoG target. The results are visualized in Figure 7.

Discussion. We found that diagonal affine paired with larger number of temperatures results in a better, more robust performance compared to using more sophisticated flow types. Moreover, the latter often result in out-of-memory problems on high dimensional problems.

Refer to caption Refer to caption
Figure 7: ELBO and EUBO values for CRAFT for different flow types and number of temperatures T on the two-dimensional MoG target. In particular, diagonal affine flows, inverse autoregressive flows (IAF) (Kingma et al., 2016) and neural spline flows (NSF) (Durkan et al., 2019). For larger TT, IAF becomes numerically unstable.

F.7 Ablation Study: Gradient Guidance

Experimental Setup. (Zhang & Chen, 2021) proposed to use a network of the form fθ(𝐱,t)=f1θ(𝐱,t)+f2θ(t)logγ(𝐱)f^{\theta}(\mathbf{x},t)=f_{1}^{\theta}(\mathbf{x},t)+f^{\theta}_{2}(t)\nabla\log\gamma(\mathbf{x}) and initialize such that f1θ(𝐱,t)=0f_{1}^{\theta}(\mathbf{x},t)=0. They showed that this gradient guidance helps with mode collapse and yields overall better results. (Vargas et al., 2023a; Berner et al., 2022; Richter et al., 2023) adopted the approach and reported similar results. Here, we test the network architecture with and without gradient guidance f2θ(t)logγ(𝐱)f^{\theta}_{2}(t)\nabla\log\gamma(\mathbf{x}) on the MoG target for a varying number of dimensions for the diffusion sampler.

Discussion. The results of this examination can be found in Table 16 and indicate that both the ELBO and EMC significantly deteriorate without gradient guidance, and this degradation increases with higher dimensions. This aligns with the findings from (Zhang & Chen, 2021; Vargas et al., 2023a; Berner et al., 2022; Richter et al., 2023).

Gradient ELBO \uparrow EMC \uparrow
Guidance d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200
3.105±0.27-3.105\scriptstyle\pm 0.27 543.099±13.612-543.099\scriptstyle\pm 13.612 247920.463±4258.605-247920.463\scriptstyle\pm 4258.605 0.453±0.0110.453\scriptstyle\pm 0.011 0.0±0.00.0\scriptstyle\pm 0.0 0.243±0.4210.243\scriptstyle\pm 0.421
0.612±0.019\mathbf{-0.612\scriptstyle\pm 0.019} 15.598±0.106\mathbf{-15.598\scriptstyle\pm 0.106} 63.101±0.253\mathbf{-63.101\scriptstyle\pm 0.253} 0.984±0.001\mathbf{0.984\scriptstyle\pm 0.001} 0.992±0.001\mathbf{0.992\scriptstyle\pm 0.001} 0.983±0.002\mathbf{0.983\scriptstyle\pm 0.002}
Table 16: ELBO and EMC values with and without gradient guidance f2θ(t)logγ(𝐱)f^{\theta}_{2}(t)\nabla\log\gamma(\mathbf{x}) as part of the network architecture for the denoising diffusion sampler (DDS) on the MoG target for varying dimension dd.

F.8 Ablation Study: Pre-training the Proposal/Base-Distribution π0\pi_{0}

Experimental Setup. We test the impact of pre-training the mean and covariance matrix of the Gaussian proposal/base distribution π0\pi_{0} using MFVI on the MoG target for varying dimensions. The results are shown in Table 17.

Discussion. Pretraining the the mean and covariance matrix of the Gaussian proposal/base distribution π0\pi_{0} yields significantly higher ELBO values at the cost of EMC values close to 0.

Pretrained ELBO \uparrow EMC \uparrow
Method π0\pi_{0} d=2d=2 d=50d=50 d=200d=200 d=2d=2 d=50d=50 d=200d=200
CRAFT 3.563±0.057\mathbf{-3.563\scriptstyle\pm 0.057} 784.881±14.364-784.881\scriptstyle\pm 14.364 4624.869±63.618-4624.869\scriptstyle\pm 63.618 0.987±0.001\mathbf{0.987\scriptstyle\pm 0.001} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
3.676±0.007-3.676\scriptstyle\pm 0.007 3.501±0.087\mathbf{-3.501\scriptstyle\pm 0.087} 3.699±0.135\mathbf{-3.699\scriptstyle\pm 0.135} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
MCD 2.384±0.059\mathbf{-2.384\scriptstyle\pm 0.059} 878.12±8.598-878.12\scriptstyle\pm 8.598 3458.28±248.958-3458.28\scriptstyle\pm 248.958 0.798±0.003\mathbf{0.798\scriptstyle\pm 0.003} 0.994±0.001\mathbf{0.994\scriptstyle\pm 0.001} 0.988±0.0\mathbf{0.988\scriptstyle\pm 0.0}
3.689±0.0-3.689\scriptstyle\pm 0.0 3.746±0.003\mathbf{-3.746\scriptstyle\pm 0.003} 3.938±0.003\mathbf{-3.938\scriptstyle\pm 0.003} 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0 0.0±0.00.0\scriptstyle\pm 0.0
Table 17: ELBO and EMC values for pre-trained/fixed Gaussian proposal/base distribution π0\pi_{0} on the MoG target with varying dimensions dd.