This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gradient Estimation with Discrete Stein Operators

Jiaxin Shi
Stanford University
jiaxins@stanford.edu
&Yuhao Zhou
Tsinghua University
yuhaoz.cs@gmail.com
Jessica Hwang
Stanford University
jjhwang@stanford.edu
&Michalis K. Titsias
DeepMind
mtitsias@google.com
&Lester Mackey
Microsoft Research New England
lmackey@microsoft.com
Work done while at Microsoft Research New England.
Abstract

Gradient estimation—approximating the gradient of an expectation with respect to the parameters of a distribution—is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.

1 Introduction

Modern machine learning relies heavily on gradient methods to optimize the parameters of a learning system. However, exact gradient computation is often difficult. For example, in variational inference for training latent variable models [43, 29], policy gradient algorithms in reinforcement learning [65], and combinatorial optimization [40], the exact gradient features an intractable sum or integral introduced by an expectation under an evolving probability distribution. To make progress, one resorts to estimating the gradient by drawing samples from that distribution [50, 39].

The two main classes of gradient estimators used in machine learning are the pathwise or reparameterization gradient estimators [29, 47, 58] and the REINFORCE or score function estimators [65, 18]. The pathwise estimators have shown great success in training variational autoencoders [29] but are only applicable to continuous probability distributions. The REINFORCE estimators are more general-purpose and easily accommodate discrete distributions but often suffer from excessively high variance.

To improve the quality of REINFORCE estimators, we develop a new variance reduction technique for discrete expectations. Our method is based upon Stein operators [see, e.g., 55, 1], computable functionals that generate mean-zero functions under a target distribution and provide a natural way of designing control variates (CVs) for stochastic estimation.

We first provide a general recipe for constructing practical Stein operators for discrete distributions (Table 1), generalizing the prior literature [9, 25, 16, 8, 46, 67, 6]. We then develop a gradient estimation framework—RODEO—that augments REINFORCE estimators with mean-zero CVs generated from Stein operators. Finally, inspired by Double CV [60], we extend our method to develop CVs for REINFORCE leave-one-out estimators [49, 30] to further reduce the variance.

The benefits of using Stein operators to construct discrete CVs are twofold. First, the operator structure permits us to learn CVs with a flexible functional form such as those parameterized by neural networks. Second, since our operators are derived from Markov chains on the discrete support, they naturally incorporate information from neighboring states of the process for variance reduction.

We evaluate RODEO on 15 benchmark tasks, including training binary variational autoencoders (VAEs) with one or more stochastic layers. In most cases and with the same number of function evaluations, RODEO delivers lower variance and better training objectives than the state-of-the-art gradient estimators DisARM [14, 69], ARMS [13], Double CV [60], and RELAX [20].

Table 1: Discrete Stein operators that generate mean-zero functions under qq (i.e., 𝔼q[(Ah)(x)]=0\mathbb{E}_{q}[(Ah)(x)]=0).
Stein Operator (𝑨𝒉)(𝒙)\boldsymbol{(Ah)(x)}
Gibbs 4 1di=1dyi=xiq(yi|xi)h(y)h(x)\frac{1}{d}\sum_{i=1}^{d}\sum_{y_{-i}=x_{-i}}q(y_{i}|x_{-i})h(y)-h(x)
MPF 6 y𝒩x,yxq(y)/q(x)(h(y)h(x))\sum_{y\in\mathcal{N}_{x},y\neq x}\sqrt{{q(y)/}{q(x)}}(h(y)-h(x))
Barker 6 y𝒩x,yxq(y)q(x)+q(y)(h(y)h(x))\sum_{y\in\mathcal{N}_{x},y\neq x}\frac{q(y)}{q(x)+q(y)}(h(y)-h(x))
Difference 8 1di=1dh(deci(x))q(inci(x))q(x)h(x)\frac{1}{d}\sum_{i=1}^{d}h(\texttt{dec}_{i}({x}))-\frac{q(\texttt{inc}_{i}({x}))}{q(x)}h(x)

2 Background

We consider the problem of maximizing the objective function 𝔼qη[f(x)]\mathbb{E}_{q_{\eta}}[f(x)] with respect to the parameters η\eta of a discrete distribution qη(x)q_{\eta}(x). Throughout the paper we assume f(x)f(x) is a differentiable function of real-valued inputs xdx\in\mathbb{R}^{d} but is only evaluated at a discrete subset 𝒳d\mathcal{X}^{d} due to the discreteness of qηq_{\eta}.111This assumption holds for many discrete probabilistic models [21] including the binary VAEs of Section 6. Exact computation of the expectation is typically intractable due to the complex nature of f(x)f(x) and qη(x)q_{\eta}(x). The standard workaround is to rewrite the gradient as η𝔼qη[f(x)]=𝔼qη[f(x)ηlogqη(x)]\nabla_{\eta}\mathbb{E}_{q_{\eta}}[f(x)]=\mathbb{E}_{q_{\eta}}[f(x)\nabla_{\eta}\log q_{\eta}(x)] and employ the Monte Carlo gradient estimator known as the score function or REINFORCE estimator [18, 65]:

1Kk=1K(f(x(k))b)ηlogqη(x(k))forx(1),,x(K)i.i.d.qη.\textstyle\frac{1}{K}\sum_{k=1}^{K}(f(x^{(k)})-b)\nabla_{\eta}\log q_{\eta}(x^{(k)})\quad\text{for}\quad x^{(1)},\dots,x^{(K)}\overset{i.i.d.}{\sim}q_{\eta}. (REINFORCE)

Here, bb is a constant called the “baseline” introduced to reduce the variance of the estimator by reducing the scaling effect of f(x(k))f(x^{(k)}). Since 𝔼qη[ηlogqη(x)]=0\mathbb{E}_{q_{\eta}}[\nabla_{\eta}\log q_{\eta}(x)]=0, the REINFORCE estimator is unbiased for any choice of bb, and the term bηlogqη(x)b\nabla_{\eta}\log q_{\eta}(x) is known as a control variate (CV) [42, Ch. 8]. The optimal baseline can be estimated using additional function evaluations [64, 7, 45]. A simpler approach is to use moving averages of ff from historical evaluations or to train function approximators to mimic those values [37]. When K2K\geq 2, a powerful variant of REINFORCE is obtained by replacing bb with the leave-one-out average of function values:

1Kk=1K(f(x(k))1K1jkf(x(j)))ηlogqη(x(k)).\textstyle\frac{1}{K}\sum_{k=1}^{K}\big{(}f(x^{(k)})-\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)})\big{)}\nabla_{\eta}\log q_{\eta}(x^{(k)}). (RLOO)

This approach is often called the REINFORCE leave-one-out (RLOO) estimator [49, 30, 48] and was recently observed to have very strong performance in training discrete latent variable models [48, 14].

All the above methods construct a baseline that is independent of the point x(k)x^{(k)} under consideration, but there are other ways to preserve the unbiasedness of the estimator. We are free to use a sample-dependent baseline b(x(k))b(x^{(k)}) as long as the expectation c𝔼qη[b(x(k))ηlogqη(x(k))]c\triangleq\mathbb{E}_{q_{\eta}}[b(x^{(k)})\nabla_{\eta}\log q_{\eta}(x^{(k)})] is easily computed or has a low-variance unbiased estimate. In this case, we can correct for any bias introduced by b(x(k))b(x^{(k)}) by adding in this expectation: 1Kk=1K(f(x(k))b(x(k)))ηlogqη(x(k))+c.\frac{1}{K}\sum_{k=1}^{K}(f(x^{(k)})-b(x^{(k)}))\nabla_{\eta}\log q_{\eta}(x^{(k)})+c. For example, b(x(k))b(x^{(k)}) can be a lower bound or a Taylor expansion of f(x(k))f(x^{(k)}) [43, 22]. Taking this a step further, the Double CV estimator [60] proposed to treat f(x(k))b(x(k))f(x^{(k)})-b(x^{(k)}) as the effective objective function and apply the leave-one-out idea:

1Kk=1K[(f(x(k))𝒃𝒌(𝒙(𝒌)))1K1jk(f(x(j))𝒃𝒋(𝒙(𝒋)))]ηlogqη(x(k))+c.\textstyle\frac{1}{K}\sum_{k=1}^{K}\big{[}\big{(}f(x^{(k)})\boldsymbol{-b_{k}(x^{(k)})}\big{)}-\frac{1}{K-1}\sum_{j\neq k}\big{(}f(x^{(j)})\boldsymbol{-b_{j}(x^{(j)})}\big{)}\big{]}\nabla_{\eta}\log q_{\eta}(x^{(k)})+c.

The resulting estimator adds two CVs to RLOO: the global CV bk(x(k))ηlogqη(x(k))b_{k}(x^{(k)})\nabla_{\eta}\log q_{\eta}(x^{(k)}) and the local CV bk(x(j))b_{k}(x^{(j)}). Intuitively, bk(x(j))b_{k}(x^{(j)}) is aimed at reducing the variance of the LOO average. This is motivated by the fact that replacing 1K1jkf(x(j))\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)}) with 𝔼qη[f]\mathbb{E}_{q_{\eta}}[f] in RLOO leads to lower variance [60, Prop 1]. To obtain a tractable correction term cc, Titsias and Shi [60] adopt a linear design of bkb_{k}: bk(x)=α1K1jkf(x(j))(xμ)b_{k}(x)=\alpha\cdot\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)})(x-\mu) for μ=𝔼qη[x]\mu=\mathbb{E}_{q_{\eta}}[x] and a coefficient α\alpha.

Although the Double CV framework points to a promising new direction for developing better REINFORCE estimators, one only obtains significant reduction in variance when bkb_{k} is strongly correlated with ff. This may fail to hold for the above linear bkb_{k} and a highly nonlinear ff, especially in applications like training deep generative models.

In the following two sections, we will introduce a method that allows us to use very flexible CVs while still maintaining a tractable correction term. Our method enables online adaptation of CVs to minimize gradient variance (similar to RELAX [20]) but does not assume qηq_{\eta} has a continuous reparameterization. We then apply it to generalize the linear CVs in Double CV to very flexible ones such as neural networks. Moreover, we provide an effective CV design based on surrogate functions that requires no additional evaluation of ff compared to RLOO.

3 Control Variates from Discrete Stein Operators

At the heart of our new estimator is a technique for generating flexible discrete CVs, that is for generating a rich class of functions that have known expectations under a given discrete distribution qq. One way to achieve this is to identify any discrete-time Markov chain (X(t))t=0(X^{(t)})_{t=0}^{\infty} with stationary distribution qq. Then, the transition matrix PP, defined via Pxy=P(X(t+1)=y|X(t)=x)P_{xy}=P(X^{(t+1)}=y|X^{(t)}=x), satisfies Pq=qP^{\top}q=q and hence

𝔼q[(PI)h]=0,\mathbb{E}_{q}[(P-I)h]=0, (1)

for any integrable function hh. We overload our notation so that, for any suitable matrix AA, (Ah)(x)yAxyh(y)(Ah)(x)\triangleq\sum_{y}A_{xy}h(y). In other words, for any integrable hh, the function (PI)h(P-I)h is a valid CV as it has known mean under qq. Moreover, the linear operator PIP-I is an example of a Stein operator [55, 1] in the sense that it generates mean-zero functions under qq. In fact, both Stein et al. [56] and Dellaportas and Kontoyiannis [12] developed CVs of the form (PI)h(P-I)h based on reversible discrete-time Markov chains and linear input functions h(x)=xih(x)=x_{i}.

More generally, Barbour [3] and Henderson [23] observed that if we identify a continuous-time Markov chain (X(t))t0(X^{(t)})_{t\geq 0} with qq as its stationary distribution, then the generator AA, defined via

(Ah)(x)=limt0𝔼[h(X(t))|X(0)=x]h(x)t,\displaystyle(Ah)(x)=\lim_{t\to 0}\textstyle\frac{\mathbb{E}[h(X^{(t)})|X^{(0)}=x]-h(x)}{t}, (2)

satisfies Aq=0A^{\top}q=0 and hence 𝔼q[Ah]=0\mathbb{E}_{q}[Ah]=0 for all integrable hh. Therefore, AA is also a Stein operator suitable for generating CVs. Moreover, since any discrete-time chain with transition matrix PP can be embedded into a continuous-time chain with transition rate matrix A=PIA=P-I, this continuous-time construction is strictly more general [57, Ch. 4].

3.1 Discrete Stein operators

We next present several examples of broadly applicable discrete Stein operators (summarized in Table 1) that can serve as practical defaults.

Gibbs Stein operator   The transition kernel of the random-scan Gibbs sampler with stationary distribution qq [see, e.g., 17] is

Pxy=1di=1dq(yi|xi)𝟏(yi=xi),\textstyle P_{xy}=\frac{1}{d}\sum_{i=1}^{d}q(y_{i}|x_{-i})\mathbf{1}(y_{-i}=x_{-i}), (3)

where 𝟏()\mathbf{1}(\cdot) is the indicator function, xix_{-i} denotes all elements of xx except xix_{i}. The associated Stein operator is A=PIA=P-I with

(Ah)(x)=1di=1dyi=xiq(yi|xi)h(y)h(x).\textstyle(Ah)(x)=\frac{1}{d}\sum_{i=1}^{d}\sum_{y_{-i}=x_{-i}}q(y_{i}|x_{-i})h(y)-h(x). (4)

In the binary variable setting, 4 recovers the operator Bresler and Nagaraj [8], Reinert and Ross [46] used to bound distances between the stationary distributions of Glauber dynamics Markov chains.

Zanella Stein operator   Zanella [70] studied continuous-time Markov chains with generator

Axy=κ(q(y)/q(x))𝟏(y𝒩x,yx)zxAxz 1(y=x),\textstyle A_{xy}=\kappa\left({q(y)/}{q(x)}\right)\mathbf{1}(y\in\mathcal{N}_{x},y\neq x)-\sum_{z\neq x}A_{xz}\,\mathbf{1}(y=x), (5)

where 𝒩x\mathcal{N}_{x} denotes the transition neighborhood of xx and κ\kappa is a continuous positive function satisfying κ(t)=tκ(1/t)\kappa(t)=t\kappa(1/t). Conveniently, the neighborhood structure can be arbitrarily sparse. Moreover, the κ\kappa constraint ensures that the chain satisfies detailed balance (i.e., q(x)Axy=q(y)Ayxq(x)A_{xy}=q(y)A_{yx}) and hence that Aq=0A^{\top}q=0. Hodgkinson et al. [24] call the associated operator the Zanella Stein operator:

(Ah)(x)=y𝒩x,yxκ(q(y)q(x))(h(y)h(x)).\textstyle(Ah)(x)=\sum_{y\in\mathcal{N}_{x},y\neq x}\ \textstyle\kappa\big{(}\frac{q(y)}{q(x)}\big{)}(h(y)-h(x)). (6)

Important special cases include the minimum probability flow [53] (MPF) Stein operator (κ(t)=t\kappa(t)=\sqrt{t}) discussed in Barp et al. [6] and the Barker [4] Stein operator (κ(t)=tt+1\kappa(t)=\frac{t}{t+1}).

Birth-death Stein operator   Let 𝒆1,,𝒆d\boldsymbol{e}_{1},\dots,\boldsymbol{e}_{d} represent the standard basis vectors on d\mathbb{R}^{d}. For any finite-cardinality space, we may index the elements of 𝒳={s0,,sm1}\mathcal{X}=\{s_{0},\dots,s_{m-1}\}, let idx(xi)\texttt{idx}(x_{i}) return the index of the element that xix_{i} represents, and define the increment and decrement operators

inci(x)=x+𝒆i(s(idx(xi)+1)modmxi) and deci(x)=x+𝒆i(s(idx(xi)1)modmxi).\displaystyle\texttt{inc}_{i}({x})=x+\boldsymbol{e}_{i}(s_{(\texttt{idx}(x_{i})+1)\ \mathrm{mod}\ m}-x_{i})\quad\text{ and }\quad\texttt{dec}_{i}({x})=x+\boldsymbol{e}_{i}(s_{(\texttt{idx}(x_{i})-1)\ \mathrm{mod}\ m}-x_{i}).

Then the product birth-death process [28] on 𝒳d\mathcal{X}^{d} with birth rates bi,x=q(inci(x))q(x)b_{i,x}=\frac{q(\texttt{inc}_{i}({x}))}{q(x)}, death rates di,x=1d_{i,x}=1, and generator

Axy=1di=1dbi,x𝟏(y=inci(x))+di,x𝟏(y=deci(x))(bi,x+di,x)𝟏(y=x),\textstyle A_{xy}=\frac{1}{d}\sum_{i=1}^{d}b_{i,x}\mathbf{1}(y=\texttt{inc}_{i}({x}))+d_{i,x}\mathbf{1}(y=\texttt{dec}_{i}({x}))-(b_{i,x}+d_{i,x})\mathbf{1}(y=x),

has qq as a stationary distribution, as Aq=0A^{\top}q=0. This construction yields the birth-death Stein operator

(Ag)(x)=1di=1dbi,x(g(inci(x))g(x))di,x(g(x)g(deci(x))).\textstyle(Ag)(x)=\frac{1}{d}\sum_{i=1}^{d}b_{i,x}(g(\texttt{inc}_{i}({x}))-g(x))-d_{i,x}(g(x)-g(\texttt{dec}_{i}({x}))). (7)

An analogous operator is available for countably infinite 𝒳\mathcal{X} [24], and Brown and Xia [9], Holmes [25], Eichelsbacher and Reinert [16] used related operators to characterize convergence to discrete target distributions. Moreover, by substituting h(x)=g(x)g(inc(x))h(x)=g(x)-g(\texttt{inc}({x})) in 7, we recover the difference Stein operator proposed by Yang et al. [67] without reference to birth-death processes:

(Ah)(x)=1di=1dh(deci(x))bi,xh(x).\textstyle(Ah)(x)=\frac{1}{d}\sum_{i=1}^{d}h(\texttt{dec}_{i}({x}))-b_{i,x}h(x). (8)

Choosing a Stein operator   Despite being better-known, the difference and MPF operators often suffer from large variance and numerical instability due to their use of unbounded probability ratios q(y)/q(x)q(y)/q(x). As a result, we recommend the numerically stable Gibbs operator when each component of xx is binary or takes on a small number of values. When xix_{i} takes on a large number of values (mm), the Gibbs operator suffers from linear scaling with mm. In this case, we recommend the Barker operator where a sparse neighborhood structure can be specified, such as 𝒩x={inci(x),deci(x) for i[d]}\mathcal{N}_{x}=\{\texttt{inc}_{i}(x),\texttt{dec}_{i}(x)\text{ for }i\in[d]\}. The Barker operator is numerically stable as its κ(q(y)q(x))=q(y)q(x)+q(y)\kappa(\frac{q(y)}{q(x)})=\frac{q(y)}{q(x)+q(y)}.

4 Gradient Estimation with Discrete Stein Operators

Recall that REINFORCE estimates the gradient 𝔼qη[f(x)ηlogqη(x)].\mathbb{E}_{q_{\eta}}[f(x)\nabla_{\eta}\log q_{\eta}(x)]. Due to the existence of ηlogqη(x)\nabla_{\eta}\log q_{\eta}(x) as a weighting function, we apply a discrete Stein operator to a vector-valued function h~:𝒳d\tilde{h}:\mathcal{X}\to\mathbb{R}^{d} per dimension to construct the following estimator with a mean-zero CV:

𝔼qη[f(x)ηilogqη(x)+(Ah~i)(x)].\displaystyle\mathbb{E}_{q_{\eta}}[f(x)\nabla_{\eta_{i}}\log q_{\eta}(x)+(A\tilde{h}_{i})(x)]. (9)

Ideally, we want to choose the h~\tilde{h} such that Ah~iA\tilde{h}_{i} strongly correlates with f(x)ηilogqη(x)f(x)\nabla_{\eta_{i}}\log q_{\eta}(x) to reduce its variance. The optimal h~i\tilde{h}_{i} that yields zero variance is given by the solution of Poisson equation

𝔼qη[fηilogqη]fηilogqη=Ah~i.\displaystyle\mathbb{E}_{q_{\eta}}[f\nabla_{\eta_{i}}\log q_{\eta}]-f\nabla_{\eta_{i}}\log q_{\eta}=A\tilde{h}_{i}. (10)

We could learn a separate h~i\tilde{h}_{i} per dimension to approximate the solution of 10. However, this poses a difficult optimization problem that requires simultaneously solving dd Poisson equations. Instead, we will incorporate knowledge about the structure of the solution to simplify the optimization.

To determine a candidate functional form for h~\tilde{h}, we draw inspiration from an “optimal” Markov chain in which the current state is ignored entirely and the new state is generated independently from qηq_{\eta}. In this case, (PI)h~i(P-I)\tilde{h}_{i} becomes 𝔼qη[h~i]h~i\mathbb{E}_{q_{\eta}}[\tilde{h}_{i}]-\tilde{h}_{i}, and the optimal solution is h~i=fηilogqη\tilde{h}_{i}=f\nabla_{\eta_{i}}\log q_{\eta}. Inspired by this, we consider h~\tilde{h} of the form

h~(x)=h(x)ηlogqη(x),\textstyle\tilde{h}(x)=h(x)\nabla_{\eta}\log q_{\eta}(x), (11)

where we now only need to learn a scalar-valued function hh. Notably, when hh exactly equals ff and A=PIA=P-I for any discrete time Markov chain kernel PP, our CV adjustment amounts to Rao-Blackwellization [42, Sec. 8.7], as we end up replacing f(x)ηilogqη(x)f(x)\nabla_{\eta_{i}}\log q_{\eta}(x) with its conditional expectation P(fηilogqη)(x)=𝔼Xt+1|Xt=x[f(Xt+1)ηilogqη(Xt+1)]P(f\nabla_{\eta_{i}}\log q_{\eta})(x)=\mathbb{E}_{X_{t+1}|X_{t}=x}[f(X_{t+1})\nabla_{\eta_{i}}\log q_{\eta}(X_{t+1})]. This yields a guaranteed variance reduction.

Surrogate function design   Based on the above reasoning, we can view hh as a surrogate for ff. We avoid directly setting h=fh=f because our Stein operators evaluate hh at all neighbors of the sample points, and ff can be very expensive to evaluate [see, e.g., 51]. To avoid this problem, we first observe that h~\tilde{h} 11 can be made sample-specific, i.e., we can use a different h~k\tilde{h}_{k} for each sample point x(k)x^{(k)}:

1Kk=1K[f(x(k))ηlogqη(x(k))+(Ah~k)(x(k))].\textstyle\frac{1}{K}\sum_{k=1}^{K}[f(x^{(k)})\nabla_{\eta}\log q_{\eta}(x^{(k)})+(A\tilde{h}_{k})(x^{(k)})]. (12)

We then consider the following choices of hkh_{k} that are informed about ff while being cheap to evaluate:

hk(y)=1K1jkH(f(x(j)),f(x(j))(yx(j))).\textstyle h_{k}(y)=\frac{1}{K-1}\sum_{j\neq k}\!H(f(x^{(j)}),\nabla f(x^{(j)})^{\top}(y-x^{(j)})). (13)

Here HH belongs to a flexible parametric family of functions such as neural networks and is chosen to have significantly lower cost than ff. yy will take on the values of x(k)x^{(k)} and its neighbors in the Markov chain. We omit x(k)x^{(k)} in the sum 13 to ensure that the function hkh_{k} is independent of x(k)x^{(k)} and hence that 𝔼qη[1Kk=1K(Ah~k)(x(k))]=0\mathbb{E}_{q_{\eta}}[\frac{1}{K}\sum_{k=1}^{K}(A\tilde{h}_{k})(x^{(k)})]=0. Moreover, this surrogate function design introduces no additional evaluations of ff beyond those required for the usual RLOO estimator. Also, as observed by Titsias and Shi [60], for many applications, including VAE training, where ff has parameters θ\theta learned through gradient-based optimization, {f(x(k))}k=1K\{\nabla f(x^{(k)})\}_{k=1}^{K} can be obtained “for free” from the same backpropagation used to compute θfθ(x(k))\nabla_{\theta}f_{\theta}(x^{(k)}) (see Algorithm 1).

RODEO  We can further improve the estimator by leveraging discrete Stein operators and the above surrogate function design to construct both the global and local CVs in the Double CV framework [60] (see Section 2). We call our final estimator RODEO for RLOO with Discrete StEin Operators:

1Kk=1K[(f(x(k))\displaystyle{\textstyle\frac{1}{K}}\operatorname*{\textstyle\sum}_{k=1}^{K}\![(f(x^{(k)})-\! 1K1jk(f(x(j))+(𝑨𝒉𝒋)(𝒙(𝒋))))ηlogqη(x(k))+(𝑨𝒉~𝒌)(𝒙(𝒌))],\displaystyle{\textstyle\frac{1}{K-1}}\operatorname*{\textstyle\sum}_{j\neq k}(f(x^{(j)})\!+\!\boldsymbol{(Ah_{j})(x^{(j)})}))\nabla_{\eta}\log q_{\eta}(x^{(k)})\!+\!\boldsymbol{(A\tilde{h}_{k}^{\star})(x^{(k)})}], (RODEO)

where h~k(y)=hk(y)ηlogqη(y)\tilde{h}^{\star}_{k}(y)=h^{\star}_{k}(y)\nabla_{\eta}\log q_{\eta}(y) and {hk,hk}k=1K\{h_{k},h_{k}^{\star}\}_{k=1}^{K} are scalar-valued functions. Here, (Ahj)(x(j))(Ah_{j})(x^{(j)}) is a scalar-valued CV introduced to reduce the variance of the leave-one-out baseline 1K1jkf(x(j))\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)}), while (Ah~k)(x(k))(A\tilde{h}_{k}^{\star})(x^{(k)}) acts as a global CV to further reduce the variance of the gradient estimate. We adopt the aforementioned design of hh 13 and h~\tilde{h} 11 for the two CVs. In Appendix B, we show that the RODEO estimator is unbiased for η𝔼qη[f(x)]\nabla_{\eta}\mathbb{E}_{q_{\eta}}[f(x)] when each hkh_{k} is defined as in 13 and

hk(y)=1K1jkH(f(x(j)),f(x(j))(yx(j))).\textstyle h_{k}^{\star}(y)=\frac{1}{K-1}\sum_{j\neq k}\!H^{\star}(f(x^{(j)}),\nabla f(x^{(j)})^{\top}(y-x^{(j)})). (14)
Algorithm 1 Optimizing 𝔼qη[fθ(x)]\mathbb{E}_{q_{\eta}}[f_{\theta}(x)] with RODEO gradients
input: Objective fθf_{\theta}, sample points x(1:K)i.i.d.qηx^{(1:K)}\overset{i.i.d.}{\sim}q_{\eta}, Stein operator AA, step sizes αt,βt\alpha_{t},\beta_{t}
for t=1:Tt=1:T do
1:{fθ(x(k)),θfθ(x(k)),fθ(x(k))}k=1Kautodiff(fθ,x(1:K))\{f_{\theta}(x^{(k)}),\nabla_{\theta}f_{\theta}(x^{(k)}),\nabla f_{\theta}(x^{(k)})\}_{k=1}^{K}\leftarrow\text{autodiff}(f_{\theta},x^{(1:K)}).
2:Compute the surrogates hk(x(j)),hk(x(j))h_{k}(x^{(j)}),h^{\star}_{k}(x^{(j)}) of 13, 14 for jkj\neq k and j,k=1,,Kj,k=1,\cdots,K.
3:Compute the RODEO gradient estimator gγ(x(1:K))g_{\gamma}(x^{(1:K)}).
4:θθ+αt1Kk=1Kθf(x(k))\theta\leftarrow\theta+\alpha_{t}\frac{1}{K}\sum_{k=1}^{K}\nabla_{\theta}f(x^{(k)}).
5:ηη+αtgγ(x(1:K))\eta\leftarrow\eta+\alpha_{t}g_{\gamma}(x^{(1:K)}).
6:Update hyperparameters: γγβtγgγ(x(1:K))22\gamma\leftarrow\gamma-\beta_{t}\nabla_{\gamma}\|g_{\gamma}(x^{(1:K)})\|_{2}^{2}.

Optimization with RODEO   In practice, we let the functions HH and HH^{\star} share a neural network architecture with two output units. Since the estimator is unbiased, we can optimize the parameters γ\gamma of the network in an online fashion to minimize the variance of the estimator (similar to [20]). Specifically, if we denote the RODEO gradient estimate by gγ(x(1:K))g_{\gamma}(x^{(1:K)}), then γTr(Var(gγ(x(1:K))))=𝔼[γgγ(x(1:K))22].\nabla_{\gamma}\mathrm{Tr}(\mathrm{Var}(g_{\gamma}(x^{(1:K)})))=\mathbb{E}[\nabla_{\gamma}\|g_{\gamma}(x^{(1:K)})\|_{2}^{2}]. In Algorithm 1, we use an unbiased estimate of this hyperparameter gradient, γgγ(x(1:K))22\nabla_{\gamma}\|g_{\gamma}(x^{(1:K)})\|_{2}^{2}, to update γ\gamma after each optimization step of η\eta.

5 Related Work

As we have seen in Section 2, there is a long history of designing CVs for REINFORCE estimators using “baselines” [64, 7, 45, 37]. Recent progress is mostly driven by leave-one-out [49, 38, 30, 48] and sample-dependent baselines [43, 22, 60, 62, 20]. REBAR [62] constructs the baseline through continuous relaxation of the discrete distribution [26, 35] and applies reparameterization gradients to the correction term. As a result REBAR uses three evaluations of ff for each x(k)x^{(k)} instead of the usual single evaluation (see Appendix A for a detailed explanation). The RELAX [20] estimator generalizes REBAR by noticing that their continuous relaxation can be replaced with a free-form CV. However, in order to get strong performance, RELAX still includes the continuous relaxation in their CV and only adds a small deviation to it. Therefore, RELAX also uses three evaluations of ff for each x(k)x^{(k)} and is usually considered more expensive than other estimators.

An attractive property of the RODEO estimator is that it incorporates information from neighboring states thanks to the Stein operator while avoiding additional costly ff evaluations thanks to the learned surrogate functions hkh_{k}. Estimators based on analytic local expectation [59, 61] and GO gradients [11] also use neighborhood information but only at the cost of many additional target function evaluations. In fact, the local expectation gradient [59] can be viewed as a Stein CV adjustment RODEO with a Gibbs Stein operator and the target function ff used directly instead of the surrogate hh. The downside of these approaches is that ff must be evaluated KdKd times per training step instead of KK times as in RODEO, a prohibitive cost when ff is expensive and d200d\geq 200 as in Section 6.

Prior work has also studied variance reduction methods based on sampling without replacement [31] and antithetic sampling [68, 14, 13, 15] for gradient estimation. DisARM [14, 69] was shown to outperform RLOO estimators when K=2K=2 and ARMS [13] generalizes DisARM to K>2K>2.

Stein operators for continuous distributions have also been leveraged for effective variance reduction in a variety of learning tasks [2, 36, 41, 5, 54, 52] including gradient estimation [34, 27]. In particular, the gradient estimator of Liu et al. [34] is based on the Langevin Stein operator [19] for continuous distributions and coincides with the continuous counterpart of RELAX [20]. In contrast, our approach considers discrete Stein operators for Monte Carlo estimation in discrete distributions with exponentially large state spaces. Recently, Parmas and Sugiyama [44, App. E.4] used a probability flow perspective to characterize all unbiased gradient estimators satisfying a mild technical condition; our estimators fall into this broad class but were not specifically investigated.

Table 2: Training binary latent VAEs with K=2,3K=2,3 (except for RELAX which uses 33 evaluations) on MNIST, Fashion-MNIST, and Omniglot. We report the average ELBO (±1\pm 1 standard error) on the training set after 1M steps over 5 independent runs. Test data bounds are reported in Table 4.

Bernoulli Likelihoods Gaussian Likelihoods
MNIST Fashion-MNIST Omniglot MNIST Fashion-MNIST Omniglot
K=2K=2 DisARM 102.75±0.08-102.75\pm 0.08 237.68±0.13-237.68\pm 0.13 116.50±0.04-116.50\pm 0.04 668.03±0.61668.03\pm 0.61 182.65±0.47182.65\pm 0.47 446.61±1.16446.61\pm 1.16
Double CV 102.14±0.06-102.14\pm 0.06 237.55±0.16-237.55\pm 0.16 116.39±0.10-116.39\pm 0.10 676.87±1.18676.87\pm 1.18 186.35±0.64186.35\pm 0.64 447.65±0.87447.65\pm 0.87
RODEO (Ours) 101.89±0.17\bf-101.89\pm 0.17 237.44±0.09\bf-237.44\pm 0.09 115.93±0.06\bf-115.93\pm 0.06 681.95±0.37\bf 681.95\pm 0.37 191.81±0.67\bf 191.81\pm 0.67 454.74±1.11\bf 454.74\pm 1.11
K=3K=3 ARMS 100.84±0.14-100.84\pm 0.14 237.05±0.12-237.05\pm 0.12 115.21±0.07-115.21\pm 0.07 683.55±1.01683.55\pm 1.01 193.07±0.34193.07\pm 0.34 457.98±1.03457.98\pm 1.03
Double CV 100.94±0.09-100.94\pm 0.09 237.40±0.11-237.40\pm 0.11 115.06±0.12-115.06\pm 0.12 686.48±0.68686.48\pm 0.68 193.93±0.20193.93\pm 0.20 457.44±0.79457.44\pm 0.79
RODEO (Ours) 100.46±0.13\bf-100.46\pm 0.13 236.88±0.12\bf-236.88\pm 0.12 115.01±0.05\bf-115.01\pm 0.05 692.37±0.39\bf 692.37\pm 0.39 196.56±0.42\bf 196.56\pm 0.42 461.87±0.90461.87\pm 0.90
RELAX (3 evals) 101.99±0.04-101.99\pm 0.04 237.74±0.12-237.74\pm 0.12 115.70±0.08-115.70\pm 0.08 688.58±0.52688.58\pm 0.52 196.38±0.66196.38\pm 0.66 462.23±0.63\bf 462.23\pm 0.63

6 Experiments

Python code replicating all experiments can be found at https://github.com/thjashin/rodeo.

6.1 Training Bernoulli VAEs

Following Dong et al. [14] and Titsias and Shi [60], we conduct experiments on training variational auto-encoders [29, 47] (VAEs) with Bernoulli latent variables. VAEs are models with a joint density pθ(y,x)p_{\theta}(y,x), where xx is the latent variable and θ\theta denotes model parameters. They are typically learned through maximizing the evidence lower bound (ELBO) 𝔼qη(x|y)[f(x)]\mathbb{E}_{q_{\eta}(x|y)}[f(x)] for an auxiliary inference network qη(x|y)q_{\eta}(x|y) and f(x)logpθ(y,x)logqη(x|y)f(x)\triangleq\log p_{\theta}(y,x)-\log q_{\eta}(x|y). In our experiments, p(x)p(x) and qη(x|y)q_{\eta}(x|y) are high-dimensional distributions where each dimension of the random variable is an independent Bernoulli. Since exact gradient computations are intractable, we will use gradient estimators to learn the parameters η\eta of the inference network. The VAE architecture and training experimental setup follows Titsias and Shi [60], and details are given in Appendix D. The dimensionality of the latent variable xx is d=200d=200. The functions HH 13 and HH^{*} 14 share a neural network architecture with two output units and a single hidden layer with 100 units. For the numerical stability and variance reasons discussed in Section 3, we use the Gibbs Stein operator 4 as a default choice in our experiments, but we revisit this choice in Section 6.3.

We consider the MNIST [33], Fashion-MNIST [66] and Omniglot [32] datasets using their standard train, validation, and test splits. We use both binary and continuous VAE outputs (yy) as in Titsias and Shi [60]. In the binary output setting, data are dynamically binarized, and the Bernoulli likelihood is used; in the continuous output setting, data are centered between [1,1][-1,1], and the Gaussian likelihood with learnable diagonal covariance parameters is used.

Refer to caption Refer to caption
Figure 1: Training binary latent VAEs with 2 or 3 ff evaluations per step on binarized MNIST.

𝑲=𝟐\boldsymbol{K=2}   In the first set of experiments we focus on the most common setting of K=2K=2 sample points and compare our variance reduction method to its counterparts including DisARM [14] and Double CV [60]. Results for RLOO are omitted since it is consistently outperformed by Double CV [see 60]. Table 2 shows, on all three datasets, RODEO achieves the best training ELBOs. In Figure 1 (left), we plot the gradient variance and average training ELBOs against training steps for all estimators on dynamically binarized MNIST. RODEO outperforms DisARM and Double CV by a large margin in gradient variance.

Refer to caption
Refer to caption
Figure 2: Training binary latent VAEs with Gaussian likelihoods, K=2K=2, and non-binarized datasets.

Next, we consider VAEs with Gaussian likelihoods trained on non-binarized datasets. In this case the gradient estimates suffer from even higher variance due to the large range of values f(x)f(x) can take. The results are plotted in Figure 2. We see that RODEO has substantially lower variance than DisARM and Double CV, leading to significant improvements in training ELBOs. In Appendix Figure 7, we find that RODEO also consistently yields the best test set performance in all six settings.

𝑲=𝟑\boldsymbol{K=3}   In the second set of experiments we compare RELAX [20], which uses three evaluations of ff per training step, with RODEO, Double CV, and ARMS [13] for K=3K=3. Figure 1 (right) and Table 2 demonstrate that RODEO outperforms the three previous methods and generally leads to the lowest variance. Although RELAX was often observed to have very strong performance in prior work [14, 60], our results in Figure 1 suggest that, for dynamically binarized datasets, much larger gains can be achieved by using the same number of function evaluations in other estimators.

For the experiments mentioned above, we report final training ELBOs in Table 2, test log-likelihood bounds in Appendix Table 4, and binarized MNIST average running time in Appendix Table 5. For K=3K=3, RODEO has the best final performance in 5 out of 6 tasks and runtime nearly identical to RELAX. For K=2K=2, RODEO has the best final performance for all tasks and is 1.5 times slower than Double CV and DisARM. We attribute the runtime gap to the Gibbs operator 4 which performs 2d2d evaluations of the auxiliary functions HH and HH^{*} in 13 and 14. While this results in some extra cost, the network parameterizing HH and HH^{*} takes only 2-dimensional inputs, produces scalar outputs, and is typically small relative to the VAE model. As a result, the evaluation of HH and HH^{*} is significantly cheaper than that of ff, and its relative contribution to runtime shrinks as the cost of ff grows. To demonstrate this, we include a wall clock time comparison of our method with RLOO in Section C.1. In this experiment we replace the two-layer MLP-based VAE with a ResNet VAE, where the cost of ff is significantly higher than the single-layer MLP of H,HH,H^{*}. In this case, RODEO and RLOO have very close per-iteration time (0.025s vs. 0.023s). And RODEO achieves better training ELBOs than RLOO for the same amount of time.

6.2 Training hierarchical Bernoulli VAEs

To investigate the performance of RODEO when scaling to hierarchical discrete latent variable models, we follow DisARM [14, 68] to train VAEs with 2/3/4 stochastic layers, each of which consists of 200200 Bernoulli variables. We set K=2K=2 and compare our estimator with DisARM and Double CV on dynamically binarized MNIST, Fashion-MNIST, and Omniglot. For each stochastic layer, we use a different CV network which has the same architecture as those in our VAE experiments from the previous section. More details are presented in Appendix D.

Refer to caption
Figure 3: Training hierarchical binary latent VAEs with four stochastic layers on Fashion-MNIST. In this experiment, the estimators have very different behaviors towards the beginning and the end of training. We show this on the right by zooming into the first 50K steps of the gradient variance plot.

We plot the results on VAEs with four stochastic layers on Fashion-MNIST in Figure 3. The results for other datasets and for 2 and 3 stochastic layers can be found in Appendix E. RODEO generally achieves the best training ELBOs. One difference we noticed when comparing the variance results with those obtained from single-stochastic-layer VAEs is that these estimators have very different behaviors towards the beginning and the end of training. For example, in Figure 3, Double CV starts with lower variance than DisARM, but the gap diminishes after around 100K steps, and DisARM start to perform better as the training proceeds. In contrast, RODEO has the lowest variance in both phases for all datasets save Omniglot, where DisARM overtakes it in the long run.

6.3 Ablation study

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Ablation study of impact of RODEO components: (a) Stein operators, (b) LOO baseline and global and local CVs, (c) surrogate functions on binary VAE training performance.

Finally, we conduct an ablation study to gain more insight into the impact of each component of RODEO. In each experiment we train binary latent VAEs with K=2K=2.

Impact of Stein operator   Figure 4(a) explores the impact of RODEO Stein operator choice on non-binarized MNIST. As expected, the less stable difference 8 and MPF 6 operators lead to significantly higher gradient variances and worse training ELBOs. In fact, the same operators led to divergent training on binarized MNIST. The Barker operator 6 with neighborhoods defined by having one differing coordinate yields results very similar to Gibbs 4 as the operators themselves are very similar for binary variables (notice that q(y)q(y)+q(x)=q(yi|xi)\frac{q(y)}{q(y)+q(x)}=q(y_{i}|x_{-i}) when xi=yix_{-i}=y_{-i}).

Impact of LOO baseline and global and local CVs   Figure 4(b) explores binarized MNIST performance when RODEO is modified to retain a) only the LOO baseline (this is equivalent to standard RLOO), b) only the global CV, c) the LOO baseline + the global CV, or d) the LOO baseline + the local CV. We observe that the global and local Stein CVs both contribute to variance reduction and have complementary effects. Moreover, remarkably, the global Stein CV alone outperforms RLOO.

Impact of surrogate functions   To tease apart the benefits of our new surrogate functions 13 and the remaining RODEO components, we replace the surrogate function cϕ(z)c_{\phi}(z) in RELAX [20] with our surrogates, which only requires a single evaluation of ff per x(k)x^{(k)} (see Appendix A for more details). Figure 4(c) compares the performance of RODEO and this modified RELAX on binarized MNIST. Since the surrogate functions are matched, the consistent improvements over modified RELAX can be attributed to the Stein and double CV components of RODEO. In Section C.2, we also experiment with increasing the complexity of HH and HH^{*} by using two hidden layers instead of a single hidden layer in the MLP. We did not observe significant improvements in variance reduction.

7 Conclusions, Limitations, and Future Work

This work tackles the gradient estimation problem for discrete distributions. We proposed a variance reduction technique which exploits Stein operators to generate control variates for REINFORCE leave-one-out estimators. Our RODEO estimator does not rely on continuous reparameterization of the distribution, requires no additional function evaluations per sample point, and can be adapted online to learn very flexible control variates parameterized by neural networks.

One potential drawback of our surrogate function constructions 13 and 14 is the need to evaluate an auxiliary function (HH or HH^{*}) at K1K-1 locations. This cost can be comfortably borne when HH and HH^{*} are much cheaper to evaluate than ff, such as in the ResNet VAE example in Section C.1 and many large VAE models used in practice [63]. And, in our experiments with the most common sample sizes, the runtime of RODEO was no worse than that of RELAX. To obtain a more favorable cost for large KK, one could employ alternative surrogates that require only a constant number of auxiliary function evaluations, e.g.,

hk(y)=H(1K1jkf(x(j)),1K1jkf(x(j))(yx(j))).\textstyle h_{k}(y)=H\big{(}\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)}),\frac{1}{K-1}\sum_{j\neq k}\nabla f(x^{(j)})^{\top}(y-x^{(j)})\big{)}.

The runtime of RODEO could also be improved by varying the Stein operator employed. For example, the Gibbs operator (Ah)(x)(Ah)(x) 4 used in our experiments evaluated its surrogate function hh at dd neighboring locations of xx. This evaluation complexity could be reduced by subsampling neighbors, resulting in a cheaper but still valid Stein operator, or by employing the numerically stable Barker operator 6 with fewer neighbors. Either strategy would introduce a speed-variance trade-off worthy of study in follow-up work. Finally, we have restricted our focus in this work to differentiable target functions ff. In future work, this limitation could be overcome by designing effective surrogate functions that make no use of derivative information.

Acknowledgments and Disclosure of Funding

We thank Heishiro Kanagawa for suggesting appropriate names for the Barker Stein operator.

References

  • Anastasiou et al. [2021] Andreas Anastasiou, Alessandro Barp, François-Xavier Briol, Bruno Ebner, Robert E Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, Arthur Gretton, Christophe Ley, Qiang Liu, and Lester Mackey. Stein’s method meets statistics: A review of some recent developments. arXiv preprint arXiv:2105.03481, 2021.
  • Assaraf and Caffarel [1999] Roland Assaraf and Michel Caffarel. Zero-variance principle for Monte Carlo algorithms. Phys. Rev. Lett., 83:4682–4685, Dec 1999.
  • Barbour [1988] A. D. Barbour. Stein’s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A):175–184, 1988. ISSN 0021-9002. A celebration of applied probability.
  • Barker [1965] Av A Barker. Monte Carlo calculations of the radial distribution functions for a proton? electron plasma. Australian Journal of Physics, 18(2):119–134, 1965.
  • Barp et al. [2018] Alessandro Barp, Chris Oates, Emilio Porcu, and Mark Girolami. A Riemann-Stein kernel method. arXiv preprint arXiv:1810.04946, 2018.
  • Barp et al. [2019] Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. Advances in Neural Information Processing Systems, 32:12964–12976, 2019.
  • Bengio et al. [2013] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Bresler and Nagaraj [2019] Guy Bresler and Dheeraj Nagaraj. Stein’s method for stationary distributions of Markov chains and application to Ising models. The Annals of Applied Probability, 29(5):3230–3265, 2019.
  • Brown and Xia [2001] Timothy C. Brown and Aihua Xia. Stein’s Method and Birth-Death Processes. The Annals of Probability, 29(3):1373 – 1403, 2001.
  • Burda et al. [2015] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. International Conference on Learning Representations, 2015.
  • Cong et al. [2019] Yulai Cong, Miaoyun Zhao, Ke Bai, and Lawrence Carin. GO gradient for expectation-based objectives. In International Conference on Learning Representations, 2019.
  • Dellaportas and Kontoyiannis [2012] Petros Dellaportas and Ioannis Kontoyiannis. Control variates for estimation based on reversible Markov chain Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):133–161, 2012.
  • Dimitriev and Zhou [2021] Aleksandar Dimitriev and Mingyuan Zhou. ARMS: Antithetic-REINFORCE-Multi-Sample gradient for binary variables. In International Conference on Machine Learning, volume 139, pages 2717–2727, 2021.
  • Dong et al. [2020] Zhe Dong, Andriy Mnih, and George Tucker. DisARM: An antithetic gradient estimator for binary latent variables. In Advances in Neural Information Processing Systems, volume 33, pages 18637–18647, 2020.
  • Dong et al. [2021] Zhe Dong, Andriy Mnih, and George Tucker. Coupled gradient estimators for discrete latent variables. arXiv preprint arXiv:2106.08056, 2021.
  • Eichelsbacher and Reinert [2008] Peter Eichelsbacher and Gesine Reinert. Stein’s method for discrete Gibbs measures. The Annals of Applied Probability, 18(4):1588–1618, 2008.
  • Geyer [2011] Charles J. Geyer. Introduction to Markov chain Monte Carlo. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, pages 3–48. CRC Press, 2011.
  • Glynn [1990] Peter W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Commun. ACM, 33(10):75–84, October 1990.
  • Gorham and Mackey [2015] Jackson Gorham and Lester Mackey. Measuring sample quality with Stein’s method. In Advances in Neural Information Processing Systems, pages 226–234, 2015.
  • Grathwohl et al. [2018] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018.
  • Grathwohl et al. [2021] Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops I took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831–3841, 2021.
  • Gu et al. [2016] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. In International Conference on Learning Representations, 2016.
  • Henderson [1997] Shane G Henderson. Variance reduction via an approximating Markov process. PhD thesis, Stanford University, 1997.
  • Hodgkinson et al. [2020] Liam Hodgkinson, Robert Salomone, and Fred Roosta. The reproducing stein kernel approach for post-hoc corrected sampling. arXiv preprint arXiv:2001.09266, 2020.
  • Holmes [2004] Susan Holmes. Stein’s method for birth and death chains. In Persi Diaconis and Susan Holmes, editors, Stein’s Method: Expository Lectures and Applications, pages 42–65. 2004.
  • Jang et al. [2017] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. International Conference on Learning Representations, 2017.
  • Jankowiak and Obermeyer [2018] Martin Jankowiak and Fritz Obermeyer. Pathwise derivatives beyond the reparameterization trick. In International Conference on Machine Learning, pages 2235–2244. PMLR, 2018.
  • Karlin and McGregor [1957] Samuel Karlin and James McGregor. The classification of birth and death processes. Transactions of the American Mathematical Society, 86(2):366–400, 1957.
  • Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Kool et al. [2019] W. Kool, H. V. Hoof, and M. Welling. Buy 4 REINFORCE samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019.
  • Kool et al. [2020] Wouter Kool, Herke van Hoof, and Max Welling. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020.
  • Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Liu et al. [2017] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-depedent control variates for policy optimization via Stein’s identity. arXiv preprint arXiv:1710.11198, 2017.
  • Maddison et al. [2017] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017.
  • Mira et al. [2013] Antonietta Mira, Reza Solgi, and Daniele Imparato. Zero variance Markov chain Monte Carlo for Bayesian estimators. Statistics and Computing, 23(5):653–662, 2013.
  • Mnih and Gregor [2014] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pages 1791–1799, 2014.
  • Mnih and Rezende [2016] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016.
  • Mohamed et al. [2020] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte Carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21(132):1–62, 2020.
  • Niepert et al. [2021] Mathias Niepert, Pasquale Minervini, and Luca Franceschi. Implicit MLE: backpropagating through discrete exponential family distributions. Advances in Neural Information Processing Systems, 34:14567–14579, 2021.
  • Oates et al. [2017] Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695–718, 2017.
  • Owen [2013] Art B. Owen. Monte Carlo theory, methods and examples. 2013.
  • Paisley et al. [2012] John Paisley, David M Blei, and Michael I Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, pages 1363–1370, 2012.
  • Parmas and Sugiyama [2021] Paavo Parmas and Masashi Sugiyama. A unified view of likelihood ratio and reparameterization gradients. In International Conference on Artificial Intelligence and Statistics, pages 4078–4086. PMLR, 2021.
  • Ranganath et al. [2014] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In International Conference on Artificial Intelligence and Statistics, page 814–822, 2014.
  • Reinert and Ross [2019] Gesine Reinert and Nathan Ross. Approximating stationary distributions of fast mixing Glauber dynamics, with applications to exponential random graphs. The Annals of Applied Probability, 29(5):3201–3229, 2019.
  • Rezende et al. [2014] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
  • Richter et al. [2020] Lorenz Richter, Ayman Boustati, Nikolas Nüsken, Francisco Ruiz, and Omer Deniz Akyildiz. VarGrad: A low-variance gradient estimator for variational inference. In Advances in Neural Information Processing Systems, volume 33, pages 13481–13492, 2020.
  • Salimans and Knowles [2014] Tim Salimans and David A Knowles. On using control variates with stochastic approximation for variational Bayes and its connection to stochastic linear regression. arXiv preprint arXiv:1401.1022, 2014.
  • Schulman et al. [2015] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. Advances in Neural Information Processing Systems, 28, 2015.
  • Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • Si et al. [2020] Shijing Si, Chris Oates, Andrew B Duncan, Lawrence Carin, and François-Xavier Briol. Scalable control variates for Monte Carlo methods via stochastic optimization. arXiv preprint arXiv:2006.07487, 2020.
  • Sohl-Dickstein et al. [2011] Jascha Sohl-Dickstein, Peter Battaglino, and Michael R DeWeese. Minimum probability flow learning. In International Conference on Machine Learning, pages 905–912, 2011.
  • South et al. [2018] Leah F South, Chris J Oates, Antonietta Mira, and Christopher Drovandi. Regularised zero-variance control variates for high-dimensional variance reduction. arXiv preprint arXiv:1811.05073, 2018.
  • Stein [1972] Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972.
  • Stein et al. [2004] Charles Stein, Persi Diaconis, Susan Holmes, and Gesine Reinert. Use of exchangeable pairs in the analysis of simulations. Lecture Notes-Monograph Series, pages 1–26, 2004.
  • Stirzaker [2005] David Stirzaker. Stochastic processes and models. OUP Oxford, 2005.
  • Titsias and Lázaro-Gredilla [2014] Michalis K. Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.
  • Titsias and Lázaro-Gredilla [2015] Michalis K Titsias and Miguel Lázaro-Gredilla. Local expectation gradients for black box variational inference. Advances in neural information processing systems, 28:2638–2646, 2015.
  • Titsias and Shi [2022] Michalis K Titsias and Jiaxin Shi. Double control variates for gradient estimation in discrete latent variable models. International Conference on Artificial Intelligence and Statistics, 2022.
  • Tokui and Sato [2017] Seiya Tokui and Issei Sato. Evaluating the variance of likelihood-ratio gradient estimators. In International Conference on Machine Learning, pages 3414–3423, 2017.
  • Tucker et al. [2017] G. Tucker, A. Mnih, C. J. Maddison, and J. Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, 2017.
  • Vahdat and Kautz [2020] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33:19667–19679, 2020.
  • Weaver and Tao [2001] Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 538–545, 2001.
  • Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Yang et al. [2018] Jiasen Yang, Qiang Liu, Vinayak Rao, and Jennifer Neville. Goodness-of-fit testing for discrete distributions via Stein discrepancy. In International Conference on Machine Learning, pages 5561–5570. PMLR, 2018.
  • Yin and Zhou [2019] Mingzhang Yin and Mingyuan Zhou. ARM: Augment-REINFORCE-merge gradient for stochastic binary networks. In International Conference on Learning Representations, 2019.
  • Yin et al. [2020] Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic best subset selection via gradient-based optimization, 2020.
  • Zanella [2020] Giacomo Zanella. Informed proposals for local MCMC in discrete spaces. Journal of the American Statistical Association, 115(530):852–865, 2020.

Appendix A Sample-dependent Baselines in REBAR and RELAX

We start with the REINFORCE estimator with the sample-dependent baseline bkb_{k}:

1Kk=1K(f(x(k))bk)ηlogqη(x(k))+𝔼[bkηlogqη(x(k))].\frac{1}{K}\sum_{k=1}^{K}(f(x^{(k)})-b_{k})\nabla_{\eta}\log q_{\eta}(x^{(k)})+\mathbb{E}[b_{k}\nabla_{\eta}\log q_{\eta}(x^{(k)})]. (15)

REBAR [62] introduces indirect independence on x(k)x^{(k)} in bkb_{k} through the continuous reparameterization x=H(z),z(k)qη(z|x=x(k))x=H(z),z^{(k)}\sim q_{\eta}(z|x=x^{(k)}), where zz is a continuous variable and HH is an argmax-like thresholding function. Specifically, bk=f(σλ(z(k)))b_{k}=f(\sigma_{\lambda}(z^{(k)})), where σλ\sigma_{\lambda} is a continuous relaxation of HH controlled by the parameter λ\lambda. The correction term decomposes into two parts:

𝔼x(k)[𝔼z(k)|x(k)[f(σλ(z(k)))]ηlogqη(x(k))]\displaystyle\mathbb{E}_{x^{(k)}}[\mathbb{E}_{z^{(k)}|x^{(k)}}[f(\sigma_{\lambda}(z^{(k)}))]\nabla_{\eta}\log q_{\eta}(x^{(k)})]
=\displaystyle= η𝔼qη(z)[f(σλ(z))]𝔼x(k)[η𝔼qη(z(k)|x(k))[f(σλ(z(k)))]].\displaystyle\;\nabla_{\eta}\mathbb{E}_{q_{\eta}(z)}[f(\sigma_{\lambda}(z))]-\mathbb{E}_{x^{(k)}}[\nabla_{\eta}\mathbb{E}_{q_{\eta}(z^{(k)}|x^{(k)})}[f(\sigma_{\lambda}(z^{(k)}))]].

Both parts can be estimated with the reparameterization trick [29, 47, 58] which often has low variance. The RELAX [20] estimator generalizes REBAR by noticing that f(σλ(z))f(\sigma_{\lambda}(z)) can be replaced with a free-form differentiable function cϕ(z)c_{\phi}(z). However, RELAX still relies on parameterizing cϕ(z)c_{\phi}(z) as f(σλ(z))+rθ(z)f(\sigma_{\lambda}(z))+r_{\theta}(z) to achieve strong performance, as noted in Dong et al. [14].

To form modified RELAX in Section 6.3, we replace bk=cϕ(z(k))b_{k}=c_{\phi}(z^{(k)}) with bk=hk(σλ(z(k)))b_{k}=h_{k}(\sigma_{\lambda}(z^{(k)})) for hkh_{k} defined in 13.

Appendix B Proof of Unbiasedness of RODEO

Recall our estimator defined in RODEO is

1Kk=1K[(f(x(k))1K1jk(f(x(j))+(Ahj)(x(j))))ηlogqη(x(k))+(Ah~k)(x(k))].\frac{1}{K}\sum_{k=1}^{K}[(f(x^{(k)})-\frac{1}{K-1}\sum_{j\neq k}(f(x^{(j)})+(Ah_{j})(x^{(j)})))\cdot\nabla_{\eta}\log q_{\eta}(x^{(k)})+(A\tilde{h}_{k}^{\star})(x^{(k)})]. (16)

To show the unbiasedness, we compute its expectation under qηq_{\eta} as

1Kk=1K𝔼qη[f(x(k))ηlogqη(x(k))]\displaystyle\phantom{{}={}}\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{q_{\eta}}[f(x^{(k)})\nabla_{\eta}\log q_{\eta}(x^{(k)})]
1K(K1)k=1Kjk𝔼qη[(f(x(j))+(Ahj)(x(j)))ηlogqη(x(k))]\displaystyle-\frac{1}{K(K-1)}\sum_{k=1}^{K}\sum_{j\neq k}\mathbb{E}_{q_{\eta}}[(f(x^{(j)})+(Ah_{j})(x^{(j)}))\nabla_{\eta}\log q_{\eta}(x^{(k)})]
+1Kk=1K𝔼qη[(Ah~k)(x(k))].\displaystyle+\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{q_{\eta}}[(A\tilde{h}_{k}^{\star})(x^{(k)})].

Since the first term is the desired gradient η𝔼qη[f(x)]\nabla_{\eta}\mathbb{E}_{q_{\eta}}[f(x)] and the third term is zero, it suffices to show that the second term also vanishes. Using the law of total expectations, we find for jkj\neq k,

𝔼qη[(f(x(j))+(Ahj)(x(j)))ηlogqη(x(k))]\displaystyle\mathbb{E}_{q_{\eta}}[(f(x^{(j)})+(Ah_{j})(x^{(j)}))\nabla_{\eta}\log q_{\eta}(x^{(k)})]
=𝔼x(k)qη[𝔼qη[f(x(j))+(Ahj)(x(j))x(k)]ηlogqη(x(k))]\displaystyle=\mathbb{E}_{x^{(k)}\sim q_{\eta}}[\mathbb{E}_{q_{\eta}}[f(x^{(j)})+(Ah_{j})(x^{(j)})\mid x^{(k)}]\nabla_{\eta}\log q_{\eta}(x^{(k)})]
=𝔼x(k)qη[𝔼qη[f(x(j))x(k)]ηlogqη(x(k))]\displaystyle=\mathbb{E}_{x^{(k)}\sim q_{\eta}}[\mathbb{E}_{q_{\eta}}[f(x^{(j)})\mid x^{(k)}]\nabla_{\eta}\log q_{\eta}(x^{(k)})]
=𝔼qη[f(x(j))ηlogqη(x(k))]\displaystyle=\mathbb{E}_{q_{\eta}}[f(x^{(j)})\nabla_{\eta}\log q_{\eta}(x^{(k)})]
=𝔼x(j)qη[f(x(j))𝔼x(k)qη[ηlogqη(x(k))x(j)]]=0,\displaystyle=\mathbb{E}_{x^{(j)}\sim q_{\eta}}[f(x^{(j)})\mathbb{E}_{x^{(k)}\sim q_{\eta}}[\nabla_{\eta}\log q_{\eta}(x^{(k)})\mid x^{(j)}]]=0,

which completes the proof.

Appendix C Additional Experiments

C.1 Wall clock time comparison with RLOO

Besides necessary target function evaluations, the RODEO estimator comes with the additional cost of evaluating the neural network-based H,HH,H^{*}. Therefore, RODEO is most suited to the problems where the cost of evaluating ff dominates that of evaluating H,HH,H^{*}. This is often the case in practice. For example, state-of-the-art variational autoencoders [e.g., 63] are often built on expensive neural architectures such as deep residual networks (ResNets). Here, to demonstrate the practical advantage of our method as the complexity of ff grows, we replace the two-layer MLP VAEs used in previous experiments with a ResNet VAE (architecture shown in Table 3), while the neural network used by H,HH,H^{*} remains a single-layer MLP with 100 hidden units. We then compare the wall clock performance of RODEO with RLOO. The latent variables in this experiment remain binary and have 200 dimensions.

The results are shown in Figure 5. RODEO achieves better training ELBOs than RLOO in the same amount of time. In fact, for this VAE architecture, the per-iteration time of RODEO is 25.2ms, which is very close to the 23.1ms of RLOO. This indicates that the cost of ff is significantly higher than that of H,HH,H^{*}.

Refer to caption
Figure 5: Comparing the performance of RODEO and RLOO on more expensive ResNet VAE models trained on binarized MNIST with K=2K=2. The middle plot shows the average wall clock performance over 5 trials.
Table 3: Architecture of the ResNet VAE in Section C.1. 3x3xCC means kernel size 3x3 and CC output channels. Each (De)conv Res block is composed of two (de)convolutional layers with strides 1, same padding, and ReLU activations, plus a skip connection with identity map. For Res blocks with downsample and upsample functions, the first convolutional layer has strides 2, and the skip connection is replaced by a convolutional layer with 2x2 kernel size and strides 2.

Encoder Decoder
Conv 3x3x16, strides 1, padding 1 Fully connected, 7x7x64 units
Conv Res block 3x3x16 Deconv Res block 3x3x64
Conv Res block 3x3x16 Deconv Res block 3x3x64
Conv Res block 3x3x32 (downsample by 2) Deconv Res block 3x3x32 (upsample by 2)
Conv Res block 3x3x32 Deconv Res block 3x3x32
Conv Res block 3x3x64 (downsample by 2) Deconv Res block 3x3x16 (upsample by 2)
Conv Res block 3x3x64 Deconv Res block 3x3x16
Fully connected, 200 units Deconv 3x3x1, strides 1, padding 1

C.2 Impact of neural network architectures of surrogate functions

We conduct one more ablation study to investigate the impact of neural network architectures used by H,HH,H^{*}. Specifically, we replace the single-hidden-layer control variate network used in previous experiments with a two-hidden-layer MLP (each layer has 100 units) and compare their performance on binarized MNIST with K=2K=2. We keep other settings the same as in Section 6.1. The results are plotted in Figure 6. We do not observe significant difference between the two versions of RODEO.

Refer to caption
Figure 6: Comparing the performance of RODEO with single-hidden-layer and two-hidden-layer neural network architectures for H,HH,H^{*} on binary VAEs.

Appendix D Experimental Details

Our implementation is based on the open-source code of DisARM [14] (Apache license) and Double CV [60] (MIT license). Our figures display 1M training steps and our tables report performance after 1M training steps to replicate the experimental settings of DisARM [14].

D.1 Details of VAE experiments

VAEs are models with a joint density p(y,x)=p(y|x)p(x)p(y,x)=p(y|x)p(x), where xx denotes the latent variable. xx is assigned a uniform factorized Bernoulli prior. The likelihood pθ(y|x)p_{\theta}(y|x) is parameterized by the output of a neural network with xx as input and parameters θ\theta. The VAE has two hidden layers with 200200 units activated by LeakyReLU with the coefficient 0.30.3. To optimize the VAE we use Adam with base learning rate 10410^{-4} for non-binarized data and 10310^{-3} for dynamically binarized data, except for binarized Fashion-MNIST we decreased the learning rate to 3×1043\times 10^{-4} because otherwise the training is very unstable for all estimators. We use Adam with the same learning rate 10310^{-3} for adapting our control variate network in all experiments. The batch size is 100100. The settings of other estimators are kept the same with Titsias and Shi [60].

In the minimum probability flow (MPF) and Barker Stein operator 6, we choose the neighborhood 𝒩x\mathcal{N}_{x} to be the states that differ in only one coordinate from xx. Let y𝒩xy\in\mathcal{N}_{x} be an element in this neighborhood such that yixiy_{i}\neq x_{i} and yi=xiy_{-i}=x_{-i}. For the MPF Stein estimator and the difference Stein estimator 8, the density ratio q(y)q(x)\frac{q(y)}{q(x)} can be simplified to q(yi|xi)q(xi|xi)\frac{q(y_{i}|x_{-i})}{q(x_{i}|x_{-i})}. We further replace it with q(yi|xi)q(xi|xi)+103\frac{q(y_{i}|x_{-i})}{q(x_{i}|x_{-i})+10^{-3}} to alleviate numerical instability. The Barker Stein estimator does not suffer from the numerical issue since the coefficient is bounded in the Bernoulli case: q(y)q(x)+q(y)=q(yi|xi)q(xi|xi)+q(yi|xi)=q(yi|xi)\frac{q(y)}{q(x)+q(y)}=\frac{q(y_{i}|x_{-i})}{q(x_{i}|x_{-i})+q(y_{i}|x_{-i})}=q(y_{i}|x_{-i}). In our experiments, we find that the difference Stein estimator is highly unstable and may diverge as the iteration proceeds.

D.2 Details of hierarchical VAE experiments

We optimize the hierarchical VAE using Adam with base learning rate 10410^{-4}. Our control variate network is optimized using Adam with learning rate 10310^{-3}. Settings of training multilayer VAEs are kept the same with Dong et al. [14], except that we do not optimize the prior distribution of the VAE hidden layer and use a larger batch size 100100.

Appendix E Additional Results

In this section, we measure test set performance using 100 test points and the marginal log-likelihood bound of Burda et al. [10], which provides a tighter estimate of marginal log likelihood than the ELBO. Throughout, we call this the “test log-likelihood bound.”

Table 4: Average 100-point test log-likelihood bounds of binary latent VAEs trained with K=2,3K=2,3 (except for RELAX which uses 33 evaluations per step) on MNIST, Fashion-MNIST, and Omniglot. We report the average value ±1\pm 1 standard error after 1M steps over 5 independent runs.

Bernoulli Likelihoods Gaussian Likelihoods
MNIST Fashion-MNIST Omniglot MNIST Fashion-MNIST Omniglot
K=2K=2 DisARM 101.61±0.07-101.61\pm 0.07 239.11±0.11-239.11\pm 0.11 118.34±0.05-118.34\pm 0.05 669.26±0.53669.26\pm 0.53 163.40±0.59163.40\pm 0.59 305.32±0.89305.32\pm 0.89
Double CV 100.91±0.04-100.91\pm 0.04 239.00±0.17-239.00\pm 0.17 118.45±0.09-118.45\pm 0.09 677.02±0.93677.02\pm 0.93 164.99±0.71164.99\pm 0.71 304.72±1.39304.72\pm 1.39
RODEO (Ours) 100.78±0.16\bf-100.78\pm 0.16 238.97±0.09\bf-238.97\pm 0.09 118.09±0.05\bf-118.09\pm 0.05 681.11±0.31\bf 681.11\pm 0.31 168.26±0.73\bf 168.26\pm 0.73 308.55±1.02\bf 308.55\pm 1.02
K=3K=3 ARMS 99.08±0.12-99.08\pm 0.12 238.19±0.11-238.19\pm 0.11 116.78±0.13-116.78\pm 0.13 688.61±0.84688.61\pm 0.84 174.14±0.44174.14\pm 0.44 320.45±1.07320.45\pm 1.07
Double CV 99.16±0.12-99.16\pm 0.12 238.54±0.16-238.54\pm 0.16 116.75±0.15-116.75\pm 0.15 690.28±0.49690.28\pm 0.49 173.67±0.30173.67\pm 0.30 322.88±1.10322.88\pm 1.10
RODEO (Ours) 98.72±0.14\bf-98.72\pm 0.14 237.97±0.12\bf-237.97\pm 0.12 116.69±0.09\bf-116.69\pm 0.09 695.11±0.33\bf 695.11\pm 0.33 174.57±0.30\bf 174.57\pm 0.30 323.92±1.24\bf 323.92\pm 1.24
RELAX (3 evals) 100.80±0.09-100.80\pm 0.09 239.03±0.11-239.03\pm 0.11 117.60±0.06-117.60\pm 0.06 686.21±0.57686.21\pm 0.57 171.43±0.61171.43\pm 0.61 317.78±1.25317.78\pm 1.25

Binarized

Refer to caption

Non-binarized

Refer to caption
Figure 7: Average 100-point test log-likelihood bounds for binary latent VAEs trained on (top) dynamically binarized and (bottom) non-binarized MNIST, Fashion-MNIST, and Omniglot using K=2K=2.
Refer to caption
Refer to caption
Figure 8: Training binary latent VAEs with Gaussian likelihoods with three evaluations of ff per step using RODEO/Double CV/ARMS with K=3K=3 or RELAX on non-binarized MNIST, Fashion-MNIST, and Omniglot. (Top) variance of gradient estimates. (Bottom) the plot of average ELBO on training examples against training steps.

Binarized

Refer to caption

Non-binarized

Refer to caption
Figure 9: Average 100-point test log-likelihood bounds for binary latent VAEs trained on (top) dynamically binarized and (bottom) non-binarized MNIST, Fashion-MNIST, and Omniglot with three evaluations of ff per step using RODEO/Double CV/ARMS with K=3K=3 or RELAX.

MNIST

Refer to caption

Fashion-MNIST

Refer to caption

Omniglot

Refer to caption
Figure 10: Training hierarchical binary latent VAEs with two stochastic layers on dynamically binarized MNIST, Fashion-MNIST and Omniglot. We plot (left) the average ELBO on training examples and (middle) variance of gradient estimates. We zoom into the first 50K steps of the variance plot on the right figure.

MNIST

Refer to caption

Fashion-MNIST

Refer to caption

Omniglot

Refer to caption
Figure 11: Training hierarchical binary latent VAEs with three stochastic layers. on dynamically binarized MNIST, Fashion-MNIST and Omniglot. We plot the average ELBO on training examples (left) and variance of gradient estimates (middle). We zoom into the first 50K steps of the variance plot on the right figure.

MNIST

Refer to caption

Fashion-MNIST

Refer to caption

Omniglot

Refer to caption
Figure 12: Training hierarchical binary latent VAEs with four stochastic layers on dynamically binarized MNIST, Fashion-MNIST and Omniglot. We plot the average ELBO on training examples (left) and variance of gradient estimates (middle). We zoom into the first 50K steps of the variance plot on the right figure.
Table 5: Average running time across 10410^{4} steps on an NVIDIA 3080Ti GPU with an AMD 5950X CPU for the VAE experiment on binary MNIST in Section 6.1.

Double CV DisARM/ARMS RODEO (Ours) RELAX (3 evals)
K=2K=2 2.11 ms/step 1.89 ms/step 3.08 ms/step 4.71 ms/step
K=3K=3 2.28 ms/step 1.91 ms/step 4.72 ms/step
Table 6: Average running time across 10410^{4} steps on an NVIDIA 3080Ti GPU with an AMD 5950X CPU when training hierarchical VAEs with K=2K=2.

Double CV DisARM RODEO (Ours)
Two layers 4.33 ms/step 3.54 ms/step 6.79 ms/step
Three layers 7.69 ms/step 6.09 ms/step 10.61 ms/step
Four layers 11.67 ms/step 9.53 ms/step 14.91 ms/step
Table 7: Training hierarchical binary latent VAEs on dynamically binarized MNIST, Fashion-MNIST, and Omniglot. We report the average (±1\pm 1 standard error) training ELBOs and 100-point test log-likelihood bounds after 1M steps over 5 independent runs.

Training ELBO Test Log-Likelihood Bound
MNIST Fashion-MNIST Omniglot MNIST Fashion-MNIST Omniglot
Two layers Double CV 103.52±0.06-103.52\pm 0.06 239.82±0.07-239.82\pm 0.07 114.06±0.04-114.06\pm 0.04 97.62±0.08-97.62\pm 0.08 237.65±0.06-237.65\pm 0.06 110.48±0.04-110.48\pm 0.04
DisARM 103.39±0.12-103.39\pm 0.12 239.67±0.06\mathbf{-239.67\pm 0.06} 113.67±0.05\mathbf{-113.67\pm 0.05} 97.56±0.07-97.56\pm 0.07 237.61±0.06\mathbf{-237.61\pm 0.06} 110.15±0.04\mathbf{-110.15\pm 0.04}
RODEO (Ours) 103.15±0.07\mathbf{-103.15\pm 0.07} 239.76±0.09-239.76\pm 0.09 113.84±0.11-113.84\pm 0.11 97.43±0.03\mathbf{-97.43\pm 0.03} 237.63±0.07-237.63\pm 0.07 110.32±0.10-110.32\pm 0.10
Three layers Double CV 97.59±0.15-97.59\pm 0.15 234.34±0.07-234.34\pm 0.07 108.66±0.06-108.66\pm 0.06 93.71±0.12-93.71\pm 0.12 234.34±0.07-234.34\pm 0.07 107.48±0.07-107.48\pm 0.07
DisARM 97.95±0.30-97.95\pm 0.30 234.45±0.05-234.45\pm 0.05 108.60±0.08-108.60\pm 0.08 94.12±0.28-94.12\pm 0.28 234.46±0.06-234.46\pm 0.06 107.32±0.10-107.32\pm 0.10
RODEO (Ours) 97.21±0.17\mathbf{-97.21\pm 0.17} 234.11±0.10\mathbf{-234.11\pm 0.10} 108.51±0.04\mathbf{-108.51\pm 0.04} 93.52±0.16\mathbf{-93.52\pm 0.16} 234.19±0.07\mathbf{-234.19\pm 0.07} 107.26±0.06\mathbf{-107.26\pm 0.06}
Four layers Double CV 98.73±0.06-98.73\pm 0.06 235.69±0.07-235.69\pm 0.07 110.92±0.06-110.92\pm 0.06 93.28±0.03-93.28\pm 0.03 234.63±0.03-234.63\pm 0.03 107.86±0.03-107.86\pm 0.03
DisARM 98.97±0.02-98.97\pm 0.02 235.50±0.04-235.50\pm 0.04 110.85±0.07-110.85\pm 0.07 93.56±0.04-93.56\pm 0.04 234.52±0.04-234.52\pm 0.04 107.87±0.05-107.87\pm 0.05
RODEO (Ours) 98.67±0.14\mathbf{-98.67\pm 0.14} 235.39±0.05\mathbf{-235.39\pm 0.05} 110.79±0.03\mathbf{-110.79\pm 0.03} 93.27±0.09\mathbf{-93.27\pm 0.09} 234.39±0.06\mathbf{-234.39\pm 0.06} 107.77±0.02\mathbf{-107.77\pm 0.02}