Gradient Estimation with Discrete Stein Operators

Jiaxin Shi
Stanford University
jiaxins@stanford.edu
&Yuhao Zhou
Tsinghua University
yuhaoz.cs@gmail.com
Jessica Hwang
Stanford University
jjhwang@stanford.edu
&Michalis K. Titsias
DeepMind
mtitsias@google.com
&Lester Mackey
Microsoft Research New England
lmackey@microsoft.com
Work done while at Microsoft Research New England.

Abstract

Gradient estimation—approximating the gradient of an expectation with respect to the parameters of a distribution—is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.

1 Introduction

Modern machine learning relies heavily on gradient methods to optimize the parameters of a learning system. However, exact gradient computation is often difficult. For example, in variational inference for training latent variable models [43, 29], policy gradient algorithms in reinforcement learning [65], and combinatorial optimization [40], the exact gradient features an intractable sum or integral introduced by an expectation under an evolving probability distribution. To make progress, one resorts to estimating the gradient by drawing samples from that distribution [50, 39].

The two main classes of gradient estimators used in machine learning are the pathwise or reparameterization gradient estimators [29, 47, 58] and the REINFORCE or score function estimators [65, 18]. The pathwise estimators have shown great success in training variational autoencoders [29] but are only applicable to continuous probability distributions. The REINFORCE estimators are more general-purpose and easily accommodate discrete distributions but often suffer from excessively high variance.

To improve the quality of REINFORCE estimators, we develop a new variance reduction technique for discrete expectations. Our method is based upon Stein operators [see, e.g., 55, 1], computable functionals that generate mean-zero functions under a target distribution and provide a natural way of designing control variates (CVs) for stochastic estimation.

We first provide a general recipe for constructing practical Stein operators for discrete distributions (Table 1), generalizing the prior literature [9, 25, 16, 8, 46, 67, 6]. We then develop a gradient estimation framework—RODEO—that augments REINFORCE estimators with mean-zero CVs generated from Stein operators. Finally, inspired by Double CV [60], we extend our method to develop CVs for REINFORCE leave-one-out estimators [49, 30] to further reduce the variance.

The benefits of using Stein operators to construct discrete CVs are twofold. First, the operator structure permits us to learn CVs with a flexible functional form such as those parameterized by neural networks. Second, since our operators are derived from Markov chains on the discrete support, they naturally incorporate information from neighboring states of the process for variance reduction.

We evaluate RODEO on 15 benchmark tasks, including training binary variational autoencoders (VAEs) with one or more stochastic layers. In most cases and with the same number of function evaluations, RODEO delivers lower variance and better training objectives than the state-of-the-art gradient estimators DisARM [14, 69], ARMS [13], Double CV [60], and RELAX [20].

Table 1: Discrete Stein operators that generate mean-zero functions under

q

(i.e.,

\mathbb{E}_{q}[(Ah)(x)]=0

Stein Operator	$\boldsymbol{(Ah)(x)}$
Gibbs 4	$\frac{1}{d}\sum_{i=1}^{d}\sum_{y_{-i}=x_{-i}}q(y_{i}\|x_{-i})h(y)-h(x)$
MPF 6	$\sum_{y\in\mathcal{N}_{x},y\neq x}\sqrt{{q(y)/}{q(x)}}(h(y)-h(x))$
Barker 6	$\sum_{y\in\mathcal{N}_{x},y\neq x}\frac{q(y)}{q(x)+q(y)}(h(y)-h(x))$
Difference 8	$\frac{1}{d}\sum_{i=1}^{d}h(\texttt{dec}_{i}({x}))-\frac{q(\texttt{inc}_{i}({x}))}{q(x)}h(x)$

2 Background

We consider the problem of maximizing the objective function $\mathbb{E}_{q_{\eta}}[f(x)]$ with respect to the parameters $\eta$ of a discrete distribution $q_{\eta}(x)$ . Throughout the paper we assume $f(x)$ is a differentiable function of real-valued inputs $x\in\mathbb{R}^{d}$ but is only evaluated at a discrete subset $\mathcal{X}^{d}$ due to the discreteness of $q_{\eta}$ .¹¹1This assumption holds for many discrete probabilistic models [21] including the binary VAEs of Section 6. Exact computation of the expectation is typically intractable due to the complex nature of $f(x)$ and $q_{\eta}(x)$ . The standard workaround is to rewrite the gradient as $\nabla_{\eta}\mathbb{E}_{q_{\eta}}[f(x)]=\mathbb{E}_{q_{\eta}}[f(x)\nabla_{\eta}\log q_{\eta}(x)]$ and employ the Monte Carlo gradient estimator known as the score function or REINFORCE estimator [18, 65]:

\textstyle\frac{1}{K}\sum_{k=1}^{K}(f(x^{(k)})-b)\nabla_{\eta}\log q_{\eta}(x^{(k)})\quad\text{for}\quad x^{(1)},\dots,x^{(K)}\overset{i.i.d.}{\sim}q_{\eta}.

(REINFORCE)

Here, $b$ is a constant called the “baseline” introduced to reduce the variance of the estimator by reducing the scaling effect of $f(x^{(k)})$ . Since $\mathbb{E}_{q_{\eta}}[\nabla_{\eta}\log q_{\eta}(x)]=0$ , the REINFORCE estimator is unbiased for any choice of $b$ , and the term $b\nabla_{\eta}\log q_{\eta}(x)$ is known as a control variate (CV) [42, Ch. 8]. The optimal baseline can be estimated using additional function evaluations [64, 7, 45]. A simpler approach is to use moving averages of $f$ from historical evaluations or to train function approximators to mimic those values [37]. When $K\geq 2$ , a powerful variant of REINFORCE is obtained by replacing $b$ with the leave-one-out average of function values:

\textstyle\frac{1}{K}\sum_{k=1}^{K}\big{(}f(x^{(k)})-\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)})\big{)}\nabla_{\eta}\log q_{\eta}(x^{(k)}).

(RLOO)

This approach is often called the REINFORCE leave-one-out (RLOO) estimator [49, 30, 48] and was recently observed to have very strong performance in training discrete latent variable models [48, 14].

All the above methods construct a baseline that is independent of the point $x^{(k)}$ under consideration, but there are other ways to preserve the unbiasedness of the estimator. We are free to use a sample-dependent baseline $b(x^{(k)})$ as long as the expectation $c\triangleq\mathbb{E}_{q_{\eta}}[b(x^{(k)})\nabla_{\eta}\log q_{\eta}(x^{(k)})]$ is easily computed or has a low-variance unbiased estimate. In this case, we can correct for any bias introduced by $b(x^{(k)})$ by adding in this expectation: $\frac{1}{K}\sum_{k=1}^{K}(f(x^{(k)})-b(x^{(k)}))\nabla_{\eta}\log q_{\eta}(x^{(k)})+c.$ For example, $b(x^{(k)})$ can be a lower bound or a Taylor expansion of $f(x^{(k)})$ [43, 22]. Taking this a step further, the Double CV estimator [60] proposed to treat $f(x^{(k)})-b(x^{(k)})$ as the effective objective function and apply the leave-one-out idea:

\textstyle\frac{1}{K}\sum_{k=1}^{K}\big{[}\big{(}f(x^{(k)})\boldsymbol{-b_{k}(x^{(k)})}\big{)}-\frac{1}{K-1}\sum_{j\neq k}\big{(}f(x^{(j)})\boldsymbol{-b_{j}(x^{(j)})}\big{)}\big{]}\nabla_{\eta}\log q_{\eta}(x^{(k)})+c.

The resulting estimator adds two CVs to RLOO: the global CV $b_{k}(x^{(k)})\nabla_{\eta}\log q_{\eta}(x^{(k)})$ and the local CV $b_{k}(x^{(j)})$ . Intuitively, $b_{k}(x^{(j)})$ is aimed at reducing the variance of the LOO average. This is motivated by the fact that replacing $\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)})$ with $\mathbb{E}_{q_{\eta}}[f]$ in RLOO leads to lower variance [60, Prop 1]. To obtain a tractable correction term $c$ , Titsias and Shi [60] adopt a linear design of $b_{k}$ : $b_{k}(x)=\alpha\cdot\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)})(x-\mu)$ for $\mu=\mathbb{E}_{q_{\eta}}[x]$ and a coefficient $\alpha$ .

Although the Double CV framework points to a promising new direction for developing better REINFORCE estimators, one only obtains significant reduction in variance when $b_{k}$ is strongly correlated with $f$ . This may fail to hold for the above linear $b_{k}$ and a highly nonlinear $f$ , especially in applications like training deep generative models.

In the following two sections, we will introduce a method that allows us to use very flexible CVs while still maintaining a tractable correction term. Our method enables online adaptation of CVs to minimize gradient variance (similar to RELAX [20]) but does not assume $q_{\eta}$ has a continuous reparameterization. We then apply it to generalize the linear CVs in Double CV to very flexible ones such as neural networks. Moreover, we provide an effective CV design based on surrogate functions that requires no additional evaluation of $f$ compared to RLOO.

3 Control Variates from Discrete Stein Operators

At the heart of our new estimator is a technique for generating flexible discrete CVs, that is for generating a rich class of functions that have known expectations under a given discrete distribution $q$ . One way to achieve this is to identify any discrete-time Markov chain $(X^{(t)})_{t=0}^{\infty}$ with stationary distribution $q$ . Then, the transition matrix $P$ , defined via $P_{xy}=P(X^{(t+1)}=y|X^{(t)}=x)$ , satisfies $P^{\top}q=q$ and hence

\mathbb{E}_{q}[(P-I)h]=0,

(1)

for any integrable function $h$ . We overload our notation so that, for any suitable matrix $A$ , $(Ah)(x)\triangleq\sum_{y}A_{xy}h(y)$ . In other words, for any integrable $h$ , the function $(P-I)h$ is a valid CV as it has known mean under $q$ . Moreover, the linear operator $P-I$ is an example of a Stein operator [55, 1] in the sense that it generates mean-zero functions under $q$ . In fact, both Stein et al. [56] and Dellaportas and Kontoyiannis [12] developed CVs of the form $(P-I)h$ based on reversible discrete-time Markov chains and linear input functions $h(x)=x_{i}$ .

More generally, Barbour [3] and Henderson [23] observed that if we identify a continuous-time Markov chain $(X^{(t)})_{t\geq 0}$ with $q$ as its stationary distribution, then the generator $A$ , defined via

\displaystyle(Ah)(x)=\lim_{t\to 0}\textstyle\frac{\mathbb{E}[h(X^{(t)})|X^{(0)}=x]-h(x)}{t},

(2)

satisfies $A^{\top}q=0$ and hence $\mathbb{E}_{q}[Ah]=0$ for all integrable $h$ . Therefore, $A$ is also a Stein operator suitable for generating CVs. Moreover, since any discrete-time chain with transition matrix $P$ can be embedded into a continuous-time chain with transition rate matrix $A=P-I$ , this continuous-time construction is strictly more general [57, Ch. 4].

3.1 Discrete Stein operators

We next present several examples of broadly applicable discrete Stein operators (summarized in Table 1) that can serve as practical defaults.

Gibbs Stein operator The transition kernel of the random-scan Gibbs sampler with stationary distribution $q$ [see, e.g., 17] is

\textstyle P_{xy}=\frac{1}{d}\sum_{i=1}^{d}q(y_{i}|x_{-i})\mathbf{1}(y_{-i}=x_{-i}),

(3)

where $\mathbf{1}(\cdot)$ is the indicator function, $x_{-i}$ denotes all elements of $x$ except $x_{i}$ . The associated Stein operator is $A=P-I$ with

\textstyle(Ah)(x)=\frac{1}{d}\sum_{i=1}^{d}\sum_{y_{-i}=x_{-i}}q(y_{i}|x_{-i})h(y)-h(x).

(4)

In the binary variable setting, 4 recovers the operator Bresler and Nagaraj [8], Reinert and Ross [46] used to bound distances between the stationary distributions of Glauber dynamics Markov chains.

Zanella Stein operator Zanella [70] studied continuous-time Markov chains with generator

\textstyle A_{xy}=\kappa\left({q(y)/}{q(x)}\right)\mathbf{1}(y\in\mathcal{N}_{x},y\neq x)-\sum_{z\neq x}A_{xz}\,\mathbf{1}(y=x),

(5)

where $\mathcal{N}_{x}$ denotes the transition neighborhood of $x$ and $\kappa$ is a continuous positive function satisfying $\kappa(t)=t\kappa(1/t)$ . Conveniently, the neighborhood structure can be arbitrarily sparse. Moreover, the $\kappa$ constraint ensures that the chain satisfies detailed balance (i.e., $q(x)A_{xy}=q(y)A_{yx}$ ) and hence that $A^{\top}q=0$ . Hodgkinson et al. [24] call the associated operator the Zanella Stein operator:

\textstyle(Ah)(x)=\sum_{y\in\mathcal{N}_{x},y\neq x}\ \textstyle\kappa\big{(}\frac{q(y)}{q(x)}\big{)}(h(y)-h(x)).

(6)

Important special cases include the minimum probability flow [53] (MPF) Stein operator ( $\kappa(t)=\sqrt{t}$ ) discussed in Barp et al. [6] and the Barker [4] Stein operator ( $\kappa(t)=\frac{t}{t+1}$ ).

Birth-death Stein operator Let $\boldsymbol{e}_{1},\dots,\boldsymbol{e}_{d}$ represent the standard basis vectors on $\mathbb{R}^{d}$ . For any finite-cardinality space, we may index the elements of $\mathcal{X}=\{s_{0},\dots,s_{m-1}\}$ , let $\texttt{idx}(x_{i})$ return the index of the element that $x_{i}$ represents, and define the increment and decrement operators

\displaystyle\texttt{inc}_{i}({x})=x+\boldsymbol{e}_{i}(s_{(\texttt{idx}(x_{i})+1)\ \mathrm{mod}\ m}-x_{i})\quad\text{ and }\quad\texttt{dec}_{i}({x})=x+\boldsymbol{e}_{i}(s_{(\texttt{idx}(x_{i})-1)\ \mathrm{mod}\ m}-x_{i}).

Then the product birth-death process [28] on $\mathcal{X}^{d}$ with birth rates $b_{i,x}=\frac{q(\texttt{inc}_{i}({x}))}{q(x)}$ , death rates $d_{i,x}=1$ , and generator

\textstyle A_{xy}=\frac{1}{d}\sum_{i=1}^{d}b_{i,x}\mathbf{1}(y=\texttt{inc}_{i}({x}))+d_{i,x}\mathbf{1}(y=\texttt{dec}_{i}({x}))-(b_{i,x}+d_{i,x})\mathbf{1}(y=x),

has $q$ as a stationary distribution, as $A^{\top}q=0$ . This construction yields the birth-death Stein operator

\textstyle(Ag)(x)=\frac{1}{d}\sum_{i=1}^{d}b_{i,x}(g(\texttt{inc}_{i}({x}))-g(x))-d_{i,x}(g(x)-g(\texttt{dec}_{i}({x}))).

(7)

An analogous operator is available for countably infinite $\mathcal{X}$ [24], and Brown and Xia [9], Holmes [25], Eichelsbacher and Reinert [16] used related operators to characterize convergence to discrete target distributions. Moreover, by substituting $h(x)=g(x)-g(\texttt{inc}({x}))$ in 7, we recover the difference Stein operator proposed by Yang et al. [67] without reference to birth-death processes:

\textstyle(Ah)(x)=\frac{1}{d}\sum_{i=1}^{d}h(\texttt{dec}_{i}({x}))-b_{i,x}h(x).

(8)

Choosing a Stein operator Despite being better-known, the difference and MPF operators often suffer from large variance and numerical instability due to their use of unbounded probability ratios $q(y)/q(x)$ . As a result, we recommend the numerically stable Gibbs operator when each component of $x$ is binary or takes on a small number of values. When $x_{i}$ takes on a large number of values ( $m$ ), the Gibbs operator suffers from linear scaling with $m$ . In this case, we recommend the Barker operator where a sparse neighborhood structure can be specified, such as $\mathcal{N}_{x}=\{\texttt{inc}_{i}(x),\texttt{dec}_{i}(x)\text{ for }i\in[d]\}$ . The Barker operator is numerically stable as its $\kappa(\frac{q(y)}{q(x)})=\frac{q(y)}{q(x)+q(y)}$ .

4 Gradient Estimation with Discrete Stein Operators

Recall that REINFORCE estimates the gradient $\mathbb{E}_{q_{\eta}}[f(x)\nabla_{\eta}\log q_{\eta}(x)].$ Due to the existence of $\nabla_{\eta}\log q_{\eta}(x)$ as a weighting function, we apply a discrete Stein operator to a vector-valued function $\tilde{h}:\mathcal{X}\to\mathbb{R}^{d}$ per dimension to construct the following estimator with a mean-zero CV:

\displaystyle\mathbb{E}_{q_{\eta}}[f(x)\nabla_{\eta_{i}}\log q_{\eta}(x)+(A\tilde{h}_{i})(x)].

(9)

Ideally, we want to choose the $\tilde{h}$ such that $A\tilde{h}_{i}$ strongly correlates with $f(x)\nabla_{\eta_{i}}\log q_{\eta}(x)$ to reduce its variance. The optimal $\tilde{h}_{i}$ that yields zero variance is given by the solution of Poisson equation

\displaystyle\mathbb{E}_{q_{\eta}}[f\nabla_{\eta_{i}}\log q_{\eta}]-f\nabla_{\eta_{i}}\log q_{\eta}=A\tilde{h}_{i}.

(10)

We could learn a separate $\tilde{h}_{i}$ per dimension to approximate the solution of 10. However, this poses a difficult optimization problem that requires simultaneously solving $d$ Poisson equations. Instead, we will incorporate knowledge about the structure of the solution to simplify the optimization.

To determine a candidate functional form for $\tilde{h}$ , we draw inspiration from an “optimal” Markov chain in which the current state is ignored entirely and the new state is generated independently from $q_{\eta}$ . In this case, $(P-I)\tilde{h}_{i}$ becomes $\mathbb{E}_{q_{\eta}}[\tilde{h}_{i}]-\tilde{h}_{i}$ , and the optimal solution is $\tilde{h}_{i}=f\nabla_{\eta_{i}}\log q_{\eta}$ . Inspired by this, we consider $\tilde{h}$ of the form

\textstyle\tilde{h}(x)=h(x)\nabla_{\eta}\log q_{\eta}(x),

(11)

where we now only need to learn a scalar-valued function $h$ . Notably, when $h$ exactly equals $f$ and $A=P-I$ for any discrete time Markov chain kernel $P$ , our CV adjustment amounts to Rao-Blackwellization [42, Sec. 8.7], as we end up replacing $f(x)\nabla_{\eta_{i}}\log q_{\eta}(x)$ with its conditional expectation $P(f\nabla_{\eta_{i}}\log q_{\eta})(x)=\mathbb{E}_{X_{t+1}|X_{t}=x}[f(X_{t+1})\nabla_{\eta_{i}}\log q_{\eta}(X_{t+1})]$ . This yields a guaranteed variance reduction.

Surrogate function design Based on the above reasoning, we can view $h$ as a surrogate for $f$ . We avoid directly setting $h=f$ because our Stein operators evaluate $h$ at all neighbors of the sample points, and $f$ can be very expensive to evaluate [see, e.g., 51]. To avoid this problem, we first observe that $\tilde{h}$ 11 can be made sample-specific, i.e., we can use a different $\tilde{h}_{k}$ for each sample point $x^{(k)}$ :

\textstyle\frac{1}{K}\sum_{k=1}^{K}[f(x^{(k)})\nabla_{\eta}\log q_{\eta}(x^{(k)})+(A\tilde{h}_{k})(x^{(k)})].

(12)

We then consider the following choices of $h_{k}$ that are informed about $f$ while being cheap to evaluate:

\textstyle h_{k}(y)=\frac{1}{K-1}\sum_{j\neq k}\!H(f(x^{(j)}),\nabla f(x^{(j)})^{\top}(y-x^{(j)})).

(13)

Here $H$ belongs to a flexible parametric family of functions such as neural networks and is chosen to have significantly lower cost than $f$ . $y$ will take on the values of $x^{(k)}$ and its neighbors in the Markov chain. We omit $x^{(k)}$ in the sum 13 to ensure that the function $h_{k}$ is independent of $x^{(k)}$ and hence that $\mathbb{E}_{q_{\eta}}[\frac{1}{K}\sum_{k=1}^{K}(A\tilde{h}_{k})(x^{(k)})]=0$ . Moreover, this surrogate function design introduces no additional evaluations of $f$ beyond those required for the usual RLOO estimator. Also, as observed by Titsias and Shi [60], for many applications, including VAE training, where $f$ has parameters $\theta$ learned through gradient-based optimization, $\{\nabla f(x^{(k)})\}_{k=1}^{K}$ can be obtained “for free” from the same backpropagation used to compute $\nabla_{\theta}f_{\theta}(x^{(k)})$ (see Algorithm 1).

RODEO We can further improve the estimator by leveraging discrete Stein operators and the above surrogate function design to construct both the global and local CVs in the Double CV framework [60] (see Section 2). We call our final estimator RODEO for RLOO with Discrete StEin Operators:

\displaystyle{\textstyle\frac{1}{K}}\operatorname*{\textstyle\sum}_{k=1}^{K}\![(f(x^{(k)})-\!

\displaystyle{\textstyle\frac{1}{K-1}}\operatorname*{\textstyle\sum}_{j\neq k}(f(x^{(j)})\!+\!\boldsymbol{(Ah_{j})(x^{(j)})}))\nabla_{\eta}\log q_{\eta}(x^{(k)})\!+\!\boldsymbol{(A\tilde{h}_{k}^{\star})(x^{(k)})}],

(RODEO)

where $\tilde{h}^{\star}_{k}(y)=h^{\star}_{k}(y)\nabla_{\eta}\log q_{\eta}(y)$ and $\{h_{k},h_{k}^{\star}\}_{k=1}^{K}$ are scalar-valued functions. Here, $(Ah_{j})(x^{(j)})$ is a scalar-valued CV introduced to reduce the variance of the leave-one-out baseline $\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)})$ , while $(A\tilde{h}_{k}^{\star})(x^{(k)})$ acts as a global CV to further reduce the variance of the gradient estimate. We adopt the aforementioned design of $h$ 13 and $\tilde{h}$ 11 for the two CVs. In Appendix B, we show that the RODEO estimator is unbiased for $\nabla_{\eta}\mathbb{E}_{q_{\eta}}[f(x)]$ when each $h_{k}$ is defined as in 13 and

\textstyle h_{k}^{\star}(y)=\frac{1}{K-1}\sum_{j\neq k}\!H^{\star}(f(x^{(j)}),\nabla f(x^{(j)})^{\top}(y-x^{(j)})).

(14)

Algorithm 1 Optimizing

\mathbb{E}_{q_{\eta}}[f_{\theta}(x)]

with RODEO gradients

input: Objective

f_{\theta}

, sample points

x^{(1:K)}\overset{i.i.d.}{\sim}q_{\eta}

, Stein operator

A

, step sizes

\alpha_{t},\beta_{t}

for

t=1:T

\{f_{\theta}(x^{(k)}),\nabla_{\theta}f_{\theta}(x^{(k)}),\nabla f_{\theta}(x^{(k)})\}_{k=1}^{K}\leftarrow\text{autodiff}(f_{\theta},x^{(1:K)})

2:Compute the surrogates

h_{k}(x^{(j)}),h^{\star}_{k}(x^{(j)})

of 13, 14 for

j\neq k

and

j,k=1,\cdots,K

3:Compute the RODEO gradient estimator

g_{\gamma}(x^{(1:K)})

\theta\leftarrow\theta+\alpha_{t}\frac{1}{K}\sum_{k=1}^{K}\nabla_{\theta}f(x^{(k)})

\eta\leftarrow\eta+\alpha_{t}g_{\gamma}(x^{(1:K)})

6:Update hyperparameters:

\gamma\leftarrow\gamma-\beta_{t}\nabla_{\gamma}\|g_{\gamma}(x^{(1:K)})\|_{2}^{2}

Optimization with RODEO In practice, we let the functions $H$ and $H^{\star}$ share a neural network architecture with two output units. Since the estimator is unbiased, we can optimize the parameters $\gamma$ of the network in an online fashion to minimize the variance of the estimator (similar to [20]). Specifically, if we denote the RODEO gradient estimate by $g_{\gamma}(x^{(1:K)})$ , then $\nabla_{\gamma}\mathrm{Tr}(\mathrm{Var}(g_{\gamma}(x^{(1:K)})))=\mathbb{E}[\nabla_{\gamma}\|g_{\gamma}(x^{(1:K)})\|_{2}^{2}].$ In Algorithm 1, we use an unbiased estimate of this hyperparameter gradient, $\nabla_{\gamma}\|g_{\gamma}(x^{(1:K)})\|_{2}^{2}$ , to update $\gamma$ after each optimization step of $\eta$ .

5 Related Work

As we have seen in Section 2, there is a long history of designing CVs for REINFORCE estimators using “baselines” [64, 7, 45, 37]. Recent progress is mostly driven by leave-one-out [49, 38, 30, 48] and sample-dependent baselines [43, 22, 60, 62, 20]. REBAR [62] constructs the baseline through continuous relaxation of the discrete distribution [26, 35] and applies reparameterization gradients to the correction term. As a result REBAR uses three evaluations of $f$ for each $x^{(k)}$ instead of the usual single evaluation (see Appendix A for a detailed explanation). The RELAX [20] estimator generalizes REBAR by noticing that their continuous relaxation can be replaced with a free-form CV. However, in order to get strong performance, RELAX still includes the continuous relaxation in their CV and only adds a small deviation to it. Therefore, RELAX also uses three evaluations of $f$ for each $x^{(k)}$ and is usually considered more expensive than other estimators.

An attractive property of the RODEO estimator is that it incorporates information from neighboring states thanks to the Stein operator while avoiding additional costly $f$ evaluations thanks to the learned surrogate functions $h_{k}$ . Estimators based on analytic local expectation [59, 61] and GO gradients [11] also use neighborhood information but only at the cost of many additional target function evaluations. In fact, the local expectation gradient [59] can be viewed as a Stein CV adjustment RODEO with a Gibbs Stein operator and the target function $f$ used directly instead of the surrogate $h$ . The downside of these approaches is that $f$ must be evaluated $Kd$ times per training step instead of $K$ times as in RODEO, a prohibitive cost when $f$ is expensive and $d\geq 200$ as in Section 6.

Prior work has also studied variance reduction methods based on sampling without replacement [31] and antithetic sampling [68, 14, 13, 15] for gradient estimation. DisARM [14, 69] was shown to outperform RLOO estimators when $K=2$ and ARMS [13] generalizes DisARM to $K>2$ .

Stein operators for continuous distributions have also been leveraged for effective variance reduction in a variety of learning tasks [2, 36, 41, 5, 54, 52] including gradient estimation [34, 27]. In particular, the gradient estimator of Liu et al. [34] is based on the Langevin Stein operator [19] for continuous distributions and coincides with the continuous counterpart of RELAX [20]. In contrast, our approach considers discrete Stein operators for Monte Carlo estimation in discrete distributions with exponentially large state spaces. Recently, Parmas and Sugiyama [44, App. E.4] used a probability flow perspective to characterize all unbiased gradient estimators satisfying a mild technical condition; our estimators fall into this broad class but were not specifically investigated.

Table 2: Training binary latent VAEs with

K=2,3

(except for RELAX which uses

3

evaluations) on MNIST, Fashion-MNIST, and Omniglot. We report the average ELBO (

\pm 1

standard error) on the training set after 1M steps over 5 independent runs. Test data bounds are reported in Table 4.

		Bernoulli Likelihoods			Gaussian Likelihoods
		MNIST	Fashion-MNIST	Omniglot	MNIST	Fashion-MNIST	Omniglot
$K=2$	DisARM	$-102.75\pm 0.08$	$-237.68\pm 0.13$	$-116.50\pm 0.04$	$668.03\pm 0.61$	$182.65\pm 0.47$	$446.61\pm 1.16$
	Double CV	$-102.14\pm 0.06$	$-237.55\pm 0.16$	$-116.39\pm 0.10$	$676.87\pm 1.18$	$186.35\pm 0.64$	$447.65\pm 0.87$
	RODEO (Ours)	$\bf-101.89\pm 0.17$	$\bf-237.44\pm 0.09$	$\bf-115.93\pm 0.06$	$\bf 681.95\pm 0.37$	$\bf 191.81\pm 0.67$	$\bf 454.74\pm 1.11$
$K=3$	ARMS	$-100.84\pm 0.14$	$-237.05\pm 0.12$	$-115.21\pm 0.07$	$683.55\pm 1.01$	$193.07\pm 0.34$	$457.98\pm 1.03$
	Double CV	$-100.94\pm 0.09$	$-237.40\pm 0.11$	$-115.06\pm 0.12$	$686.48\pm 0.68$	$193.93\pm 0.20$	$457.44\pm 0.79$
	RODEO (Ours)	$\bf-100.46\pm 0.13$	$\bf-236.88\pm 0.12$	$\bf-115.01\pm 0.05$	$\bf 692.37\pm 0.39$	$\bf 196.56\pm 0.42$	$461.87\pm 0.90$
RELAX (3 evals)		$-101.99\pm 0.04$	$-237.74\pm 0.12$	$-115.70\pm 0.08$	$688.58\pm 0.52$	$196.38\pm 0.66$	$\bf 462.23\pm 0.63$

6 Experiments

Python code replicating all experiments can be found at https://github.com/thjashin/rodeo.

6.1 Training Bernoulli VAEs

Following Dong et al. [14] and Titsias and Shi [60], we conduct experiments on training variational auto-encoders [29, 47] (VAEs) with Bernoulli latent variables. VAEs are models with a joint density $p_{\theta}(y,x)$ , where $x$ is the latent variable and $\theta$ denotes model parameters. They are typically learned through maximizing the evidence lower bound (ELBO) $\mathbb{E}_{q_{\eta}(x|y)}[f(x)]$ for an auxiliary inference network $q_{\eta}(x|y)$ and $f(x)\triangleq\log p_{\theta}(y,x)-\log q_{\eta}(x|y)$ . In our experiments, $p(x)$ and $q_{\eta}(x|y)$ are high-dimensional distributions where each dimension of the random variable is an independent Bernoulli. Since exact gradient computations are intractable, we will use gradient estimators to learn the parameters $\eta$ of the inference network. The VAE architecture and training experimental setup follows Titsias and Shi [60], and details are given in Appendix D. The dimensionality of the latent variable $x$ is $d=200$ . The functions $H$ 13 and $H^{*}$ 14 share a neural network architecture with two output units and a single hidden layer with 100 units. For the numerical stability and variance reasons discussed in Section 3, we use the Gibbs Stein operator 4 as a default choice in our experiments, but we revisit this choice in Section 6.3.

We consider the MNIST [33], Fashion-MNIST [66] and Omniglot [32] datasets using their standard train, validation, and test splits. We use both binary and continuous VAE outputs ( $y$ ) as in Titsias and Shi [60]. In the binary output setting, data are dynamically binarized, and the Bernoulli likelihood is used; in the continuous output setting, data are centered between $[-1,1]$ , and the Gaussian likelihood with learnable diagonal covariance parameters is used.

Refer to caption — Figure 1: Training binary latent VAEs with 2 or 3 $f$ evaluations per step on binarized MNIST.

$\boldsymbol{K=2}$ In the first set of experiments we focus on the most common setting of $K=2$ sample points and compare our variance reduction method to its counterparts including DisARM [14] and Double CV [60]. Results for RLOO are omitted since it is consistently outperformed by Double CV [see 60]. Table 2 shows, on all three datasets, RODEO achieves the best training ELBOs. In Figure 1 (left), we plot the gradient variance and average training ELBOs against training steps for all estimators on dynamically binarized MNIST. RODEO outperforms DisARM and Double CV by a large margin in gradient variance.

Next, we consider VAEs with Gaussian likelihoods trained on non-binarized datasets. In this case the gradient estimates suffer from even higher variance due to the large range of values $f(x)$ can take. The results are plotted in Figure 2. We see that RODEO has substantially lower variance than DisARM and Double CV, leading to significant improvements in training ELBOs. In Appendix Figure 7, we find that RODEO also consistently yields the best test set performance in all six settings.

$\boldsymbol{K=3}$ In the second set of experiments we compare RELAX [20], which uses three evaluations of $f$ per training step, with RODEO, Double CV, and ARMS [13] for $K=3$ . Figure 1 (right) and Table 2 demonstrate that RODEO outperforms the three previous methods and generally leads to the lowest variance. Although RELAX was often observed to have very strong performance in prior work [14, 60], our results in Figure 1 suggest that, for dynamically binarized datasets, much larger gains can be achieved by using the same number of function evaluations in other estimators.

For the experiments mentioned above, we report final training ELBOs in Table 2, test log-likelihood bounds in Appendix Table 4, and binarized MNIST average running time in Appendix Table 5. For $K=3$ , RODEO has the best final performance in 5 out of 6 tasks and runtime nearly identical to RELAX. For $K=2$ , RODEO has the best final performance for all tasks and is 1.5 times slower than Double CV and DisARM. We attribute the runtime gap to the Gibbs operator 4 which performs $2d$ evaluations of the auxiliary functions $H$ and $H^{*}$ in 13 and 14. While this results in some extra cost, the network parameterizing $H$ and $H^{*}$ takes only 2-dimensional inputs, produces scalar outputs, and is typically small relative to the VAE model. As a result, the evaluation of $H$ and $H^{*}$ is significantly cheaper than that of $f$ , and its relative contribution to runtime shrinks as the cost of $f$ grows. To demonstrate this, we include a wall clock time comparison of our method with RLOO in Section C.1. In this experiment we replace the two-layer MLP-based VAE with a ResNet VAE, where the cost of $f$ is significantly higher than the single-layer MLP of $H,H^{*}$ . In this case, RODEO and RLOO have very close per-iteration time (0.025s vs. 0.023s). And RODEO achieves better training ELBOs than RLOO for the same amount of time.

6.2 Training hierarchical Bernoulli VAEs

To investigate the performance of RODEO when scaling to hierarchical discrete latent variable models, we follow DisARM [14, 68] to train VAEs with 2/3/4 stochastic layers, each of which consists of $200$ Bernoulli variables. We set $K=2$ and compare our estimator with DisARM and Double CV on dynamically binarized MNIST, Fashion-MNIST, and Omniglot. For each stochastic layer, we use a different CV network which has the same architecture as those in our VAE experiments from the previous section. More details are presented in Appendix D.

We plot the results on VAEs with four stochastic layers on Fashion-MNIST in Figure 3. The results for other datasets and for 2 and 3 stochastic layers can be found in Appendix E. RODEO generally achieves the best training ELBOs. One difference we noticed when comparing the variance results with those obtained from single-stochastic-layer VAEs is that these estimators have very different behaviors towards the beginning and the end of training. For example, in Figure 3, Double CV starts with lower variance than DisARM, but the gap diminishes after around 100K steps, and DisARM start to perform better as the training proceeds. In contrast, RODEO has the lowest variance in both phases for all datasets save Omniglot, where DisARM overtakes it in the long run.

6.3 Ablation study

Finally, we conduct an ablation study to gain more insight into the impact of each component of RODEO. In each experiment we train binary latent VAEs with $K=2$ .

Impact of Stein operator Figure 4(a) explores the impact of RODEO Stein operator choice on non-binarized MNIST. As expected, the less stable difference 8 and MPF 6 operators lead to significantly higher gradient variances and worse training ELBOs. In fact, the same operators led to divergent training on binarized MNIST. The Barker operator 6 with neighborhoods defined by having one differing coordinate yields results very similar to Gibbs 4 as the operators themselves are very similar for binary variables (notice that $\frac{q(y)}{q(y)+q(x)}=q(y_{i}|x_{-i})$ when $x_{-i}=y_{-i}$ ).

Impact of LOO baseline and global and local CVs Figure 4(b) explores binarized MNIST performance when RODEO is modified to retain a) only the LOO baseline (this is equivalent to standard RLOO), b) only the global CV, c) the LOO baseline + the global CV, or d) the LOO baseline + the local CV. We observe that the global and local Stein CVs both contribute to variance reduction and have complementary effects. Moreover, remarkably, the global Stein CV alone outperforms RLOO.

Impact of surrogate functions To tease apart the benefits of our new surrogate functions 13 and the remaining RODEO components, we replace the surrogate function $c_{\phi}(z)$ in RELAX [20] with our surrogates, which only requires a single evaluation of $f$ per $x^{(k)}$ (see Appendix A for more details). Figure 4(c) compares the performance of RODEO and this modified RELAX on binarized MNIST. Since the surrogate functions are matched, the consistent improvements over modified RELAX can be attributed to the Stein and double CV components of RODEO. In Section C.2, we also experiment with increasing the complexity of $H$ and $H^{*}$ by using two hidden layers instead of a single hidden layer in the MLP. We did not observe significant improvements in variance reduction.

7 Conclusions, Limitations, and Future Work

This work tackles the gradient estimation problem for discrete distributions. We proposed a variance reduction technique which exploits Stein operators to generate control variates for REINFORCE leave-one-out estimators. Our RODEO estimator does not rely on continuous reparameterization of the distribution, requires no additional function evaluations per sample point, and can be adapted online to learn very flexible control variates parameterized by neural networks.

One potential drawback of our surrogate function constructions 13 and 14 is the need to evaluate an auxiliary function ( $H$ or $H^{*}$ ) at $K-1$ locations. This cost can be comfortably borne when $H$ and $H^{*}$ are much cheaper to evaluate than $f$ , such as in the ResNet VAE example in Section C.1 and many large VAE models used in practice [63]. And, in our experiments with the most common sample sizes, the runtime of RODEO was no worse than that of RELAX. To obtain a more favorable cost for large $K$ , one could employ alternative surrogates that require only a constant number of auxiliary function evaluations, e.g.,

\textstyle h_{k}(y)=H\big{(}\frac{1}{K-1}\sum_{j\neq k}f(x^{(j)}),\frac{1}{K-1}\sum_{j\neq k}\nabla f(x^{(j)})^{\top}(y-x^{(j)})\big{)}.

The runtime of RODEO could also be improved by varying the Stein operator employed. For example, the Gibbs operator $(Ah)(x)$ 4 used in our experiments evaluated its surrogate function $h$ at $d$ neighboring locations of $x$ . This evaluation complexity could be reduced by subsampling neighbors, resulting in a cheaper but still valid Stein operator, or by employing the numerically stable Barker operator 6 with fewer neighbors. Either strategy would introduce a speed-variance trade-off worthy of study in follow-up work. Finally, we have restricted our focus in this work to differentiable target functions $f$ . In future work, this limitation could be overcome by designing effective surrogate functions that make no use of derivative information.

Acknowledgments and Disclosure of Funding

We thank Heishiro Kanagawa for suggesting appropriate names for the Barker Stein operator.

References

Anastasiou et al. [2021] Andreas Anastasiou, Alessandro Barp, François-Xavier Briol, Bruno Ebner, Robert E Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, Arthur Gretton, Christophe Ley, Qiang Liu, and Lester Mackey. Stein’s method meets statistics: A review of some recent developments. arXiv preprint arXiv:2105.03481, 2021.
Assaraf and Caffarel [1999] Roland Assaraf and Michel Caffarel. Zero-variance principle for Monte Carlo algorithms. Phys. Rev. Lett., 83:4682–4685, Dec 1999.
Barbour [1988] A. D. Barbour. Stein’s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A):175–184, 1988. ISSN 0021-9002. A celebration of applied probability.
Barker [1965] Av A Barker. Monte Carlo calculations of the radial distribution functions for a proton? electron plasma. Australian Journal of Physics, 18(2):119–134, 1965.
Barp et al. [2018] Alessandro Barp, Chris Oates, Emilio Porcu, and Mark Girolami. A Riemann-Stein kernel method. arXiv preprint arXiv:1810.04946, 2018.
Barp et al. [2019] Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, and Lester Mackey. Minimum Stein discrepancy estimators. Advances in Neural Information Processing Systems, 32:12964–12976, 2019.
Bengio et al. [2013] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
Bresler and Nagaraj [2019] Guy Bresler and Dheeraj Nagaraj. Stein’s method for stationary distributions of Markov chains and application to Ising models. The Annals of Applied Probability, 29(5):3230–3265, 2019.
Brown and Xia [2001] Timothy C. Brown and Aihua Xia. Stein’s Method and Birth-Death Processes. The Annals of Probability, 29(3):1373 – 1403, 2001.
Burda et al. [2015] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. International Conference on Learning Representations, 2015.
Cong et al. [2019] Yulai Cong, Miaoyun Zhao, Ke Bai, and Lawrence Carin. GO gradient for expectation-based objectives. In International Conference on Learning Representations, 2019.
Dellaportas and Kontoyiannis [2012] Petros Dellaportas and Ioannis Kontoyiannis. Control variates for estimation based on reversible Markov chain Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):133–161, 2012.
Dimitriev and Zhou [2021] Aleksandar Dimitriev and Mingyuan Zhou. ARMS: Antithetic-REINFORCE-Multi-Sample gradient for binary variables. In International Conference on Machine Learning, volume 139, pages 2717–2727, 2021.
Dong et al. [2020] Zhe Dong, Andriy Mnih, and George Tucker. DisARM: An antithetic gradient estimator for binary latent variables. In Advances in Neural Information Processing Systems, volume 33, pages 18637–18647, 2020.
Dong et al. [2021] Zhe Dong, Andriy Mnih, and George Tucker. Coupled gradient estimators for discrete latent variables. arXiv preprint arXiv:2106.08056, 2021.
Eichelsbacher and Reinert [2008] Peter Eichelsbacher and Gesine Reinert. Stein’s method for discrete Gibbs measures. The Annals of Applied Probability, 18(4):1588–1618, 2008.
Geyer [2011] Charles J. Geyer. Introduction to Markov chain Monte Carlo. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, pages 3–48. CRC Press, 2011.
Glynn [1990] Peter W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Commun. ACM, 33(10):75–84, October 1990.
Gorham and Mackey [2015] Jackson Gorham and Lester Mackey. Measuring sample quality with Stein’s method. In Advances in Neural Information Processing Systems, pages 226–234, 2015.
Grathwohl et al. [2018] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018.
Grathwohl et al. [2021] Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops I took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831–3841, 2021.
Gu et al. [2016] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. In International Conference on Learning Representations, 2016.
Henderson [1997] Shane G Henderson. Variance reduction via an approximating Markov process. PhD thesis, Stanford University, 1997.
Hodgkinson et al. [2020] Liam Hodgkinson, Robert Salomone, and Fred Roosta. The reproducing stein kernel approach for post-hoc corrected sampling. arXiv preprint arXiv:2001.09266, 2020.
Holmes [2004] Susan Holmes. Stein’s method for birth and death chains. In Persi Diaconis and Susan Holmes, editors, Stein’s Method: Expository Lectures and Applications, pages 42–65. 2004.
Jang et al. [2017] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. International Conference on Learning Representations, 2017.
Jankowiak and Obermeyer [2018] Martin Jankowiak and Fritz Obermeyer. Pathwise derivatives beyond the reparameterization trick. In International Conference on Machine Learning, pages 2235–2244. PMLR, 2018.
Karlin and McGregor [1957] Samuel Karlin and James McGregor. The classification of birth and death processes. Transactions of the American Mathematical Society, 86(2):366–400, 1957.
Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
Kool et al. [2019] W. Kool, H. V. Hoof, and M. Welling. Buy 4 REINFORCE samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019.
Kool et al. [2020] Wouter Kool, Herke van Hoof, and Max Welling. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020.
Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Liu et al. [2017] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-depedent control variates for policy optimization via Stein’s identity. arXiv preprint arXiv:1710.11198, 2017.
Maddison et al. [2017] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017.
Mira et al. [2013] Antonietta Mira, Reza Solgi, and Daniele Imparato. Zero variance Markov chain Monte Carlo for Bayesian estimators. Statistics and Computing, 23(5):653–662, 2013.
Mnih and Gregor [2014] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pages 1791–1799, 2014.
Mnih and Rezende [2016] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016.
Mohamed et al. [2020] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte Carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21(132):1–62, 2020.
Niepert et al. [2021] Mathias Niepert, Pasquale Minervini, and Luca Franceschi. Implicit MLE: backpropagating through discrete exponential family distributions. Advances in Neural Information Processing Systems, 34:14567–14579, 2021.
Oates et al. [2017] Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695–718, 2017.
Owen [2013] Art B. Owen. Monte Carlo theory, methods and examples. 2013.
Paisley et al. [2012] John Paisley, David M Blei, and Michael I Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, pages 1363–1370, 2012.
Parmas and Sugiyama [2021] Paavo Parmas and Masashi Sugiyama. A unified view of likelihood ratio and reparameterization gradients. In International Conference on Artificial Intelligence and Statistics, pages 4078–4086. PMLR, 2021.
Ranganath et al. [2014] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In International Conference on Artificial Intelligence and Statistics, page 814–822, 2014.
Reinert and Ross [2019] Gesine Reinert and Nathan Ross. Approximating stationary distributions of fast mixing Glauber dynamics, with applications to exponential random graphs. The Annals of Applied Probability, 29(5):3201–3229, 2019.
Rezende et al. [2014] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
Richter et al. [2020] Lorenz Richter, Ayman Boustati, Nikolas Nüsken, Francisco Ruiz, and Omer Deniz Akyildiz. VarGrad: A low-variance gradient estimator for variational inference. In Advances in Neural Information Processing Systems, volume 33, pages 13481–13492, 2020.
Salimans and Knowles [2014] Tim Salimans and David A Knowles. On using control variates with stochastic approximation for variational Bayes and its connection to stochastic linear regression. arXiv preprint arXiv:1401.1022, 2014.
Schulman et al. [2015] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. Advances in Neural Information Processing Systems, 28, 2015.
Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
Si et al. [2020] Shijing Si, Chris Oates, Andrew B Duncan, Lawrence Carin, and François-Xavier Briol. Scalable control variates for Monte Carlo methods via stochastic optimization. arXiv preprint arXiv:2006.07487, 2020.
Sohl-Dickstein et al. [2011] Jascha Sohl-Dickstein, Peter Battaglino, and Michael R DeWeese. Minimum probability flow learning. In International Conference on Machine Learning, pages 905–912, 2011.
South et al. [2018] Leah F South, Chris J Oates, Antonietta Mira, and Christopher Drovandi. Regularised zero-variance control variates for high-dimensional variance reduction. arXiv preprint arXiv:1811.05073, 2018.
Stein [1972] Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972.
Stein et al. [2004] Charles Stein, Persi Diaconis, Susan Holmes, and Gesine Reinert. Use of exchangeable pairs in the analysis of simulations. Lecture Notes-Monograph Series, pages 1–26, 2004.
Stirzaker [2005] David Stirzaker. Stochastic processes and models. OUP Oxford, 2005.
Titsias and Lázaro-Gredilla [2014] Michalis K. Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.
Titsias and Lázaro-Gredilla [2015] Michalis K Titsias and Miguel Lázaro-Gredilla. Local expectation gradients for black box variational inference. Advances in neural information processing systems, 28:2638–2646, 2015.
Titsias and Shi [2022] Michalis K Titsias and Jiaxin Shi. Double control variates for gradient estimation in discrete latent variable models. International Conference on Artificial Intelligence and Statistics, 2022.
Tokui and Sato [2017] Seiya Tokui and Issei Sato. Evaluating the variance of likelihood-ratio gradient estimators. In International Conference on Machine Learning, pages 3414–3423, 2017.
Tucker et al. [2017] G. Tucker, A. Mnih, C. J. Maddison, and J. Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, 2017.
Vahdat and Kautz [2020] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33:19667–19679, 2020.
Weaver and Tao [2001] Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 538–545, 2001.
Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Yang et al. [2018] Jiasen Yang, Qiang Liu, Vinayak Rao, and Jennifer Neville. Goodness-of-fit testing for discrete distributions via Stein discrepancy. In International Conference on Machine Learning, pages 5561–5570. PMLR, 2018.
Yin and Zhou [2019] Mingzhang Yin and Mingyuan Zhou. ARM: Augment-REINFORCE-merge gradient for stochastic binary networks. In International Conference on Learning Representations, 2019.
Yin et al. [2020] Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic best subset selection via gradient-based optimization, 2020.
Zanella [2020] Giacomo Zanella. Informed proposals for local MCMC in discrete spaces. Journal of the American Statistical Association, 115(530):852–865, 2020.

Appendix A Sample-dependent Baselines in REBAR and RELAX

We start with the REINFORCE estimator with the sample-dependent baseline $b_{k}$ :

\frac{1}{K}\sum_{k=1}^{K}(f(x^{(k)})-b_{k})\nabla_{\eta}\log q_{\eta}(x^{(k)})+\mathbb{E}[b_{k}\nabla_{\eta}\log q_{\eta}(x^{(k)})].

(15)

REBAR [62] introduces indirect independence on $x^{(k)}$ in $b_{k}$ through the continuous reparameterization $x=H(z),z^{(k)}\sim q_{\eta}(z|x=x^{(k)})$ , where $z$ is a continuous variable and $H$ is an argmax-like thresholding function. Specifically, $b_{k}=f(\sigma_{\lambda}(z^{(k)}))$ , where $\sigma_{\lambda}$ is a continuous relaxation of $H$ controlled by the parameter $\lambda$ . The correction term decomposes into two parts:

		$\displaystyle\mathbb{E}_{x^{(k)}}[\mathbb{E}_{z^{(k)}\|x^{(k)}}[f(\sigma_{\lambda}(z^{(k)}))]\nabla_{\eta}\log q_{\eta}(x^{(k)})]$
	$\displaystyle=$	$\displaystyle\;\nabla_{\eta}\mathbb{E}_{q_{\eta}(z)}[f(\sigma_{\lambda}(z))]-\mathbb{E}_{x^{(k)}}[\nabla_{\eta}\mathbb{E}_{q_{\eta}(z^{(k)}\|x^{(k)})}[f(\sigma_{\lambda}(z^{(k)}))]].$

Both parts can be estimated with the reparameterization trick [29, 47, 58] which often has low variance. The RELAX [20] estimator generalizes REBAR by noticing that $f(\sigma_{\lambda}(z))$ can be replaced with a free-form differentiable function $c_{\phi}(z)$ . However, RELAX still relies on parameterizing $c_{\phi}(z)$ as $f(\sigma_{\lambda}(z))+r_{\theta}(z)$ to achieve strong performance, as noted in Dong et al. [14].

To form modified RELAX in Section 6.3, we replace $b_{k}=c_{\phi}(z^{(k)})$ with $b_{k}=h_{k}(\sigma_{\lambda}(z^{(k)}))$ for $h_{k}$ defined in 13.

Appendix B Proof of Unbiasedness of RODEO

Recall our estimator defined in RODEO is

\frac{1}{K}\sum_{k=1}^{K}[(f(x^{(k)})-\frac{1}{K-1}\sum_{j\neq k}(f(x^{(j)})+(Ah_{j})(x^{(j)})))\cdot\nabla_{\eta}\log q_{\eta}(x^{(k)})+(A\tilde{h}_{k}^{\star})(x^{(k)})].

(16)

To show the unbiasedness, we compute its expectation under $q_{\eta}$ as

		$\displaystyle\phantom{{}={}}\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{q_{\eta}}[f(x^{(k)})\nabla_{\eta}\log q_{\eta}(x^{(k)})]$
		$\displaystyle-\frac{1}{K(K-1)}\sum_{k=1}^{K}\sum_{j\neq k}\mathbb{E}_{q_{\eta}}[(f(x^{(j)})+(Ah_{j})(x^{(j)}))\nabla_{\eta}\log q_{\eta}(x^{(k)})]$
		$\displaystyle+\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{q_{\eta}}[(A\tilde{h}_{k}^{\star})(x^{(k)})].$

Since the first term is the desired gradient $\nabla_{\eta}\mathbb{E}_{q_{\eta}}[f(x)]$ and the third term is zero, it suffices to show that the second term also vanishes. Using the law of total expectations, we find for $j\neq k$ ,

		$\displaystyle\mathbb{E}_{q_{\eta}}[(f(x^{(j)})+(Ah_{j})(x^{(j)}))\nabla_{\eta}\log q_{\eta}(x^{(k)})]$
		$\displaystyle=\mathbb{E}_{x^{(k)}\sim q_{\eta}}[\mathbb{E}_{q_{\eta}}[f(x^{(j)})+(Ah_{j})(x^{(j)})\mid x^{(k)}]\nabla_{\eta}\log q_{\eta}(x^{(k)})]$
		$\displaystyle=\mathbb{E}_{x^{(k)}\sim q_{\eta}}[\mathbb{E}_{q_{\eta}}[f(x^{(j)})\mid x^{(k)}]\nabla_{\eta}\log q_{\eta}(x^{(k)})]$
		$\displaystyle=\mathbb{E}_{q_{\eta}}[f(x^{(j)})\nabla_{\eta}\log q_{\eta}(x^{(k)})]$
		$\displaystyle=\mathbb{E}_{x^{(j)}\sim q_{\eta}}[f(x^{(j)})\mathbb{E}_{x^{(k)}\sim q_{\eta}}[\nabla_{\eta}\log q_{\eta}(x^{(k)})\mid x^{(j)}]]=0,$

which completes the proof.

Appendix C Additional Experiments

C.1 Wall clock time comparison with RLOO

Besides necessary target function evaluations, the RODEO estimator comes with the additional cost of evaluating the neural network-based $H,H^{*}$ . Therefore, RODEO is most suited to the problems where the cost of evaluating $f$ dominates that of evaluating $H,H^{*}$ . This is often the case in practice. For example, state-of-the-art variational autoencoders [e.g., 63] are often built on expensive neural architectures such as deep residual networks (ResNets). Here, to demonstrate the practical advantage of our method as the complexity of $f$ grows, we replace the two-layer MLP VAEs used in previous experiments with a ResNet VAE (architecture shown in Table 3), while the neural network used by $H,H^{*}$ remains a single-layer MLP with 100 hidden units. We then compare the wall clock performance of RODEO with RLOO. The latent variables in this experiment remain binary and have 200 dimensions.

The results are shown in Figure 5. RODEO achieves better training ELBOs than RLOO in the same amount of time. In fact, for this VAE architecture, the per-iteration time of RODEO is 25.2ms, which is very close to the 23.1ms of RLOO. This indicates that the cost of $f$ is significantly higher than that of $H,H^{*}$ .

Table 3: Architecture of the ResNet VAE in Section C.1. 3x3x

C

means kernel size 3x3 and

C

output channels. Each (De)conv Res block is composed of two (de)convolutional layers with strides 1, same padding, and ReLU activations, plus a skip connection with identity map. For Res blocks with downsample and upsample functions, the first convolutional layer has strides 2, and the skip connection is replaced by a convolutional layer with 2x2 kernel size and strides 2.

Encoder	Decoder
Conv 3x3x16, strides 1, padding 1	Fully connected, 7x7x64 units
Conv Res block 3x3x16	Deconv Res block 3x3x64
Conv Res block 3x3x16	Deconv Res block 3x3x64
Conv Res block 3x3x32 (downsample by 2)	Deconv Res block 3x3x32 (upsample by 2)
Conv Res block 3x3x32	Deconv Res block 3x3x32
Conv Res block 3x3x64 (downsample by 2)	Deconv Res block 3x3x16 (upsample by 2)
Conv Res block 3x3x64	Deconv Res block 3x3x16
Fully connected, 200 units	Deconv 3x3x1, strides 1, padding 1

C.2 Impact of neural network architectures of surrogate functions

We conduct one more ablation study to investigate the impact of neural network architectures used by $H,H^{*}$ . Specifically, we replace the single-hidden-layer control variate network used in previous experiments with a two-hidden-layer MLP (each layer has 100 units) and compare their performance on binarized MNIST with $K=2$ . We keep other settings the same as in Section 6.1. The results are plotted in Figure 6. We do not observe significant difference between the two versions of RODEO.

Appendix D Experimental Details

Our implementation is based on the open-source code of DisARM [14] (Apache license) and Double CV [60] (MIT license). Our figures display 1M training steps and our tables report performance after 1M training steps to replicate the experimental settings of DisARM [14].

D.1 Details of VAE experiments

VAEs are models with a joint density $p(y,x)=p(y|x)p(x)$ , where $x$ denotes the latent variable. $x$ is assigned a uniform factorized Bernoulli prior. The likelihood $p_{\theta}(y|x)$ is parameterized by the output of a neural network with $x$ as input and parameters $\theta$ . The VAE has two hidden layers with $200$ units activated by LeakyReLU with the coefficient $0.3$ . To optimize the VAE we use Adam with base learning rate $10^{-4}$ for non-binarized data and $10^{-3}$ for dynamically binarized data, except for binarized Fashion-MNIST we decreased the learning rate to $3\times 10^{-4}$ because otherwise the training is very unstable for all estimators. We use Adam with the same learning rate $10^{-3}$ for adapting our control variate network in all experiments. The batch size is $100$ . The settings of other estimators are kept the same with Titsias and Shi [60].

In the minimum probability flow (MPF) and Barker Stein operator 6, we choose the neighborhood $\mathcal{N}_{x}$ to be the states that differ in only one coordinate from $x$ . Let $y\in\mathcal{N}_{x}$ be an element in this neighborhood such that $y_{i}\neq x_{i}$ and $y_{-i}=x_{-i}$ . For the MPF Stein estimator and the difference Stein estimator 8, the density ratio $\frac{q(y)}{q(x)}$ can be simplified to $\frac{q(y_{i}|x_{-i})}{q(x_{i}|x_{-i})}$ . We further replace it with $\frac{q(y_{i}|x_{-i})}{q(x_{i}|x_{-i})+10^{-3}}$ to alleviate numerical instability. The Barker Stein estimator does not suffer from the numerical issue since the coefficient is bounded in the Bernoulli case: $\frac{q(y)}{q(x)+q(y)}=\frac{q(y_{i}|x_{-i})}{q(x_{i}|x_{-i})+q(y_{i}|x_{-i})}=q(y_{i}|x_{-i})$ . In our experiments, we find that the difference Stein estimator is highly unstable and may diverge as the iteration proceeds.

D.2 Details of hierarchical VAE experiments

We optimize the hierarchical VAE using Adam with base learning rate $10^{-4}$ . Our control variate network is optimized using Adam with learning rate $10^{-3}$ . Settings of training multilayer VAEs are kept the same with Dong et al. [14], except that we do not optimize the prior distribution of the VAE hidden layer and use a larger batch size $100$ .

Appendix E Additional Results

In this section, we measure test set performance using 100 test points and the marginal log-likelihood bound of Burda et al. [10], which provides a tighter estimate of marginal log likelihood than the ELBO. Throughout, we call this the “test log-likelihood bound.”

Table 4: Average 100-point test log-likelihood bounds of binary latent VAEs trained with

K=2,3

(except for RELAX which uses

3

evaluations per step) on MNIST, Fashion-MNIST, and Omniglot. We report the average value

\pm 1

standard error after 1M steps over 5 independent runs.

		Bernoulli Likelihoods			Gaussian Likelihoods
		MNIST	Fashion-MNIST	Omniglot	MNIST	Fashion-MNIST	Omniglot
$K=2$	DisARM	$-101.61\pm 0.07$	$-239.11\pm 0.11$	$-118.34\pm 0.05$	$669.26\pm 0.53$	$163.40\pm 0.59$	$305.32\pm 0.89$
	Double CV	$-100.91\pm 0.04$	$-239.00\pm 0.17$	$-118.45\pm 0.09$	$677.02\pm 0.93$	$164.99\pm 0.71$	$304.72\pm 1.39$
	RODEO (Ours)	$\bf-100.78\pm 0.16$	$\bf-238.97\pm 0.09$	$\bf-118.09\pm 0.05$	$\bf 681.11\pm 0.31$	$\bf 168.26\pm 0.73$	$\bf 308.55\pm 1.02$
$K=3$	ARMS	$-99.08\pm 0.12$	$-238.19\pm 0.11$	$-116.78\pm 0.13$	$688.61\pm 0.84$	$174.14\pm 0.44$	$320.45\pm 1.07$
	Double CV	$-99.16\pm 0.12$	$-238.54\pm 0.16$	$-116.75\pm 0.15$	$690.28\pm 0.49$	$173.67\pm 0.30$	$322.88\pm 1.10$
	RODEO (Ours)	$\bf-98.72\pm 0.14$	$\bf-237.97\pm 0.12$	$\bf-116.69\pm 0.09$	$\bf 695.11\pm 0.33$	$\bf 174.57\pm 0.30$	$\bf 323.92\pm 1.24$
RELAX (3 evals)		$-100.80\pm 0.09$	$-239.03\pm 0.11$	$-117.60\pm 0.06$	$686.21\pm 0.57$	$171.43\pm 0.61$	$317.78\pm 1.25$

Table 5: Average running time across

10^{4}

steps on an NVIDIA 3080Ti GPU with an AMD 5950X CPU for the VAE experiment on binary MNIST in Section 6.1.

	Double CV	DisARM/ARMS	RODEO (Ours)	RELAX (3 evals)
$K=2$	2.11 ms/step	1.89 ms/step	3.08 ms/step	4.71 ms/step
$K=3$	2.28 ms/step	1.91 ms/step	4.72 ms/step	4.71 ms/step

Table 6: Average running time across

10^{4}

steps on an NVIDIA 3080Ti GPU with an AMD 5950X CPU when training hierarchical VAEs with

K=2

	Double CV	DisARM	RODEO (Ours)
Two layers	4.33 ms/step	3.54 ms/step	6.79 ms/step
Three layers	7.69 ms/step	6.09 ms/step	10.61 ms/step
Four layers	11.67 ms/step	9.53 ms/step	14.91 ms/step

Table 7: Training hierarchical binary latent VAEs on dynamically binarized MNIST, Fashion-MNIST, and Omniglot. We report the average (

\pm 1

standard error) training ELBOs and 100-point test log-likelihood bounds after 1M steps over 5 independent runs.

		Training ELBO			Test Log-Likelihood Bound
		MNIST	Fashion-MNIST	Omniglot	MNIST	Fashion-MNIST	Omniglot
Two layers	Double CV	$-103.52\pm 0.06$	$-239.82\pm 0.07$	$-114.06\pm 0.04$	$-97.62\pm 0.08$	$-237.65\pm 0.06$	$-110.48\pm 0.04$
	DisARM	$-103.39\pm 0.12$	$\mathbf{-239.67\pm 0.06}$	$\mathbf{-113.67\pm 0.05}$	$-97.56\pm 0.07$	$\mathbf{-237.61\pm 0.06}$	$\mathbf{-110.15\pm 0.04}$
	RODEO (Ours)	$\mathbf{-103.15\pm 0.07}$	$-239.76\pm 0.09$	$-113.84\pm 0.11$	$\mathbf{-97.43\pm 0.03}$	$-237.63\pm 0.07$	$-110.32\pm 0.10$
Three layers	Double CV	$-97.59\pm 0.15$	$-234.34\pm 0.07$	$-108.66\pm 0.06$	$-93.71\pm 0.12$	$-234.34\pm 0.07$	$-107.48\pm 0.07$
	DisARM	$-97.95\pm 0.30$	$-234.45\pm 0.05$	$-108.60\pm 0.08$	$-94.12\pm 0.28$	$-234.46\pm 0.06$	$-107.32\pm 0.10$
	RODEO (Ours)	$\mathbf{-97.21\pm 0.17}$	$\mathbf{-234.11\pm 0.10}$	$\mathbf{-108.51\pm 0.04}$	$\mathbf{-93.52\pm 0.16}$	$\mathbf{-234.19\pm 0.07}$	$\mathbf{-107.26\pm 0.06}$
Four layers	Double CV	$-98.73\pm 0.06$	$-235.69\pm 0.07$	$-110.92\pm 0.06$	$-93.28\pm 0.03$	$-234.63\pm 0.03$	$-107.86\pm 0.03$
	DisARM	$-98.97\pm 0.02$	$-235.50\pm 0.04$	$-110.85\pm 0.07$	$-93.56\pm 0.04$	$-234.52\pm 0.04$	$-107.87\pm 0.05$
	RODEO (Ours)	$\mathbf{-98.67\pm 0.14}$	$\mathbf{-235.39\pm 0.05}$	$\mathbf{-110.79\pm 0.03}$	$\mathbf{-93.27\pm 0.09}$	$\mathbf{-234.39\pm 0.06}$	$\mathbf{-107.77\pm 0.02}$

Binarized
Non-binarized

Binarized
Non-binarized