This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Linda S. L. Tan 22institutetext: Department of Statistics and Applied Probability
National University of Singapore
Tel.: +65-6516-8473
Fax: +65-6872-3939
22email: statsll@nus.edu.sg
33institutetext: Aishwarya Bhaskaran 44institutetext: Department of Statistics and Applied Probability
National University of Singapore
Tel.: +65-6601-6229
Fax: +65-6872-3939
44email: staai@nus.edu.sg
55institutetext: David J. Nott 66institutetext: Department of Statistics and Applied Probability
National University of Singapore
Tel.: +65-6516-2744
Fax: +65-6872-3939
66email: standj@nus.edu.sg

Conditionally structured variational Gaussian approximation with importance weightsthanks: Linda Tan and Aishwarya Bhaskaran are supported by the start-up grant R-155-000-190-133.

Linda S. L. Tan    Aishwarya Bhaskaran    David J. Nott
(Received: date / Accepted: date)
Abstract

We develop flexible methods of deriving variational inference for models with complex latent variable structure. By splitting the variables in these models into “global” parameters and “local” latent variables, we define a class of variational approximations that exploit this partitioning and go beyond Gaussian variational approximation. This approximation is motivated by the fact that in many hierarchical models, there are global variance parameters which determine the scale of local latent variables in their posterior conditional on the global parameters. We also consider parsimonious parametrizations by using conditional independence structure, and improved estimation of the log marginal likelihood and variational density using importance weights. These methods are shown to improve significantly on Gaussian variational approximation methods for a similar computational cost. Application of the methodology is illustrated using generalized linear mixed models and state space models.

Keywords:
Gaussian variational approximation Sparse precision matrix Stochastic variational inference Importance weighted lower bound Rényi’s divergence

1 Introduction

In many modern statistical applications, it is necessary to model complex dependent data. In these situations, models which employ observation specific latent variables such as random effects and state space models are widely used because of their flexibility, and Bayesian approaches dealing naturally with the hierarchical structure are attractive in principle. However, incorporating observation specific latent variables leads to a parameter dimension increasing with sample size, and standard Bayesian computational methods can be challenging to implement in very high-dimensional settings. For this reason, approximate inference methods are attractive for these models, both in exploratory settings where many models need to be fitted quickly, as well as in applications involving large datasets where exact methods are infeasible. One of the most common approximate inference paradigms is variational inference (Ormerod and Wand, 2010; Blei et al., 2017), which is the approach considered here.

Our main contribution is to consider partitioning the unknowns in a local latent variable model into “global” parameters and “local” latent variables, and to suggest ways of structuring the dependence in a variational approximation that match the specification of these models. We go beyond standard Gaussian approximations by defining the variational approximation sequentially, through a marginal density for the global parameters and a conditional density for local parameters given global parameters. Each term in our approximation is Gaussian, but we allow the conditional covariance matrix for the local parameters to depend on the global parameters, which leads to an approximation that is not jointly Gaussian. We are particularly interested in improved inference on global variance and dependence parameters which determine the scale and dependence structure of local latent variables. With this objective, we suggest a parametrization of our conditional approximation to the local variables that is well-motivated and respects the exact conditional independence structure in the true posterior distribution. Our approximations are parsimonious in terms of the number of required variational parameters, which is important since a high-dimensional variational optimization is computationally burdensome. The methods suggested improve on Gaussian variational approximation methods for a similar computational cost. Besides defining a novel and useful variational family appropriate to local latent variable models, we also employ importance weighted variational inference methods (Burda et al., 2016; Domke and Sheldon, 2018) to further improve the quality of inference, and elaborate further on the connections between this approach and the use of Rényi’s divergence within the variational optimization (Li and Turner, 2016; Regli and Silva, 2018; Yang et al., 2019).

Our method is a contribution to the literature on the development of flexible variational families, and there are many interesting existing methods for this task. One fruitful approach is based on normalizing flows (Rezende and Mohamed, 2015), where a variational family is defined using an invertible transformation of a random vector with some known density function. To be useful, the transformation should have an easily computed Jacobian determinant. In the original work of Rezende and Mohamed (2015), compositions of simple flows called radial and planar flows were considered. Later authors have suggested alternatives, such as autoregressive flows (Germain et al., 2015), inverse autoregressive flows (Kingma et al., 2016), and real-valued non-volume preserving transformations (Dinh et al., 2017), among others. Spantini et al. (2018) gives a theoretical framework connecting Markov properties of a target posterior distribution to representations involving transport maps, with normalizing flows being one way to parametrize such mappings. The variational family we consider here can be thought of as a simple autoregressive flow, but carefully constructed to preserve the conditional independence structure in the true posterior and to achieve parsimony in the representation of dependence between local latent variables and global scale parameters. Our work is also related to the hierarchically structured approximations considered in Salimans and Knowles (2013, Section 7.1); these authors also consider other flexible approximations based on mixture models, and a variety of innovative numerical approaches to the variational optimization. Hoffman and Blei (2015) propose an approach called structured stochastic variational inference which is applicable in conditionally conjugate models. Their approach is similar to ours, in the sense that local variables depend on global variables in the variational posterior. However conditional conjugacy does not hold in the examples we consider.

The methods we describe can be thought of as extending the Gaussian variational approximation (GVA) of Tan and Nott (2018), where parametrization of the variational covariance matrix was considered in terms of a sparse Cholesky factor of the precision matrix. Similar approximations have been considered for state space models in Archer et al. (2016). The sparse structure reduces the number of free variational parameters, and allows matching the exact conditional independence structure in the true posterior. Tan (2018) propose an approach called reparametrized variational Bayes, where the model is reparametrized by applying an invertible affine transformation to the local variables to minimize their posterior dependency on global variables, before applying a mean field approximation. The affine transformation is obtained by considering a second order Taylor series approximation to the posterior of the local variables conditional on the global variables. One way of improving on Gaussian approximations is to consider mixtures of Gaussians (Jaakkola and Jordan, 1998; Salimans and Knowles, 2013; Miller et al., 2016; Guo et al., 2016). However, even with a parsimonious parametrization of component densities, a large number of additional variational parameters are added with each mixture component. Other flexible variational families can be formed using copulas (Tran et al., 2015; Han et al., 2016; Smith et al., 2019), hierarchical variational models (Ranganath et al., 2016) or implicit approaches (Huszár, 2017).

We specify the model and notation in Section 2 and introduce the conditionally structured Gaussian variational approximation (CSGVA) in Section 3. The algorithm for optimizing the variational parameters is described in Section 4 and Section 5 highlights the association between GVA and CSGVA. Section 6 describes how CSGVA can be improved using importance weighting. Experimental results and applications to generalized linear mixed models (GLMMs) and state space models are presented in Sections 7, 8 and 9 respectively. Section 10 gives some concluding discussion.

2 Model specification and notation

Let y=(y1,,yn)Ty=(y_{1},\dots,y_{n})^{T} be observations from a model with global variables θG\theta_{G} and local variables θL=(b1,,bn)T\theta_{L}=(b_{1},\dots,b_{n})^{T}, where bib_{i} contains latent variables specific to yiy_{i} for i=1,,ni=1,\dots,n. Suppose θG\theta_{G} is a vector of length GG and each bib_{i} is a vector of length LL. Let θ=(θGT,θLT)T\theta=(\theta_{G}^{T},\theta_{L}^{T})^{T}. We consider models where the joint density is of the form

p(y,θ)\displaystyle p(y,\theta) =p(θG)p(b1,,b|θG){i=1np(yi|bi,θG)}\displaystyle=p(\theta_{G})p(b_{1},\dots,b_{\ell}|\theta_{G})\bigg{\{}\prod_{i=1}^{n}p(y_{i}|b_{i},\theta_{G})\bigg{\}}
×{i>p(bi|bi1,,bi,θG)}.\displaystyle\quad\times\bigg{\{}\prod_{i>\ell}p(b_{i}|b_{i-1},\dots,b_{i-\ell},\theta_{G})\bigg{\}}.

The observations {yi}\{y_{i}\} are conditionally independent given {bi}\{b_{i}\} and θG\theta_{G}. Conditional on θG\theta_{G}, the local variables {bi}\{b_{i}\} form a \ellth order Markov chain if 1\ell\geq 1, and they are conditionally independent if =0\ell=0. This class of models include important models such as GLMMs and state space models. Next, we define some mathematical notation before discussing CSGVA for this class of models.

2.1 Notation

For an r×rr\times r matrix AA, let diag(A)\text{\rm diag}(A) denote the diagonal elements of AA and dg(A)\text{\rm dg}(A) be the diagonal matrix obtained by setting non-diagonal elements in AA to zero. Let vec(A)\text{vec}(A) be the vector of length r2r^{2} obtained by stacking the columns of AA under each other from left to right and v(A)v(A) be the vector of length r(r+1)/2r(r+1)/2 obtained from vec(A)\text{\rm vec}(A) by eliminating all superdiagonal elements of AA. Let ErE_{r} be the r(r+1)/2×r2r(r+1)/2\times r^{2} elimination matrix, KrK_{r} be the r2×r2r^{2}\times r^{2} commutation matrix and DrD_{r} be the r2×r(r+1)/2r^{2}\times r(r+1)/2 duplication matrix (see Magnus and Neudecker, 1980). Then Krvec(A)=vec(AT)K_{r}\text{\rm vec}(A)=\text{\rm vec}(A^{T}), Ervec(A)=v(A)E_{r}\text{\rm vec}(A)=v(A), ErTv(A)=vec(A)E_{r}^{T}v(A)=\text{\rm vec}(A) if AA is lower triangular, and Drv(A)=vec(A)D_{r}v(A)=\text{\rm vec}(A) if AA is symmetric. Let 𝟏r{\bf 1}_{r} be a vector of ones of length rr. The Kronecker product between any two matrices is denoted by \otimes. Scalar functions applied to vector arguments are evaluated element by element. Let d denote the differential operator (see e.g. Magnus and Neudecker, 1999).

3 Conditionally structured Gaussian variational approximation

We propose to approximate the posterior distribution p(θ|y)p(\theta|y) of the model defined in Section 2 by a density of the form

q(θ)=q(θG)q(θL|θG),q(\theta)=q(\theta_{G})q(\theta_{L}|\theta_{G}),

where q(θG)=N(μ1,Ω11)q(\theta_{G})=N(\mu_{1},\Omega_{1}^{-1}), q(θL|θG)=N(μ2,Ω21)q(\theta_{L}|\theta_{G})=N(\mu_{2},\Omega_{2}^{-1}), and Ω1\Omega_{1} and Ω2\Omega_{2} are the precision (inverse covariance) matrices of q(θG)q(\theta_{G}) and q(θL|θG)q(\theta_{L}|\theta_{G}) respectively. Here μ2\mu_{2} and Ω2\Omega_{2} depend on θG\theta_{G}, but we do not denote this explicitly for notational conciseness. Let C1C1TC_{1}C_{1}^{T} and C2C2TC_{2}C_{2}^{T} be unique Cholesky factorizations of Ω1\Omega_{1} and Ω2\Omega_{2} respectively, where C1C_{1} and C2C_{2} are lower triangular matrices with positive diagonal entries. We further define C1C_{1}^{*} and C2C_{2}^{*} to be lower triangular matrices of order GG and nLnL respectively such that Cr,ii=log(Cr,ii)C_{r,ii}^{*}=\log(C_{r,ii}) and Cr,ij=Cr,ijC_{r,ij}^{*}=C_{r,ij} if iji\neq j for r=1,2r=1,2. The purpose of introducing C1C_{1}^{*} and C2C_{2}^{*} is to allow unconstrained optimization of the variational parameters in the stochastic gradient ascent algorithm, since diagonal entries of C1C_{1} and C2C_{2} are constrained to be positive. Note that C2C_{2} and C2C_{2}^{*} also depend on θG\theta_{G} but again we do not show this explicitly in our notation.

As Ω2\Omega_{2} depends on θG\theta_{G}, the joint distribution q(θ)q(\theta) is generally non-Gaussian even though q(θG)q(\theta_{G}) and q(θL|θG)q(\theta_{L}|\theta_{G}) are individually Gaussian. Here we consider a first order approximation and assume that μ2\mu_{2} and v(C2)v(C_{2}^{*}) are linear functions of θG\theta_{G}:

μ2=d+C2TD(μ1θG),v(C2)=f+FθG.\mu_{2}=d+C_{2}^{-T}D(\mu_{1}-\theta_{G}),\quad v(C_{2}^{*})=f+F\theta_{G}. (1)

In (1), dd is a vector of length nLnL, DD is a nL×GnL\times G matrix, ff is a vector of length nL(nL+1)/2nL(nL+1)/2 and FF is a nL(nL+1)/2×GnL(nL+1)/2\times G matrix. For this specification, q(θ)q(\theta) is not jointly Gaussian due to dependence of the covariance matrix of q(θL|θG)q(\theta_{L}|\theta_{G}) on θG\theta_{G}. It is Gaussian if and only if F0F\equiv 0. The set of variational parameters to be optimized is denoted as

λ=(μ1T,v(C1)T,dT,vec(D)T,fT,vec(F)T)T.\lambda=(\mu_{1}^{T},v(C_{1}^{*})^{T},d^{T},\text{\rm vec}(D)^{T},f^{T},\text{\rm vec}(F)^{T})^{T}.

As motivation for the linear approximation in (1), consider the linear mixed model,

yi=Xiβ+Zibi+ϵi,(i=1,,n),βN(0,σβ2Ip),biN(0,Λ),ϵiN(0,σϵ2Ini),\begin{gathered}y_{i}=X_{i}\beta+Z_{i}b_{i}+\epsilon_{i},\quad(i=1,\dots,n),\\ \beta\sim N(0,\sigma_{\beta}^{2}I_{p}),\quad b_{i}\sim N(0,\Lambda),\quad\epsilon_{i}\sim N(0,\sigma_{\epsilon}^{2}I_{n_{i}}),\\ \end{gathered}

where yiy_{i} is a vector of responses of length nin_{i} for the iith subject, XiX_{i} and ZiZ_{i} are covariate matrices of dimensions ni×pn_{i}\times p and ni×Ln_{i}\times L respectively, β\beta is a vector of coefficients of length pp and {bi}\{b_{i}\} are random effects. Assume σϵ2\sigma_{\epsilon}^{2} is known. Then the global parameters θG\theta_{G} consists of β\beta and Λ\Lambda. The posterior of θL\theta_{L} conditional on θG\theta_{G} is

p(θL|y,θG)\displaystyle p(\theta_{L}|y,\theta_{G}) i=1np(yi|β,bi)p(bi|Λ)\displaystyle\propto\prod_{i=1}^{n}p(y_{i}|\beta,b_{i})p(b_{i}|\Lambda)
i=1nexp[{biT(ZiTZi/σϵ2+Λ1)bi\displaystyle\propto\prod_{i=1}^{n}\exp[-\{b_{i}^{T}(Z_{i}^{T}Z_{i}/\sigma_{\epsilon}^{2}+\Lambda^{-1})b_{i}
2biTZiT(yiXiβ)/σϵ2}/2].\displaystyle\quad-2b_{i}^{T}Z_{i}^{T}(y_{i}-X_{i}\beta)/\sigma_{\epsilon}^{2}\}/2].

Thus p(θL|y,θG)=i=1np(bi|y,θG)p(\theta_{L}|y,\theta_{G})=\prod_{i=1}^{n}p(b_{i}|y,\theta_{G}), where p(bi|y,θG)p(b_{i}|y,\theta_{G}) is a normal density with precision matrix ZiTZi/σϵ2+Λ1Z_{i}^{T}Z_{i}/\sigma_{\epsilon}^{2}+\Lambda^{-1} and mean (ZiTZi/σϵ2+Λ1)1ZiT(yiXiβ)/σϵ2(Z_{i}^{T}Z_{i}/\sigma_{\epsilon}^{2}+\Lambda^{-1})^{-1}Z_{i}^{T}(y_{i}-X_{i}\beta)/\sigma_{\epsilon}^{2}. The precision matrix depends on Λ1\Lambda^{-1} linearly and the mean depends on β\beta linearly after scaling by the covariance matrix. The linear approximation in (1) tries to mimic this dependence relationship.

The proposed variational density is conditionally structured and highly flexible. Such dependence structure is particularly valuable in constructing variational approximations for hierarchical models, where there are global scale parameters in θG\theta_{G} which help to determine the scale of local latent variables in the conditional posterior of θL|θG\theta_{L}|\theta_{G}. While marginal posteriors of the global variables are often well approximated by Gaussian densities, marginal posteriors of the local variables tend to exhibit more skewness and kurtosis. This deviation from normality can be captured by q(θL)=q(θG)q(θL|θG)𝑑θGq(\theta_{L})=\int q(\theta_{G})q(\theta_{L}|\theta_{G})d\theta_{G}, which is a mixture of normal densities. The formulation in (1) also allows for a reduction in the number of variational parameters if conditional independence structure consistent with that in the true posterior is imposed on the variational approximation.

3.1 Using conditional independence structure

Tan and Nott (2018) incorporate the conditional independence structure of the true posterior into Gaussian variational approximations by using the fact that zeros in the precision matrix correspond to conditional independence for Gaussian random vectors. This incorporation achieves sparsity in the precision matrix of the approximation and leads to a large reduction in the number of variational parameters to be optimized. For high-dimensional θ\theta, this sparse structure is especially important because a full Gaussian approximation involves learning a covariance matrix where the number of elements grows quadratically with the dimension of θ\theta.

Recall that θL=(b1,,bn)T\theta_{L}=(b_{1},\dots,b_{n})^{T}. Suppose bib_{i} is conditionally independent of bjb_{j} in the posterior for |ij|>|i-j|>\ell, given θG\theta_{G} and {bj|ij|}\{b_{j}\mid|i-j|\leq\ell\}. For instance, in a GLMM, {bi}\{b_{i}\} may be subject specific random effects, and these are conditionally independent given the global parameters, so this structure holds with =0\ell=0. In the case of a state space model for a time series, {bi}\{b_{i}\} are the latent states and this structure holds with =1\ell=1. Note that ordering of the latent variables is important here.

Now partition the precision matrix Ω2\Omega_{2} of q(θL|θG)q(\theta_{L}|\theta_{G}) into L×LL\times L blocks with nn row and nn column partitions corresponding to θL=(b1,,bn)T\theta_{L}=(b_{1},\dots,b_{n})^{T}. Let Ω2,ij\Omega_{2,ij} be the block corresponding to bib_{i} horizontally and bjb_{j} vertically for i,j=1,,ni,j=1,\dots,n. If bib_{i} is conditionally independent of bjb_{j} for |ij|>|i-j|>\ell, given θG\theta_{G} and {bj|ij|}\{b_{j}\mid|i-j|\leq\ell\}, then we set Ω2,ij=0\Omega_{2,ij}=0 for all pairs (i,j)(i,j) with |ij|>|i-j|>\ell. Let \mathcal{I} denote the indices of elements in v(Ω2)v(\Omega_{2}) which are fixed at zero by this conditional independence requirement. If we choose fi=0f_{i}=0 and Fij=0F_{ij}=0 for all ii\in\mathcal{I} and all jj in (1), then C2C_{2}^{*} has the same block sparse structure we desire for the lower triangular part of Ω2\Omega_{2}. By Proposition 1 of Rothman et al. (2010), this means that Ω2\Omega_{2} will have the desired block sparse structure. Hence we impose the constraints fi=0f_{i}=0 and Fij=0F_{ij}=0 for ii\in\mathcal{I} and all jj, which reduces the number of variational parameters to be optimized.

4 Optimization of variational parameters

To make the dependence on λ\lambda explicit, write q(θ)q(\theta) as qλ(θ)q_{\lambda}(\theta). The variational parameters λ\lambda are optimized by minimizing the Kullback-Leibler divergence between qλ(θ)q_{\lambda}(\theta) and the true posterior p(θ|y)p(\theta|y), where

KL{qλ||p(θ|y)}=qλ(θ)logqλ(θ)p(θ|y)dθ=logp(y)qλ(θ)logp(y,θ)qλ(θ)dθ0.\text{\rm KL}\{q_{\lambda}||p(\theta|y)\}=\int q_{\lambda}(\theta)\log\frac{q_{\lambda}(\theta)}{p(\theta|y)}d\theta\\ =\log p(y)-\int q_{\lambda}(\theta)\log\frac{p(y,\theta)}{q_{\lambda}(\theta)}d\theta\geq 0.

Minimizing KL{qλ||p(θ|y)}\text{\rm KL}\{q_{\lambda}||p(\theta|y)\} is therefore equivalent to maximizing an evidence lower bound VI{\mathcal{L}}^{\text{VI}} on the log marginal likelihood logp(y)\log p(y), where

VI=Eqλ{logp(y,θ)logqλ(θ)}.{\mathcal{L}}^{\text{VI}}=E_{q_{\lambda}}\{\log p(y,\theta)-\log q_{\lambda}(\theta)\}. (2)

In (2), EqλE_{q_{\lambda}} denotes expectation with respect to qλ(θ)q_{\lambda}(\theta). We seek to maximize VI{\mathcal{L}}^{\text{VI}} with respect to λ\lambda using stochastic gradient ascent. Starting with some initial estimate of λ\lambda, we perform the following update at each iteration tt,

λt=λt1+ρt^λVI,\lambda_{t}=\lambda_{t-1}+\rho_{t}\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}},

where ρt\rho_{t} represents a small stepsize taken in the direction of the stochastic gradient ^λVI\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}}. The sequence {ρt}\{\rho_{t}\} should satisfy the conditions tρt=\sum_{t}\rho_{t}=\infty and tρt2<\sum_{t}\rho_{t}^{2}<\infty (Spall, 2003).

An unbiased estimate of the gradient λVI\nabla_{\lambda}{\mathcal{L}}^{\text{VI}} can be constructed using (2) by simulating θ\theta from qλ(θ)q_{\lambda}(\theta). However, this approach usually results in large fluctuations in the stochastic gradients. Hence we implement the “reparametrization trick” (Kingma and Welling, 2014; Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014), which helps to reduce the variance of the stochastic gradients. This approach writes θqλ(θ)\theta\sim q_{\lambda}(\theta) as a function of the variational parameters λ\lambda and a random vector ss having a density not depending on λ\lambda. To explain further, let s=[s1T,s2T]Ts=[s_{1}^{T},s_{2}^{T}]^{T}, where s1s_{1} and s2s_{2} are vectors of length GG and nLnL corresponding to θG\theta_{G} and θL\theta_{L} respectively. Consider a transformation θ=rλ(s)\theta=r_{\lambda}(s) of the form

[θGθL]=[μ1+C1Ts1μ2+C2Ts2].\begin{bmatrix}\theta_{G}\\ \theta_{L}\end{bmatrix}=\begin{bmatrix}\mu_{1}+C_{1}^{-T}s_{1}\\ \mu_{2}+C_{2}^{-T}s_{2}\end{bmatrix}. (3)

Since μ2\mu_{2} and C2C_{2} are functions of θG\theta_{G} from (1),

μ2\displaystyle\mu_{2} =d+C2TD(μ1θG)=dC2TDC1Ts1,\displaystyle=d+C_{2}^{-T}D(\mu_{1}-\theta_{G})=d-C_{2}^{-T}DC_{1}^{-T}s_{1},
v(C2)\displaystyle v(C_{2}^{*}) =f+FθG=f+F(μ1+C1Ts1).\displaystyle=f+F\theta_{G}=f+F(\mu_{1}+C_{1}^{-T}s_{1}).

Hence μ2\mu_{2} and C2C_{2} are functions of s1s_{1}, and θL\theta_{L} is a function of both s1s_{1} and s2s_{2}. This transformation is invertible since given θ\theta and λ\lambda, we can first recover s1=C1T(θGμ1)s_{1}=C_{1}^{T}(\theta_{G}-\mu_{1}), find μ2\mu_{2} and C2C_{2}, and then recover s2=C2T(θLμ2)s_{2}=C_{2}^{T}(\theta_{L}-\mu_{2}). Applying this transformation,

VI\displaystyle{\mathcal{L}}^{\text{VI}} =ϕ(s){logp(y,θ)logqλ(θ)}𝑑s\displaystyle=\int\phi(s)\{\log p(y,\theta)-\log q_{\lambda}(\theta)\}ds (4)
=Eϕ{logp(y,θ)logqλ(θ)},\displaystyle=E_{\phi}\{\log p(y,\theta)-\log q_{\lambda}(\theta)\},

where EϕE_{\phi} denotes expectation with respect to ϕ(s)\phi(s) and θ=rλ(s)\theta=r_{\lambda}(s).

4.1 Stochastic gradients

Next, we differentiate (4) with respect to λ\lambda to find unbiased estimates of the gradients. As logqλ(θ)\log q_{\lambda}(\theta) depends on λ\lambda directly as well as through θ\theta, applying the chain rule, we have

λVI\displaystyle\nabla_{\lambda}{\mathcal{L}}^{\text{VI}} =Eϕ[λrλ(s){θlogp(y,θ)θlogqλ(θ)}\displaystyle=E_{\phi}[\nabla_{\lambda}r_{\lambda}(s)\{\nabla_{\theta}\log p(y,\theta)-\nabla_{\theta}\log q_{\lambda}(\theta)\}
λlogqλ(θ)]\displaystyle\quad-\nabla_{\lambda}\log q_{\lambda}(\theta)] (5)
=Eϕ[λrλ(s){θlogp(y,θ)θlogqλ(θ)}].\displaystyle=E_{\phi}[\nabla_{\lambda}r_{\lambda}(s)\{\nabla_{\theta}\log p(y,\theta)-\nabla_{\theta}\log q_{\lambda}(\theta)\}]. (6)

Note that Eϕ{λlogqλ(θ)}E_{\phi}\{\nabla_{\lambda}\log q_{\lambda}(\theta)\} = 0 as it is the expectation of the score function. Roeder et al. (2017) refer to the expressions inside the expectations in (5) and (6) as the total derivative and path derivative respectively. In (6), the contributions to the gradient from logp(y,θ)\log p(y,\theta) and logqλ(θ)\log q_{\lambda}(\theta) cancel each other if qλ(θ)q_{\lambda}(\theta) approximates the true posterior well (at convergence). However, the score function λlogqλ(θ)\nabla_{\lambda}\log q_{\lambda}(\theta) is not necessarily small even if qλ(θ)q_{\lambda}(\theta) is a good approximation to p(θ|y)p(\theta|y). This term affects adversely the ability of the algorithm to converge and “stick” to the optimal variational parameters, a phenomenon Roeder et al. (2017) refers to as “sticking the landing”. Hence we consider the path derivative,

^λVI=λrλ(s){θlogp(y,θ)θlogqλ(θ)}\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}}=\nabla_{\lambda}r_{\lambda}(s)\{\nabla_{\theta}\log p(y,\theta)-\nabla_{\theta}\log q_{\lambda}(\theta)\} (7)

as an unbiased estimate of the true gradient λVI\nabla_{\lambda}{\mathcal{L}}^{\text{VI}}. Tan and Nott (2018) and Tan (2018) also demonstrate that the path derivative has smaller variation about zero when the algorithm is close to convergence.

Let θlogp(y,θ)θlogqλ(θ)=(𝒢1T,𝒢2T)T\nabla_{\theta}\log p(y,\theta)-\nabla_{\theta}\log q_{\lambda}(\theta)=({\mathcal{G}}_{1}^{T},{\mathcal{G}}_{2}^{T})^{T}, where 𝒢1{\mathcal{G}}_{1} and 𝒢2{\mathcal{G}}_{2} are vectors of length GG and nLnL respectively corresponding to the partitioning of θ=[θGT,θLT]T\theta=[\theta_{G}^{T},\theta_{L}^{T}]^{T}. Then ^λVI=λrλ(s)(𝒢1T,𝒢2T)T\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}}=\nabla_{\lambda}r_{\lambda}(s)({\mathcal{G}}_{1}^{T},{\mathcal{G}}_{2}^{T})^{T} is given by

[𝒢1+μ1θL𝒢2D1v[C1Ts1{𝒢1+(μ1θLDTC21)𝒢2}TC1T]𝒢2vec(C21𝒢2s1TC11)fθL𝒢2vec(fθL𝒢2θGT)],\begin{bmatrix}{\mathcal{G}}_{1}+\nabla_{\mu_{1}}\theta_{L}{\mathcal{G}}_{2}\\ -D_{1}^{*}v[C_{1}^{-T}s_{1}\{{\mathcal{G}}_{1}+(\nabla_{\mu_{1}}\theta_{L}-D^{T}C_{2}^{-1}){\mathcal{G}}_{2}\}^{T}C_{1}^{-T}]\\ {\mathcal{G}}_{2}\\ -\text{\rm vec}(C_{2}^{-1}{\mathcal{G}}_{2}s_{1}^{T}C_{1}^{-1})\\ \nabla_{f}\theta_{L}{\mathcal{G}}_{2}\\ \text{\rm vec}(\nabla_{f}\theta_{L}{\mathcal{G}}_{2}\theta_{G}^{T})\\ \end{bmatrix},

where

μ1θL\displaystyle\nabla_{\mu_{1}}\theta_{L} =FTfθL,\displaystyle=F^{T}\nabla_{f}\theta_{L},
fθL𝒢2\displaystyle\nabla_{f}\theta_{L}{\mathcal{G}}_{2} =D2v{C2T(s2DC1Ts1)𝒢2TC2T}.\displaystyle=-D_{2}^{*}v\{C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1}){\mathcal{G}}_{2}^{T}C_{2}^{-T}\}.

Here D1D^{*}_{1} and D2D^{*}_{2} are diagonal matrices of order G(G+1)/2G(G+1)/2 and nL(nL+1)/2nL(nL+1)/2 respectively such that dv(Cr)=Drdv(Cr)\text{\rm d}v(C_{r})=D^{*}_{r}\text{\rm d}v(C_{r}^{*}) for r=1,2r=1,2. Formally, D1=diag{v(dg(C1)+𝟏G𝟏GTIG)}D^{*}_{1}=\text{\rm diag}\{v(\text{\rm dg}(C_{1})+{\bf 1}_{G}{\bf 1}_{G}^{T}-I_{G})\} and D2=diag{v(dg(C2)+𝟏nL𝟏nLTInL)}D^{*}_{2}=\text{\rm diag}\{v(\text{\rm dg}(C_{2})+{\bf 1}_{nL}{\bf 1}_{nL}^{T}-I_{nL})\}. The full expression and derivation of λrλ(s)\nabla_{\lambda}r_{\lambda}(s) are given in Appendix A. In addition, we show (in Appendix A) that

θlogqλ(θ)=[θGlogqλ(θ)θLlogqλ(θ)]\displaystyle\nabla_{\theta}\log q_{\lambda}(\theta)=\begin{bmatrix}\nabla_{\theta_{G}}\log q_{\lambda}(\theta)\\ \nabla_{\theta_{L}}\log q_{\lambda}(\theta)\end{bmatrix}
=[FT[v(InL)D2v{(θLd)s2T}]C1s1DTs2C2s2].\displaystyle=\begin{bmatrix}F^{T}[v(I_{nL})-D_{2}^{*}v\{(\theta_{L}-d)s_{2}^{T}\}]-C_{1}s_{1}-D^{T}s_{2}\\ -C_{2}s_{2}\end{bmatrix}.

θlogp(y,θ)\nabla_{\theta}\log p(y,\theta) is model specific and we discuss the application to GLMMs and state space models in Sections 8 and 9 respectively.

4.2 Stochastic variational algorithm

The stochastic gradient ascent algorithm for CSGVA is outlined in Algorithm 1. For computing the stepsize, we use Adam (Kingma and Ba, 2015), which uses bias-corrected estimates of the first and second moments of the stochastic gradients to compute adaptive learning rates.

Algorithm 1: CSGVA algorithm.
 

Initialize λ0=0\lambda_{0}=0, m0=0m_{0}=0, v0=0v_{0}=0,

For t=1,,Nt=1,\dots,N,

  1. 1.

    Generate sN(0,InL+G)s\sim N(0,I_{nL+G}) and compute θ=rλt1(s)\theta=r_{\lambda_{t-1}}(s).

  2. 2.

    Compute gradient gt=^λVIg_{t}=\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}}.

  3. 3.

    Update biased first moment estimate:
    mt=τ1mt1+(1τ1)gtm_{t}=\tau_{1}m_{t-1}+(1-\tau_{1})g_{t}.

  4. 4.

    Update biased second moment estimate:
    vt=τ2vt1+(1τ2)gt2v_{t}=\tau_{2}v_{t-1}+(1-\tau_{2})g_{t}^{2}.

  5. 5.

    Compute bias-corrected first moment estimate:
    m^t=mt/(1τ1t)\hat{m}_{t}=m_{t}/(1-\tau_{1}^{t}).

  6. 6.

    Compute bias-corrected second moment estimate:
    v^t=vt/(1τ2t)\hat{v}_{t}=v_{t}/(1-\tau_{2}^{t}).

  7. 7.

    Update λt=λt1+αm^t/(v^t+ϵ)\lambda_{t}=\lambda_{t-1}+\alpha\hat{m}_{t}/(\sqrt{\hat{v}_{t}}+\epsilon).

 

At iteration tt, the variational parameter λ\lambda is updated as λt=λt1+Δt\lambda_{t}=\lambda_{t-1}+\Delta_{t}. Let gtg_{t} denote the stochastic gradient estimate at iteration tt. In steps 3 and 4, Adam computes estimates of the mean (first moment) and uncentered variance (second moment) of the gradients using exponential moving averages, where τ1,τ2[0,1)\tau_{1},\tau_{2}\in[0,1) control the decay rates. In step 4, gt2g_{t}^{2} is evaluated as gtgtg_{t}\odot g_{t}, where \odot denotes the Hadamard (element-wise) product. As mtm_{t} and vtv_{t} are initialized as zero, these moment estimates tend to be biased towards zero, especially at the beginning of the algorithm if τ1\tau_{1}, τ2\tau_{2} are close to one. As mt=(1τ1)i=1tτ1tigim_{t}=(1-\tau_{1})\sum_{i=1}^{t}\tau_{1}^{t-i}g_{i},

E(mt)=E(gt)(1τ1t)+ζ,E(m_{t})=E(g_{t})(1-\tau_{1}^{t})+\zeta,

where ζ=0\zeta=0 if E(gi)=E(gt)E(g_{i})=E(g_{t}) for 1i<t1\leq i<t. Otherwise, ζ\zeta can be kept small since the weights for past gradients decrease exponentially. An analogous argument holds for vtv_{t}. Thus the bias can be corrected by using the estimates m^t\hat{m}_{t} and v^t\hat{v}_{t} in steps 5 and 6. The change is then computed as

Δt=αm^tv^t+ϵ,\Delta_{t}=\frac{\alpha\hat{m}_{t}}{\sqrt{\hat{v}_{t}}+\epsilon},

where α\alpha controls the magnitude of the stepsize and ϵ\epsilon is a small positive constant which ensures that the denominator is positive. In our experiments, we set α=0.001\alpha=0.001, τ1=0.9\tau_{1}=0.9, τ2=0.99\tau_{2}=0.99 and ϵ=108\epsilon=10^{-8}, values close to what is recommended by Kingma and Ba (2015).

At each iteration tt, we can also compute an unbiased estimate of the lower bound,

^VI=logp(y,θ)logqλt1(θ),\hat{{\mathcal{L}}}^{\text{VI}}=\log p(y,\theta)-\log q_{\lambda_{t-1}}(\theta),

where θ\theta is computed in step 1. Since these estimates are stochastic, we follow the path traced by ¯VI\bar{{\mathcal{L}}}^{\text{VI}}, which is an average of the lower bounds averaged over every 1000 iterations, as a means to diagnose the convergence of Algorithm 1. VI¯\bar{{\mathcal{L}}^{\text{VI}}} tends to increase monotonically at the start, but as the algorithm comes close to convergence, the values of ¯VI\bar{{\mathcal{L}}}^{\text{VI}} fluctuate close to and about the true maximum lower bound. Hence, we fit a least squares regression line to the past κ\kappa values of ¯VI\bar{{\mathcal{L}}}^{\text{VI}} and terminate Algorithm 1 once the gradient of the regression line becomes negative (see Tan, 2018). For our experiments, we set κ=6\kappa=6.

5 Links to Gaussian variational approximation

CSGVA is an extension of Gaussian variational approximation (GVA, Tan and Nott, 2018). In both approaches, the conditional posterior independence structure of the local latent variables is used to introduce sparsity in the precision matrix of the approximation. Below we demonstrate that GVA is a special case of CSGVA when F0F\equiv 0.

Tan and Nott (2018) consider a GVA of the form

[θLθG]𝒩([μLμG],TTT1)whereT=[TLL0TGLTGG].\begin{bmatrix}\theta_{L}\\ \theta_{G}\end{bmatrix}\sim{\mathcal{N}}\left(\begin{bmatrix}\mu_{L}\\ \mu_{G}\end{bmatrix},T^{-T}T^{-1}\right)\,\text{where}\;T=\begin{bmatrix}T_{LL}&0\\ T_{GL}&T_{GG}\end{bmatrix}.

Note that TLLT_{LL} and TGGT_{GG} are lower triangular matrices. Using a vector s=[sLT,sGT]TN(0,InL+G)s=[s_{L}^{T},s_{G}^{T}]^{T}\sim N(0,I_{nL+G}), we can write

[θLθG]=[μLμG]+TT[sLsG],\displaystyle\begin{bmatrix}\theta_{L}\\ \theta_{G}\end{bmatrix}=\begin{bmatrix}\mu_{L}\\ \mu_{G}\end{bmatrix}+T^{-T}\begin{bmatrix}s_{L}\\ s_{G}\end{bmatrix},
whereTT=[TLLTTLLTTGLTTGGT0TGGT].\displaystyle\text{where}\quad T^{-T}=\begin{bmatrix}T_{LL}^{-T}&-T_{LL}^{-T}T_{GL}^{T}T_{GG}^{-T}\\[5.69054pt] 0&T_{GG}^{-T}\end{bmatrix}.

Assuming F0F\equiv 0 for CSGVA, we have from (3) that

[θLθG]=[dμ1]+[C2TC2TDC1T0C1T][s2s1],\displaystyle\begin{bmatrix}\theta_{L}\\ \theta_{G}\end{bmatrix}=\begin{bmatrix}d\\ \mu_{1}\end{bmatrix}+\begin{bmatrix}C_{2}^{-T}&-C_{2}^{-T}DC_{1}^{-T}\\ 0&C_{1}^{-T}&\end{bmatrix}\begin{bmatrix}s_{2}\\ s_{1}\end{bmatrix},

where [s2T,s1T]TN(0,InL+G)[s_{2}^{T},s_{1}^{T}]^{T}\sim N(0,I_{nL+G}). Hence we can identify

μ1=μG,d=μL,C1=TGG,C2=TLL,D=TGLT.\mu_{1}=\mu_{G},\;d=\mu_{L},\;C_{1}=T_{GG},\;C_{2}=T_{LL},\;D=T_{GL}^{T}.

If the standard way of initializing of Algorithm 1 (by setting λ=0\lambda=0) does not work well, we can use this association to initialize Algorithm 1 by using the fit from GVA. This informative initialization can reduce computation time significantly although there may be a risk of getting stuck in a local mode.

6 Importance weighted variational inference

Here we discuss how CSGVA can be improved by maximizing an importance weighted lower bound (IWLB, Burda et al., 2016), which leads to a tighter lower bound on the log marginal likelihood, and a variational approximation less prone to underestimation of the true posterior variance. We also relate the IWLB with Rényi’s divergence (Rényi, 1961; van Erven and Harremos, 2014) between qλ(θ)q_{\lambda}(\theta) and p(θ|y)p(\theta|y), demonstrating that maximizing the IWLB instead of the usual evidence lower bound leads to a transition in the behavior of the variational approximation from “mode-seeking” to “mass-covering”. We first define Rényi’s divergence and the variational Rényi bound (Li and Turner, 2016), before introducing the IWLB as the expectation of a Monte Carlo approximation of the variational Rényi bound.

6.1 Rényi’s divergence and variational Rényi bound

Rényi’s divergence provides a measure of the distance between two densities qq and pp, and it is defined as

Dα(q||p)=1α1logq(θ)αp(θ)1αdθ,D_{\alpha}(q||p)=\frac{1}{\alpha-1}\log\int q(\theta)^{\alpha}p(\theta)^{1-\alpha}d\theta,

for 0<α<0<\alpha<\infty, α1\alpha\neq 1. This definition can be extended by continuity to the orders 0, 1 and \infty, as well as to negative orders α<0\alpha<0. Note that Dα(q||p)D_{\alpha}(q||p) is no longer a divergence measure if α<0\alpha<0, but we can write Dα(q||p)D_{\alpha}(q||p) as α1αD1α(p||q)\frac{\alpha}{1-\alpha}D_{1-\alpha}(p||q) for α{0,1}\alpha\notin\{0,1\} by the skew symmetry property. As α\alpha approaches 1, the limit of Dα(q||p)D_{\alpha}(q||p) is the Kullback-Leibler divergence, KL(q||p)\text{\rm KL}(q||p). In variational inference, minimizing the Kullback-Leibler divergence between the variational density qλ(θ)q_{\lambda}(\theta) and the true posterior p(θ|y)p(\theta|y) is equivalent to maximizing a lower bound VI{\mathcal{L}}^{\text{VI}} on the log marginal likelihood due to the relationship:

VI\displaystyle{\mathcal{L}}^{\text{VI}} =logp(y)KL{qλ||p(θ|y)}=Eqλ{logp(y,θ)qλ(θ)}.\displaystyle=\log p(y)-\text{\rm KL}\{q_{\lambda}||p(\theta|y)\}=E_{q_{\lambda}}\left\{\log\frac{p(y,\theta)}{q_{\lambda}(\theta)}\right\}.

Generalizing this relation using Rényi’s divergence measure, Li and Turner (2016) define the variational Rényi bound α{\mathcal{L}}_{\alpha} as

α\displaystyle{\mathcal{L}}_{\alpha} =logp(y)Dα{qλ||p(θ|y)}\displaystyle=\log p(y)-D_{\alpha}\{q_{\lambda}||p(\theta|y)\}
=11αlogEqλ{(p(y,θ)qλ(θ))1α}.\displaystyle=\frac{1}{1-\alpha}\log E_{q_{\lambda}}\left\{\left(\frac{p(y,\theta)}{q_{\lambda}(\theta)}\right)^{1-\alpha}\right\}.

Note that 1{\mathcal{L}}_{1}, the limit of α{\mathcal{L}}_{\alpha} as α1\alpha\rightarrow 1, is equal to VI{\mathcal{L}}^{\text{VI}}. A Monte Carlo approximation of α{\mathcal{L}}_{\alpha} when the expectation is intractable is

^α,K=11αlog1Kk=1Kwk1α,\hat{{\mathcal{L}}}_{\alpha,K}=\frac{1}{1-\alpha}\log\frac{1}{K}\sum_{k=1}^{K}w_{k}^{1-\alpha}, (8)

where ΘK=[θ1,,θK]\Theta_{K}=[\theta_{1},\dots,\theta_{K}] is a set of KK samples generated randomly from qλ(θ)q_{\lambda}(\theta), and

wk=w(θk)=p(y,θk)qλ(θk),k=1,,K,w_{k}=w(\theta_{k})=\frac{p(y,\theta_{k})}{q_{\lambda}(\theta_{k})},\quad k=1,\dots,K,

are importance weights. For each kk, Eqλ(wk)=p(y)E_{q_{\lambda}}(w_{k})=p(y). The limit of ^α,K\hat{{\mathcal{L}}}_{\alpha,K} as α1\alpha\rightarrow 1 is 1Kk=1Klogp(y,θk)qλ(θk)\frac{1}{K}\sum_{k=1}^{K}\log\frac{p(y,\theta_{k})}{q_{\lambda}(\theta_{k})}. Hence ^1,K\hat{{\mathcal{L}}}_{1,K} is an unbiased estimate of 1{\mathcal{L}}_{1} as EΘK(^1,K)=1=VIE_{\Theta_{K}}(\hat{{\mathcal{L}}}_{1,K})={\mathcal{L}}_{1}={\mathcal{L}}^{\text{VI}}, where EΘKE_{\Theta_{K}} denotes expectation with respect to q(ΘK)=k=1Kqλ(θk)q(\Theta_{K})=\prod_{k=1}^{K}q_{\lambda}(\theta_{k}). For α1\alpha\neq 1, ^α,K\hat{{\mathcal{L}}}_{\alpha,K} is not an unbiased estimate of α{\mathcal{L}}_{\alpha}.

6.2 Importance weighted lower bound

The importance weighted lower bound (IWLB, Burda et al., 2016) is defined as

KIW=EΘK(^0,K)=EΘK(log1Kk=1Kwk),{\mathcal{L}}_{K}^{\text{IW}}=E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{0,K})=E_{\Theta_{K}}\left(\log\frac{1}{K}\sum_{k=1}^{K}w_{k}\right),

where α=0\alpha=0 in (8). It reduces to VI{\mathcal{L}}^{\text{VI}} when K=1K=1. By Jensen’s inequality,

KIWlogEΘK(1Kk=1Kwk)=logp(y).\displaystyle{\mathcal{L}}_{K}^{\text{IW}}\leq\log E_{\Theta_{K}}\left(\frac{1}{K}\sum_{k=1}^{K}w_{k}\right)=\log p(y).

Thus KIW{\mathcal{L}}_{K}^{\text{IW}} provides a lower bound to the log marginal likelihood for any positive integer KK. From Theorem 6.1 (Burda et al., 2016), this bound becomes tighter as KK increases.

Theorem 6.1

KIW{\mathcal{L}}^{\text{IW}}_{K} increases with KK and approaches logp(y)\log p(y) as KK\rightarrow\infty if wkw_{k} is bounded.

Proof

Let I={wI1,,wIK}I=\{w_{I_{1}},\dots,w_{I_{K}}\} be selected randomly without replacement from {w1,,wK+1}\{w_{1},\dots,w_{K+1}\}. Then EI|ΘK+1(wIj)=1K+1k=1K+1wkE_{I|\Theta_{K+1}}(w_{I_{j}})=\frac{1}{K+1}\sum_{k=1}^{K+1}w_{k} for any j=1,,Kj=1,\dots,K, where EI|ΘK+1E_{I|\Theta_{K+1}} denotes the expectation associated with the randomness in selecting II given ΘK+1\Theta_{K+1}. Thus

K+1IW\displaystyle{\mathcal{L}}^{\text{IW}}_{K+1} =EΘK+1(log1K+1k=1K+1wk)\displaystyle=E_{\Theta_{K+1}}\left(\log\frac{1}{K+1}\sum\nolimits_{k=1}^{K+1}w_{k}\right)
=EΘK+1{logEI|ΘK+1(1Kj=1KwIj)}\displaystyle=E_{\Theta_{K+1}}\left\{\log E_{I|\Theta_{K+1}}\left(\frac{1}{K}\sum\nolimits_{j=1}^{K}w_{I_{j}}\right)\right\}
EΘK+1{EI|ΘK+1log(1Kj=1KwIj)}\displaystyle\geq E_{\Theta_{K+1}}\left\{E_{I|\Theta_{K+1}}\log\left(\frac{1}{K}\sum\nolimits_{j=1}^{K}w_{I_{j}}\right)\right\}
=EΘKlog(1Kk=1Kwk)=KIW.\displaystyle=E_{\Theta_{K}}\log\left(\frac{1}{K}\sum\nolimits_{k=1}^{K}w_{k}\right)={\mathcal{L}}^{\text{IW}}_{K}.

If wkw_{k} is bounded, then 1Kk=1Kwka.s.p(y)\frac{1}{K}\sum_{k=1}^{K}w_{k}\overset{a.s.}{\rightarrow}p(y) as KK\rightarrow\infty by the law of large numbers. Hence KIW=EΘK(log1Kk=1Kwk)logp(y){\mathcal{L}}^{\text{IW}}_{K}=E_{\Theta_{K}}(\log\frac{1}{K}\sum_{k=1}^{K}w_{k})\rightarrow\log p(y) as KK\rightarrow\infty.

Next, we present some properties of Rényi’s divergence and EΘK(^α,K)E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{\alpha,K}) which are important in understanding the behavior of the variational density arising from maximizing KIW{\mathcal{L}}^{\text{IW}}_{K}. The proofs of these properties can be found in van Erven and Harremos (2014) and Li and Turner (2016).

Property 1

DαD_{\alpha} is increasing in α\alpha, and is continuous in α\alpha on [0,1]{α[0,1]|Dα|<}[0,1]\cup\{\alpha\notin[0,1]\mid|D_{\alpha}|<\infty\}.

Property 2

EΘK(^α,K)E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{\alpha,K}) is continuous and decreasing in α\alpha for fixed KK.

Theorem 6.2

There exists 0αqλ,K10\leq\alpha_{q_{\lambda},K}\leq 1 for given qλq_{\lambda} and KK such that

logp(y)=Dαqλ,K{qλ||p(θ|y)}+KIW.\log p(y)=D_{\alpha_{q_{\lambda},K}}\{q_{\lambda}||p(\theta|y)\}+{\mathcal{L}}_{K}^{\text{IW}}.
Proof

From Property 2,

1=EΘK(^1,K)EΘK(^0,K)=KIW0,1logp(y)KIWlogp(y)0logp(y),D0{qλ||p(θ|y)}logp(y)KIWD1{qλ||p(θ|y)}.\begin{gathered}{\mathcal{L}}_{1}=E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{1,K})\leq E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{0,K})={\mathcal{L}}^{\text{IW}}_{K}\leq{\mathcal{L}}_{0},\\ {\mathcal{L}}_{1}-\log p(y)\leq{\mathcal{L}}^{\text{IW}}_{K}-\log p(y)\leq{\mathcal{L}}_{0}-\log p(y),\\ D_{0}\{q_{\lambda}||p(\theta|y)\}\leq\log p(y)-{\mathcal{L}}^{\text{IW}}_{K}\leq D_{1}\{q_{\lambda}||p(\theta|y)\}.\end{gathered}

From Property 1, since DαD_{\alpha} is continuous and decreasing for α[0,1]\alpha\in[0,1], there exists 0αqλ,K10\leq\alpha_{q_{\lambda},K}\leq 1 such that logp(y)KIW=Dαqλ,K{qλ||p(θ|y)}\log p(y)-{\mathcal{L}}^{\text{IW}}_{K}=D_{\alpha_{q_{\lambda},K}}\{q_{\lambda}||p(\theta|y)\}.

Minimizing Rényi’s divergence for α1\alpha\gg 1 tends to produce approximations which are mode-seeking (zero-forcing) while maximizing Rényi’s divergence for α0\alpha\ll 0 encourages mass-covering behavior. Theorem 6.2 suggests that maximizing the IWLB results in a variational approximation qλq_{\lambda} whose Rényi’s divergence from the true posterior can be captured with 0α10\leq\alpha\leq 1, which represents a mix and certain balance between mode-seeking and mass-covering behavior (Minka, 2005). In our experiments, we observe that maximizing the IWLB is highly effective in correcting the underestimation of posterior variance in variational inference.

Alternatively, if we approximate KIW{\mathcal{L}}_{K}^{\text{IW}} by considering a second-order Taylor expansion of logw¯K\log\bar{w}_{K} about EΘ(w¯K)=p(y)E_{\Theta}(\bar{w}_{K})=p(y), where w¯K=1Kk=1KwK\bar{w}_{K}=\frac{1}{K}\sum_{k=1}^{K}w_{K}, and then take expectations, we have

KIWlogp(y)var(wk)2Kp(y)2.{\mathcal{L}}_{K}^{\text{IW}}\approx\log p(y)-\frac{\text{\rm var}(w_{k})}{2Kp(y)^{2}}.

Maddison et al. (2017) and Domke and Sheldon (2018) provide bounds on the order of the remainder term in the Taylor approximation above, and demonstrate that the “looseness” of the IWLB is given by var(wk)\text{\rm var}(w_{k}) as KK\rightarrow\infty. Minimizing var(wK)\text{\rm var}(w_{K}) is equivalent to minimizing the χ2\chi^{2} divergence D2(p||q)D_{2}(p||q). Note that if qλ(θ)q_{\lambda}(\theta) has thin tails compared to p(θ|y)p(\theta|y), then the variance of var(wk)\text{\rm var}(w_{k}) will be large. Hence minimizing var(wK)\text{\rm var}(w_{K}) attempts to match p(θ|y)p(\theta|y) with qλ(θ)q_{\lambda}(\theta) so that qλ(θ)q_{\lambda}(\theta) is able to cover the tails.

6.3 Unbiased gradient estimate of importance weighted lower bound

To maximize the IWLB in CSGVA, we need to find an unbiased estimate of λKIW\nabla_{\lambda}{\mathcal{L}}_{K}^{\text{IW}} using the transformation in (3). Let skN(0,IG+nL)s_{k}\sim N(0,I_{G+nL}), θk=rλ(sk)\theta_{k}=r_{\lambda}(s_{k}) for k=1,,Kk=1,\dots,K, and SK=[s1,,sK]TS_{K}=[s_{1},\dots,s_{K}]^{T}.

λKIW\displaystyle\nabla_{\lambda}{\mathcal{L}}_{K}^{\text{IW}} =λEΘK(logw¯K)=λESK(logw¯K)\displaystyle=\nabla_{\lambda}E_{\Theta_{K}}(\log\bar{w}_{K})=\nabla_{\lambda}E_{S_{K}}(\log\bar{w}_{K}) (9)
=ESK[k=1Kλwkk=1Kwk]\displaystyle=E_{S_{K}}\left[\sum_{k=1}^{K}\frac{\nabla_{\lambda}w_{k}}{\sum_{k^{\prime}=1}^{K}w_{k^{\prime}}}\right]
=ESK[k=1Kwkλlogwkk=1Kwk]\displaystyle=E_{S_{K}}\left[\sum_{k=1}^{K}\frac{w_{k}\nabla_{\lambda}\log w_{k}}{\sum_{k^{\prime}=1}^{K}w_{k^{\prime}}}\right]
=ESK[k=1Kw~kλlogwk],\displaystyle=E_{S_{K}}\left[\sum\nolimits_{k=1}^{K}\widetilde{w}_{k}\nabla_{\lambda}\log w_{k}\right],

where wk=w(θk)=w{rλ(sk)}w_{k}=w(\theta_{k})=w\{r_{\lambda}(s_{k})\} and w~k=wk/k=1Kwk\widetilde{w}_{k}=w_{k}/\sum_{k^{\prime}=1}^{K}w_{k^{\prime}} for k=1,,Kk=1,\dots,K are normalized importance weights. Applying chain rule,

λlogwk\displaystyle\nabla_{\lambda}\log w_{k} =λrλ(sk)θklogwkλlogqλ(θk).\displaystyle=\nabla_{\lambda}r_{\lambda}(s_{k})\nabla_{\theta_{k}}\log w_{k}-\nabla_{\lambda}\log q_{\lambda}(\theta_{k}). (10)

In Section 4.1, we note that Eϕ{λlogqλ(θ)}=0E_{\phi}\{\nabla_{\lambda}\log q_{\lambda}(\theta)\}=0 as it is the expectation of the score function and hence we can omit λlogqλ(θ)\nabla_{\lambda}\log q_{\lambda}(\theta) to obtain an unbiased estimate of λVI\nabla_{\lambda}{\mathcal{L}}^{\text{VI}}. However, in this case, it is unclear if

ESK[k=1Kw~kλlogqλ(θk)]=0.E_{S_{K}}\left[\sum_{k=1}^{K}\widetilde{w}_{k}\nabla_{\lambda}\log q_{\lambda}(\theta_{k})\right]=0. (11)

Roeder et al. (2017) conjecture that (11) is true and report improved results when omitting the term λlogqλ(θk)\nabla_{\lambda}\log q_{\lambda}(\theta_{k}) from λlogwk\nabla_{\lambda}\log w_{k} in computing gradient estimates. However, Tucker et al. (2018) demonstrated via simulations that (11) does not hold generally and that such omission will result in biased gradient estimates. Our own simulations using CSGVA also suggest that (11) does not hold even though omission of λlogqλ(θk)\nabla_{\lambda}\log q_{\lambda}(\theta_{k}) does lead to improved results. As the stochastic gradient algorithm is not guaranteed to converge with biased gradient estimates, we turn to the double reparametrized gradient estimate proposed by Tucker et al. (2018) which allows unbiased gradient estimates to be constructed with the omission of λlogqλ(θk)\nabla_{\lambda}\log q_{\lambda}(\theta_{k}) albeit with revised weights.

Since w~k\tilde{w}_{k} depends on λ\lambda directly as well as through θk\theta_{k}, we use chain rule to obtain

λEθk(w~k)\displaystyle\nabla_{\lambda}E_{\theta_{k}}(\tilde{w}_{k}) =λEsk(w~k)\displaystyle=\nabla_{\lambda}E_{s_{k}}(\tilde{w}_{k}) (12)
=Esk(λθkθkw~k)+Esk(λw~k),\displaystyle=E_{s_{k}}(\nabla_{\lambda}\theta_{k}\nabla_{\theta_{k}}\tilde{w}_{k})+E_{s_{k}}(\nabla_{\lambda}\tilde{w}_{k}),

where

θkw~k\displaystyle\nabla_{\theta_{k}}\tilde{w}_{k} ={1k=1Kwkwk(k=1Kwk)2}θkwk\displaystyle=\left\{\frac{1}{\sum_{k^{\prime}=1}^{K}w_{k^{\prime}}}-\frac{w_{k}}{(\sum_{k^{\prime}=1}^{K}w_{k^{\prime}})^{2}}\right\}\nabla_{\theta_{k}}w_{k}
=(w~kw~k2)θklogwk.\displaystyle=(\tilde{w}_{k}-\tilde{w}_{k}^{2})\nabla_{\theta_{k}}\log w_{k}.

Alternatively,

λEθk(w~k)=λqλ(θk)w~k𝑑θk\displaystyle\nabla_{\lambda}E_{\theta_{k}}(\tilde{w}_{k})=\nabla_{\lambda}\int q_{\lambda}(\theta_{k})\tilde{w}_{k}\;d\theta_{k} (13)
=w~kλqλ(θk)+qλ(θk)λw~kdθk\displaystyle=\int\tilde{w}_{k}\nabla_{\lambda}q_{\lambda}(\theta_{k})+q_{\lambda}(\theta_{k})\nabla_{\lambda}\tilde{w}_{k}\;d\theta_{k}
=w~kqλ(θk)λlogqλ(θk)𝑑θk+Eθk(λw~k)\displaystyle=\int\tilde{w}_{k}q_{\lambda}(\theta_{k})\nabla_{\lambda}\log q_{\lambda}(\theta_{k})\;d\theta_{k}+E_{\theta_{k}}(\nabla_{\lambda}\tilde{w}_{k})
=Eθk[w~kλlogqλ(θk)]+Esk(λw~k).\displaystyle=E_{\theta_{k}}[\tilde{w}_{k}\nabla_{\lambda}\log q_{\lambda}(\theta_{k})]+E_{s_{k}}(\nabla_{\lambda}\tilde{w}_{k}).

Comparing (12) and (13), we have

EΘK(k=1Kw~kλlogqλ(θk))\displaystyle E_{\Theta_{K}}\left(\sum\nolimits_{k=1}^{K}\tilde{w}_{k}\nabla_{\lambda}\log q_{\lambda}(\theta_{k})\right)
=k=1KEΘK\θkEθk[w~kλlogqλ(θk)]\displaystyle=\sum\nolimits_{k=1}^{K}E_{\Theta_{K}\backslash\theta_{k}}E_{\theta_{k}}[\tilde{w}_{k}\nabla_{\lambda}\log q_{\lambda}(\theta_{k})]
=k=1KESK\skEsk(λθk(w~kw~k2)θklogwk)\displaystyle=\sum\nolimits_{k=1}^{K}E_{S_{K}\backslash s_{k}}E_{s_{k}}(\nabla_{\lambda}\theta_{k}(\tilde{w}_{k}-\tilde{w}_{k}^{2})\nabla_{\theta_{k}}\log w_{k})
=ESK{k=1K(w~kw~k2)λrλ(sk)θklogwk}.\displaystyle=E_{S_{K}}\left\{\sum\nolimits_{k=1}^{K}(\tilde{w}_{k}-\tilde{w}_{k}^{2})\nabla_{\lambda}r_{\lambda}(s_{k})\nabla_{\theta_{k}}\log w_{k}\right\}.

Combining the above expression with (9) and (10), we find that

λKIW=ESK(k=1Kw~k2λrλ(sk)θklogwk)\nabla_{\lambda}{\mathcal{L}}_{K}^{\text{IW}}=E_{S_{K}}\left(\sum\nolimits_{k=1}^{K}\tilde{w}_{k}^{2}\nabla_{\lambda}r_{\lambda}(s_{k})\nabla_{\theta_{k}}\log w_{k}\right)

An unbiased gradient estimate is thus given by

^λKIW=k=1Kw~k2λrλ(sk)θk{logp(y,θk)logqλ(θk)}.\widehat{\nabla}_{\lambda}{\mathcal{L}}_{K}^{\text{IW}}=\sum_{k=1}^{K}\tilde{w}_{k}^{2}\nabla_{\lambda}r_{\lambda}(s_{k})\nabla_{\theta_{k}}\{\log p(y,\theta_{k})-\log q_{\lambda}(\theta_{k})\}.

Thus, to use CSGVA with important weights, we only need to modify Algorithm 1 by drawing KK samples s1,,sKs_{1},\dots,s_{K} in step 1 instead of a single sample and then compute the unbiased gradient estimate, gt=^λKIWg_{t}=\widehat{\nabla}_{\lambda}{\mathcal{L}}_{K}^{\text{IW}}, in step 2. The rest of the steps in Algorithm 1 remain unchanged. In the importance weighted version of CSGVA, the gradient estimate based on a single sample ss is replaced by a weighted sum of the gradients in (7) based on KK samples s1,,sKs_{1},\dots,s_{K}. However, these weights do not necessarily sum to 1. An unbiased estimate of KIW{\mathcal{L}}_{K}^{\text{IW}} itself is given by ^KIW=log1Kk=1Kwk\hat{{\mathcal{L}}}_{K}^{\text{IW}}=\log\frac{1}{K}\sum_{k=1}^{K}w_{k}.

7 Experimental results

We apply CSGVA to GLMMs and state space models and compare the results with that of GVA in terms of computation time and accuracy of the posterior approximation. Lower bounds reported exclude constants which are independent of the model variables. We also illustrate how CSGVA can be improved using importance weighting (IW-CSGVA), considering K{5,20,100}K\in\{5,20,100\}. We find that IW-CSGVA performs poorly if it is initialized in the standard manner using λ=0\lambda=0. This is because, when qλ(θ)q_{\lambda}(\theta) is still far from optimal, a few of the importance weights tend to dominate with the rest close to zero, thus producing inaccurate estimates of the gradients. Hence we initialize IW-CSGVA using the CSGVA fit, and the algorithm is terminated after a short run of 1000 iterations as IW-CSGVA is computationally intensive and improvements in the IWLB and variational approximation seem to be negligible thereafter. Posterior distributions estimated using MCMC via RStan are regarded as the ground truth. Code for all variational algorithms are written in Julia and are available as supplementary materials. All experiments are run on on Intel Core i9-9900K CPU @3.60 GHz with 16GB RAM. As the computation time of IW-CSGVA increases linearly with KK, we also investigate the speedup that can be achieved through parallel computing on a machine with 8 cores. Julia retains one worker (or core) as the master process and parallel computing is implemented using the remaining seven workers.

The parametrization of a hierarchical model plays a major role in the rate of convergence of both GVA and CSGVA. In some cases, it can even affect the ability of the algorithm to converge (to a local mode). We have attempted both the centered and noncentered parametrizations (Papaspiliopoulos et al., 2003, 2007), which are known to have a huge impact on the rate of convergence of MCMC algorithms. These two parametrizations are complementary and neither is superior to the other. If an algorithm converges slowly under one parametrization, it often converges much faster under the other. Which parametrization works better usually depends on the nature of data. For the datasets that we use in the experiments, the centered parametrization was found to have better convergence properties than the noncentered parametrization for GLMMs while the noncentered parametrization is preferred for stochastic volatilty models. These observations are further discussed below.

8 Generalized linear mixed models

Let yi=(yi1,,yini)Ty_{i}=(y_{i1},\dots,y_{in_{i}})^{T} denote the vector of responses of length nin_{i} for the iith subject for i=1,,ni=1,\dots,n, where yijy_{ij} is generated from some distribution in the exponential family. The mean μij=E(yij)\mu_{ij}=E(y_{ij}) is connected to the linear predictor ηij\eta_{ij} via

g(μij)=ηij=XijTβ+ZijTbi,g(\mu_{ij})=\eta_{ij}=X_{ij}^{T}\beta+Z_{ij}^{T}b_{i},

for some smooth invertible link function g()g(\cdot). The fixed effects β\beta is a p×1p\times 1 vector and biN(0,Λ)b_{i}\sim N(0,\Lambda) is a L×1L\times 1 vector of random effects specific to the iith subject. XijX_{ij} and ZijZ_{ij} are vectors of covariates of length pp and LL respectively. Let ηi=[ηi1,,ηini]T\eta_{i}=[\eta_{i1},\dots,\eta_{in_{i}}]^{T}, Xi=[Xi1,,Xini]TX_{i}=[X_{i1},\dots,X_{in_{i}}]^{T} and Zi=[Zi1,,Zini]TZ_{i}=[Z_{i1},\dots,Z_{in_{i}}]^{T}. We focus on the one-parameter exponential family with canonical links. This includes the Bernoulli model, yijy_{ij}\sim Bern(μij\mu_{ij}), with the logit link g(μij)=log{μij/(1μij)}g(\mu_{ij})=\log\{\mu_{ij}/(1-\mu_{ij})\} and the Poisson model, yijy_{ij}\sim Pois(μij\mu_{ij}), with the log link g(μij)=log(μij)g(\mu_{ij})=\log(\mu_{ij}). Let WWTWW^{T} be the unique Cholesky decomposition of Λ1\Lambda^{-1}, where WW is a lower triangular matrix with positive diagonal entries. Define WW^{*} such that Wii=log(Wii)W^{*}_{ii}=\log(W_{ii}) and Wij=WijW^{*}_{ij}=W_{ij} if iji\neq j, and let ω=v(W)\omega=v(W^{*}). We consider normal priors, βN(0,σβ2I)\beta\sim N(0,\sigma_{\beta}^{2}I) and ωN(0,σω2I)\omega\sim N(0,\sigma_{\omega}^{2}I), where σβ2\sigma_{\beta}^{2} and σω2\sigma_{\omega}^{2} are set as 100.

The above parametrization of the GLMM is noncentered since bib_{i} has mean 0. Alternatively, we can consider the centered parametrization proposed by Tan and Nott (2013). Suppose the covariates for the random effects are a subset of the covariates for the fixed effects and the first column of XiX_{i} and ZiZ_{i} are ones corresponding to an intercept and random intercept respectively. Then we can write

ηi=Xiβ+Zibi=ZiβR+XiGβG+Zibi.\eta_{i}=X_{i}\beta+Z_{i}b_{i}=Z_{i}\beta_{R}+X_{i}^{G}\beta_{G}+Z_{i}b_{i}.

where β=[βRT,βGT]T\beta=[\beta_{R}^{T},\beta_{G}^{T}]^{T}. We further split XiGX_{i}^{G} into covariates which are subject specific (varies only with ii and assumes the same value for j=1,,nij=1,\dots,n_{i}) and those which are not, and βG=[βG1T,βG2T]T\beta_{G}=[\beta_{G_{1}}^{T},\beta_{G_{2}}^{T}]^{T} accordingly, where βG1\beta_{G_{1}}, βG2\beta_{G_{2}} are vectors of length g1g_{1} and g2g_{2} respectively. Then

ηi\displaystyle\eta_{i} =ZiβR+𝟏nixiG1TβG1+XiG2βG2+Zibi\displaystyle=Z_{i}\beta_{R}+{\bf 1}_{n_{i}}{x_{i}^{G_{1}}}^{T}\beta_{G_{1}}+X_{i}^{G_{2}}\beta_{G_{2}}+Z_{i}b_{i}
=Zi(CiβRG1+bi)+XiG2βG2,\displaystyle=Z_{i}(C_{i}\beta_{RG_{1}}+b_{i})+X_{i}^{G_{2}}\beta_{G_{2}},

where

Ci=[ILxiG1T0L1×g1]andβRG1=[βRβG1].C_{i}=\begin{bmatrix}I_{L}&\begin{array}[]{l}{x_{i}^{G_{1}}}^{T}\\ 0_{L-1\times g_{1}}\end{array}\end{bmatrix}\;\;\text{and}\;\;\beta_{RG_{1}}=\begin{bmatrix}\beta_{R}\\ \beta_{G_{1}}\end{bmatrix}.

Let b~i=CiβRG1+bi\tilde{b}_{i}=C_{i}\beta_{RG_{1}}+b_{i}. The centered parametrization is represented as

ηi=Zib~i+XiG2βG2,b~iN(CiβRG1,Λ)\eta_{i}=Z_{i}\tilde{b}_{i}+X_{i}^{G_{2}}\beta_{G_{2}},\quad\tilde{b}_{i}\sim N(C_{i}\beta_{RG_{1}},\Lambda) (14)

for i=1,,ni=1,\dots,n.

Tan and Nott (2013) compare the centered, noncentered and partially noncentered parametrizations for GLMMs in the context of variational Bayes, showing that the choice of parametrization affects not only the rate of convergence, but also the accuracy of the variational approximation. For CSGVA, we observe that the accuracy of the variational approximation and the rate of convergence can also be greatly affected. Tan and Nott (2013) demonstrate that the centered parametrization is preferred when the observations are highly informative about the latent variables. In practice, a general guideline is to use the centered parametrization for Poisson models when observed counts are large and the noncentered parametrization when most counts are close to zero. For Bernoulli models, differences between using centered and noncentered parametrizations are usually minor. Here we use the centered parametrization in (14), which has been observed to yield gains in convergence rates for the datasets used for illustration.

The global parameters are θG=(βT,ωT)T\theta_{G}=(\beta^{T},\omega^{T})^{T} of dimension G=p+L(L+1)/2G=p+L(L+1)/2, and the local variables are θL=(b~1,,b~n)T\theta_{L}=(\tilde{b}_{1},\dots,\tilde{b}_{n})^{T}. The joint density is

p(y,θ)=p(β)p(ω)i=1n{p(b~i|ω,β)j=1nip(yij|β,b~i)}.p(y,\theta)=p(\beta)p(\omega)\prod^{n}_{i=1}\bigg{\{}p(\tilde{b}_{i}|\omega,\beta)\prod^{n_{i}}_{j=1}p(y_{ij}|\beta,\tilde{b}_{i})\bigg{\}}.

The log of the joint density is given by

logp(y,θ)\displaystyle\log p(y,\theta) =i=1n{yiηi𝟏Th(ηi)\displaystyle=\sum\nolimits_{i=1}^{n}\{y_{i}\eta_{i}-{\bf 1}^{T}h(\eta_{i})
(b~iCiβRG1)TWWT(b~iCiβRG1)/2}\displaystyle-{(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})^{T}WW^{T}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})}/{2}\}
βTβ/(2σβ2)ωTω/(2σω2)+nlog|W|+c,\displaystyle-\beta^{T}\beta/(2\sigma_{\beta}^{2})-\omega^{T}\omega/(2\sigma_{\omega}^{2})+n\log|W|+c,

where h()h(\cdot) is the log-partition function and cc is a constant independent of θ\theta. For the Poisson model with log link, h(x)=exp(x)h(x)=\exp(x). For the Bernoulli model with logit link, h(x)=log{1+exp(x)}h(x)=\log\{1+\exp(x)\}. The gradient θlogp(y,θ)\nabla_{\theta}\log p(y,\theta) is given in Appendix B.

For the GLMM, bib_{i} and bjb_{j} are conditionally independent given θG\theta_{G} for iji\neq j in p(θ|y)p(\theta|y). Hence we impose the following sparsity structure on Ω2\Omega_{2} and C2C_{2},

Ω2=[Ω2,11000Ω2,22000Ω2,nn],\displaystyle\Omega_{2}=\begin{bmatrix}\Omega_{2,11}&0&\dots&0\\ 0&\Omega_{2,22}&\dots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\dots&\Omega_{2,nn}\end{bmatrix},
C2=[C2,11000C2,22000C2,nn],\displaystyle C_{2}^{*}=\left[\begin{array}[]{cccc}C^{*}_{2,11}&0&\dots&0\\ 0&C^{*}_{2,22}&\dots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\dots&C^{*}_{2,nn}\end{array}\right],

where each Ω2,ii\Omega_{2,ii} is a L×LL\times L block matrix and each C2,iiC^{*}_{2,ii} is a L×LL\times L lower triangular matrix. We set fi=0f_{i}=0 and Fij=0F_{ij}=0 for all ii\in\mathcal{I} and all jj, where \mathcal{I} denotes the set of indices in v(C2)v(C_{2}^{*}) which are fixed as zero. The number of nonzero elements in v(C2)v(C_{2}^{*}) is nL(L+1)/2nL(L+1)/2. Hence the number of variational parameters to be optimized are reduced from nL(nL+1)/2nL(nL+1)/2 to nL(L+1)/2nL(L+1)/2 for ff and from nL(nL+1)G/2nL(nL+1)G/2 to nL(L+1)G/2nL(L+1)G/2 for FF.

8.1 Epilepsy data

In this epilepsy data (Thall and Vail, 1990), n=59n=59 patients are involved in a clinical trial to investigate the effects of the anti-epileptic drug Progabide. The patients are randomly assigned to receive either the drug (Trt = 1) or a placebo (Trt = 0). The response yiy_{i} denotes the number of epileptic attacks patient ii had during 4 follow-up periods of two weeks each. Covariates include the log of the age of the patients (Age), the log of 1/4 the baseline seizure count recorded over an eight-week period prior to the trial (Base) and Visit (coded as Visit=10.3{}_{1}=-0.3, Visit=20.1{}_{2}=-0.1, Visit=30.1{}_{3}=0.1 and Visit=40.3{}_{4}=0.3). Note that Age is centered about its mean. Consider yijPois(μij)y_{ij}\sim\text{Pois}(\mu_{ij}), where

logμij\displaystyle\log\mu_{ij} =β0+βBaseBasei+βTrtTrti+βAgeAgei\displaystyle=\beta_{0}+\beta_{\text{Base}}\text{Base}_{i}+\beta_{\text{Trt}}\text{Trt}_{i}+\beta_{\text{Age}}\text{Age}_{i}
+βBase×TrtBasei×Trti+βVisitVisitij\displaystyle\quad+\beta_{\text{Base}\times\text{Trt}}\text{Base}_{i}\times\text{Trt}_{i}+\beta_{\text{Visit}}\text{Visit}_{ij}
+bi1+bi2Visitij,\displaystyle\quad+b_{i1}+b_{i2}\text{Visit}_{ij},

for i=1,,59,j=1,,4i=1,\dots,59,j=1,\dots,4 (Breslow and Clayton, 1993).

Table 1 shows the results obtained from applying the variational algorithms to this data. The lower bounds are estimated using 1000 simulations in each case and the mean and standard deviation from these simulations are reported. CSGVA produced an improvement in the estimate of the lower bound (3139.2) as compared to GVA (3138.3) and maximizing the IWLB led to further improvements. Using a larger KK of 20 or 100 resulted in minimal improvements compared with K=5K=5. As this dataset is small, parallel computing is slower than serial for a small KK. This is because, even though the importance weights and gradients for KK samples are computed in parallel, the cost of sending and fetching data from the workers at each iteration dwarfs the cost of computation when KK is small. For K=100K=100, parallel computing reduces the computation time by about half.

Table 1: Epilepsy data. Number of iterations II (in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.
KK II time parallel ^KIW\hat{{\mathcal{L}}}_{K}^{\text{IW}}
GVA 1 31 13.9 - 3138.3 (1.8)
CSGVA 1 39 16.2 - 3139.2 (1.5)
IW-CSGVA 5 1 2.5 6.1 3139.9 (0.7)
20 1 6.9 8.1 3140.1 (0.4)
100 1 33.5 16.0 3140.1 (0.3)

The estimated marginal posterior distributions of the global parameters are shown in Figure 1. The plots show that CSGVA (red) produces improved estimates of the posterior distribution as compared to GVA (blue), especially in estimating the posterior variance of the precision parameters ω2\omega_{2} and ω3\omega_{3}. The posteriors estimated using IW-CSGVA for the different values of KK are very close. By using just K=5K=5, we are able to obtain estimates that are virtually indistinguishable from that of MCMC.

Refer to caption
Figure 1: Epilepsy data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (K=5K=5), orange (K=20K=20), green (K=100K=100).

8.2 Madras data

These data come from the Madras longitudinal schizophrenia study (Diggle et al., 2002) for detecting a psychiatric symptom called “thought disorder”. Monthly records showing whether the symptom is present in a patient are kept for n=86n=86 patients over 12 months. The response yijy_{ij} is a binary indicator for presence of the symptom. Covariates include the time in months since initial hospitalization (tt), gender of patient (Gender = 0 if male and 1 if female) and age of patient (Age = 0 if the patient is younger than 20 years and 1 otherwise). Consider yijBern(μij)y_{ij}\sim\text{Bern}(\mu_{ij}) and

logit(μij)=β0+βAge Agei+βGender Genderi+βttij+βAge×t Agei×tij+βGender×t Genderi×tij+bi,\text{logit}(\mu_{ij})=\beta_{0}+\beta_{\text{Age}}\text{ Age}_{i}+\beta_{\text{Gender}}\text{ Gender}_{i}+\beta_{t}t_{ij}\\ +\beta_{\text{Age}\times t}\text{ Age}_{i}\times t_{ij}+\beta_{\text{Gender}\times t}\text{ Gender}_{i}\times t_{ij}+b_{i},

for i=1,,86,1j12i=1,\dots,86,1\leq j\leq 12.

The results in Table 2 are quite similar to that of the epilepsy data. CSGVA yields an improvement in the lower bound estimate as compared to GVA and IW-CSGVA provided further improvements, which are evident starting with a KK as small as 5. Parallel computing halved the computation time for K=100K=100 but did not yield any benefits for K{5,20}K\in\{5,20\}. From Figure 2, CSGVA and IW-CSGVA are again able to capture the posterior variance of the precision parameter ω1\omega_{1} better than GVA.

Table 2: Madras data. Number of iterations II (in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.
KK II time parallel ^KIW\hat{{\mathcal{L}}}_{K}^{\text{IW}}
GVA 1 25 13.1 - -383.4 (1.4)
CSGVA 1 35 12.6 - -383.1 (1.4)
IW-CSGVA 5 1 2.4 7.1 -382.5 (0.7)
20 1 6.8 8.9 -382.4 (0.4)
100 1 33.9 16.8 -382.3 (0.2)
Refer to caption
Figure 2: Madras data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (K=5K=5), orange (K=20K=20), green (K=100K=100).

8.3 Six cities data

In the six cities data (Fitzmaurice and Laird, 1993), n=537n=537 children from Steubenville, Ohio, are involved in a longitudinal study to investigate the health effects of air pollution. Each child is examined yearly from age 7 to 10 and the response yijy_{ij} is a binary indicator for wheezing. There are two covariates, Smokei\text{Smoke}_{i} (a binary indicator for smoking status of the mother of child ii) and Ageij\text{Age}_{ij} (age of child ii at time point jj, centered at 9 years). Consider yijBern(μij)y_{ij}\sim\text{Bern}(\mu_{ij}), where

logit(μij)\displaystyle\text{logit}(\mu_{ij}) =β0+βSmokeSmokei+βAge Ageij\displaystyle=\beta_{0}+\beta_{\text{Smoke}}\text{Smoke}_{i}+\beta_{\text{Age}}\text{ Age}_{ij}
+βSmoke×Age Smokei×Ageij+bi,\displaystyle\quad+\beta_{\text{Smoke}\times\text{Age}}\text{ Smoke}_{i}\times\text{Age}_{ij}+b_{i},

for i=1,,537,j=1,,4i=1,\dots,537,j=1,\dots,4.

Table 3: Six cities data. Number of iterations II (in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.
KK II time parallel ^K\hat{{\mathcal{L}}}_{K}
GVA 1 26 60.3 - -816.4 (4.0)
CSGVA 1 28 36.5 - -816.0 (3.9)
IW-CSGVA 5 1 6.5 16.3 -812.6 (2.5)
20 1 23.1 24.5 -811.0 (1.9)
100 1 115.5 61.4 -809.8 (1.5)

From Table 3, CSGVA managed to achieve a higher lower bound than GVA in about half the runtime. As KK increases, IW-CSGVA produced tighter lower bounds for the log marginal likelihood. As in the previous two examples, parallel computing is beneficial only when K=100K=100, cutting the runtime by about half. From Figure 3, there is slight overestimation of the posterior means of β0\beta_{0} and ω1\omega_{1} by all the variational methods. However, CSGVA and IW-CSGVA are able to capture the posterior variance of these parameters much better than GVA especially for ω1\omega_{1}.

Refer to caption
Figure 3: Six cities data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (K=5K=5), orange (K=20K=20), green (K=100K=100).

9 Application to state space models

Here we consider the stochastic volatility model (SVM) widely used for modeling financial time series. Let each observation yiy_{i} for i=1,,ni=1,\dots,n, be generated from a zero-mean Gaussian distribution where the error variance is stochastically evolving over time. The unobserved log volatility bib_{i} is modeled using an autoregressive process of order one with Gaussian disturbances:

yi|bi,σ,κN(0,eσbi+κ),(i=1,,n)bi|ϕN(ϕbi1,1),(i=2,,n)b1|ϕN(0,1/(1ϕ2)),\begin{gathered}y_{i}|b_{i},\sigma,\kappa\sim N(0,{\rm e}^{\sigma b_{i}+\kappa}),\quad(i=1,\dots,n)\\ b_{i}|\phi\sim N(\phi b_{i-1},1),\quad(i=2,\dots,n)\\ b_{1}|\phi\sim N(0,1/(1-\phi^{2})),\end{gathered}

where κ\kappa\in\mathbb{R}, σ>0\sigma>0 and 0<ϕ<10<\phi<1. Here, yiy_{i} represents the mean-corrected return at time ii and the states {bi}\{b_{i}\} come from a stationary process with b1b_{1} drawn from the stationary distribution. The parametrization of the SVM above is noncentered. The centered parametrization can be obtained by replacing bib_{i} by (b~iκ)/σ(\tilde{b}_{i}-\kappa)/\sigma. The performance of GVA and CSGVA are sensitive to the parametrization both in terms of rate of convergence and attained local mode. For the data sets below, the noncentered parametrization was found to have better convergence properties. The sensitivity of the stochastic volatility model to parametrization when fitted using MCMC algorithms is well known in the literature. To “combine the best of different worlds”, Kastner and Frühwirth-Schnatter (2014) introduce a strategy that samples parameters from the centered and noncentered parametrizations alternately. Tan (2017) studies optimal partially noncentered parametrizations for Gaussian state space models fitted using EM, MCMC or variational algorithms.

We use the following transformations to map constrained parameters to \mathbb{R}:

α=log(exp(σ)1),ψ=logit(ϕ).\alpha=\log(\exp(\sigma)-1),\quad\psi=\text{\rm logit}(\phi).

This transformation for α\alpha works better than α=log(σ)\alpha=\log(\sigma), which leads to erratic fluctuations in the lower bound and convergence issues more often. The global variables are θG=[α,κ,ψ]T\theta_{G}=[\alpha,\kappa,\psi]^{T} of dimension G=3G=3 and the local variables are θL=[b1,,bn]T\theta_{L}=[b_{1},\dots,b_{n}]^{T} of length nn. We assume normal priors for the global parameters, where αN(0,σα2)\alpha\sim N(0,\sigma_{\alpha}^{2}), κN(0,σκ2)\kappa\sim N(0,\sigma_{\kappa}^{2}) and ψN(0,σψ2)\psi\sim N(0,\sigma_{\psi}^{2}), where σα2=σκ2=σψ2=10\sigma_{\alpha}^{2}=\sigma_{\kappa}^{2}=\sigma_{\psi}^{2}=10. The joint density can be written as

p(y,θ)\displaystyle p(y,\theta) =p(α)p(κ)p(ψ)p(b1|ψ){i=2np(bi|bi1,ψ)}\displaystyle=p(\alpha)p(\kappa)p(\psi)p(b_{1}|\psi)\bigg{\{}\prod_{i=2}^{n}p(b_{i}|b_{i-1},\psi)\bigg{\}}
×{i=1np(yi|bi,α,κ)}.\displaystyle\times\bigg{\{}\prod^{n}_{i=1}p(y_{i}|b_{i},\alpha,\kappa)\bigg{\}}.

The log joint density is

logp(y,θ)\displaystyle\log p(y,\theta) =nκ2σ2i=1nbi12i=1nyi2eσbiκ\displaystyle=-\frac{n\kappa}{2}-\frac{\sigma}{2}\sum_{i=1}^{n}b_{i}-\frac{1}{2}\sum^{n}_{i=1}y_{i}^{2}{\rm e}^{-\sigma b_{i}-\kappa}
12i=2n(biϕbi1)212b12(1ϕ2)\displaystyle\quad-\frac{1}{2}\sum_{i=2}^{n}(b_{i}-\phi b_{i-1})^{2}-\frac{1}{2}b_{1}^{2}(1-\phi^{2})
+12log(1ϕ2)α22σα2κ22σκ2ψ22σψ2+c,\displaystyle\quad+\frac{1}{2}\log(1-\phi^{2})-\frac{\alpha^{2}}{2\sigma_{\alpha}^{2}}-\frac{\kappa^{2}}{2\sigma_{\kappa}^{2}}-\frac{\psi^{2}}{2\sigma_{\psi}^{2}}+c,

where ϕ=exp(ψ)/{1+exp(ψ)}\phi=\exp(\psi)/\{1+\exp(\psi)\}, σ=log(exp(α)+1)\sigma=\log(\exp(\alpha)+1) and cc is a constant independent of θ\theta. The gradient θlogp(y,θ)\nabla_{\theta}\log p(y,\theta) is given in Appendix C. For this model, bib_{i} is conditionally independent of bjb_{j} in the posterior if |ij|>1|i-j|>1 given θG\theta_{G}. Thus, the sparsity structure imposed on Ω2\Omega_{2} and C2C_{2} are

Ω=[Ω2,11Ω2,1200Ω2,21Ω2,22Ω2,2300Ω2,32Ω2,330000Ω2,nn],\displaystyle\Omega=\begin{bmatrix}\Omega_{2,11}&\Omega_{2,12}&0&\dots&0\\ \Omega_{2,21}&\Omega_{2,22}&\Omega_{2,23}&\dots&0\\ 0&\Omega_{2,32}&\Omega_{2,33}&\dots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\dots&\Omega_{2,nn}\end{bmatrix},
C2=[C2,11000C2,21C2,22000C2,32C2,330000C2,nn].\displaystyle C_{2}=\begin{bmatrix}C_{2,{11}}&0&0&\dots&0\\ C_{2,21}&C_{2,{22}}&0&\dots&0\\ 0&C_{2,{32}}&C_{2,{33}}&\dots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\dots&C_{2,{nn}}\end{bmatrix}.

The number of nonzero elements in v(C2)v(C_{2}^{*}) is 2n12n-1. Setting fi=0f_{i}=0 and Fij=0F_{ij}=0 for all ii\in\mathcal{I} and all jj, where \mathcal{I} denotes the set of indices in v(C2)v(C_{2}^{*}) which are fixed as zero, the number of variational parameters to be optimized are reduced from n(n+1)/2n(n+1)/2 to 2n12n-1 for ff, and from n(n+1)G/2n(n+1)G/2 to (2n1)G(2n-1)G for FF.

9.1 GBP/USD exchange rate data

This data contain 946 observations of the exchange rates of the US Dollar (USD) against the British Pound (GBP), recorded daily from 1 October 1981, to 28 June 1985. This data are available from the “Garch” dataset in the R package Ecdat. The mean-corrected responses {yt|t=1,,n}\{y_{t}|t=1,\dots,n\} are computed from the exchange rates {rt|t=0,,n}\{r_{t}|t=0,\dots,n\} as

yt=100{log(rtrt1)1ni=1nlog(riri1)},y_{t}=100\left\{\log\left(\frac{r_{t}}{r_{t-1}}\right)-\frac{1}{n}\sum^{n}_{i=1}\log\left(\frac{r_{i}}{r_{i-1}}\right)\right\},

where n=945n=945.

For this data, CSGVA failed to achieve a higher lower bound when it was initialized using λ=0\lambda=0. Hence we initialize CSGVA using the fit from GVA instead, by using the association discussed in Section 5. With this informative starting point, CSGVA converged in 16000 iterations and managed to improve upon the GVA fit, attaining a higher lower bound. IW-CSGVA led to further improvements with increasing KK. As this dataset contains a large number of observations with correspondingly more variational parameters to be optimized, the computation is more intensive and parallel computing comes in very useful reducing the computation time by factors of 1.8, 2.9 and 4.2 for K=5,20,100K=5,20,100 respectively.

Table 4: GBP data. Number of iterations II (in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.
KK II time parallel ^KIW\hat{{\mathcal{L}}}_{K}^{\text{IW}}
GVA 1 61 239.7 - -138.2 (1.3)
CSGVA 1 16 58.6 - -137.8 (1.3)
IW-CSGVA 5 1 18.3 10.2 -137.4 (1.0)
20 1 71.2 24.4 -137.0 (0.5)
100 1 355.3 84.3 -136.8 (0.4)

Figure 4 shows the estimated marginal posteriors of the global parameters. CSGVA provides significant improvements in estimating the posterior variance of α\alpha and ψ\psi as compared to GVA. With the application of IW-CSGVA, we are able to obtain posterior estimates that are quite close to that of MCMC even with a small KK.

Refer to caption
Figure 4: GBP data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (K=5K=5), orange (K=20K=20), green (K=100K=100).

Figure 5 shows the estimated marginal posteriors of the latent states {bi}\{b_{i}\} summarized using the mean (solid line) and one standard deviation from the mean (dotted line) estimated by MCMC (black) and IW-CSGVA (K=5K=5, purple).

Refer to caption
Figure 5: GBP data. Top: Posterior means (solid line) of the latent states and one standard deviation from the mean (dotted line) estimated using MCMC (black) and IW-CSGVA (K=5K=5, purple). Bottom: Boxplots of meanMCMCmeanVA\text{mean}_{\text{MCMC}}-\text{mean}_{\text{VA}} and sdMCMCsdVA\text{sd}_{\text{MCMC}}-\text{sd}_{\text{VA}}.

For IW-CSGVA, the means and standard deviations are estimated based on 2000 samples, by generating θG\theta_{G} from q(θG)q(\theta_{G}) followed by θL\theta_{L} from q(θL|θG)q(\theta_{L}|\theta_{G}). For MCMC, estimation was based on 5000 samples. IW-CSGVA estimated the means quite accurately (with a little overestimation) but the standard deviations are underestimated when compared to MCMC. The boxplots shows the difference between the means and standard deviations estimated by IW-CSGVA (K=5)(K=5) and GVA with MCMC. We can see that IW-CSGVA estimated both the means and standard deviations more accurately as compared to GVA.

9.2 New York stock exchange data

Next we consider 2000 observations of the returns of the New York Stock Exchange (NYSE) from February 2, 1984 to December 31, 1991. This data is available as the dataset “nyse” from the R package astsa. We consider 100 times the mean-corrected returns as responses {yi}\{y_{i}\}.

From Table 5, CSGVA was able to attain a higher lower bound than GVA when initialized in the standard manner using λ=0\lambda=0. Applying IW-CSGVA led to further improvements as KK increases. For this massive data set, parallel computing yields significant reductions in computation time, by factors of 2.9, 4.5 and 5.5 corresponding to K=5K=5, 20, 100 respectively.

Table 5: NYSE data. Number of iterations II (in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.
KK II time parallel ^KIW\hat{{\mathcal{L}}}_{K}^{\text{IW}}
GVA 1 43 679.0 - -570.8 (1.8)
CSGVA 1 49 749.2 - -570.7 (2.0)
IW-CSGVA 5 1 76.0 26.1 -569.4 (1.1)
20 1 305.0 67.9 -569.0 (0.7)
100 1 1503.0 274.0 -568.7 (0.4)

Figure 6 shows that the marginal posteriors estimated using CSGVA are quite close to that of MCMC while GVA underestimated the posterior variance of α\alpha and ψ\psi quite severely. Posteriors estimated by IW-CSGVA are virtually indistinguishable from MCMC.

Refer to caption
Figure 6: NYSE data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (K=5K=5), orange (K=20K=20), green (K=100K=100).

From Figure 7, we can see that the marginal posteriors of the latent states are also estimated very well using IW-CSGVA (K=5)(K=5) and there is no systematic underestimation of the posterior variance unlike the previous example. GVA captures the posterior means very well but did not perform as well as IW-CSGVA in estimating the posterior variance.

Refer to caption
Figure 7: NYSE data. Top: Posterior means (solid line) of the latent states and one standard deviation from the mean (dotted line) estimated using MCMC (black) and IW-CSGVA (K=5K=5, purple). Bottom: Boxplots of meanMCMCmeanVA\text{mean}_{\text{MCMC}}-\text{mean}_{\text{VA}} and sdMCMCsdVA\text{sd}_{\text{MCMC}}-\text{sd}_{\text{VA}}.

10 Conclusion

In this article, we have proposed a Gaussian variational approximation for hierarchical models that adopts a conditional structure q(θ)=q(θG)q(θL|θG)q(\theta)=q(\theta_{G})q(\theta_{L}|\theta_{G}). The dependence of the local variables θL\theta_{L} on global variables θG\theta_{G} are then captured using a linear approximation. This structure is very useful when there are global scale parameters in θG\theta_{G} which help to determine the scale of local variables in the conditional posterior of θL|θG\theta_{L}|\theta_{G}. We further demonstrate how CSGVA can be improved by maximizing the importance weighted lower bound. From our experiments, using a KK as small as 5 can lead to significant improvements in the variational approximation, with just a short run. Moreover, for massive datasets, computation time can be further reduced through parallel computing. Our experiments indicate that CSGVA coupled with importance weighting is particularly useful in improving the estimation of the posterior variance of precision parameters ω\omega in GLMMs, and the persistence ϕ\phi and volatility σ\sigma of the log-variance in SVMs.

References

  • Archer et al. (2016) Archer E, Park IM, Buesing L, Cunningham J, Paninski L (2016) Black box variational inference for state space models. arXiv:1511.07367
  • Blei et al. (2017) Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: A review for statisticians. Journal of the American Statistical Association 112:859–877
  • Breslow and Clayton (1993) Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88:9–25
  • Burda et al. (2016) Burda Y, Grosse R, Salakhutdinov R (2016) Importance weighted autoencoders. In: Proceedings of the 4th International Conference on Learning Representations (ICLR)
  • Diggle et al. (2002) Diggle PJ, Heagerty P, Liang KY, Zeger SL (2002) The analysis of Longitudinal Data, 2nd edn. Oxford University Press, Oxford
  • Dinh et al. (2017) Dinh L, Sohl-Dickstein J, Bengio S (2017) Density estimation using real NVP. In: Proceedings of the 5th International Conference on Learning Representations (ICLR)
  • Domke and Sheldon (2018) Domke J, Sheldon DR (2018) Importance weighting and variational inference. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in Neural Information Processing Systems 31, Curran Associates, Inc., pp 4470–4479
  • Fitzmaurice and Laird (1993) Fitzmaurice GM, Laird NM (1993) A likelihood-based method for analysing longitudinal binary responses. Biometrika 80:141–151
  • Germain et al. (2015) Germain M, Gregor K, Murray I, Larochelle H (2015) MADE: Masked autoencoder for distribution estimation. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol 37, pp 881–889
  • Guo et al. (2016) Guo F, Wang X, Broderick T, Dunson DB (2016) Boosting variational inference. arXiv: 1611.05559
  • Han et al. (2016) Han S, Liao X, Dunson DB, Carin LC (2016) Variational Gaussian copula inference. In: Gretton A, Robert CC (eds) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol 51, pp 829–838
  • Hoffman and Blei (2015) Hoffman M, Blei D (2015) Stochastic structured variational inference. In: Lebanon G, Vishwanathan S (eds) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol 38, pp 361–369
  • Huszár (2017) Huszár F (2017) Variational inference using implicit distributions. arXiv:1702.08235
  • Jaakkola and Jordan (1998) Jaakkola TS, Jordan MI (1998) Improving the mean field approximation via the use of mixture distributions, Springer Netherlands, Dordrecht, pp 163–173
  • Kastner and Frühwirth-Schnatter (2014) Kastner G, Frühwirth-Schnatter S (2014) Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Computational Statistics and Data Analysis 76:408 – 423
  • Kingma and Ba (2015) Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR)
  • Kingma and Welling (2014) Kingma DP, Welling M (2014) Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR)
  • Kingma et al. (2016) Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M (2016) Improved variational inference with inverse autoregressive flow. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems 29, Curran Associates, Inc., pp 4743–4751
  • Li and Turner (2016) Li Y, Turner RE (2016) Rényi divergence variational inference. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’16, pp 1081–1089
  • Maddison et al. (2017) Maddison CJ, Lawson J, Tucker G, Heess N, Norouzi M, Mnih A, Doucet A, Teh Y (2017) Filtering variational objectives. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp 6573–6583
  • Magnus and Neudecker (1980) Magnus JR, Neudecker H (1980) The elimination matrix: Some lemmas and applications. SIAM Journal on Algebraic Discrete Methods 1:422–449
  • Magnus and Neudecker (1999) Magnus JR, Neudecker H (1999) Matrix differential calculus with applications in statistics and econometrics, 3rd edn. John Wiley & Sons, New York
  • Miller et al. (2016) Miller AC, Foti N, Adams RP (2016) Variational boosting: Iteratively refining posterior approximations. arXiv: 1611.06585
  • Minka (2005) Minka T (2005) Divergence measures and message passing. Tech. rep.
  • Ormerod and Wand (2010) Ormerod JT, Wand MP (2010) Explaining variational approximations. The American Statistician 64:140–153
  • Papaspiliopoulos et al. (2003) Papaspiliopoulos O, Roberts GO, Sköld M (2003) Non-centered parameterisations for hierarchical models and data augmentation. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M (eds) Bayesian Statistics 7, Oxford University Press, New York, pp 307–326
  • Papaspiliopoulos et al. (2007) Papaspiliopoulos O, Roberts GO, Sköld M (2007) A general framework for the parametrization of hierarchical models. Statist Sci 22:59–73
  • Ranganath et al. (2016) Ranganath R, Tran D, Blei DM (2016) Hierarchical variational models. In: Balcan MF, Weinberger KQ (eds) Proceedings of The 33rd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol 37, pp 324–333
  • Regli and Silva (2018) Regli JB, Silva R (2018) Alpha-Beta divergence for variational inference. arXiv: 1805.01045
  • Rényi (1961) Rényi A (1961) On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, University of California Press, Berkeley, Calif., pp 547–561
  • Rezende and Mohamed (2015) Rezende DJ, Mohamed S (2015) Variational inference with normalizing flows. In: Bach F, Blei D (eds) Proceedings of The 32nd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol 37, pp 1530–1538
  • Rezende et al. (2014) Rezende DJ, Mohamed S, Wierstra D (2014) Stochastic backpropagation and approximate inference in deep generative models. In: Xing EP, Jebara T (eds) Proceedings of The 31st International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol 32, pp 1278–1286
  • Roeder et al. (2017) Roeder G, Wu Y, Duvenaud DK (2017) Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In: Guyon I, Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30
  • Rothman et al. (2010) Rothman AJ, Levina E, Zhu J (2010) A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97:539–550
  • Salimans and Knowles (2013) Salimans T, Knowles DA (2013) Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis 8:837–882
  • Smith et al. (2019) Smith MS, Loaiza-Maya R, Nott DJ (2019) High-dimensional copula variational approximation through transformation. arXiv:1904.07495
  • Spall (2003) Spall JC (2003) Introduction to stochastic search and optimization: estimation, simulation and control. Wiley, New Jersey
  • Spantini et al. (2018) Spantini A, Bigoni D, Marzouk Y (2018) Inference via low-dimensional couplings. Journal of Machine Learning Research 19:1–71
  • Tan (2017) Tan LSL (2017) Efficient data augmentation techniques for gaussian state space models. arXiv:1712.08887
  • Tan (2018) Tan LSL (2018) Use of model reparametrization to improve variational Bayes. arXiv:1805.07267
  • Tan and Nott (2013) Tan LSL, Nott DJ (2013) Variational inference for generalized linear mixed models using partially non-centered parametrizations. Statistical Science 28:168–188
  • Tan and Nott (2018) Tan LSL, Nott DJ (2018) Gaussian variational approximation with sparse precision matrices. Statistics and Computing 28:259–275
  • Thall and Vail (1990) Thall PF, Vail SC (1990) Some covariance models for longitudinal count data with overdispersion. Biometrics 46:657–671
  • Titsias and Lázaro-Gredilla (2014) Titsias M, Lázaro-Gredilla M (2014) Doubly stochastic variational Bayes for non-conjugate inference. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp 1971–1979
  • Tran et al. (2015) Tran D, Blei DM, Airoldi EM (2015) Copula variational inference. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 3564–3572
  • Tucker et al. (2018) Tucker G, Lawson D, Gu S, Maddison CJ (2018) Doubly reparametrized gradient estimators for Monte Carlo objectives. arXiv: 1810.04152
  • van Erven and Harremos (2014) van Erven T, Harremos P (2014) Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory 60:3797–3820
  • Yang et al. (2019) Yang Y, Pati D, Bhattacharya A (2019) α\alpha-variational inference with statistical guarantees. Annals of Statistics To appear

Appendix A Derivation of stochastic gradient

We have

rλ(s)=[θGθL]=[μ1+C1Ts1d+C2T(s2DC1Ts1)],r_{\lambda}(s)=\begin{bmatrix}\theta_{G}\\ \theta_{L}\end{bmatrix}=\begin{bmatrix}\mu_{1}+C_{1}^{-T}s_{1}\\ d+C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1})\end{bmatrix},

where v(C2)=f+F(μ1+C1Ts1)v(C_{2}^{*})=f+F(\mu_{1}+C_{1}^{-T}s_{1}). Differentiating rλ(s)r_{\lambda}(s) with respect to λ\lambda, λrλ(s)\nabla_{\lambda}r_{\lambda}(s) is given by

[μ1θGμ1θLv(C1)θGv(C1)θLdθGdθLvec(D)θGvec(D)θLfθGfθLvec(F)θGvec(F)θL].\displaystyle\begin{bmatrix}\nabla_{\mu_{1}}\theta_{G}&\nabla_{\mu_{1}}\theta_{L}\\ \nabla_{v(C_{1}^{*})}\theta_{G}&\nabla_{v(C_{1}^{*})}\theta_{L}\\ \nabla_{d}\theta_{G}&\nabla_{d}\theta_{L}\\ \nabla_{\text{\rm vec}(D)}\theta_{G}&\nabla_{\text{\rm vec}(D)}\theta_{L}\\ \nabla_{f}\theta_{G}&\nabla_{f}\theta_{L}\\ \nabla_{\text{\rm vec}(F)}\theta_{G}&\nabla_{\text{\rm vec}(F)}\theta_{L}\\ \end{bmatrix}.

Since θG\theta_{G} does not depend on dd, DD, ff and FF, we have

dθG=0nL×G,vec(D)θG=0nLG×GfθG=0nL(nL+1)/2×G,vec(F)θG=0nLG(nL+1)/2×G.\begin{gathered}\nabla_{d}\theta_{G}=0_{nL\times G},\quad\nabla_{\text{\rm vec}(D)}\theta_{G}=0_{nLG\times G}\\ \nabla_{f}\theta_{G}=0_{nL(nL+1)/2\times G},\;\;\nabla_{\text{\rm vec}(F)}\theta_{G}=0_{nLG(nL+1)/2\times G}.\end{gathered}

It is easy to see that μ1θG=IG\nabla_{\mu_{1}}\theta_{G}=I_{G} and dθL=InL\nabla_{d}\theta_{L}=I_{nL}. The rest of the terms are derived as follows.

Differentiating θG\theta_{G} with respect to v(C1)v(C_{1}^{*}),

dθG\displaystyle\text{\rm d}\theta_{G} =C1Td(C1T)C1Ts1\displaystyle=-C_{1}^{-T}\text{\rm d}(C_{1}^{T})C_{1}^{-T}s_{1}
=(s1TC11C1T)KGEGTD1dv(C1)\displaystyle=-(s_{1}^{T}C_{1}^{-1}\otimes C_{1}^{-T})K_{G}E_{G}^{T}D_{1}^{*}\text{\rm d}v(C_{1}^{*})
=(C1Ts1TC11)EGTD1dv(C1).\displaystyle=-(C_{1}^{-T}\otimes s_{1}^{T}C_{1}^{-1})E_{G}^{T}D_{1}^{*}\text{\rm d}v(C_{1}^{*}).
v(C1)θG\displaystyle\therefore\;\nabla_{v(C_{1}^{*})}\theta_{G} =D1EG(C11C1Ts1).\displaystyle=-D_{1}^{*}E_{G}(C_{1}^{-1}\otimes C_{1}^{-T}s_{1}).

Differentiating θL\theta_{L} with respect to ff,

dθL\displaystyle\text{\rm d}\theta_{L} =C2Td(C2T)C2T(s2DC1Ts1)\displaystyle=-C_{2}^{-T}\text{\rm d}(C_{2}^{T})C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1})
={(s2DC1Ts1)TC21C2T}\displaystyle=-\{(s_{2}-DC_{1}^{-T}s_{1})^{T}C_{2}^{-1}\otimes C_{2}^{-T}\}
×KnLEnLTD2df\displaystyle\quad\times K_{nL}E_{nL}^{T}D_{2}^{*}\text{\rm d}f
fθL=D2EnL{C21C2T(s2DC1Ts1)}.\therefore\;\nabla_{f}\theta_{L}=-D_{2}^{*}E_{nL}\{C_{2}^{-1}\otimes C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1})\}.

Differentiating θL\theta_{L} with respect to FF,

dθL\displaystyle\text{\rm d}\theta_{L} =(fθL)TdFθG\displaystyle=(\nabla_{f}\theta_{L})^{T}\text{\rm d}F\theta_{G}
={θGT(fθL)T}dvec(F).\displaystyle=\{\theta_{G}^{T}\otimes(\nabla_{f}\theta_{L})^{T}\}\text{\rm d}\text{\rm vec}(F).
vec(F)θL=θGfθL.\therefore\;\nabla_{\text{\rm vec}(F)}\theta_{L}=\theta_{G}\otimes\nabla_{f}\theta_{L}.

Differentiating θL\theta_{L} with respect to DD,

dθL\displaystyle\text{\rm d}\theta_{L} =C2TdDC1Ts1\displaystyle=-C_{2}^{-T}\text{\rm d}DC_{1}^{-T}s_{1}
=(s1TC11C2T)dvec(D).\displaystyle=-(s_{1}^{T}C_{1}^{-1}\otimes C_{2}^{-T})\text{\rm d}\text{\rm vec}(D).
vec(D)θL=(C1Ts1C21).\therefore\;\nabla_{\text{\rm vec}(D)}\theta_{L}=-(C_{1}^{-T}s_{1}\otimes C_{2}^{-1}).

Differentiating θL\theta_{L} with respect to μ1\mu_{1},

dθL\displaystyle\text{\rm d}\theta_{L} =(fθL)TFdμ1\displaystyle=(\nabla_{f}\theta_{L})^{T}F\text{\rm d}\mu_{1}
μ1θL=FT(fθL).\therefore\;\nabla_{\mu_{1}}\theta_{L}=F^{T}(\nabla_{f}\theta_{L}).

Differentiating θL\theta_{L} with respect to v(C1)v(C_{1}),

dθL\displaystyle\text{\rm d}\theta_{L} =C2Td(C2T)C2T(s2DC1Ts1)\displaystyle=-C_{2}^{-T}\text{\rm d}(C_{2}^{T})C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1})
C2TDd(C1T)s1\displaystyle\quad-C_{2}^{-T}D\text{\rm d}(C_{1}^{-T})s_{1}
=(fθL)TFd(C1T)s1C2TDd(C1T)s1\displaystyle=(\nabla_{f}\theta_{L})^{T}F\text{\rm d}(C_{1}^{-T})s_{1}-C_{2}^{-T}D\text{\rm d}(C_{1}^{-T})s_{1}
={(fθL)TFC2TD}(v(C1)θG)Tdv(C1)\displaystyle=\{(\nabla_{f}\theta_{L})^{T}F-C_{2}^{-T}D\}(\nabla_{v(C_{1}^{*})}\theta_{G})^{T}\text{\rm d}v(C_{1}^{*})
v(C1)θL\displaystyle\therefore\;\nabla_{v(C_{1}^{*})}\theta_{L} =v(C1)θG{FTfθLDTC21}\displaystyle=\nabla_{v(C_{1}^{*})}\theta_{G}\{F^{T}\nabla_{f}\theta_{L}-D^{T}C_{2}^{-1}\}
=v(C1)θG{μ1θLDTC21}.\displaystyle=\nabla_{v(C_{1}^{*})}\theta_{G}\{\nabla_{\mu_{1}}\theta_{L}-D^{T}C_{2}^{-1}\}.

Since s1=C1T(θGμ1)s_{1}=C_{1}^{T}(\theta_{G}-\mu_{1}) and s2=C2T(θLμ2)s_{2}=C_{2}^{T}(\theta_{L}-\mu_{2}), we have

logqλ(θ)=logq(θG)+logq(θL|θG)\displaystyle\log q_{\lambda}(\theta)=\log q(\theta_{G})+\log q(\theta_{L}|\theta_{G})
=G2log(2π)+log|C1|12(θGμ1)TC1C1T(θGμ1)\displaystyle=-\frac{G}{2}\log(2\pi)+\log|C_{1}|-\frac{1}{2}(\theta_{G}-\mu_{1})^{T}C_{1}C_{1}^{T}(\theta_{G}-\mu_{1})
nL2log(2π)+log|C2|12(θLμ2)TC2C2T(θLμ2)\displaystyle\quad-\frac{nL}{2}\log(2\pi)+\log|C_{2}|-\frac{1}{2}(\theta_{L}-\mu_{2})^{T}C_{2}C_{2}^{T}(\theta_{L}-\mu_{2})
=nL+G2log(2π)+log|C1C2|12sTs.\displaystyle=-\frac{nL+G}{2}\log(2\pi)+\log|C_{1}C_{2}|-\frac{1}{2}s^{T}s.

As μ2=d+C2TD(μ1θG)\mu_{2}=d+C_{2}^{-T}D(\mu_{1}-\theta_{G}) and v(C2)=f+FθGv(C_{2}^{*})=f+F\theta_{G}, differentiating logqλ(θ)\log q_{\lambda}(\theta) with respect to θG\theta_{G},

dlogqλ(θ)\displaystyle\text{\rm d}\log q_{\lambda}(\theta)
=(θGμ1)TC1C1TdθG(θLμ2)TC2C2T(dμ2)\displaystyle=-(\theta_{G}-\mu_{1})^{T}C_{1}C_{1}^{T}\text{\rm d}\theta_{G}-(\theta_{L}-\mu_{2})^{T}C_{2}C_{2}^{T}(-\text{\rm d}\mu_{2})
(θLμ2)TdC2s2+tr(C21dC2)\displaystyle\quad-(\theta_{L}-\mu_{2})^{T}\text{\rm d}C_{2}s_{2}+\text{\rm tr}(C_{2}^{-1}\text{\rm d}C_{2})
=s1TC1TdθG+s2TC2T{C2TDdθG+d(C2T)D(μ1θG)}\displaystyle=-s_{1}^{T}C_{1}^{T}\text{\rm d}\theta_{G}+s_{2}^{T}C_{2}^{T}\{-C_{2}^{-T}D\text{\rm d}\theta_{G}+\text{\rm d}(C_{2}^{-T})D(\mu_{1}-\theta_{G})\}
vec(C2Ts2s2T)Tdvec(C2)+vec(C2T)Tdvec(C2)\displaystyle\quad-\text{\rm vec}(C_{2}^{-T}s_{2}s_{2}^{T})^{T}\text{\rm d}\text{\rm vec}(C_{2})+\text{\rm vec}(C_{2}^{-T})^{T}d\text{\rm vec}(C_{2})
=vec(C2T{C2Ts2+(μ2d)}s2T)Tdvec(C2)\displaystyle=\text{\rm vec}(C_{2}^{-T}-\{C_{2}^{-T}s_{2}+(\mu_{2}-d)\}s_{2}^{T})^{T}d\text{\rm vec}(C_{2})
s1TC1TdθGs2TDdθG\displaystyle\quad-s_{1}^{T}C_{1}^{T}\text{\rm d}\theta_{G}-s_{2}^{T}D\text{\rm d}\theta_{G}
=vec(C2T(θLd)s2T)TEnLTD2FdθG\displaystyle=\text{\rm vec}(C_{2}^{-T}-(\theta_{L}-d)s_{2}^{T})^{T}E_{nL}^{T}D_{2}^{*}F\text{\rm d}\theta_{G}
s1TC1TdθGs2TDdθG.\displaystyle\quad-s_{1}^{T}C_{1}^{T}\text{\rm d}\theta_{G}-s_{2}^{T}D\text{\rm d}\theta_{G}.

Therefore

θGlogqλ(θ)\displaystyle\nabla_{\theta_{G}}\log q_{\lambda}(\theta) =FTD2v(C2T(θLd)s2T)\displaystyle=F^{T}D_{2}^{*}v(C_{2}^{-T}-(\theta_{L}-d)s_{2}^{T})
C1s1DTs2.\displaystyle\quad-C_{1}s_{1}-D^{T}s_{2}.

Note that D2v(C2T)=v(InL)D_{2}^{*}v(C_{2}^{-T})=v(I_{nL}) as C2TC_{2}^{-T} is upper triangular and v(C2T)v(C_{2}^{-T}) only retains the diagonal elements of C2TC_{2}^{-T}.

Differentiating logqλ(θ)\log q_{\lambda}(\theta) with respect to θL\theta_{L},

dlogqλ(θ)\displaystyle\text{\rm d}\log q_{\lambda}(\theta) =(θLμ2)TC2C2TdθL\displaystyle=-(\theta_{L}-\mu_{2})^{T}C_{2}C_{2}^{T}\text{\rm d}\theta_{L}
=s2TC2TdθL.\displaystyle=-s_{2}^{T}C_{2}^{T}\text{\rm d}\theta_{L}.
θLlogqλ(θ)=C2s2.\therefore\;\nabla_{\theta_{L}}\log q_{\lambda}(\theta)=-C_{2}s_{2}.

Appendix B Gradients for generalized linear mixed models

Since θ=[βT,ωT,b~1T,,b~nT]T\theta=[\beta^{T},\omega^{T},\tilde{b}_{1}^{T},\dots,\tilde{b}_{n}^{T}]^{T}, we require

θlogp(y,θ)=[βlogp(y,θ),ωlogp(y,θ),b~1logp(y,θ),,b~nlogp(y,θ)]T.\nabla_{\theta}\log p(y,\theta)=[\nabla_{\beta}\log p(y,\theta),\nabla_{\omega}\log p(y,\theta),\\ \nabla_{\tilde{b}_{1}}\log p(y,\theta),\dots,\nabla_{\tilde{b}_{n}}\log p(y,\theta)]^{T}.

For the centered parametrization, the components in θlogp(y,θ)\nabla_{\theta}\log p(y,\theta) are given below. Note that β=[βRG1T,βG2T]T\beta=[\beta_{RG_{1}}^{T},\beta_{G_{2}}^{T}]^{T}.

βG2logp(y,θ)=i=1nXiG2T{yih(ηi)}βG2/σβ2.\nabla_{\beta_{G_{2}}}\log p(y,\theta)=\sum_{i=1}^{n}{X_{i}^{G_{2}}}^{T}\{y_{i}-h^{\prime}(\eta_{i})\}-\beta_{G_{2}}/\sigma_{\beta}^{2}.
βRG1logp(y,θ)=i=1nCiTWWT(b~iCiβRG1)βRG1/σβ2.\nabla_{\beta_{RG_{1}}}\log p(y,\theta)=\sum_{i=1}^{n}C_{i}^{T}WW^{T}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})-\beta_{RG_{1}}/\sigma_{\beta}^{2}.

Differentiating logp(y,θ)\log p(y,\theta) with respect to ω\omega,

dlogp(y,θ)\displaystyle\text{\rm d}\log p(y,\theta) =i=1n(b~iCiβRG1)TdWWT(b~iCiβRG1)\displaystyle=-\sum_{i=1}^{n}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})^{T}\text{\rm d}WW^{T}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})
+ntr(W1dW)ωTdω/σω2\displaystyle\quad+n\text{\rm tr}(W^{-1}\text{\rm d}W)-\omega^{T}\text{\rm d}\omega/\sigma_{\omega}^{2}
=vec{i=1n(b~iCiβRG1)(b~iCiβRG1)TW\displaystyle=\text{\rm vec}\bigg{\{}-\sum_{i=1}^{n}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})^{T}W
+nWT}TELTDLdωωTdω/σω2,\displaystyle\quad+nW^{-T}\bigg{\}}^{T}E_{L}^{T}D_{L}^{*}\text{\rm d}\omega-\omega^{T}\text{\rm d}\omega/\sigma_{\omega}^{2},

where dv(W)=DLdω\text{\rm d}v(W)=D^{*}_{L}\text{\rm d}\omega and DL=diag{v(dg(W)+𝟏L𝟏LTIL)}D^{*}_{L}=\text{\rm diag}\{v(\text{\rm dg}(W)+{\bf 1}_{L}{\bf 1}_{L}^{T}-I_{L})\}. Hence

ωlogp(y,θ)\displaystyle\nabla_{\omega}\log p(y,\theta) =DLi=1nv{(b~iCiβRG1)(b~iCiβRG1)TW}\displaystyle=-D^{*}_{L}\sum_{i=1}^{n}v\{(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})^{T}W\}
+nv(IL)ω/σω2.\displaystyle\quad+nv(I_{L})-\omega/\sigma_{\omega}^{2}.

Note that DLv(WT)=v(IL)D_{L}^{*}v(W^{-T})=v(I_{L}) because WTW^{-T} is upper triangular and v(WT)v(W^{-T}) only retains the diagonal elements.

b~ilogp(y,θ)=ZiT{yih(ηi)}WWT(b~iCiβRG1).\nabla_{\tilde{b}_{i}}\log p(y,\theta)=Z_{i}^{T}\{y_{i}-h^{\prime}(\eta_{i})\}-WW^{T}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}}).

Appendix C Gradients for state space models

Since θ=[α,κ,ψ,b1T,,bnT]T\theta=[\alpha,\kappa,\psi,b_{1}^{T},\dots,b_{n}^{T}]^{T}, we require

θlogp(y,θ)=[αlogp(y,θ),κlogp(y,θ),ψlogp(y,θ),b1logp(y,θ),,bnlogp(y,θ)]T.\nabla_{\theta}\log p(y,\theta)=[\nabla_{\alpha}\log p(y,\theta),\nabla_{\kappa}\log p(y,\theta),\\ \nabla_{\psi}\log p(y,\theta),\nabla_{b_{1}}\log p(y,\theta),\dots,\nabla_{b_{n}}\log p(y,\theta)]^{T}.

The components in θlogp(y,θ)\nabla_{\theta}\log p(y,\theta) are given below.

αlogp(y,θ)=12i=1n(biyi2eσbiκbi)(1eσ)ασα2.\nabla_{\alpha}\log p(y,\theta)=\frac{1}{2}\sum^{n}_{i=1}(b_{i}y_{i}^{2}{\rm e}^{-\sigma b_{i}-\kappa}-b_{i})(1-{\rm e}^{-\sigma})-\frac{\alpha}{\sigma_{\alpha}^{2}}.
κlogp(y,θ)=12(i=1nyi2eσbiκn)κ/σκ2.\nabla_{\kappa}\log p(y,\theta)=\frac{1}{2}\bigg{(}\sum^{n}_{i=1}y_{i}^{2}{\rm e}^{-\sigma b_{i}-\kappa}-n\bigg{)}-\kappa/\sigma_{\kappa}^{2}.
ψlogp(y,θ)\displaystyle\nabla_{\psi}\log p(y,\theta) ={i=2n(biϕbi1)bi1+b12ϕϕ1ϕ2}\displaystyle=\bigg{\{}\sum_{i=2}^{n}(b_{i}-\phi b_{i-1})b_{i-1}+b_{1}^{2}\phi-\frac{\phi}{1-\phi^{2}}\bigg{\}}
×ϕ(1ϕ)ψ/σψ2.\displaystyle\quad\times\phi(1-\phi)-\psi/\sigma_{\psi}^{2}.
b1logp(y,θ)=σ2(y12eσb1κ1)+ϕ(b2ϕb1)b1(1ϕ)2.\nabla_{b_{1}}\log p(y,\theta)=\frac{\sigma}{2}(y_{1}^{2}{\rm e}^{-\sigma b_{1}-\kappa}-1)+\phi(b_{2}-\phi b_{1})-b_{1}(1-\phi)^{2}.

For 2in12\leq i\leq n-1,

bilogp(y,θ)\displaystyle\nabla_{b_{i}}\log p(y,\theta) =σ2(yi2eσbiκ1)+ϕ(bi+1ϕbi)\displaystyle=\frac{\sigma}{2}(y_{i}^{2}{\rm e}^{-\sigma b_{i}-\kappa}-1)+\phi(b_{i+1}-\phi b_{i})
(biϕbi1).\displaystyle\quad-(b_{i}-\phi b_{i-1}).
bnlogp(y,θ)=σ2(yn2eσbnκ1)(bnϕbn1).\displaystyle\nabla_{b_{n}}\log p(y,\theta)=\frac{\sigma}{2}(y_{n}^{2}{\rm e}^{-\sigma b_{n}-\kappa}-1)-(b_{n}-\phi b_{n-1}).