∎

¹¹institutetext: Linda S. L. Tan ²²institutetext: Department of Statistics and Applied Probability
National University of Singapore
Tel.: +65-6516-8473
Fax: +65-6872-3939
²²email: statsll@nus.edu.sg ³³institutetext: Aishwarya Bhaskaran ⁴⁴institutetext: Department of Statistics and Applied Probability
National University of Singapore
Tel.: +65-6601-6229
Fax: +65-6872-3939
⁴⁴email: staai@nus.edu.sg ⁵⁵institutetext: David J. Nott ⁶⁶institutetext: Department of Statistics and Applied Probability
National University of Singapore
Tel.: +65-6516-2744
Fax: +65-6872-3939
⁶⁶email: standj@nus.edu.sg

Conditionally structured variational Gaussian approximation with importance weights^†^†thanks: Linda Tan and Aishwarya Bhaskaran are supported by the start-up grant R-155-000-190-133.

Linda S. L. Tan Aishwarya Bhaskaran David J. Nott

(Received: date / Accepted: date)

Abstract

We develop flexible methods of deriving variational inference for models with complex latent variable structure. By splitting the variables in these models into “global” parameters and “local” latent variables, we define a class of variational approximations that exploit this partitioning and go beyond Gaussian variational approximation. This approximation is motivated by the fact that in many hierarchical models, there are global variance parameters which determine the scale of local latent variables in their posterior conditional on the global parameters. We also consider parsimonious parametrizations by using conditional independence structure, and improved estimation of the log marginal likelihood and variational density using importance weights. These methods are shown to improve significantly on Gaussian variational approximation methods for a similar computational cost. Application of the methodology is illustrated using generalized linear mixed models and state space models.

Keywords:

Gaussian variational approximation Sparse precision matrix Stochastic variational inference Importance weighted lower bound Rényi’s divergence

1 Introduction

In many modern statistical applications, it is necessary to model complex dependent data. In these situations, models which employ observation specific latent variables such as random effects and state space models are widely used because of their flexibility, and Bayesian approaches dealing naturally with the hierarchical structure are attractive in principle. However, incorporating observation specific latent variables leads to a parameter dimension increasing with sample size, and standard Bayesian computational methods can be challenging to implement in very high-dimensional settings. For this reason, approximate inference methods are attractive for these models, both in exploratory settings where many models need to be fitted quickly, as well as in applications involving large datasets where exact methods are infeasible. One of the most common approximate inference paradigms is variational inference (Ormerod and Wand, 2010; Blei et al., 2017), which is the approach considered here.

Our main contribution is to consider partitioning the unknowns in a local latent variable model into “global” parameters and “local” latent variables, and to suggest ways of structuring the dependence in a variational approximation that match the specification of these models. We go beyond standard Gaussian approximations by defining the variational approximation sequentially, through a marginal density for the global parameters and a conditional density for local parameters given global parameters. Each term in our approximation is Gaussian, but we allow the conditional covariance matrix for the local parameters to depend on the global parameters, which leads to an approximation that is not jointly Gaussian. We are particularly interested in improved inference on global variance and dependence parameters which determine the scale and dependence structure of local latent variables. With this objective, we suggest a parametrization of our conditional approximation to the local variables that is well-motivated and respects the exact conditional independence structure in the true posterior distribution. Our approximations are parsimonious in terms of the number of required variational parameters, which is important since a high-dimensional variational optimization is computationally burdensome. The methods suggested improve on Gaussian variational approximation methods for a similar computational cost. Besides defining a novel and useful variational family appropriate to local latent variable models, we also employ importance weighted variational inference methods (Burda et al., 2016; Domke and Sheldon, 2018) to further improve the quality of inference, and elaborate further on the connections between this approach and the use of Rényi’s divergence within the variational optimization (Li and Turner, 2016; Regli and Silva, 2018; Yang et al., 2019).

Our method is a contribution to the literature on the development of flexible variational families, and there are many interesting existing methods for this task. One fruitful approach is based on normalizing flows (Rezende and Mohamed, 2015), where a variational family is defined using an invertible transformation of a random vector with some known density function. To be useful, the transformation should have an easily computed Jacobian determinant. In the original work of Rezende and Mohamed (2015), compositions of simple flows called radial and planar flows were considered. Later authors have suggested alternatives, such as autoregressive flows (Germain et al., 2015), inverse autoregressive flows (Kingma et al., 2016), and real-valued non-volume preserving transformations (Dinh et al., 2017), among others. Spantini et al. (2018) gives a theoretical framework connecting Markov properties of a target posterior distribution to representations involving transport maps, with normalizing flows being one way to parametrize such mappings. The variational family we consider here can be thought of as a simple autoregressive flow, but carefully constructed to preserve the conditional independence structure in the true posterior and to achieve parsimony in the representation of dependence between local latent variables and global scale parameters. Our work is also related to the hierarchically structured approximations considered in Salimans and Knowles (2013, Section 7.1); these authors also consider other flexible approximations based on mixture models, and a variety of innovative numerical approaches to the variational optimization. Hoffman and Blei (2015) propose an approach called structured stochastic variational inference which is applicable in conditionally conjugate models. Their approach is similar to ours, in the sense that local variables depend on global variables in the variational posterior. However conditional conjugacy does not hold in the examples we consider.

The methods we describe can be thought of as extending the Gaussian variational approximation (GVA) of Tan and Nott (2018), where parametrization of the variational covariance matrix was considered in terms of a sparse Cholesky factor of the precision matrix. Similar approximations have been considered for state space models in Archer et al. (2016). The sparse structure reduces the number of free variational parameters, and allows matching the exact conditional independence structure in the true posterior. Tan (2018) propose an approach called reparametrized variational Bayes, where the model is reparametrized by applying an invertible affine transformation to the local variables to minimize their posterior dependency on global variables, before applying a mean field approximation. The affine transformation is obtained by considering a second order Taylor series approximation to the posterior of the local variables conditional on the global variables. One way of improving on Gaussian approximations is to consider mixtures of Gaussians (Jaakkola and Jordan, 1998; Salimans and Knowles, 2013; Miller et al., 2016; Guo et al., 2016). However, even with a parsimonious parametrization of component densities, a large number of additional variational parameters are added with each mixture component. Other flexible variational families can be formed using copulas (Tran et al., 2015; Han et al., 2016; Smith et al., 2019), hierarchical variational models (Ranganath et al., 2016) or implicit approaches (Huszár, 2017).

We specify the model and notation in Section 2 and introduce the conditionally structured Gaussian variational approximation (CSGVA) in Section 3. The algorithm for optimizing the variational parameters is described in Section 4 and Section 5 highlights the association between GVA and CSGVA. Section 6 describes how CSGVA can be improved using importance weighting. Experimental results and applications to generalized linear mixed models (GLMMs) and state space models are presented in Sections 7, 8 and 9 respectively. Section 10 gives some concluding discussion.

2 Model specification and notation

Let $y=(y_{1},\dots,y_{n})^{T}$ be observations from a model with global variables $\theta_{G}$ and local variables $\theta_{L}=(b_{1},\dots,b_{n})^{T}$ , where $b_{i}$ contains latent variables specific to $y_{i}$ for $i=1,\dots,n$ . Suppose $\theta_{G}$ is a vector of length $G$ and each $b_{i}$ is a vector of length $L$ . Let $\theta=(\theta_{G}^{T},\theta_{L}^{T})^{T}$ . We consider models where the joint density is of the form

	$\displaystyle p(y,\theta)$	$\displaystyle=p(\theta_{G})p(b_{1},\dots,b_{\ell}\|\theta_{G})\bigg{\{}\prod_{i=1}^{n}p(y_{i}\|b_{i},\theta_{G})\bigg{\}}$
		$\displaystyle\quad\times\bigg{\{}\prod_{i>\ell}p(b_{i}\|b_{i-1},\dots,b_{i-\ell},\theta_{G})\bigg{\}}.$

The observations $\{y_{i}\}$ are conditionally independent given $\{b_{i}\}$ and $\theta_{G}$ . Conditional on $\theta_{G}$ , the local variables $\{b_{i}\}$ form a $\ell$ th order Markov chain if $\ell\geq 1$ , and they are conditionally independent if $\ell=0$ . This class of models include important models such as GLMMs and state space models. Next, we define some mathematical notation before discussing CSGVA for this class of models.

2.1 Notation

For an $r\times r$ matrix $A$ , let $\text{\rm diag}(A)$ denote the diagonal elements of $A$ and $\text{\rm dg}(A)$ be the diagonal matrix obtained by setting non-diagonal elements in $A$ to zero. Let $\text{vec}(A)$ be the vector of length $r^{2}$ obtained by stacking the columns of $A$ under each other from left to right and $v(A)$ be the vector of length $r(r+1)/2$ obtained from $\text{\rm vec}(A)$ by eliminating all superdiagonal elements of $A$ . Let $E_{r}$ be the $r(r+1)/2\times r^{2}$ elimination matrix, $K_{r}$ be the $r^{2}\times r^{2}$ commutation matrix and $D_{r}$ be the $r^{2}\times r(r+1)/2$ duplication matrix (see Magnus and Neudecker, 1980). Then $K_{r}\text{\rm vec}(A)=\text{\rm vec}(A^{T})$ , $E_{r}\text{\rm vec}(A)=v(A)$ , $E_{r}^{T}v(A)=\text{\rm vec}(A)$ if $A$ is lower triangular, and $D_{r}v(A)=\text{\rm vec}(A)$ if $A$ is symmetric. Let ${\bf 1}_{r}$ be a vector of ones of length $r$ . The Kronecker product between any two matrices is denoted by $\otimes$ . Scalar functions applied to vector arguments are evaluated element by element. Let d denote the differential operator (see e.g. Magnus and Neudecker, 1999).

3 Conditionally structured Gaussian variational approximation

We propose to approximate the posterior distribution $p(\theta|y)$ of the model defined in Section 2 by a density of the form

q(\theta)=q(\theta_{G})q(\theta_{L}|\theta_{G}),

where $q(\theta_{G})=N(\mu_{1},\Omega_{1}^{-1})$ , $q(\theta_{L}|\theta_{G})=N(\mu_{2},\Omega_{2}^{-1})$ , and $\Omega_{1}$ and $\Omega_{2}$ are the precision (inverse covariance) matrices of $q(\theta_{G})$ and $q(\theta_{L}|\theta_{G})$ respectively. Here $\mu_{2}$ and $\Omega_{2}$ depend on $\theta_{G}$ , but we do not denote this explicitly for notational conciseness. Let $C_{1}C_{1}^{T}$ and $C_{2}C_{2}^{T}$ be unique Cholesky factorizations of $\Omega_{1}$ and $\Omega_{2}$ respectively, where $C_{1}$ and $C_{2}$ are lower triangular matrices with positive diagonal entries. We further define $C_{1}^{*}$ and $C_{2}^{*}$ to be lower triangular matrices of order $G$ and $nL$ respectively such that $C_{r,ii}^{*}=\log(C_{r,ii})$ and $C_{r,ij}^{*}=C_{r,ij}$ if $i\neq j$ for $r=1,2$ . The purpose of introducing $C_{1}^{*}$ and $C_{2}^{*}$ is to allow unconstrained optimization of the variational parameters in the stochastic gradient ascent algorithm, since diagonal entries of $C_{1}$ and $C_{2}$ are constrained to be positive. Note that $C_{2}$ and $C_{2}^{*}$ also depend on $\theta_{G}$ but again we do not show this explicitly in our notation.

As $\Omega_{2}$ depends on $\theta_{G}$ , the joint distribution $q(\theta)$ is generally non-Gaussian even though $q(\theta_{G})$ and $q(\theta_{L}|\theta_{G})$ are individually Gaussian. Here we consider a first order approximation and assume that $\mu_{2}$ and $v(C_{2}^{*})$ are linear functions of $\theta_{G}$ :

\mu_{2}=d+C_{2}^{-T}D(\mu_{1}-\theta_{G}),\quad v(C_{2}^{*})=f+F\theta_{G}.

(1)

In (1), $d$ is a vector of length $nL$ , $D$ is a $nL\times G$ matrix, $f$ is a vector of length $nL(nL+1)/2$ and $F$ is a $nL(nL+1)/2\times G$ matrix. For this specification, $q(\theta)$ is not jointly Gaussian due to dependence of the covariance matrix of $q(\theta_{L}|\theta_{G})$ on $\theta_{G}$ . It is Gaussian if and only if $F\equiv 0$ . The set of variational parameters to be optimized is denoted as

\lambda=(\mu_{1}^{T},v(C_{1}^{*})^{T},d^{T},\text{\rm vec}(D)^{T},f^{T},\text{\rm vec}(F)^{T})^{T}.

As motivation for the linear approximation in (1), consider the linear mixed model,

\begin{gathered}y_{i}=X_{i}\beta+Z_{i}b_{i}+\epsilon_{i},\quad(i=1,\dots,n),\\ \beta\sim N(0,\sigma_{\beta}^{2}I_{p}),\quad b_{i}\sim N(0,\Lambda),\quad\epsilon_{i}\sim N(0,\sigma_{\epsilon}^{2}I_{n_{i}}),\\ \end{gathered}

where $y_{i}$ is a vector of responses of length $n_{i}$ for the $i$ th subject, $X_{i}$ and $Z_{i}$ are covariate matrices of dimensions $n_{i}\times p$ and $n_{i}\times L$ respectively, $\beta$ is a vector of coefficients of length $p$ and $\{b_{i}\}$ are random effects. Assume $\sigma_{\epsilon}^{2}$ is known. Then the global parameters $\theta_{G}$ consists of $\beta$ and $\Lambda$ . The posterior of $\theta_{L}$ conditional on $\theta_{G}$ is

	$\displaystyle p(\theta_{L}\|y,\theta_{G})$	$\displaystyle\propto\prod_{i=1}^{n}p(y_{i}\|\beta,b_{i})p(b_{i}\|\Lambda)$
		$\displaystyle\propto\prod_{i=1}^{n}\exp[-\{b_{i}^{T}(Z_{i}^{T}Z_{i}/\sigma_{\epsilon}^{2}+\Lambda^{-1})b_{i}$
		$\displaystyle\quad-2b_{i}^{T}Z_{i}^{T}(y_{i}-X_{i}\beta)/\sigma_{\epsilon}^{2}\}/2].$

Thus $p(\theta_{L}|y,\theta_{G})=\prod_{i=1}^{n}p(b_{i}|y,\theta_{G})$ , where $p(b_{i}|y,\theta_{G})$ is a normal density with precision matrix $Z_{i}^{T}Z_{i}/\sigma_{\epsilon}^{2}+\Lambda^{-1}$ and mean $(Z_{i}^{T}Z_{i}/\sigma_{\epsilon}^{2}+\Lambda^{-1})^{-1}Z_{i}^{T}(y_{i}-X_{i}\beta)/\sigma_{\epsilon}^{2}$ . The precision matrix depends on $\Lambda^{-1}$ linearly and the mean depends on $\beta$ linearly after scaling by the covariance matrix. The linear approximation in (1) tries to mimic this dependence relationship.

The proposed variational density is conditionally structured and highly flexible. Such dependence structure is particularly valuable in constructing variational approximations for hierarchical models, where there are global scale parameters in $\theta_{G}$ which help to determine the scale of local latent variables in the conditional posterior of $\theta_{L}|\theta_{G}$ . While marginal posteriors of the global variables are often well approximated by Gaussian densities, marginal posteriors of the local variables tend to exhibit more skewness and kurtosis. This deviation from normality can be captured by $q(\theta_{L})=\int q(\theta_{G})q(\theta_{L}|\theta_{G})d\theta_{G}$ , which is a mixture of normal densities. The formulation in (1) also allows for a reduction in the number of variational parameters if conditional independence structure consistent with that in the true posterior is imposed on the variational approximation.

3.1 Using conditional independence structure

Tan and Nott (2018) incorporate the conditional independence structure of the true posterior into Gaussian variational approximations by using the fact that zeros in the precision matrix correspond to conditional independence for Gaussian random vectors. This incorporation achieves sparsity in the precision matrix of the approximation and leads to a large reduction in the number of variational parameters to be optimized. For high-dimensional $\theta$ , this sparse structure is especially important because a full Gaussian approximation involves learning a covariance matrix where the number of elements grows quadratically with the dimension of $\theta$ .

Recall that $\theta_{L}=(b_{1},\dots,b_{n})^{T}$ . Suppose $b_{i}$ is conditionally independent of $b_{j}$ in the posterior for $|i-j|>\ell$ , given $\theta_{G}$ and $\{b_{j}\mid|i-j|\leq\ell\}$ . For instance, in a GLMM, $\{b_{i}\}$ may be subject specific random effects, and these are conditionally independent given the global parameters, so this structure holds with $\ell=0$ . In the case of a state space model for a time series, $\{b_{i}\}$ are the latent states and this structure holds with $\ell=1$ . Note that ordering of the latent variables is important here.

Now partition the precision matrix $\Omega_{2}$ of $q(\theta_{L}|\theta_{G})$ into $L\times L$ blocks with $n$ row and $n$ column partitions corresponding to $\theta_{L}=(b_{1},\dots,b_{n})^{T}$ . Let $\Omega_{2,ij}$ be the block corresponding to $b_{i}$ horizontally and $b_{j}$ vertically for $i,j=1,\dots,n$ . If $b_{i}$ is conditionally independent of $b_{j}$ for $|i-j|>\ell$ , given $\theta_{G}$ and $\{b_{j}\mid|i-j|\leq\ell\}$ , then we set $\Omega_{2,ij}=0$ for all pairs $(i,j)$ with $|i-j|>\ell$ . Let $\mathcal{I}$ denote the indices of elements in $v(\Omega_{2})$ which are fixed at zero by this conditional independence requirement. If we choose $f_{i}=0$ and $F_{ij}=0$ for all $i\in\mathcal{I}$ and all $j$ in (1), then $C_{2}^{*}$ has the same block sparse structure we desire for the lower triangular part of $\Omega_{2}$ . By Proposition 1 of Rothman et al. (2010), this means that $\Omega_{2}$ will have the desired block sparse structure. Hence we impose the constraints $f_{i}=0$ and $F_{ij}=0$ for $i\in\mathcal{I}$ and all $j$ , which reduces the number of variational parameters to be optimized.

4 Optimization of variational parameters

To make the dependence on $\lambda$ explicit, write $q(\theta)$ as $q_{\lambda}(\theta)$ . The variational parameters $\lambda$ are optimized by minimizing the Kullback-Leibler divergence between $q_{\lambda}(\theta)$ and the true posterior $p(\theta|y)$ , where

\text{\rm KL}\{q_{\lambda}||p(\theta|y)\}=\int q_{\lambda}(\theta)\log\frac{q_{\lambda}(\theta)}{p(\theta|y)}d\theta\\ =\log p(y)-\int q_{\lambda}(\theta)\log\frac{p(y,\theta)}{q_{\lambda}(\theta)}d\theta\geq 0.

Minimizing $\text{\rm KL}\{q_{\lambda}||p(\theta|y)\}$ is therefore equivalent to maximizing an evidence lower bound ${\mathcal{L}}^{\text{VI}}$ on the log marginal likelihood $\log p(y)$ , where

{\mathcal{L}}^{\text{VI}}=E_{q_{\lambda}}\{\log p(y,\theta)-\log q_{\lambda}(\theta)\}.

(2)

In (2), $E_{q_{\lambda}}$ denotes expectation with respect to $q_{\lambda}(\theta)$ . We seek to maximize ${\mathcal{L}}^{\text{VI}}$ with respect to $\lambda$ using stochastic gradient ascent. Starting with some initial estimate of $\lambda$ , we perform the following update at each iteration $t$ ,

\lambda_{t}=\lambda_{t-1}+\rho_{t}\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}},

where $\rho_{t}$ represents a small stepsize taken in the direction of the stochastic gradient $\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}}$ . The sequence $\{\rho_{t}\}$ should satisfy the conditions $\sum_{t}\rho_{t}=\infty$ and $\sum_{t}\rho_{t}^{2}<\infty$ (Spall, 2003).

An unbiased estimate of the gradient $\nabla_{\lambda}{\mathcal{L}}^{\text{VI}}$ can be constructed using (2) by simulating $\theta$ from $q_{\lambda}(\theta)$ . However, this approach usually results in large fluctuations in the stochastic gradients. Hence we implement the “reparametrization trick” (Kingma and Welling, 2014; Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014), which helps to reduce the variance of the stochastic gradients. This approach writes $\theta\sim q_{\lambda}(\theta)$ as a function of the variational parameters $\lambda$ and a random vector $s$ having a density not depending on $\lambda$ . To explain further, let $s=[s_{1}^{T},s_{2}^{T}]^{T}$ , where $s_{1}$ and $s_{2}$ are vectors of length $G$ and $nL$ corresponding to $\theta_{G}$ and $\theta_{L}$ respectively. Consider a transformation $\theta=r_{\lambda}(s)$ of the form

\begin{bmatrix}\theta_{G}\\ \theta_{L}\end{bmatrix}=\begin{bmatrix}\mu_{1}+C_{1}^{-T}s_{1}\\ \mu_{2}+C_{2}^{-T}s_{2}\end{bmatrix}.

(3)

Since $\mu_{2}$ and $C_{2}$ are functions of $\theta_{G}$ from (1),

	$\displaystyle\mu_{2}$	$\displaystyle=d+C_{2}^{-T}D(\mu_{1}-\theta_{G})=d-C_{2}^{-T}DC_{1}^{-T}s_{1},$
	$\displaystyle v(C_{2}^{*})$	$\displaystyle=f+F\theta_{G}=f+F(\mu_{1}+C_{1}^{-T}s_{1}).$

Hence $\mu_{2}$ and $C_{2}$ are functions of $s_{1}$ , and $\theta_{L}$ is a function of both $s_{1}$ and $s_{2}$ . This transformation is invertible since given $\theta$ and $\lambda$ , we can first recover $s_{1}=C_{1}^{T}(\theta_{G}-\mu_{1})$ , find $\mu_{2}$ and $C_{2}$ , and then recover $s_{2}=C_{2}^{T}(\theta_{L}-\mu_{2})$ . Applying this transformation,

	$\displaystyle{\mathcal{L}}^{\text{VI}}$	$\displaystyle=\int\phi(s)\{\log p(y,\theta)-\log q_{\lambda}(\theta)\}ds$		(4)
		$\displaystyle=E_{\phi}\{\log p(y,\theta)-\log q_{\lambda}(\theta)\},$		(4)

where $E_{\phi}$ denotes expectation with respect to $\phi(s)$ and $\theta=r_{\lambda}(s)$ .

4.1 Stochastic gradients

Next, we differentiate (4) with respect to $\lambda$ to find unbiased estimates of the gradients. As $\log q_{\lambda}(\theta)$ depends on $\lambda$ directly as well as through $\theta$ , applying the chain rule, we have

$\displaystyle\nabla_{\lambda}{\mathcal{L}}^{\text{VI}}$	$\displaystyle=E_{\phi}[\nabla_{\lambda}r_{\lambda}(s)\{\nabla_{\theta}\log p(y,\theta)-\nabla_{\theta}\log q_{\lambda}(\theta)\}$
	$\displaystyle\quad-\nabla_{\lambda}\log q_{\lambda}(\theta)]$	(5)
	$\displaystyle=E_{\phi}[\nabla_{\lambda}r_{\lambda}(s)\{\nabla_{\theta}\log p(y,\theta)-\nabla_{\theta}\log q_{\lambda}(\theta)\}].$	(6)

Note that $E_{\phi}\{\nabla_{\lambda}\log q_{\lambda}(\theta)\}$ = 0 as it is the expectation of the score function. Roeder et al. (2017) refer to the expressions inside the expectations in (5) and (6) as the total derivative and path derivative respectively. In (6), the contributions to the gradient from $\log p(y,\theta)$ and $\log q_{\lambda}(\theta)$ cancel each other if $q_{\lambda}(\theta)$ approximates the true posterior well (at convergence). However, the score function $\nabla_{\lambda}\log q_{\lambda}(\theta)$ is not necessarily small even if $q_{\lambda}(\theta)$ is a good approximation to $p(\theta|y)$ . This term affects adversely the ability of the algorithm to converge and “stick” to the optimal variational parameters, a phenomenon Roeder et al. (2017) refers to as “sticking the landing”. Hence we consider the path derivative,

\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}}=\nabla_{\lambda}r_{\lambda}(s)\{\nabla_{\theta}\log p(y,\theta)-\nabla_{\theta}\log q_{\lambda}(\theta)\}

(7)

as an unbiased estimate of the true gradient $\nabla_{\lambda}{\mathcal{L}}^{\text{VI}}$ . Tan and Nott (2018) and Tan (2018) also demonstrate that the path derivative has smaller variation about zero when the algorithm is close to convergence.

Let $\nabla_{\theta}\log p(y,\theta)-\nabla_{\theta}\log q_{\lambda}(\theta)=({\mathcal{G}}_{1}^{T},{\mathcal{G}}_{2}^{T})^{T}$ , where ${\mathcal{G}}_{1}$ and ${\mathcal{G}}_{2}$ are vectors of length $G$ and $nL$ respectively corresponding to the partitioning of $\theta=[\theta_{G}^{T},\theta_{L}^{T}]^{T}$ . Then $\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}}=\nabla_{\lambda}r_{\lambda}(s)({\mathcal{G}}_{1}^{T},{\mathcal{G}}_{2}^{T})^{T}$ is given by

\begin{bmatrix}{\mathcal{G}}_{1}+\nabla_{\mu_{1}}\theta_{L}{\mathcal{G}}_{2}\\ -D_{1}^{*}v[C_{1}^{-T}s_{1}\{{\mathcal{G}}_{1}+(\nabla_{\mu_{1}}\theta_{L}-D^{T}C_{2}^{-1}){\mathcal{G}}_{2}\}^{T}C_{1}^{-T}]\\ {\mathcal{G}}_{2}\\ -\text{\rm vec}(C_{2}^{-1}{\mathcal{G}}_{2}s_{1}^{T}C_{1}^{-1})\\ \nabla_{f}\theta_{L}{\mathcal{G}}_{2}\\ \text{\rm vec}(\nabla_{f}\theta_{L}{\mathcal{G}}_{2}\theta_{G}^{T})\\ \end{bmatrix},

where

	$\displaystyle\nabla_{\mu_{1}}\theta_{L}$	$\displaystyle=F^{T}\nabla_{f}\theta_{L},$
	$\displaystyle\nabla_{f}\theta_{L}{\mathcal{G}}_{2}$	$\displaystyle=-D_{2}^{*}v\{C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1}){\mathcal{G}}_{2}^{T}C_{2}^{-T}\}.$

Here $D^{*}_{1}$ and $D^{*}_{2}$ are diagonal matrices of order $G(G+1)/2$ and $nL(nL+1)/2$ respectively such that $\text{\rm d}v(C_{r})=D^{*}_{r}\text{\rm d}v(C_{r}^{*})$ for $r=1,2$ . Formally, $D^{*}_{1}=\text{\rm diag}\{v(\text{\rm dg}(C_{1})+{\bf 1}_{G}{\bf 1}_{G}^{T}-I_{G})\}$ and $D^{*}_{2}=\text{\rm diag}\{v(\text{\rm dg}(C_{2})+{\bf 1}_{nL}{\bf 1}_{nL}^{T}-I_{nL})\}$ . The full expression and derivation of $\nabla_{\lambda}r_{\lambda}(s)$ are given in Appendix A. In addition, we show (in Appendix A) that

		$\displaystyle\nabla_{\theta}\log q_{\lambda}(\theta)=\begin{bmatrix}\nabla_{\theta_{G}}\log q_{\lambda}(\theta)\\ \nabla_{\theta_{L}}\log q_{\lambda}(\theta)\end{bmatrix}$
		$\displaystyle=\begin{bmatrix}F^{T}[v(I_{nL})-D_{2}^{*}v\{(\theta_{L}-d)s_{2}^{T}\}]-C_{1}s_{1}-D^{T}s_{2}\\ -C_{2}s_{2}\end{bmatrix}.$

$\nabla_{\theta}\log p(y,\theta)$ is model specific and we discuss the application to GLMMs and state space models in Sections 8 and 9 respectively.

4.2 Stochastic variational algorithm

The stochastic gradient ascent algorithm for CSGVA is outlined in Algorithm 1. For computing the stepsize, we use Adam (Kingma and Ba, 2015), which uses bias-corrected estimates of the first and second moments of the stochastic gradients to compute adaptive learning rates.

Algorithm 1: CSGVA algorithm.

Initialize $\lambda_{0}=0$ , $m_{0}=0$ , $v_{0}=0$ ,

For $t=1,\dots,N$ ,

1.

Generate $s\sim N(0,I_{nL+G})$ and compute $\theta=r_{\lambda_{t-1}}(s)$ .
2.

Compute gradient $g_{t}=\widehat{\nabla}_{\lambda}{\mathcal{L}}^{\text{VI}}$ .
3.

Update biased first moment estimate:
$m_{t}=\tau_{1}m_{t-1}+(1-\tau_{1})g_{t}$ .
4.

Update biased second moment estimate:
$v_{t}=\tau_{2}v_{t-1}+(1-\tau_{2})g_{t}^{2}$ .
5.

Compute bias-corrected first moment estimate:
$\hat{m}_{t}=m_{t}/(1-\tau_{1}^{t})$ .
6.

Compute bias-corrected second moment estimate:
$\hat{v}_{t}=v_{t}/(1-\tau_{2}^{t})$ .
7.

Update $\lambda_{t}=\lambda_{t-1}+\alpha\hat{m}_{t}/(\sqrt{\hat{v}_{t}}+\epsilon)$ .

At iteration $t$ , the variational parameter $\lambda$ is updated as $\lambda_{t}=\lambda_{t-1}+\Delta_{t}$ . Let $g_{t}$ denote the stochastic gradient estimate at iteration $t$ . In steps 3 and 4, Adam computes estimates of the mean (first moment) and uncentered variance (second moment) of the gradients using exponential moving averages, where $\tau_{1},\tau_{2}\in[0,1)$ control the decay rates. In step 4, $g_{t}^{2}$ is evaluated as $g_{t}\odot g_{t}$ , where $\odot$ denotes the Hadamard (element-wise) product. As $m_{t}$ and $v_{t}$ are initialized as zero, these moment estimates tend to be biased towards zero, especially at the beginning of the algorithm if $\tau_{1}$ , $\tau_{2}$ are close to one. As $m_{t}=(1-\tau_{1})\sum_{i=1}^{t}\tau_{1}^{t-i}g_{i}$ ,

E(m_{t})=E(g_{t})(1-\tau_{1}^{t})+\zeta,

where $\zeta=0$ if $E(g_{i})=E(g_{t})$ for $1\leq i<t$ . Otherwise, $\zeta$ can be kept small since the weights for past gradients decrease exponentially. An analogous argument holds for $v_{t}$ . Thus the bias can be corrected by using the estimates $\hat{m}_{t}$ and $\hat{v}_{t}$ in steps 5 and 6. The change is then computed as

\Delta_{t}=\frac{\alpha\hat{m}_{t}}{\sqrt{\hat{v}_{t}}+\epsilon},

where $\alpha$ controls the magnitude of the stepsize and $\epsilon$ is a small positive constant which ensures that the denominator is positive. In our experiments, we set $\alpha=0.001$ , $\tau_{1}=0.9$ , $\tau_{2}=0.99$ and $\epsilon=10^{-8}$ , values close to what is recommended by Kingma and Ba (2015).

At each iteration $t$ , we can also compute an unbiased estimate of the lower bound,

\hat{{\mathcal{L}}}^{\text{VI}}=\log p(y,\theta)-\log q_{\lambda_{t-1}}(\theta),

where $\theta$ is computed in step 1. Since these estimates are stochastic, we follow the path traced by $\bar{{\mathcal{L}}}^{\text{VI}}$ , which is an average of the lower bounds averaged over every 1000 iterations, as a means to diagnose the convergence of Algorithm 1. $\bar{{\mathcal{L}}^{\text{VI}}}$ tends to increase monotonically at the start, but as the algorithm comes close to convergence, the values of $\bar{{\mathcal{L}}}^{\text{VI}}$ fluctuate close to and about the true maximum lower bound. Hence, we fit a least squares regression line to the past $\kappa$ values of $\bar{{\mathcal{L}}}^{\text{VI}}$ and terminate Algorithm 1 once the gradient of the regression line becomes negative (see Tan, 2018). For our experiments, we set $\kappa=6$ .

5 Links to Gaussian variational approximation

CSGVA is an extension of Gaussian variational approximation (GVA, Tan and Nott, 2018). In both approaches, the conditional posterior independence structure of the local latent variables is used to introduce sparsity in the precision matrix of the approximation. Below we demonstrate that GVA is a special case of CSGVA when $F\equiv 0$ .

Tan and Nott (2018) consider a GVA of the form

\begin{bmatrix}\theta_{L}\\ \theta_{G}\end{bmatrix}\sim{\mathcal{N}}\left(\begin{bmatrix}\mu_{L}\\ \mu_{G}\end{bmatrix},T^{-T}T^{-1}\right)\,\text{where}\;T=\begin{bmatrix}T_{LL}&0\\ T_{GL}&T_{GG}\end{bmatrix}.

Note that $T_{LL}$ and $T_{GG}$ are lower triangular matrices. Using a vector $s=[s_{L}^{T},s_{G}^{T}]^{T}\sim N(0,I_{nL+G})$ , we can write

		$\displaystyle\begin{bmatrix}\theta_{L}\\ \theta_{G}\end{bmatrix}=\begin{bmatrix}\mu_{L}\\ \mu_{G}\end{bmatrix}+T^{-T}\begin{bmatrix}s_{L}\\ s_{G}\end{bmatrix},$
		$\displaystyle\text{where}\quad T^{-T}=\begin{bmatrix}T_{LL}^{-T}&-T_{LL}^{-T}T_{GL}^{T}T_{GG}^{-T}\\[5.69054pt] 0&T_{GG}^{-T}\end{bmatrix}.$

Assuming $F\equiv 0$ for CSGVA, we have from (3) that

\displaystyle\begin{bmatrix}\theta_{L}\\ \theta_{G}\end{bmatrix}=\begin{bmatrix}d\\ \mu_{1}\end{bmatrix}+\begin{bmatrix}C_{2}^{-T}&-C_{2}^{-T}DC_{1}^{-T}\\ 0&C_{1}^{-T}&\end{bmatrix}\begin{bmatrix}s_{2}\\ s_{1}\end{bmatrix},

where $[s_{2}^{T},s_{1}^{T}]^{T}\sim N(0,I_{nL+G})$ . Hence we can identify

\mu_{1}=\mu_{G},\;d=\mu_{L},\;C_{1}=T_{GG},\;C_{2}=T_{LL},\;D=T_{GL}^{T}.

If the standard way of initializing of Algorithm 1 (by setting $\lambda=0$ ) does not work well, we can use this association to initialize Algorithm 1 by using the fit from GVA. This informative initialization can reduce computation time significantly although there may be a risk of getting stuck in a local mode.

6 Importance weighted variational inference

Here we discuss how CSGVA can be improved by maximizing an importance weighted lower bound (IWLB, Burda et al., 2016), which leads to a tighter lower bound on the log marginal likelihood, and a variational approximation less prone to underestimation of the true posterior variance. We also relate the IWLB with Rényi’s divergence (Rényi, 1961; van Erven and Harremos, 2014) between $q_{\lambda}(\theta)$ and $p(\theta|y)$ , demonstrating that maximizing the IWLB instead of the usual evidence lower bound leads to a transition in the behavior of the variational approximation from “mode-seeking” to “mass-covering”. We first define Rényi’s divergence and the variational Rényi bound (Li and Turner, 2016), before introducing the IWLB as the expectation of a Monte Carlo approximation of the variational Rényi bound.

6.1 Rényi’s divergence and variational Rényi bound

Rényi’s divergence provides a measure of the distance between two densities $q$ and $p$ , and it is defined as

D_{\alpha}(q||p)=\frac{1}{\alpha-1}\log\int q(\theta)^{\alpha}p(\theta)^{1-\alpha}d\theta,

for $0<\alpha<\infty$ , $\alpha\neq 1$ . This definition can be extended by continuity to the orders 0, 1 and $\infty$ , as well as to negative orders $\alpha<0$ . Note that $D_{\alpha}(q||p)$ is no longer a divergence measure if $\alpha<0$ , but we can write $D_{\alpha}(q||p)$ as $\frac{\alpha}{1-\alpha}D_{1-\alpha}(p||q)$ for $\alpha\notin\{0,1\}$ by the skew symmetry property. As $\alpha$ approaches 1, the limit of $D_{\alpha}(q||p)$ is the Kullback-Leibler divergence, $\text{\rm KL}(q||p)$ . In variational inference, minimizing the Kullback-Leibler divergence between the variational density $q_{\lambda}(\theta)$ and the true posterior $p(\theta|y)$ is equivalent to maximizing a lower bound ${\mathcal{L}}^{\text{VI}}$ on the log marginal likelihood due to the relationship:

\displaystyle{\mathcal{L}}^{\text{VI}}

\displaystyle=\log p(y)-\text{\rm KL}\{q_{\lambda}||p(\theta|y)\}=E_{q_{\lambda}}\left\{\log\frac{p(y,\theta)}{q_{\lambda}(\theta)}\right\}.

Generalizing this relation using Rényi’s divergence measure, Li and Turner (2016) define the variational Rényi bound ${\mathcal{L}}_{\alpha}$ as

	$\displaystyle{\mathcal{L}}_{\alpha}$	$\displaystyle=\log p(y)-D_{\alpha}\{q_{\lambda}\|\|p(\theta\|y)\}$
		$\displaystyle=\frac{1}{1-\alpha}\log E_{q_{\lambda}}\left\{\left(\frac{p(y,\theta)}{q_{\lambda}(\theta)}\right)^{1-\alpha}\right\}.$

Note that ${\mathcal{L}}_{1}$ , the limit of ${\mathcal{L}}_{\alpha}$ as $\alpha\rightarrow 1$ , is equal to ${\mathcal{L}}^{\text{VI}}$ . A Monte Carlo approximation of ${\mathcal{L}}_{\alpha}$ when the expectation is intractable is

\hat{{\mathcal{L}}}_{\alpha,K}=\frac{1}{1-\alpha}\log\frac{1}{K}\sum_{k=1}^{K}w_{k}^{1-\alpha},

(8)

where $\Theta_{K}=[\theta_{1},\dots,\theta_{K}]$ is a set of $K$ samples generated randomly from $q_{\lambda}(\theta)$ , and

w_{k}=w(\theta_{k})=\frac{p(y,\theta_{k})}{q_{\lambda}(\theta_{k})},\quad k=1,\dots,K,

are importance weights. For each $k$ , $E_{q_{\lambda}}(w_{k})=p(y)$ . The limit of $\hat{{\mathcal{L}}}_{\alpha,K}$ as $\alpha\rightarrow 1$ is $\frac{1}{K}\sum_{k=1}^{K}\log\frac{p(y,\theta_{k})}{q_{\lambda}(\theta_{k})}$ . Hence $\hat{{\mathcal{L}}}_{1,K}$ is an unbiased estimate of ${\mathcal{L}}_{1}$ as $E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{1,K})={\mathcal{L}}_{1}={\mathcal{L}}^{\text{VI}}$ , where $E_{\Theta_{K}}$ denotes expectation with respect to $q(\Theta_{K})=\prod_{k=1}^{K}q_{\lambda}(\theta_{k})$ . For $\alpha\neq 1$ , $\hat{{\mathcal{L}}}_{\alpha,K}$ is not an unbiased estimate of ${\mathcal{L}}_{\alpha}$ .

6.2 Importance weighted lower bound

The importance weighted lower bound (IWLB, Burda et al., 2016) is defined as

{\mathcal{L}}_{K}^{\text{IW}}=E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{0,K})=E_{\Theta_{K}}\left(\log\frac{1}{K}\sum_{k=1}^{K}w_{k}\right),

where $\alpha=0$ in (8). It reduces to ${\mathcal{L}}^{\text{VI}}$ when $K=1$ . By Jensen’s inequality,

\displaystyle{\mathcal{L}}_{K}^{\text{IW}}\leq\log E_{\Theta_{K}}\left(\frac{1}{K}\sum_{k=1}^{K}w_{k}\right)=\log p(y).

Thus ${\mathcal{L}}_{K}^{\text{IW}}$ provides a lower bound to the log marginal likelihood for any positive integer $K$ . From Theorem 6.1 (Burda et al., 2016), this bound becomes tighter as $K$ increases.

Theorem 6.1

${\mathcal{L}}^{\text{IW}}_{K}$ increases with $K$ and approaches $\log p(y)$ as $K\rightarrow\infty$ if $w_{k}$ is bounded.

Proof

Let $I=\{w_{I_{1}},\dots,w_{I_{K}}\}$ be selected randomly without replacement from $\{w_{1},\dots,w_{K+1}\}$ . Then $E_{I|\Theta_{K+1}}(w_{I_{j}})=\frac{1}{K+1}\sum_{k=1}^{K+1}w_{k}$ for any $j=1,\dots,K$ , where $E_{I|\Theta_{K+1}}$ denotes the expectation associated with the randomness in selecting $I$ given $\Theta_{K+1}$ . Thus

	$\displaystyle{\mathcal{L}}^{\text{IW}}_{K+1}$	$\displaystyle=E_{\Theta_{K+1}}\left(\log\frac{1}{K+1}\sum\nolimits_{k=1}^{K+1}w_{k}\right)$
		$\displaystyle=E_{\Theta_{K+1}}\left\{\log E_{I\|\Theta_{K+1}}\left(\frac{1}{K}\sum\nolimits_{j=1}^{K}w_{I_{j}}\right)\right\}$
		$\displaystyle\geq E_{\Theta_{K+1}}\left\{E_{I\|\Theta_{K+1}}\log\left(\frac{1}{K}\sum\nolimits_{j=1}^{K}w_{I_{j}}\right)\right\}$
		$\displaystyle=E_{\Theta_{K}}\log\left(\frac{1}{K}\sum\nolimits_{k=1}^{K}w_{k}\right)={\mathcal{L}}^{\text{IW}}_{K}.$

If $w_{k}$ is bounded, then $\frac{1}{K}\sum_{k=1}^{K}w_{k}\overset{a.s.}{\rightarrow}p(y)$ as $K\rightarrow\infty$ by the law of large numbers. Hence ${\mathcal{L}}^{\text{IW}}_{K}=E_{\Theta_{K}}(\log\frac{1}{K}\sum_{k=1}^{K}w_{k})\rightarrow\log p(y)$ as $K\rightarrow\infty$ .

Next, we present some properties of Rényi’s divergence and $E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{\alpha,K})$ which are important in understanding the behavior of the variational density arising from maximizing ${\mathcal{L}}^{\text{IW}}_{K}$ . The proofs of these properties can be found in van Erven and Harremos (2014) and Li and Turner (2016).

Property 1

$D_{\alpha}$ is increasing in $\alpha$ , and is continuous in $\alpha$ on $[0,1]\cup\{\alpha\notin[0,1]\mid|D_{\alpha}|<\infty\}$ .

Property 2

$E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{\alpha,K})$ is continuous and decreasing in $\alpha$ for fixed $K$ .

Theorem 6.2

There exists $0\leq\alpha_{q_{\lambda},K}\leq 1$ for given $q_{\lambda}$ and $K$ such that

\log p(y)=D_{\alpha_{q_{\lambda},K}}\{q_{\lambda}||p(\theta|y)\}+{\mathcal{L}}_{K}^{\text{IW}}.

Proof

From Property 2,

\begin{gathered}{\mathcal{L}}_{1}=E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{1,K})\leq E_{\Theta_{K}}(\hat{{\mathcal{L}}}_{0,K})={\mathcal{L}}^{\text{IW}}_{K}\leq{\mathcal{L}}_{0},\\ {\mathcal{L}}_{1}-\log p(y)\leq{\mathcal{L}}^{\text{IW}}_{K}-\log p(y)\leq{\mathcal{L}}_{0}-\log p(y),\\ D_{0}\{q_{\lambda}||p(\theta|y)\}\leq\log p(y)-{\mathcal{L}}^{\text{IW}}_{K}\leq D_{1}\{q_{\lambda}||p(\theta|y)\}.\end{gathered}

From Property 1, since $D_{\alpha}$ is continuous and decreasing for $\alpha\in[0,1]$ , there exists $0\leq\alpha_{q_{\lambda},K}\leq 1$ such that $\log p(y)-{\mathcal{L}}^{\text{IW}}_{K}=D_{\alpha_{q_{\lambda},K}}\{q_{\lambda}||p(\theta|y)\}$ .

Minimizing Rényi’s divergence for $\alpha\gg 1$ tends to produce approximations which are mode-seeking (zero-forcing) while maximizing Rényi’s divergence for $\alpha\ll 0$ encourages mass-covering behavior. Theorem 6.2 suggests that maximizing the IWLB results in a variational approximation $q_{\lambda}$ whose Rényi’s divergence from the true posterior can be captured with $0\leq\alpha\leq 1$ , which represents a mix and certain balance between mode-seeking and mass-covering behavior (Minka, 2005). In our experiments, we observe that maximizing the IWLB is highly effective in correcting the underestimation of posterior variance in variational inference.

Alternatively, if we approximate ${\mathcal{L}}_{K}^{\text{IW}}$ by considering a second-order Taylor expansion of $\log\bar{w}_{K}$ about $E_{\Theta}(\bar{w}_{K})=p(y)$ , where $\bar{w}_{K}=\frac{1}{K}\sum_{k=1}^{K}w_{K}$ , and then take expectations, we have

{\mathcal{L}}_{K}^{\text{IW}}\approx\log p(y)-\frac{\text{\rm var}(w_{k})}{2Kp(y)^{2}}.

Maddison et al. (2017) and Domke and Sheldon (2018) provide bounds on the order of the remainder term in the Taylor approximation above, and demonstrate that the “looseness” of the IWLB is given by $\text{\rm var}(w_{k})$ as $K\rightarrow\infty$ . Minimizing $\text{\rm var}(w_{K})$ is equivalent to minimizing the $\chi^{2}$ divergence $D_{2}(p||q)$ . Note that if $q_{\lambda}(\theta)$ has thin tails compared to $p(\theta|y)$ , then the variance of $\text{\rm var}(w_{k})$ will be large. Hence minimizing $\text{\rm var}(w_{K})$ attempts to match $p(\theta|y)$ with $q_{\lambda}(\theta)$ so that $q_{\lambda}(\theta)$ is able to cover the tails.

6.3 Unbiased gradient estimate of importance weighted lower bound

To maximize the IWLB in CSGVA, we need to find an unbiased estimate of $\nabla_{\lambda}{\mathcal{L}}_{K}^{\text{IW}}$ using the transformation in (3). Let $s_{k}\sim N(0,I_{G+nL})$ , $\theta_{k}=r_{\lambda}(s_{k})$ for $k=1,\dots,K$ , and $S_{K}=[s_{1},\dots,s_{K}]^{T}$ .

$\displaystyle\nabla_{\lambda}{\mathcal{L}}_{K}^{\text{IW}}$	$\displaystyle=\nabla_{\lambda}E_{\Theta_{K}}(\log\bar{w}_{K})=\nabla_{\lambda}E_{S_{K}}(\log\bar{w}_{K})$	(9)
	$\displaystyle=E_{S_{K}}\left[\sum_{k=1}^{K}\frac{\nabla_{\lambda}w_{k}}{\sum_{k^{\prime}=1}^{K}w_{k^{\prime}}}\right]$
	$\displaystyle=E_{S_{K}}\left[\sum_{k=1}^{K}\frac{w_{k}\nabla_{\lambda}\log w_{k}}{\sum_{k^{\prime}=1}^{K}w_{k^{\prime}}}\right]$
	$\displaystyle=E_{S_{K}}\left[\sum\nolimits_{k=1}^{K}\widetilde{w}_{k}\nabla_{\lambda}\log w_{k}\right],$

where $w_{k}=w(\theta_{k})=w\{r_{\lambda}(s_{k})\}$ and $\widetilde{w}_{k}=w_{k}/\sum_{k^{\prime}=1}^{K}w_{k^{\prime}}$ for $k=1,\dots,K$ are normalized importance weights. Applying chain rule,

\displaystyle\nabla_{\lambda}\log w_{k}

\displaystyle=\nabla_{\lambda}r_{\lambda}(s_{k})\nabla_{\theta_{k}}\log w_{k}-\nabla_{\lambda}\log q_{\lambda}(\theta_{k}).

(10)

In Section 4.1, we note that $E_{\phi}\{\nabla_{\lambda}\log q_{\lambda}(\theta)\}=0$ as it is the expectation of the score function and hence we can omit $\nabla_{\lambda}\log q_{\lambda}(\theta)$ to obtain an unbiased estimate of $\nabla_{\lambda}{\mathcal{L}}^{\text{VI}}$ . However, in this case, it is unclear if

E_{S_{K}}\left[\sum_{k=1}^{K}\widetilde{w}_{k}\nabla_{\lambda}\log q_{\lambda}(\theta_{k})\right]=0.

(11)

Roeder et al. (2017) conjecture that (11) is true and report improved results when omitting the term $\nabla_{\lambda}\log q_{\lambda}(\theta_{k})$ from $\nabla_{\lambda}\log w_{k}$ in computing gradient estimates. However, Tucker et al. (2018) demonstrated via simulations that (11) does not hold generally and that such omission will result in biased gradient estimates. Our own simulations using CSGVA also suggest that (11) does not hold even though omission of $\nabla_{\lambda}\log q_{\lambda}(\theta_{k})$ does lead to improved results. As the stochastic gradient algorithm is not guaranteed to converge with biased gradient estimates, we turn to the double reparametrized gradient estimate proposed by Tucker et al. (2018) which allows unbiased gradient estimates to be constructed with the omission of $\nabla_{\lambda}\log q_{\lambda}(\theta_{k})$ albeit with revised weights.

Since $\tilde{w}_{k}$ depends on $\lambda$ directly as well as through $\theta_{k}$ , we use chain rule to obtain

	$\displaystyle\nabla_{\lambda}E_{\theta_{k}}(\tilde{w}_{k})$	$\displaystyle=\nabla_{\lambda}E_{s_{k}}(\tilde{w}_{k})$		(12)
		$\displaystyle=E_{s_{k}}(\nabla_{\lambda}\theta_{k}\nabla_{\theta_{k}}\tilde{w}_{k})+E_{s_{k}}(\nabla_{\lambda}\tilde{w}_{k}),$		(12)

where

	$\displaystyle\nabla_{\theta_{k}}\tilde{w}_{k}$	$\displaystyle=\left\{\frac{1}{\sum_{k^{\prime}=1}^{K}w_{k^{\prime}}}-\frac{w_{k}}{(\sum_{k^{\prime}=1}^{K}w_{k^{\prime}})^{2}}\right\}\nabla_{\theta_{k}}w_{k}$
		$\displaystyle=(\tilde{w}_{k}-\tilde{w}_{k}^{2})\nabla_{\theta_{k}}\log w_{k}.$

Alternatively,

		$\displaystyle\nabla_{\lambda}E_{\theta_{k}}(\tilde{w}_{k})=\nabla_{\lambda}\int q_{\lambda}(\theta_{k})\tilde{w}_{k}\;d\theta_{k}$		(13)
		$\displaystyle=\int\tilde{w}_{k}\nabla_{\lambda}q_{\lambda}(\theta_{k})+q_{\lambda}(\theta_{k})\nabla_{\lambda}\tilde{w}_{k}\;d\theta_{k}$
		$\displaystyle=\int\tilde{w}_{k}q_{\lambda}(\theta_{k})\nabla_{\lambda}\log q_{\lambda}(\theta_{k})\;d\theta_{k}+E_{\theta_{k}}(\nabla_{\lambda}\tilde{w}_{k})$
		$\displaystyle=E_{\theta_{k}}[\tilde{w}_{k}\nabla_{\lambda}\log q_{\lambda}(\theta_{k})]+E_{s_{k}}(\nabla_{\lambda}\tilde{w}_{k}).$

Comparing (12) and (13), we have

		$\displaystyle E_{\Theta_{K}}\left(\sum\nolimits_{k=1}^{K}\tilde{w}_{k}\nabla_{\lambda}\log q_{\lambda}(\theta_{k})\right)$
		$\displaystyle=\sum\nolimits_{k=1}^{K}E_{\Theta_{K}\backslash\theta_{k}}E_{\theta_{k}}[\tilde{w}_{k}\nabla_{\lambda}\log q_{\lambda}(\theta_{k})]$
		$\displaystyle=\sum\nolimits_{k=1}^{K}E_{S_{K}\backslash s_{k}}E_{s_{k}}(\nabla_{\lambda}\theta_{k}(\tilde{w}_{k}-\tilde{w}_{k}^{2})\nabla_{\theta_{k}}\log w_{k})$
		$\displaystyle=E_{S_{K}}\left\{\sum\nolimits_{k=1}^{K}(\tilde{w}_{k}-\tilde{w}_{k}^{2})\nabla_{\lambda}r_{\lambda}(s_{k})\nabla_{\theta_{k}}\log w_{k}\right\}.$

Combining the above expression with (9) and (10), we find that

\nabla_{\lambda}{\mathcal{L}}_{K}^{\text{IW}}=E_{S_{K}}\left(\sum\nolimits_{k=1}^{K}\tilde{w}_{k}^{2}\nabla_{\lambda}r_{\lambda}(s_{k})\nabla_{\theta_{k}}\log w_{k}\right)

An unbiased gradient estimate is thus given by

\widehat{\nabla}_{\lambda}{\mathcal{L}}_{K}^{\text{IW}}=\sum_{k=1}^{K}\tilde{w}_{k}^{2}\nabla_{\lambda}r_{\lambda}(s_{k})\nabla_{\theta_{k}}\{\log p(y,\theta_{k})-\log q_{\lambda}(\theta_{k})\}.

Thus, to use CSGVA with important weights, we only need to modify Algorithm 1 by drawing $K$ samples $s_{1},\dots,s_{K}$ in step 1 instead of a single sample and then compute the unbiased gradient estimate, $g_{t}=\widehat{\nabla}_{\lambda}{\mathcal{L}}_{K}^{\text{IW}}$ , in step 2. The rest of the steps in Algorithm 1 remain unchanged. In the importance weighted version of CSGVA, the gradient estimate based on a single sample $s$ is replaced by a weighted sum of the gradients in (7) based on $K$ samples $s_{1},\dots,s_{K}$ . However, these weights do not necessarily sum to 1. An unbiased estimate of ${\mathcal{L}}_{K}^{\text{IW}}$ itself is given by $\hat{{\mathcal{L}}}_{K}^{\text{IW}}=\log\frac{1}{K}\sum_{k=1}^{K}w_{k}$ .

7 Experimental results

We apply CSGVA to GLMMs and state space models and compare the results with that of GVA in terms of computation time and accuracy of the posterior approximation. Lower bounds reported exclude constants which are independent of the model variables. We also illustrate how CSGVA can be improved using importance weighting (IW-CSGVA), considering $K\in\{5,20,100\}$ . We find that IW-CSGVA performs poorly if it is initialized in the standard manner using $\lambda=0$ . This is because, when $q_{\lambda}(\theta)$ is still far from optimal, a few of the importance weights tend to dominate with the rest close to zero, thus producing inaccurate estimates of the gradients. Hence we initialize IW-CSGVA using the CSGVA fit, and the algorithm is terminated after a short run of 1000 iterations as IW-CSGVA is computationally intensive and improvements in the IWLB and variational approximation seem to be negligible thereafter. Posterior distributions estimated using MCMC via RStan are regarded as the ground truth. Code for all variational algorithms are written in Julia and are available as supplementary materials. All experiments are run on on Intel Core i9-9900K CPU @3.60 GHz with 16GB RAM. As the computation time of IW-CSGVA increases linearly with $K$ , we also investigate the speedup that can be achieved through parallel computing on a machine with 8 cores. Julia retains one worker (or core) as the master process and parallel computing is implemented using the remaining seven workers.

The parametrization of a hierarchical model plays a major role in the rate of convergence of both GVA and CSGVA. In some cases, it can even affect the ability of the algorithm to converge (to a local mode). We have attempted both the centered and noncentered parametrizations (Papaspiliopoulos et al., 2003, 2007), which are known to have a huge impact on the rate of convergence of MCMC algorithms. These two parametrizations are complementary and neither is superior to the other. If an algorithm converges slowly under one parametrization, it often converges much faster under the other. Which parametrization works better usually depends on the nature of data. For the datasets that we use in the experiments, the centered parametrization was found to have better convergence properties than the noncentered parametrization for GLMMs while the noncentered parametrization is preferred for stochastic volatilty models. These observations are further discussed below.

8 Generalized linear mixed models

Let $y_{i}=(y_{i1},\dots,y_{in_{i}})^{T}$ denote the vector of responses of length $n_{i}$ for the $i$ th subject for $i=1,\dots,n$ , where $y_{ij}$ is generated from some distribution in the exponential family. The mean $\mu_{ij}=E(y_{ij})$ is connected to the linear predictor $\eta_{ij}$ via

g(\mu_{ij})=\eta_{ij}=X_{ij}^{T}\beta+Z_{ij}^{T}b_{i},

for some smooth invertible link function $g(\cdot)$ . The fixed effects $\beta$ is a $p\times 1$ vector and $b_{i}\sim N(0,\Lambda)$ is a $L\times 1$ vector of random effects specific to the $i$ th subject. $X_{ij}$ and $Z_{ij}$ are vectors of covariates of length $p$ and $L$ respectively. Let $\eta_{i}=[\eta_{i1},\dots,\eta_{in_{i}}]^{T}$ , $X_{i}=[X_{i1},\dots,X_{in_{i}}]^{T}$ and $Z_{i}=[Z_{i1},\dots,Z_{in_{i}}]^{T}$ . We focus on the one-parameter exponential family with canonical links. This includes the Bernoulli model, $y_{ij}\sim$ Bern( $\mu_{ij}$ ), with the logit link $g(\mu_{ij})=\log\{\mu_{ij}/(1-\mu_{ij})\}$ and the Poisson model, $y_{ij}\sim$ Pois( $\mu_{ij}$ ), with the log link $g(\mu_{ij})=\log(\mu_{ij})$ . Let $WW^{T}$ be the unique Cholesky decomposition of $\Lambda^{-1}$ , where $W$ is a lower triangular matrix with positive diagonal entries. Define $W^{*}$ such that $W^{*}_{ii}=\log(W_{ii})$ and $W^{*}_{ij}=W_{ij}$ if $i\neq j$ , and let $\omega=v(W^{*})$ . We consider normal priors, $\beta\sim N(0,\sigma_{\beta}^{2}I)$ and $\omega\sim N(0,\sigma_{\omega}^{2}I)$ , where $\sigma_{\beta}^{2}$ and $\sigma_{\omega}^{2}$ are set as 100.

The above parametrization of the GLMM is noncentered since $b_{i}$ has mean 0. Alternatively, we can consider the centered parametrization proposed by Tan and Nott (2013). Suppose the covariates for the random effects are a subset of the covariates for the fixed effects and the first column of $X_{i}$ and $Z_{i}$ are ones corresponding to an intercept and random intercept respectively. Then we can write

\eta_{i}=X_{i}\beta+Z_{i}b_{i}=Z_{i}\beta_{R}+X_{i}^{G}\beta_{G}+Z_{i}b_{i}.

where $\beta=[\beta_{R}^{T},\beta_{G}^{T}]^{T}$ . We further split $X_{i}^{G}$ into covariates which are subject specific (varies only with $i$ and assumes the same value for $j=1,\dots,n_{i}$ ) and those which are not, and $\beta_{G}=[\beta_{G_{1}}^{T},\beta_{G_{2}}^{T}]^{T}$ accordingly, where $\beta_{G_{1}}$ , $\beta_{G_{2}}$ are vectors of length $g_{1}$ and $g_{2}$ respectively. Then

	$\displaystyle\eta_{i}$	$\displaystyle=Z_{i}\beta_{R}+{\bf 1}_{n_{i}}{x_{i}^{G_{1}}}^{T}\beta_{G_{1}}+X_{i}^{G_{2}}\beta_{G_{2}}+Z_{i}b_{i}$
		$\displaystyle=Z_{i}(C_{i}\beta_{RG_{1}}+b_{i})+X_{i}^{G_{2}}\beta_{G_{2}},$

where

C_{i}=\begin{bmatrix}I_{L}&\begin{array}[]{l}{x_{i}^{G_{1}}}^{T}\\ 0_{L-1\times g_{1}}\end{array}\end{bmatrix}\;\;\text{and}\;\;\beta_{RG_{1}}=\begin{bmatrix}\beta_{R}\\ \beta_{G_{1}}\end{bmatrix}.

Let $\tilde{b}_{i}=C_{i}\beta_{RG_{1}}+b_{i}$ . The centered parametrization is represented as

\eta_{i}=Z_{i}\tilde{b}_{i}+X_{i}^{G_{2}}\beta_{G_{2}},\quad\tilde{b}_{i}\sim N(C_{i}\beta_{RG_{1}},\Lambda)

(14)

for $i=1,\dots,n$ .

Tan and Nott (2013) compare the centered, noncentered and partially noncentered parametrizations for GLMMs in the context of variational Bayes, showing that the choice of parametrization affects not only the rate of convergence, but also the accuracy of the variational approximation. For CSGVA, we observe that the accuracy of the variational approximation and the rate of convergence can also be greatly affected. Tan and Nott (2013) demonstrate that the centered parametrization is preferred when the observations are highly informative about the latent variables. In practice, a general guideline is to use the centered parametrization for Poisson models when observed counts are large and the noncentered parametrization when most counts are close to zero. For Bernoulli models, differences between using centered and noncentered parametrizations are usually minor. Here we use the centered parametrization in (14), which has been observed to yield gains in convergence rates for the datasets used for illustration.

The global parameters are $\theta_{G}=(\beta^{T},\omega^{T})^{T}$ of dimension $G=p+L(L+1)/2$ , and the local variables are $\theta_{L}=(\tilde{b}_{1},\dots,\tilde{b}_{n})^{T}$ . The joint density is

p(y,\theta)=p(\beta)p(\omega)\prod^{n}_{i=1}\bigg{\{}p(\tilde{b}_{i}|\omega,\beta)\prod^{n_{i}}_{j=1}p(y_{ij}|\beta,\tilde{b}_{i})\bigg{\}}.

The log of the joint density is given by

	$\displaystyle\log p(y,\theta)$	$\displaystyle=\sum\nolimits_{i=1}^{n}\{y_{i}\eta_{i}-{\bf 1}^{T}h(\eta_{i})$
		$\displaystyle-{(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})^{T}WW^{T}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})}/{2}\}$
		$\displaystyle-\beta^{T}\beta/(2\sigma_{\beta}^{2})-\omega^{T}\omega/(2\sigma_{\omega}^{2})+n\log\|W\|+c,$

where $h(\cdot)$ is the log-partition function and $c$ is a constant independent of $\theta$ . For the Poisson model with log link, $h(x)=\exp(x)$ . For the Bernoulli model with logit link, $h(x)=\log\{1+\exp(x)\}$ . The gradient $\nabla_{\theta}\log p(y,\theta)$ is given in Appendix B.

For the GLMM, $b_{i}$ and $b_{j}$ are conditionally independent given $\theta_{G}$ for $i\neq j$ in $p(\theta|y)$ . Hence we impose the following sparsity structure on $\Omega_{2}$ and $C_{2}$ ,

	$\displaystyle\Omega_{2}=\begin{bmatrix}\Omega_{2,11}&0&\dots&0\\ 0&\Omega_{2,22}&\dots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\dots&\Omega_{2,nn}\end{bmatrix},$
	$\displaystyle C_{2}^{}=\left[\begin{array}[]{cccc}C^{}_{2,11}&0&\dots&0\\ 0&C^{}_{2,22}&\dots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\dots&C^{}_{2,nn}\end{array}\right],$

where each $\Omega_{2,ii}$ is a $L\times L$ block matrix and each $C^{*}_{2,ii}$ is a $L\times L$ lower triangular matrix. We set $f_{i}=0$ and $F_{ij}=0$ for all $i\in\mathcal{I}$ and all $j$ , where $\mathcal{I}$ denotes the set of indices in $v(C_{2}^{*})$ which are fixed as zero. The number of nonzero elements in $v(C_{2}^{*})$ is $nL(L+1)/2$ . Hence the number of variational parameters to be optimized are reduced from $nL(nL+1)/2$ to $nL(L+1)/2$ for $f$ and from $nL(nL+1)G/2$ to $nL(L+1)G/2$ for $F$ .

8.1 Epilepsy data

In this epilepsy data (Thall and Vail, 1990), $n=59$ patients are involved in a clinical trial to investigate the effects of the anti-epileptic drug Progabide. The patients are randomly assigned to receive either the drug (Trt = 1) or a placebo (Trt = 0). The response $y_{i}$ denotes the number of epileptic attacks patient $i$ had during 4 follow-up periods of two weeks each. Covariates include the log of the age of the patients (Age), the log of 1/4 the baseline seizure count recorded over an eight-week period prior to the trial (Base) and Visit (coded as Visit ${}_{1}=-0.3$ , Visit ${}_{2}=-0.1$ , Visit ${}_{3}=0.1$ and Visit ${}_{4}=0.3$ ). Note that Age is centered about its mean. Consider $y_{ij}\sim\text{Pois}(\mu_{ij})$ , where

	$\displaystyle\log\mu_{ij}$	$\displaystyle=\beta_{0}+\beta_{\text{Base}}\text{Base}_{i}+\beta_{\text{Trt}}\text{Trt}_{i}+\beta_{\text{Age}}\text{Age}_{i}$
		$\displaystyle\quad+\beta_{\text{Base}\times\text{Trt}}\text{Base}_{i}\times\text{Trt}_{i}+\beta_{\text{Visit}}\text{Visit}_{ij}$
		$\displaystyle\quad+b_{i1}+b_{i2}\text{Visit}_{ij},$

for $i=1,\dots,59,j=1,\dots,4$ (Breslow and Clayton, 1993).

Table 1 shows the results obtained from applying the variational algorithms to this data. The lower bounds are estimated using 1000 simulations in each case and the mean and standard deviation from these simulations are reported. CSGVA produced an improvement in the estimate of the lower bound (3139.2) as compared to GVA (3138.3) and maximizing the IWLB led to further improvements. Using a larger $K$ of 20 or 100 resulted in minimal improvements compared with $K=5$ . As this dataset is small, parallel computing is slower than serial for a small $K$ . This is because, even though the importance weights and gradients for $K$ samples are computed in parallel, the cost of sending and fetching data from the workers at each iteration dwarfs the cost of computation when $K$ is small. For $K=100$ , parallel computing reduces the computation time by about half.

Table 1: Epilepsy data. Number of iterations

I

(in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.

	$K$	$I$	time	parallel	$\hat{{\mathcal{L}}}_{K}^{\text{IW}}$
GVA	1	31	13.9	-	3138.3 (1.8)
CSGVA	1	39	16.2	-	3139.2 (1.5)
IW-CSGVA	5	1	2.5	6.1	3139.9 (0.7)
	20	1	6.9	8.1	3140.1 (0.4)
	100	1	33.5	16.0	3140.1 (0.3)

The estimated marginal posterior distributions of the global parameters are shown in Figure 1. The plots show that CSGVA (red) produces improved estimates of the posterior distribution as compared to GVA (blue), especially in estimating the posterior variance of the precision parameters $\omega_{2}$ and $\omega_{3}$ . The posteriors estimated using IW-CSGVA for the different values of $K$ are very close. By using just $K=5$ , we are able to obtain estimates that are virtually indistinguishable from that of MCMC.

Refer to caption — Figure 1: Epilepsy data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple ( $K=5$ ), orange ( $K=20$ ), green ( $K=100$ ).

8.2 Madras data

These data come from the Madras longitudinal schizophrenia study (Diggle et al., 2002) for detecting a psychiatric symptom called “thought disorder”. Monthly records showing whether the symptom is present in a patient are kept for $n=86$ patients over 12 months. The response $y_{ij}$ is a binary indicator for presence of the symptom. Covariates include the time in months since initial hospitalization ( $t$ ), gender of patient (Gender = 0 if male and 1 if female) and age of patient (Age = 0 if the patient is younger than 20 years and 1 otherwise). Consider $y_{ij}\sim\text{Bern}(\mu_{ij})$ and

\text{logit}(\mu_{ij})=\beta_{0}+\beta_{\text{Age}}\text{ Age}_{i}+\beta_{\text{Gender}}\text{ Gender}_{i}+\beta_{t}t_{ij}\\ +\beta_{\text{Age}\times t}\text{ Age}_{i}\times t_{ij}+\beta_{\text{Gender}\times t}\text{ Gender}_{i}\times t_{ij}+b_{i},

for $i=1,\dots,86,1\leq j\leq 12$ .

The results in Table 2 are quite similar to that of the epilepsy data. CSGVA yields an improvement in the lower bound estimate as compared to GVA and IW-CSGVA provided further improvements, which are evident starting with a $K$ as small as 5. Parallel computing halved the computation time for $K=100$ but did not yield any benefits for $K\in\{5,20\}$ . From Figure 2, CSGVA and IW-CSGVA are again able to capture the posterior variance of the precision parameter $\omega_{1}$ better than GVA.

Table 2: Madras data. Number of iterations

I

(in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.

	$K$	$I$	time	parallel	$\hat{{\mathcal{L}}}_{K}^{\text{IW}}$
GVA	1	25	13.1	-	-383.4 (1.4)
CSGVA	1	35	12.6	-	-383.1 (1.4)
IW-CSGVA	5	1	2.4	7.1	-382.5 (0.7)
	20	1	6.8	8.9	-382.4 (0.4)
	100	1	33.9	16.8	-382.3 (0.2)

8.3 Six cities data

In the six cities data (Fitzmaurice and Laird, 1993), $n=537$ children from Steubenville, Ohio, are involved in a longitudinal study to investigate the health effects of air pollution. Each child is examined yearly from age 7 to 10 and the response $y_{ij}$ is a binary indicator for wheezing. There are two covariates, $\text{Smoke}_{i}$ (a binary indicator for smoking status of the mother of child $i$ ) and $\text{Age}_{ij}$ (age of child $i$ at time point $j$ , centered at 9 years). Consider $y_{ij}\sim\text{Bern}(\mu_{ij})$ , where

	$\displaystyle\text{logit}(\mu_{ij})$	$\displaystyle=\beta_{0}+\beta_{\text{Smoke}}\text{Smoke}_{i}+\beta_{\text{Age}}\text{ Age}_{ij}$
		$\displaystyle\quad+\beta_{\text{Smoke}\times\text{Age}}\text{ Smoke}_{i}\times\text{Age}_{ij}+b_{i},$

for $i=1,\dots,537,j=1,\dots,4$ .

Table 3: Six cities data. Number of iterations

I

(in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.

	$K$	$I$	time	parallel	$\hat{{\mathcal{L}}}_{K}$
GVA	1	26	60.3	-	-816.4 (4.0)
CSGVA	1	28	36.5	-	-816.0 (3.9)
IW-CSGVA	5	1	6.5	16.3	-812.6 (2.5)
	20	1	23.1	24.5	-811.0 (1.9)
	100	1	115.5	61.4	-809.8 (1.5)

From Table 3, CSGVA managed to achieve a higher lower bound than GVA in about half the runtime. As $K$ increases, IW-CSGVA produced tighter lower bounds for the log marginal likelihood. As in the previous two examples, parallel computing is beneficial only when $K=100$ , cutting the runtime by about half. From Figure 3, there is slight overestimation of the posterior means of $\beta_{0}$ and $\omega_{1}$ by all the variational methods. However, CSGVA and IW-CSGVA are able to capture the posterior variance of these parameters much better than GVA especially for $\omega_{1}$ .

9 Application to state space models

Here we consider the stochastic volatility model (SVM) widely used for modeling financial time series. Let each observation $y_{i}$ for $i=1,\dots,n$ , be generated from a zero-mean Gaussian distribution where the error variance is stochastically evolving over time. The unobserved log volatility $b_{i}$ is modeled using an autoregressive process of order one with Gaussian disturbances:

\begin{gathered}y_{i}|b_{i},\sigma,\kappa\sim N(0,{\rm e}^{\sigma b_{i}+\kappa}),\quad(i=1,\dots,n)\\ b_{i}|\phi\sim N(\phi b_{i-1},1),\quad(i=2,\dots,n)\\ b_{1}|\phi\sim N(0,1/(1-\phi^{2})),\end{gathered}

where $\kappa\in\mathbb{R}$ , $\sigma>0$ and $0<\phi<1$ . Here, $y_{i}$ represents the mean-corrected return at time $i$ and the states $\{b_{i}\}$ come from a stationary process with $b_{1}$ drawn from the stationary distribution. The parametrization of the SVM above is noncentered. The centered parametrization can be obtained by replacing $b_{i}$ by $(\tilde{b}_{i}-\kappa)/\sigma$ . The performance of GVA and CSGVA are sensitive to the parametrization both in terms of rate of convergence and attained local mode. For the data sets below, the noncentered parametrization was found to have better convergence properties. The sensitivity of the stochastic volatility model to parametrization when fitted using MCMC algorithms is well known in the literature. To “combine the best of different worlds”, Kastner and Frühwirth-Schnatter (2014) introduce a strategy that samples parameters from the centered and noncentered parametrizations alternately. Tan (2017) studies optimal partially noncentered parametrizations for Gaussian state space models fitted using EM, MCMC or variational algorithms.

We use the following transformations to map constrained parameters to $\mathbb{R}$ :

\alpha=\log(\exp(\sigma)-1),\quad\psi=\text{\rm logit}(\phi).

This transformation for $\alpha$ works better than $\alpha=\log(\sigma)$ , which leads to erratic fluctuations in the lower bound and convergence issues more often. The global variables are $\theta_{G}=[\alpha,\kappa,\psi]^{T}$ of dimension $G=3$ and the local variables are $\theta_{L}=[b_{1},\dots,b_{n}]^{T}$ of length $n$ . We assume normal priors for the global parameters, where $\alpha\sim N(0,\sigma_{\alpha}^{2})$ , $\kappa\sim N(0,\sigma_{\kappa}^{2})$ and $\psi\sim N(0,\sigma_{\psi}^{2})$ , where $\sigma_{\alpha}^{2}=\sigma_{\kappa}^{2}=\sigma_{\psi}^{2}=10$ . The joint density can be written as

	$\displaystyle p(y,\theta)$	$\displaystyle=p(\alpha)p(\kappa)p(\psi)p(b_{1}\|\psi)\bigg{\{}\prod_{i=2}^{n}p(b_{i}\|b_{i-1},\psi)\bigg{\}}$
		$\displaystyle\times\bigg{\{}\prod^{n}_{i=1}p(y_{i}\|b_{i},\alpha,\kappa)\bigg{\}}.$

The log joint density is

	$\displaystyle\log p(y,\theta)$	$\displaystyle=-\frac{n\kappa}{2}-\frac{\sigma}{2}\sum_{i=1}^{n}b_{i}-\frac{1}{2}\sum^{n}_{i=1}y_{i}^{2}{\rm e}^{-\sigma b_{i}-\kappa}$
		$\displaystyle\quad-\frac{1}{2}\sum_{i=2}^{n}(b_{i}-\phi b_{i-1})^{2}-\frac{1}{2}b_{1}^{2}(1-\phi^{2})$
		$\displaystyle\quad+\frac{1}{2}\log(1-\phi^{2})-\frac{\alpha^{2}}{2\sigma_{\alpha}^{2}}-\frac{\kappa^{2}}{2\sigma_{\kappa}^{2}}-\frac{\psi^{2}}{2\sigma_{\psi}^{2}}+c,$

where $\phi=\exp(\psi)/\{1+\exp(\psi)\}$ , $\sigma=\log(\exp(\alpha)+1)$ and $c$ is a constant independent of $\theta$ . The gradient $\nabla_{\theta}\log p(y,\theta)$ is given in Appendix C. For this model, $b_{i}$ is conditionally independent of $b_{j}$ in the posterior if $|i-j|>1$ given $\theta_{G}$ . Thus, the sparsity structure imposed on $\Omega_{2}$ and $C_{2}$ are

	$\displaystyle\Omega=\begin{bmatrix}\Omega_{2,11}&\Omega_{2,12}&0&\dots&0\\ \Omega_{2,21}&\Omega_{2,22}&\Omega_{2,23}&\dots&0\\ 0&\Omega_{2,32}&\Omega_{2,33}&\dots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\dots&\Omega_{2,nn}\end{bmatrix},$
	$\displaystyle C_{2}=\begin{bmatrix}C_{2,{11}}&0&0&\dots&0\\ C_{2,21}&C_{2,{22}}&0&\dots&0\\ 0&C_{2,{32}}&C_{2,{33}}&\dots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\dots&C_{2,{nn}}\end{bmatrix}.$

The number of nonzero elements in $v(C_{2}^{*})$ is $2n-1$ . Setting $f_{i}=0$ and $F_{ij}=0$ for all $i\in\mathcal{I}$ and all $j$ , where $\mathcal{I}$ denotes the set of indices in $v(C_{2}^{*})$ which are fixed as zero, the number of variational parameters to be optimized are reduced from $n(n+1)/2$ to $2n-1$ for $f$ , and from $n(n+1)G/2$ to $(2n-1)G$ for $F$ .

9.1 GBP/USD exchange rate data

This data contain 946 observations of the exchange rates of the US Dollar (USD) against the British Pound (GBP), recorded daily from 1 October 1981, to 28 June 1985. This data are available from the “Garch” dataset in the R package Ecdat. The mean-corrected responses $\{y_{t}|t=1,\dots,n\}$ are computed from the exchange rates $\{r_{t}|t=0,\dots,n\}$ as

y_{t}=100\left\{\log\left(\frac{r_{t}}{r_{t-1}}\right)-\frac{1}{n}\sum^{n}_{i=1}\log\left(\frac{r_{i}}{r_{i-1}}\right)\right\},

where $n=945$ .

For this data, CSGVA failed to achieve a higher lower bound when it was initialized using $\lambda=0$ . Hence we initialize CSGVA using the fit from GVA instead, by using the association discussed in Section 5. With this informative starting point, CSGVA converged in 16000 iterations and managed to improve upon the GVA fit, attaining a higher lower bound. IW-CSGVA led to further improvements with increasing $K$ . As this dataset contains a large number of observations with correspondingly more variational parameters to be optimized, the computation is more intensive and parallel computing comes in very useful reducing the computation time by factors of 1.8, 2.9 and 4.2 for $K=5,20,100$ respectively.

Table 4: GBP data. Number of iterations

I

(in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.

	$K$	$I$	time	parallel	$\hat{{\mathcal{L}}}_{K}^{\text{IW}}$
GVA	1	61	239.7	-	-138.2 (1.3)
CSGVA	1	16	58.6	-	-137.8 (1.3)
IW-CSGVA	5	1	18.3	10.2	-137.4 (1.0)
	20	1	71.2	24.4	-137.0 (0.5)
	100	1	355.3	84.3	-136.8 (0.4)

Figure 4 shows the estimated marginal posteriors of the global parameters. CSGVA provides significant improvements in estimating the posterior variance of $\alpha$ and $\psi$ as compared to GVA. With the application of IW-CSGVA, we are able to obtain posterior estimates that are quite close to that of MCMC even with a small $K$ .

Figure 5 shows the estimated marginal posteriors of the latent states $\{b_{i}\}$ summarized using the mean (solid line) and one standard deviation from the mean (dotted line) estimated by MCMC (black) and IW-CSGVA ( $K=5$ , purple).

For IW-CSGVA, the means and standard deviations are estimated based on 2000 samples, by generating $\theta_{G}$ from $q(\theta_{G})$ followed by $\theta_{L}$ from $q(\theta_{L}|\theta_{G})$ . For MCMC, estimation was based on 5000 samples. IW-CSGVA estimated the means quite accurately (with a little overestimation) but the standard deviations are underestimated when compared to MCMC. The boxplots shows the difference between the means and standard deviations estimated by IW-CSGVA $(K=5)$ and GVA with MCMC. We can see that IW-CSGVA estimated both the means and standard deviations more accurately as compared to GVA.

9.2 New York stock exchange data

Next we consider 2000 observations of the returns of the New York Stock Exchange (NYSE) from February 2, 1984 to December 31, 1991. This data is available as the dataset “nyse” from the R package astsa. We consider 100 times the mean-corrected returns as responses $\{y_{i}\}$ .

From Table 5, CSGVA was able to attain a higher lower bound than GVA when initialized in the standard manner using $\lambda=0$ . Applying IW-CSGVA led to further improvements as $K$ increases. For this massive data set, parallel computing yields significant reductions in computation time, by factors of 2.9, 4.5 and 5.5 corresponding to $K=5$ , 20, 100 respectively.

Table 5: NYSE data. Number of iterations

I

(in thousands), runtimes (in seconds) and estimates of lower bound (standard deviation in brackets) of the variational methods.

	$K$	$I$	time	parallel	$\hat{{\mathcal{L}}}_{K}^{\text{IW}}$
GVA	1	43	679.0	-	-570.8 (1.8)
CSGVA	1	49	749.2	-	-570.7 (2.0)
IW-CSGVA	5	1	76.0	26.1	-569.4 (1.1)
	20	1	305.0	67.9	-569.0 (0.7)
	100	1	1503.0	274.0	-568.7 (0.4)

Figure 6 shows that the marginal posteriors estimated using CSGVA are quite close to that of MCMC while GVA underestimated the posterior variance of $\alpha$ and $\psi$ quite severely. Posteriors estimated by IW-CSGVA are virtually indistinguishable from MCMC.

From Figure 7, we can see that the marginal posteriors of the latent states are also estimated very well using IW-CSGVA $(K=5)$ and there is no systematic underestimation of the posterior variance unlike the previous example. GVA captures the posterior means very well but did not perform as well as IW-CSGVA in estimating the posterior variance.

10 Conclusion

In this article, we have proposed a Gaussian variational approximation for hierarchical models that adopts a conditional structure $q(\theta)=q(\theta_{G})q(\theta_{L}|\theta_{G})$ . The dependence of the local variables $\theta_{L}$ on global variables $\theta_{G}$ are then captured using a linear approximation. This structure is very useful when there are global scale parameters in $\theta_{G}$ which help to determine the scale of local variables in the conditional posterior of $\theta_{L}|\theta_{G}$ . We further demonstrate how CSGVA can be improved by maximizing the importance weighted lower bound. From our experiments, using a $K$ as small as 5 can lead to significant improvements in the variational approximation, with just a short run. Moreover, for massive datasets, computation time can be further reduced through parallel computing. Our experiments indicate that CSGVA coupled with importance weighting is particularly useful in improving the estimation of the posterior variance of precision parameters $\omega$ in GLMMs, and the persistence $\phi$ and volatility $\sigma$ of the log-variance in SVMs.

References

Archer et al. (2016) Archer E, Park IM, Buesing L, Cunningham J, Paninski L (2016) Black box variational inference for state space models. arXiv:1511.07367
Blei et al. (2017) Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: A review for statisticians. Journal of the American Statistical Association 112:859–877
Breslow and Clayton (1993) Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88:9–25
Burda et al. (2016) Burda Y, Grosse R, Salakhutdinov R (2016) Importance weighted autoencoders. In: Proceedings of the 4th International Conference on Learning Representations (ICLR)
Diggle et al. (2002) Diggle PJ, Heagerty P, Liang KY, Zeger SL (2002) The analysis of Longitudinal Data, 2nd edn. Oxford University Press, Oxford
Dinh et al. (2017) Dinh L, Sohl-Dickstein J, Bengio S (2017) Density estimation using real NVP. In: Proceedings of the 5th International Conference on Learning Representations (ICLR)
Domke and Sheldon (2018) Domke J, Sheldon DR (2018) Importance weighting and variational inference. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in Neural Information Processing Systems 31, Curran Associates, Inc., pp 4470–4479
Fitzmaurice and Laird (1993) Fitzmaurice GM, Laird NM (1993) A likelihood-based method for analysing longitudinal binary responses. Biometrika 80:141–151
Germain et al. (2015) Germain M, Gregor K, Murray I, Larochelle H (2015) MADE: Masked autoencoder for distribution estimation. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol 37, pp 881–889
Guo et al. (2016) Guo F, Wang X, Broderick T, Dunson DB (2016) Boosting variational inference. arXiv: 1611.05559
Han et al. (2016) Han S, Liao X, Dunson DB, Carin LC (2016) Variational Gaussian copula inference. In: Gretton A, Robert CC (eds) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol 51, pp 829–838
Hoffman and Blei (2015) Hoffman M, Blei D (2015) Stochastic structured variational inference. In: Lebanon G, Vishwanathan S (eds) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol 38, pp 361–369
Huszár (2017) Huszár F (2017) Variational inference using implicit distributions. arXiv:1702.08235
Jaakkola and Jordan (1998) Jaakkola TS, Jordan MI (1998) Improving the mean field approximation via the use of mixture distributions, Springer Netherlands, Dordrecht, pp 163–173
Kastner and Frühwirth-Schnatter (2014) Kastner G, Frühwirth-Schnatter S (2014) Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Computational Statistics and Data Analysis 76:408 – 423
Kingma and Ba (2015) Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR)
Kingma and Welling (2014) Kingma DP, Welling M (2014) Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR)
Kingma et al. (2016) Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M (2016) Improved variational inference with inverse autoregressive flow. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems 29, Curran Associates, Inc., pp 4743–4751
Li and Turner (2016) Li Y, Turner RE (2016) Rényi divergence variational inference. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’16, pp 1081–1089
Maddison et al. (2017) Maddison CJ, Lawson J, Tucker G, Heess N, Norouzi M, Mnih A, Doucet A, Teh Y (2017) Filtering variational objectives. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp 6573–6583
Magnus and Neudecker (1980) Magnus JR, Neudecker H (1980) The elimination matrix: Some lemmas and applications. SIAM Journal on Algebraic Discrete Methods 1:422–449
Magnus and Neudecker (1999) Magnus JR, Neudecker H (1999) Matrix differential calculus with applications in statistics and econometrics, 3rd edn. John Wiley & Sons, New York
Miller et al. (2016) Miller AC, Foti N, Adams RP (2016) Variational boosting: Iteratively refining posterior approximations. arXiv: 1611.06585
Minka (2005) Minka T (2005) Divergence measures and message passing. Tech. rep.
Ormerod and Wand (2010) Ormerod JT, Wand MP (2010) Explaining variational approximations. The American Statistician 64:140–153
Papaspiliopoulos et al. (2003) Papaspiliopoulos O, Roberts GO, Sköld M (2003) Non-centered parameterisations for hierarchical models and data augmentation. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M (eds) Bayesian Statistics 7, Oxford University Press, New York, pp 307–326
Papaspiliopoulos et al. (2007) Papaspiliopoulos O, Roberts GO, Sköld M (2007) A general framework for the parametrization of hierarchical models. Statist Sci 22:59–73
Ranganath et al. (2016) Ranganath R, Tran D, Blei DM (2016) Hierarchical variational models. In: Balcan MF, Weinberger KQ (eds) Proceedings of The 33rd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol 37, pp 324–333
Regli and Silva (2018) Regli JB, Silva R (2018) Alpha-Beta divergence for variational inference. arXiv: 1805.01045
Rényi (1961) Rényi A (1961) On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, University of California Press, Berkeley, Calif., pp 547–561
Rezende and Mohamed (2015) Rezende DJ, Mohamed S (2015) Variational inference with normalizing flows. In: Bach F, Blei D (eds) Proceedings of The 32nd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol 37, pp 1530–1538
Rezende et al. (2014) Rezende DJ, Mohamed S, Wierstra D (2014) Stochastic backpropagation and approximate inference in deep generative models. In: Xing EP, Jebara T (eds) Proceedings of The 31st International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol 32, pp 1278–1286
Roeder et al. (2017) Roeder G, Wu Y, Duvenaud DK (2017) Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In: Guyon I, Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30
Rothman et al. (2010) Rothman AJ, Levina E, Zhu J (2010) A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97:539–550
Salimans and Knowles (2013) Salimans T, Knowles DA (2013) Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis 8:837–882
Smith et al. (2019) Smith MS, Loaiza-Maya R, Nott DJ (2019) High-dimensional copula variational approximation through transformation. arXiv:1904.07495
Spall (2003) Spall JC (2003) Introduction to stochastic search and optimization: estimation, simulation and control. Wiley, New Jersey
Spantini et al. (2018) Spantini A, Bigoni D, Marzouk Y (2018) Inference via low-dimensional couplings. Journal of Machine Learning Research 19:1–71
Tan (2017) Tan LSL (2017) Efficient data augmentation techniques for gaussian state space models. arXiv:1712.08887
Tan (2018) Tan LSL (2018) Use of model reparametrization to improve variational Bayes. arXiv:1805.07267
Tan and Nott (2013) Tan LSL, Nott DJ (2013) Variational inference for generalized linear mixed models using partially non-centered parametrizations. Statistical Science 28:168–188
Tan and Nott (2018) Tan LSL, Nott DJ (2018) Gaussian variational approximation with sparse precision matrices. Statistics and Computing 28:259–275
Thall and Vail (1990) Thall PF, Vail SC (1990) Some covariance models for longitudinal count data with overdispersion. Biometrics 46:657–671
Titsias and Lázaro-Gredilla (2014) Titsias M, Lázaro-Gredilla M (2014) Doubly stochastic variational Bayes for non-conjugate inference. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp 1971–1979
Tran et al. (2015) Tran D, Blei DM, Airoldi EM (2015) Copula variational inference. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 3564–3572
Tucker et al. (2018) Tucker G, Lawson D, Gu S, Maddison CJ (2018) Doubly reparametrized gradient estimators for Monte Carlo objectives. arXiv: 1810.04152
van Erven and Harremos (2014) van Erven T, Harremos P (2014) Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory 60:3797–3820
Yang et al. (2019) Yang Y, Pati D, Bhattacharya A (2019) $\alpha$ -variational inference with statistical guarantees. Annals of Statistics To appear

Appendix A Derivation of stochastic gradient

We have

r_{\lambda}(s)=\begin{bmatrix}\theta_{G}\\ \theta_{L}\end{bmatrix}=\begin{bmatrix}\mu_{1}+C_{1}^{-T}s_{1}\\ d+C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1})\end{bmatrix},

where $v(C_{2}^{*})=f+F(\mu_{1}+C_{1}^{-T}s_{1})$ . Differentiating $r_{\lambda}(s)$ with respect to $\lambda$ , $\nabla_{\lambda}r_{\lambda}(s)$ is given by

\displaystyle\begin{bmatrix}\nabla_{\mu_{1}}\theta_{G}&\nabla_{\mu_{1}}\theta_{L}\\ \nabla_{v(C_{1}^{*})}\theta_{G}&\nabla_{v(C_{1}^{*})}\theta_{L}\\ \nabla_{d}\theta_{G}&\nabla_{d}\theta_{L}\\ \nabla_{\text{\rm vec}(D)}\theta_{G}&\nabla_{\text{\rm vec}(D)}\theta_{L}\\ \nabla_{f}\theta_{G}&\nabla_{f}\theta_{L}\\ \nabla_{\text{\rm vec}(F)}\theta_{G}&\nabla_{\text{\rm vec}(F)}\theta_{L}\\ \end{bmatrix}.

Since $\theta_{G}$ does not depend on $d$ , $D$ , $f$ and $F$ , we have

\begin{gathered}\nabla_{d}\theta_{G}=0_{nL\times G},\quad\nabla_{\text{\rm vec}(D)}\theta_{G}=0_{nLG\times G}\\ \nabla_{f}\theta_{G}=0_{nL(nL+1)/2\times G},\;\;\nabla_{\text{\rm vec}(F)}\theta_{G}=0_{nLG(nL+1)/2\times G}.\end{gathered}

It is easy to see that $\nabla_{\mu_{1}}\theta_{G}=I_{G}$ and $\nabla_{d}\theta_{L}=I_{nL}$ . The rest of the terms are derived as follows.

Differentiating $\theta_{G}$ with respect to $v(C_{1}^{*})$ ,

	$\displaystyle\text{\rm d}\theta_{G}$	$\displaystyle=-C_{1}^{-T}\text{\rm d}(C_{1}^{T})C_{1}^{-T}s_{1}$
		$\displaystyle=-(s_{1}^{T}C_{1}^{-1}\otimes C_{1}^{-T})K_{G}E_{G}^{T}D_{1}^{}\text{\rm d}v(C_{1}^{})$
		$\displaystyle=-(C_{1}^{-T}\otimes s_{1}^{T}C_{1}^{-1})E_{G}^{T}D_{1}^{}\text{\rm d}v(C_{1}^{}).$

\displaystyle\therefore\;\nabla_{v(C_{1}^{*})}\theta_{G}

\displaystyle=-D_{1}^{*}E_{G}(C_{1}^{-1}\otimes C_{1}^{-T}s_{1}).

Differentiating $\theta_{L}$ with respect to $f$ ,

	$\displaystyle\text{\rm d}\theta_{L}$	$\displaystyle=-C_{2}^{-T}\text{\rm d}(C_{2}^{T})C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1})$
		$\displaystyle=-\{(s_{2}-DC_{1}^{-T}s_{1})^{T}C_{2}^{-1}\otimes C_{2}^{-T}\}$
		$\displaystyle\quad\times K_{nL}E_{nL}^{T}D_{2}^{*}\text{\rm d}f$

\therefore\;\nabla_{f}\theta_{L}=-D_{2}^{*}E_{nL}\{C_{2}^{-1}\otimes C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1})\}.

Differentiating $\theta_{L}$ with respect to $F$ ,

	$\displaystyle\text{\rm d}\theta_{L}$	$\displaystyle=(\nabla_{f}\theta_{L})^{T}\text{\rm d}F\theta_{G}$
		$\displaystyle=\{\theta_{G}^{T}\otimes(\nabla_{f}\theta_{L})^{T}\}\text{\rm d}\text{\rm vec}(F).$

\therefore\;\nabla_{\text{\rm vec}(F)}\theta_{L}=\theta_{G}\otimes\nabla_{f}\theta_{L}.

Differentiating $\theta_{L}$ with respect to $D$ ,

	$\displaystyle\text{\rm d}\theta_{L}$	$\displaystyle=-C_{2}^{-T}\text{\rm d}DC_{1}^{-T}s_{1}$
		$\displaystyle=-(s_{1}^{T}C_{1}^{-1}\otimes C_{2}^{-T})\text{\rm d}\text{\rm vec}(D).$

\therefore\;\nabla_{\text{\rm vec}(D)}\theta_{L}=-(C_{1}^{-T}s_{1}\otimes C_{2}^{-1}).

Differentiating $\theta_{L}$ with respect to $\mu_{1}$ ,

\displaystyle\text{\rm d}\theta_{L}

\displaystyle=(\nabla_{f}\theta_{L})^{T}F\text{\rm d}\mu_{1}

\therefore\;\nabla_{\mu_{1}}\theta_{L}=F^{T}(\nabla_{f}\theta_{L}).

Differentiating $\theta_{L}$ with respect to $v(C_{1})$ ,

	$\displaystyle\text{\rm d}\theta_{L}$	$\displaystyle=-C_{2}^{-T}\text{\rm d}(C_{2}^{T})C_{2}^{-T}(s_{2}-DC_{1}^{-T}s_{1})$
		$\displaystyle\quad-C_{2}^{-T}D\text{\rm d}(C_{1}^{-T})s_{1}$
		$\displaystyle=(\nabla_{f}\theta_{L})^{T}F\text{\rm d}(C_{1}^{-T})s_{1}-C_{2}^{-T}D\text{\rm d}(C_{1}^{-T})s_{1}$
		$\displaystyle=\{(\nabla_{f}\theta_{L})^{T}F-C_{2}^{-T}D\}(\nabla_{v(C_{1}^{})}\theta_{G})^{T}\text{\rm d}v(C_{1}^{})$

	$\displaystyle\therefore\;\nabla_{v(C_{1}^{*})}\theta_{L}$	$\displaystyle=\nabla_{v(C_{1}^{*})}\theta_{G}\{F^{T}\nabla_{f}\theta_{L}-D^{T}C_{2}^{-1}\}$
		$\displaystyle=\nabla_{v(C_{1}^{*})}\theta_{G}\{\nabla_{\mu_{1}}\theta_{L}-D^{T}C_{2}^{-1}\}.$

Since $s_{1}=C_{1}^{T}(\theta_{G}-\mu_{1})$ and $s_{2}=C_{2}^{T}(\theta_{L}-\mu_{2})$ , we have

		$\displaystyle\log q_{\lambda}(\theta)=\log q(\theta_{G})+\log q(\theta_{L}\|\theta_{G})$
		$\displaystyle=-\frac{G}{2}\log(2\pi)+\log\|C_{1}\|-\frac{1}{2}(\theta_{G}-\mu_{1})^{T}C_{1}C_{1}^{T}(\theta_{G}-\mu_{1})$
		$\displaystyle\quad-\frac{nL}{2}\log(2\pi)+\log\|C_{2}\|-\frac{1}{2}(\theta_{L}-\mu_{2})^{T}C_{2}C_{2}^{T}(\theta_{L}-\mu_{2})$
		$\displaystyle=-\frac{nL+G}{2}\log(2\pi)+\log\|C_{1}C_{2}\|-\frac{1}{2}s^{T}s.$

As $\mu_{2}=d+C_{2}^{-T}D(\mu_{1}-\theta_{G})$ and $v(C_{2}^{*})=f+F\theta_{G}$ , differentiating $\log q_{\lambda}(\theta)$ with respect to $\theta_{G}$ ,

		$\displaystyle\text{\rm d}\log q_{\lambda}(\theta)$
		$\displaystyle=-(\theta_{G}-\mu_{1})^{T}C_{1}C_{1}^{T}\text{\rm d}\theta_{G}-(\theta_{L}-\mu_{2})^{T}C_{2}C_{2}^{T}(-\text{\rm d}\mu_{2})$
		$\displaystyle\quad-(\theta_{L}-\mu_{2})^{T}\text{\rm d}C_{2}s_{2}+\text{\rm tr}(C_{2}^{-1}\text{\rm d}C_{2})$
		$\displaystyle=-s_{1}^{T}C_{1}^{T}\text{\rm d}\theta_{G}+s_{2}^{T}C_{2}^{T}\{-C_{2}^{-T}D\text{\rm d}\theta_{G}+\text{\rm d}(C_{2}^{-T})D(\mu_{1}-\theta_{G})\}$
		$\displaystyle\quad-\text{\rm vec}(C_{2}^{-T}s_{2}s_{2}^{T})^{T}\text{\rm d}\text{\rm vec}(C_{2})+\text{\rm vec}(C_{2}^{-T})^{T}d\text{\rm vec}(C_{2})$
		$\displaystyle=\text{\rm vec}(C_{2}^{-T}-\{C_{2}^{-T}s_{2}+(\mu_{2}-d)\}s_{2}^{T})^{T}d\text{\rm vec}(C_{2})$
		$\displaystyle\quad-s_{1}^{T}C_{1}^{T}\text{\rm d}\theta_{G}-s_{2}^{T}D\text{\rm d}\theta_{G}$
		$\displaystyle=\text{\rm vec}(C_{2}^{-T}-(\theta_{L}-d)s_{2}^{T})^{T}E_{nL}^{T}D_{2}^{*}F\text{\rm d}\theta_{G}$
		$\displaystyle\quad-s_{1}^{T}C_{1}^{T}\text{\rm d}\theta_{G}-s_{2}^{T}D\text{\rm d}\theta_{G}.$

Therefore

	$\displaystyle\nabla_{\theta_{G}}\log q_{\lambda}(\theta)$	$\displaystyle=F^{T}D_{2}^{*}v(C_{2}^{-T}-(\theta_{L}-d)s_{2}^{T})$
		$\displaystyle\quad-C_{1}s_{1}-D^{T}s_{2}.$

Note that $D_{2}^{*}v(C_{2}^{-T})=v(I_{nL})$ as $C_{2}^{-T}$ is upper triangular and $v(C_{2}^{-T})$ only retains the diagonal elements of $C_{2}^{-T}$ .

Differentiating $\log q_{\lambda}(\theta)$ with respect to $\theta_{L}$ ,

	$\displaystyle\text{\rm d}\log q_{\lambda}(\theta)$	$\displaystyle=-(\theta_{L}-\mu_{2})^{T}C_{2}C_{2}^{T}\text{\rm d}\theta_{L}$
		$\displaystyle=-s_{2}^{T}C_{2}^{T}\text{\rm d}\theta_{L}.$

\therefore\;\nabla_{\theta_{L}}\log q_{\lambda}(\theta)=-C_{2}s_{2}.

Appendix B Gradients for generalized linear mixed models

Since $\theta=[\beta^{T},\omega^{T},\tilde{b}_{1}^{T},\dots,\tilde{b}_{n}^{T}]^{T}$ , we require

\nabla_{\theta}\log p(y,\theta)=[\nabla_{\beta}\log p(y,\theta),\nabla_{\omega}\log p(y,\theta),\\ \nabla_{\tilde{b}_{1}}\log p(y,\theta),\dots,\nabla_{\tilde{b}_{n}}\log p(y,\theta)]^{T}.

For the centered parametrization, the components in $\nabla_{\theta}\log p(y,\theta)$ are given below. Note that $\beta=[\beta_{RG_{1}}^{T},\beta_{G_{2}}^{T}]^{T}$ .

\nabla_{\beta_{G_{2}}}\log p(y,\theta)=\sum_{i=1}^{n}{X_{i}^{G_{2}}}^{T}\{y_{i}-h^{\prime}(\eta_{i})\}-\beta_{G_{2}}/\sigma_{\beta}^{2}.

\nabla_{\beta_{RG_{1}}}\log p(y,\theta)=\sum_{i=1}^{n}C_{i}^{T}WW^{T}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})-\beta_{RG_{1}}/\sigma_{\beta}^{2}.

Differentiating $\log p(y,\theta)$ with respect to $\omega$ ,

	$\displaystyle\text{\rm d}\log p(y,\theta)$	$\displaystyle=-\sum_{i=1}^{n}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})^{T}\text{\rm d}WW^{T}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})$
		$\displaystyle\quad+n\text{\rm tr}(W^{-1}\text{\rm d}W)-\omega^{T}\text{\rm d}\omega/\sigma_{\omega}^{2}$
		$\displaystyle=\text{\rm vec}\bigg{\{}-\sum_{i=1}^{n}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})^{T}W$
		$\displaystyle\quad+nW^{-T}\bigg{\}}^{T}E_{L}^{T}D_{L}^{*}\text{\rm d}\omega-\omega^{T}\text{\rm d}\omega/\sigma_{\omega}^{2},$

where $\text{\rm d}v(W)=D^{*}_{L}\text{\rm d}\omega$ and $D^{*}_{L}=\text{\rm diag}\{v(\text{\rm dg}(W)+{\bf 1}_{L}{\bf 1}_{L}^{T}-I_{L})\}$ . Hence

	$\displaystyle\nabla_{\omega}\log p(y,\theta)$	$\displaystyle=-D^{*}_{L}\sum_{i=1}^{n}v\{(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})(\tilde{b}_{i}-C_{i}\beta_{RG_{1}})^{T}W\}$
		$\displaystyle\quad+nv(I_{L})-\omega/\sigma_{\omega}^{2}.$

Note that $D_{L}^{*}v(W^{-T})=v(I_{L})$ because $W^{-T}$ is upper triangular and $v(W^{-T})$ only retains the diagonal elements.

\nabla_{\tilde{b}_{i}}\log p(y,\theta)=Z_{i}^{T}\{y_{i}-h^{\prime}(\eta_{i})\}-WW^{T}(\tilde{b}_{i}-C_{i}\beta_{RG_{1}}).

Appendix C Gradients for state space models

Since $\theta=[\alpha,\kappa,\psi,b_{1}^{T},\dots,b_{n}^{T}]^{T}$ , we require

\nabla_{\theta}\log p(y,\theta)=[\nabla_{\alpha}\log p(y,\theta),\nabla_{\kappa}\log p(y,\theta),\\ \nabla_{\psi}\log p(y,\theta),\nabla_{b_{1}}\log p(y,\theta),\dots,\nabla_{b_{n}}\log p(y,\theta)]^{T}.

The components in $\nabla_{\theta}\log p(y,\theta)$ are given below.

\nabla_{\alpha}\log p(y,\theta)=\frac{1}{2}\sum^{n}_{i=1}(b_{i}y_{i}^{2}{\rm e}^{-\sigma b_{i}-\kappa}-b_{i})(1-{\rm e}^{-\sigma})-\frac{\alpha}{\sigma_{\alpha}^{2}}.

\nabla_{\kappa}\log p(y,\theta)=\frac{1}{2}\bigg{(}\sum^{n}_{i=1}y_{i}^{2}{\rm e}^{-\sigma b_{i}-\kappa}-n\bigg{)}-\kappa/\sigma_{\kappa}^{2}.

	$\displaystyle\nabla_{\psi}\log p(y,\theta)$	$\displaystyle=\bigg{\{}\sum_{i=2}^{n}(b_{i}-\phi b_{i-1})b_{i-1}+b_{1}^{2}\phi-\frac{\phi}{1-\phi^{2}}\bigg{\}}$
		$\displaystyle\quad\times\phi(1-\phi)-\psi/\sigma_{\psi}^{2}.$

\nabla_{b_{1}}\log p(y,\theta)=\frac{\sigma}{2}(y_{1}^{2}{\rm e}^{-\sigma b_{1}-\kappa}-1)+\phi(b_{2}-\phi b_{1})-b_{1}(1-\phi)^{2}.

For $2\leq i\leq n-1$ ,

	$\displaystyle\nabla_{b_{i}}\log p(y,\theta)$	$\displaystyle=\frac{\sigma}{2}(y_{i}^{2}{\rm e}^{-\sigma b_{i}-\kappa}-1)+\phi(b_{i+1}-\phi b_{i})$
		$\displaystyle\quad-(b_{i}-\phi b_{i-1}).$

\displaystyle\nabla_{b_{n}}\log p(y,\theta)=\frac{\sigma}{2}(y_{n}^{2}{\rm e}^{-\sigma b_{n}-\kappa}-1)-(b_{n}-\phi b_{n-1}).

		$\displaystyle\log q_{\lambda}(\theta)=\log q(\theta_{G})+\log q(\theta_{L}\|\theta_{G})$
		$\displaystyle=-\frac{G}{2}\log(2\pi)+\log\|C_{1}\|-\frac{1}{2}(\theta_{G}-\mu_{1})^{T}C_{1}C_{1}^{T}(\theta_{G}-\mu_{1})$
		$\displaystyle\quad-\frac{nL}{2}\log(2\pi)+\log\|C_{2}\|-\frac{1}{2}(\theta_{L}-\mu_{2})^{T}C_{2}C_{2}^{T}(\theta_{L}-\mu_{2})$
		$\displaystyle=-\frac{nL+G}{2}\log(2\pi)+\log\|C_{1}C_{2}\|-\frac{1}{2}s^{T}s.$

Conditionally structured variational Gaussian approximation with importance weights††thanks: Linda Tan and Aishwarya Bhaskaran are supported by the start-up grant R-155-000-190-133.

Abstract

Keywords:

1 Introduction

2 Model specification and notation

2.1 Notation

3 Conditionally structured Gaussian variational approximation

3.1 Using conditional independence structure

4 Optimization of variational parameters

4.1 Stochastic gradients

4.2 Stochastic variational algorithm

5 Links to Gaussian variational approximation

6 Importance weighted variational inference

6.1 Rényi’s divergence and variational Rényi bound

6.2 Importance weighted lower bound

Theorem 6.1

Proof

Property 1

Property 2

Theorem 6.2

Proof

6.3 Unbiased gradient estimate of importance weighted lower bound

7 Experimental results

8 Generalized linear mixed models

8.1 Epilepsy data

8.2 Madras data

8.3 Six cities data

9 Application to state space models

9.1 GBP/USD exchange rate data

9.2 New York stock exchange data

10 Conclusion

References

Appendix A Derivation of stochastic gradient

Appendix B Gradients for generalized linear mixed models

Appendix C Gradients for state space models

Conditionally structured variational Gaussian approximation with importance weights^†^†thanks: Linda Tan and Aishwarya Bhaskaran are supported by the start-up grant R-155-000-190-133.