Flexible mean field variational inference using mixtures of non-overlapping exponential families

Jeffrey P. Spence
Stanford University
Stanford, CA 94305
jspence@stanford.edu

Abstract

Sparse models are desirable for many applications across diverse domains as they can perform automatic variable selection, aid interpretability, and provide regularization. When fitting sparse models in a Bayesian framework, however, analytically obtaining a posterior distribution over the parameters of interest is intractable for all but the simplest cases. As a result practitioners must rely on either sampling algorithms such as Markov chain Monte Carlo or variational methods to obtain an approximate posterior. Mean field variational inference is a particularly simple and popular framework that is often amenable to analytically deriving closed-form parameter updates. When all distributions in the model are members of exponential families and are conditionally conjugate, optimization schemes can often be derived by hand. Yet, I show that using standard mean field variational inference can fail to produce sensible results for models with sparsity-inducing priors, such as the spike-and-slab. Fortunately, such pathological behavior can be remedied as I show that mixtures of exponential family distributions with non-overlapping support form an exponential family. In particular, any mixture of a diffuse exponential family and a point mass at zero to model sparsity forms an exponential family. Furthermore, specific choices of these distributions maintain conditional conjugacy. I use two applications to motivate these results: one from statistical genetics that has connections to generalized least squares with a spike-and-slab prior on the regression coefficients; and sparse probabilistic principal component analysis. The theoretical results presented here are broadly applicable beyond these two examples.

1 Introduction

Bayesian graphical models [33] are widely used across vast swaths of engineering and science. Graphical models succinctly summarize modeling assumptions, and their posterior distributions naturally quantify uncertainty and produce optimal point estimators to minimize downstream risk. Efficient, exact posterior inference algorithms exist for special cases such as the sum-product algorithm [41] for discrete random variables in models with low tree-width [56] or particular Gaussian graphical models [27]. Outside of these special cases, however, alternative techniques must be used to approximate the posterior. Markov chain Monte Carlo (MCMC), especially Gibbs sampling, can sample from the posterior asymptotically [17, 20] but such techniques are difficult to scale to models with many observations, can have issues with mixing [39], and their convergence can be hard to assess [52].

Variational inference (VI) [26] avoids sampling and instead fits an approximate posterior via optimizing a loss function that acts as a proxy for some measure of divergence between the approximate and true posteriors. Usually this divergence is the “reverse” Kullback-Leibler (KL) divergence between the approximate posterior $Q$ , and the true posterior $P$ : $\text{KL}(Q||P)$ . Minimizing this reverse KL is equivalent to maximizing the so-called evidence lower bound (ELBo) which depends only on the prior, the likelihood, and the approximate posterior. Importantly, the unknown true posterior does not appear in the ELBo. In many models, by choosing a particular space of approximate posteriors over which to optimize (the “variational family”), maximizing the ELBo is a tractable albeit non-convex optimization problem. In practice, such optimization problems are solved using coordinate ascent, gradient ascent, or natural gradient ascent [2]. These optimization methods and stochastic variants thereof can scale to massive models and datasets [22].

This formulation introduces a key tension in VI. On one hand, simpler variational families can be more computationally tractable to optimize. On the other hand, a large approximation error will be incurred if the true posterior does not lie within the variational family. VI is particularly tractable for fully conjugate models with all distributions belonging to exponential families, with a variational family that assumes all variables are independent. This approach, called mean field VI, has a number of desirable properties including computationally efficient, often analytic coordinate-wise updates that are equivalent to natural gradient steps [5]. Many useful models–including mixture models–are not fully conjugate, but can be made conjugate by introducing additional auxiliary variables. Unfortunately, introducing more variables and enforcing their independence in the variational family results in larger approximation gaps [50].

Many approaches have been developed to extend VI beyond the mean field approximation. Specific models, such as coupled hidden Markov models, are amenable to more expressive variational families because subsets of the variables form tractable models – an approach called structured VI [47]. For non-conjugate models or models with variational families that are otherwise not amenable to mean field VI, it may be possible to compute gradients numerically [32] or obtain an unbiased noisy estimate of the gradient via the log gradient trick (sometimes called the REINFORCE estimator) [45] or in certain special cases the reparameterization trick [31]. Because such approaches are broadly applicable they have come to be part of the toolkit of “black box VI” [44], which seeks to automatically derive gradient estimators for very general models and variational families. Stochastic gradient-based optimization has also been combined with structured VI [21]. Additionally, approximations to gradients tailored to specific distributions have been developed [24]. Much recent work has been devoted to combining deep learning with Bayesian graphical models, whether purely for speeding up inference, or by using neural networks as flexible models [4, 25, 55]. Overall, these general approaches are impressive in scope, but the gradient estimators can be noisy and hence optimizing the ELBo can require small step sizes and corresponding long compute times [46].

Here, I take an alternative approach, expanding the utility of mean field VI by showing that more flexible variational families can be constructed by combining component exponential families while maintaining desirable conjugacy properties and still forming an exponential family. This construction is particularly useful in models involving sparsity. The approach I present maintains the analytic simplicity of mean field VI but avoids the need to introduce auxiliary variables to obtain conjugacy.

I begin with a motivating example in Section 2, develop the main theoretical results in Section 3, and show their utility on two examples in Section 4. Implementation details, proofs, and additional theoretical results are presented in the Appendix.

2 A Motivating Example

To better understand the genetics of disease and other complex traits, it is now routine to collect genetic information and measure a trait of interest in many individuals. Initially, such studies hoped to find a small number of genomic locations associated with the trait to learn about the underlying biology or to find suitable drug targets. As more of these genome-wide association studies (GWAS) were performed, however, it became increasingly clear that enormous numbers of locations throughout the genome contribute to most traits, including many diseases [7]. While this makes it difficult to prioritize a small number of candidate genes for biological followup, one can still use GWAS data to predict a trait from an individual’s genotypes, an approach referred to as a polygenic score (PGS) [11]. For many diseases, PGSs are sufficiently accurate to be clinically relevant for stratifying patients by disease risk [29].

Typically PGSs are constructed by estimating effect sizes for each position in the genome and assuming that positions contribute to the trait additively. This assumption is justified by evolutionary arguments [48] and practical concerns about overfitting due to the small number of individuals in GWAS (tens to hundreds of thousands) relative to the number of genotyped positions (millions). More concretely, the predicted value of the trait $\hat{Y_{i}}$ in individual $i$ is

\hat{Y_{i}}:=\sum_{j=1}^{P}G_{ij}\beta_{j}

where $G_{ij}$ is the genotype of individual $i$ at position $j$ and $\beta_{j}$ is the effect size at position $j$ .

Estimating the effect sizes, $\beta_{j}$ , is complicated by the fact that to protect participant privacy GWAS typically only release marginal effect size estimates $\hat{\beta_{j}}$ obtained by separately regressing the value of the trait against the genotypes at each position. These marginal estimates thus ignore the correlation in individuals’ genotypes at different positions in the genome. Fortunately, these correlations are stable across sets of individuals from the same population and can be estimated from publicly available genotype data (e.g., [10]) even if that data does not contain measurements of the trait of interest.

An early approach to dealing with genotypic correlations was LDpred [53] which uses MCMC to fit the following model with a spike-and-slab prior on the effect sizes:

	$\displaystyle\beta_{j}$	$\displaystyle\overset{\text{i.i.d.}}{\sim}p_{0}\delta_{0}+(1-p_{0})\mathcal{N}(0,\sigma_{g}^{2})$
	$\displaystyle\hat{\beta}\|\beta$	$\displaystyle\overset{\hphantom{\text{i.i.d.}}}{\sim}\mathcal{N}(\mathbf{X}\beta,\sigma^{2}_{e}\mathbf{X}),$

where $\mathbf{X}\in\mathbb{R}^{P\times P}$ is the matrix of correlations between genotypes at different positions in the genome, $\sigma^{2}_{e}$ is the variance of the estimated effect sizes from the GWAS and $\sigma^{2}_{g}$ is the prior variance of the non-zero effect sizes.

While this model was motivated by PGSs, it is also a special case of Bayesian generalized least squares with a spike-and-slab prior on the regression coefficients.

Note that if $\mathbf{X}$ is diagonal, then the likelihood is separable and the true posterior can, in fact, be determined analytically. Unfortunately, in real data genotypes are highly correlated (tightly linked) across positions and so $\mathbf{X}$ is far from diagonal and as a result integrating over the mixture components for each $\beta_{j}$ must be done simultaneously, resulting in a runtime that is exponential in the number of genotyped positions. To obtain a computationally tractable approximation to the true posterior, LDpred and a number of extensions that incorporate more flexible priors (e.g., [16, 34]) use MCMC. As noted above, however, MCMC has a number of drawbacks and so there may be advantages to deriving and implementing a VI scheme.

A natural mean field VI approach to approximating the posterior would be to split the mixture distribution using an auxiliary random variable as follows

$\displaystyle Z_{j}$	$\displaystyle\overset{\text{i.i.d.}}{\sim}\text{Bernoulli}(1-p_{0})$
$\displaystyle\beta_{j}\|Z_{j}$	$\displaystyle\overset{\hphantom{\text{i.i.d.}}}{\sim}\mathcal{N}(0,\sigma^{2}_{Z_{j}})$	(1)
$\displaystyle\hat{\beta}\|\beta$	$\displaystyle\overset{\hphantom{\text{i.i.d.}}}{\sim}\mathcal{N}(\mathbf{X}\beta,\sigma^{2}_{e}\mathbf{X}),$

with $\sigma^{2}_{0}=0$ and $\sigma^{2}_{1}=\sigma^{2}_{g}$ (treating a Gaussian with zero variance as a point mass for notational convenience). One can then use categorical variational distributions $q_{Z_{j}}$ for the $Z$ s and Gaussian variational distributions $q_{\beta_{j}}$ for the $\beta$ s. Unfortunately, this approach immediately encounters an issue: when calculating the ELBo, $\text{KL}(q_{\beta_{j}}||\mathcal{N}(0,0))$ is undefined unless $q_{\beta_{j}}$ is a point mass at zero because otherwise $q_{\beta_{j}}$ is not absolutely continuous with respect to the point mass at zero. This may seem to be merely a technical issue that vanishes if $\sigma^{2}_{0}$ is taken to be some tiny value so that $q_{\beta_{j}}$ is absolutely continuous with respect to $\mathcal{N}(0,\sigma^{2}_{0})$ and hence has well-defined KL, while for practical purposes $\mathcal{N}(0,\sigma^{2}_{0})$ acts like a point mass at zero. Yet, this superficial fix is not enough: the mean field assumption requires $Z_{j}$ and $\beta_{j}$ to be independent under the variational approximation to the posterior, which cannot capture the phenomenon that $Z_{j}=0$ forces $\beta_{j}$ to be close to zero while $Z_{j}=1$ allows $\beta_{j}$ to be far from zero. Further intuition is presented in Appendix A, where I analyze the case with only a single position (i.e., $P=1$ ) in detail.

Fortunately, this problem can be solved by noting that spike-and-slab distributions like $p_{0}\delta_{0}+(1-p_{0})\mathcal{N}(0,\sigma_{g}^{2})$ surprisingly still form an exponential family and are conjugate to the likelihoo. In this example, by using a spike-and-slab distribution for the variational distributions $q_{\beta_{j}}$ , there is no need to use auxiliary variables to obtain analytical updates for fitting the approximate posterior. A similar approach has been considered for a specific model in a different context [8], but in the next section, I show why the analysis works for this approach and also show that it is more broadly applicable. In Section 4, I return to this motivating example to show that the naive mean field VI does indeed break down and that by using sparse conjugate priors an accurate approximation to the posterior may be obtained.

3 More flexible exponential families

In this section I begin with a simple observation. While it is usually true that mixtures of distributions from an exponential family no longer form an exponential family, that is not the case for mixtures of distributions with distinct support. Mixtures of exponential family members with non-overlapping support always form an exponential family.

Theorem 1.

Let $F_{1},\ldots,F_{K}$ be exponential families of distributions, where the distributions within each family have supports $\mathcal{S}_{1},\ldots,\mathcal{S}_{K}$ such that $\mathcal{S}_{i}\cap\mathcal{S}_{j}=\varnothing$ for all $i\neq j$ . Further, let $\eta_{1},\ldots,\eta_{K}$ be the natural parameters of the exponential families, $T_{1},\ldots,T_{K}$ be the sufficient statistics, $A_{1},\ldots,A_{K}$ be the log-partitions, and $H_{1},\ldots,H_{K}$ be the base measures. Then the family of mixture distributions

\displaystyle F_{\text{mix}}=\left\{\sum_{k=1}^{K}\pi_{k}f_{k}:\sum_{k=1}^{K}\pi_{k}=1,\pi_{i}\geq 0,f_{i}\in F_{i},\forall i\right\}

is an exponential family with natural parameters

	$\displaystyle\eta_{\text{mix}}$	$\displaystyle=\big{(}\eta_{1},\ldots,\eta_{K},\log\pi_{1}-A_{1}(\eta_{1})-\log\pi_{K}+A_{K}(\eta_{K}),\ldots,$
		$\displaystyle\hskip 85.35826pt\log\pi_{K-1}-A_{K-1}(\eta_{k-1})-\log\pi_{K}+A_{K}(\eta_{K})\big{)},$

corresponding sufficient statistics

T_{\text{mix}}(x)=\left(\mathbb{I}\left\{x\in\mathcal{S}_{1}\right\}T_{1}(x),\ldots,\mathbb{I}\left\{x\in\mathcal{S}_{K}\right\}T_{K}(x),\mathbb{I}\left\{x\in\mathcal{S}_{1}\right\},\ldots,\mathbb{I}\left\{x\in\mathcal{S}_{K-1}\right\}\right),

log-partition

A_{\text{mix}}(\eta_{\text{mix}})=A_{K}(\eta_{K})-\log\pi_{K},

and base-measure

\frac{dH_{\text{mix}}}{dH}(x)=\prod_{i=1}^{K}\left(\frac{dH_{i}}{dH}(x)\right)^{\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}}

where $H$ is a measure such that $H_{i}$ is absolutely continuous with respect to $H$ for all $i$ and $dH_{i}/dH$ is the usual Radon-Nikodym derivative and $0^{0}$ is taken to be $1$ .

Note that Theorem 1 has an apparent asymmetry with respect to component $K$ . This arises because the mixture weights $\pi_{1},\ldots,\pi_{K}$ are constrained to sum to one, and hence $\pi_{K}$ is completely determined by the other mixture weights. It would be possible to have a “symmetric” version of Theorem 1 but the constraint on the mixture weights would mean that the natural parameters would live on a strictly lower dimensional space. Such exponential families are called curved exponential families, and many of the desirable properties of exponential families do not hold for curved exponential families.

While the restriction in Theorem 1 to exponential families with non-overlapping support may seem particularly limiting, it is important to note that a new exponential family can be formed by restricting all of the distributions within any exponential family to lie within a fixed subset of their usual domain and renormalizing. In particular, one could divide the original full domain into non-overlapping subsets and form mixtures of exponential families restricted to these subsets. Another important set of exponential families are uniform distributions with fixed support. While each of these exponential families only contains a single distribution, by combining mixtures of these uniform distributions with non-overlapping support, Theorem 1 shows that piece-wise constant densities with fixed break points form an exponential family. In this paper I focus primarily on the case of mixtures of a continuous exponential family member with one or more point masses. By Theorem 1, mixtures of point masses at fixed locations form an exponential family, and diffuse distributions can be trivially restricted to falling outside of this set of measure zero, which makes clear that spike-and-slab distributions where the slab is a non-atomic distribution from an exponential family always form an exponential family.

While these substantially more flexible exponential families may be of independent interest, they are of little use in mean field VI unless they are conjugate to widely used likelihood models. The following theorem provides a recipe to take a conjugate exponential family model and create a model with a more flexible prior using Theorem 1 while maintaining conjugacy.

Theorem 2.

Let $F_{\text{prior}}(x)$ be an exponential family of prior distributions with sufficient statistics $T_{\text{prior}}(x)$ with the distributions supported on $\mathcal{S}_{\text{prior}}$ that are conjugate to an exponential family of distributions $F_{Y|X}(y|x)$ . For exponential families $F_{1},\ldots,F_{K}$ with supports $\mathcal{S}_{1},\ldots,\mathcal{S}_{K}$ such that $\mathcal{S}_{i}\cap\mathcal{S}_{j}=\varnothing$ for all $i\neq j$ , if there exists a matrix $M_{i}$ and vector $v_{i}$ such that $T_{\text{prior}}(x)=M_{i}T_{i}(x)+v_{i}$ for all $x\in\mathcal{S}_{i}$ for all $i$ , then $F_{\text{mix}}$ as defined in Theorem 1 is also conjugate to $F_{Y|X}$ .

Intuitively, Theorem 2 says that for a given conjugate prior distribution, say $\mathcal{P}$ , we can create a more flexible prior that maintains conjugacy by combining non-overlapping component distributions so long as those component distributions have the same sufficient statistics as $\mathcal{P}$ on their domain up to an affine transformation.

Note that the point mass distribution at a fixed point is a member of an exponential family with a sole member, and hence it has no sufficient statistics. Furthermore, any function is constant on the support of a point mass. This provides an immediate corollary that mixtures of any continuous distribution with a finite number of point-masses are conjugate to the same distributions as the original continuous distribution. Another example is mixtures of degenerate distributions where only a subset of parameters are fixed. For instance, consider the mean vector, $\mu$ , of a multivariate Gaussian. One could put a prior on $\mu$ that is a mixture of the usual conjugate multivariate Gaussian along with degenerate distributions like a multivariate Gaussian with the condition that $\mu_{i}=0$ , that $\mu_{i}=\mu_{j}$ for dimensions $i$ and $j$ . In fact, any degenerate multivariate Gaussian distribution defined by affine constraints $\mathbf{M}\mu=\mathbf{v}$ for some matrix $\mathbf{M}$ and vector $\mathbf{v}$ could be included in this mixture.

Theorem 2 makes it extremely easy to add sparsity or partial sparsity to any conjugate model. A non-degenerate application is constructing asymmetric priors from symmetric ones. For example, we can construct an asymmetric prior by taking a mixture of a copy of the original prior restricted to the negative reals and a copy restricted to non-negative reals. These mixture form an exponential family because the copies have the same sufficient statistics as the original prior but have non-overlapping domains.

4 Numerical results

To show the applicability of the theoretical results presented in Section 3, I test sparse VI schemes using Theorems 1 and 2–which I refer to as the non-overlapping mixtures trick–on two models and compare these schemes to non-sparse and naive VI approximations showing the superior performance of treating sparsity exactly. The first model is the polygenic score prediction model discussed in Section 2 and the second is a spike-and-slab prior for sparse probabilistic principal component analysis (PCA) [18]. Python implementations of the fitting procedures and the results of the simulations presented here are available at https://github.com/jeffspence/non_overlapping_mixtures. All mean field results were obtained on a Late 2013 MacBook Pro and fitting the mean field VI schemes took less than five seconds per dataset for the polygenic score model and less than two minutes per dataset for sparse PCA.

4.1 Polygenic Scores

While the LDpred model was originally fit using Gibbs sampling [53], it may be desirable to fit the model using VI for computational reasons. I simulated data under the LDpred model to compare two VI schemes. The first VI scheme is obtained by using the non-overlapping mixtures trick. The second is the naive VI scheme of introducing auxiliary variables to split the spike-and-slab prior into a mixture, and then approximating that mixture as a mixture of Gaussians. The details of the VI schemes are presented in Appendix D. I simulated a $1000$ dimensional vector $\hat{\beta}$ from the LDpred model with $p_{0}=0.99$ so that on average about $10$ sites had non-zero effects, a level of sparsity generally consistent with realistic human data [49]. For each simulation I drew $\mathbf{X}$ by simulating from a Wishart distribution with an identity scale matrix and $1000$ degrees of freedom and then dividing the draw from the Wishart distribution by $1000$ . I set $\sigma^{2}_{1}$ to be $1$ and then varied $\sigma^{2}_{e}$ from $0.05$ to $1.0$ . For each of value of $\sigma^{2}_{e}$ , I simulated 100 replicate datasets and tested the VI schemes as well as a baseline that is used in statistical genetics of using the raw values of $\hat{\beta}$ as estimates of $\beta$ . I also tested adaptive random walk MCMC as implemented in NIMBLE [12] which was run for 1000 iterations, and boosting black box VI (BBBVI) [35] as implemented in pyro [3], which sequentially fits a mixture distribution as an approximate posterior. For BBBVI, I used the equivalent formulation $\hat{\beta}|\beta,Z\sim\mathcal{N}(\mathbf{X}(\beta*Z),\sigma^{2}\mathbf{X})$ , where the $\beta_{i}$ are independent Gaussians, and the $Z_{i}$ are independent Bernoullis, with $*$ being the component-wise product. For the component distributions of the variational family, I used independent Gaussians for the $\beta_{i}$ and independent Bernoullis for the $Z_{i}$ . I used 2 particles to stochastically estimate gradients, 50 components in the mixture distribution, and 2000 gradient steps per mixture component. The BBBVI objective functions were optimized using the adam optimizer [30] with learning rate $10^{-3}$ and default parameters otherwise. I evaluated the methods using the mean squared error (MSE) and correlation between the estimated and true values of $\beta$ , using the variational posterior mean as the estimator for the Bayesian approaches (Figure 1).

For the naive VI scheme, there is an additional parameter $\sigma_{0}^{2}$ . When $\sigma_{0}^{2}=1=\sigma_{1}^{2}$ , the method is equivalent to performing mean field VI on the non-sparse model where the prior on $\beta$ is simply a single Gaussian, which is a Bayesian version of ridge regression. In addition to $\sigma_{0}^{2}=1$ , I used $\sigma_{0}^{2}\in\left\{10^{-2},10^{-4},10^{-10}\right\}$ . The results for $\sigma_{0}^{2}=10^{-4}$ and $\sigma_{0}^{2}=10^{-10}$ are indistinguishable. All of the VI schemes (both the new scheme and the naive scheme with any value of $\sigma_{0}^{2}$ ) outperform the baseline of just using $\hat{\beta}$ except for in the extremely high signal-to-noise regime, where the naive model with $\sigma_{0}^{2}=1$ over-shrinks and cannot take advantage of sparsity. BBBVI performed comparably to the naive VI schemes, but BBBVI required hours to run compared to seconds for the mean field schemes. Meanwhile, by these metrics the non-overlapping mixtures trick performed indistinguishably from MCMC, but again took seconds per run compared to approximately 12 hours per run for MCMC. I also considered the maximum likelihood estimator (MLE) $\beta_{\text{MLE}}=\mathbf{X}^{-1}\hat{\beta}$ but $\mathbf{X}$ is terribly ill-conditioned resulting in high variance. Using the MLE almost always resulted in correlations around zero (maximum correlation across all simulations was $0.13$ , mean correlation was $0.005$ ) and extremely large MSE (minimum MSE across all simulations was 4.19, mean MSE was $\approx 335000$ – about six orders of magnitude higher than any other method). None of the naive schemes provide a substantial improvement over the baseline in terms of correlation. In terms of MSE, the naive schemes provide some improvement if $\sigma_{0}^{2}$ is tuned properly, but the non-overlapping mixture trick outperforms all of the schemes across signal-to-noise regimes. Exactly accounting for sparsity when the true signal is sparse substantially improves performance.

4.2 Sparse Probabilistic PCA

PCA [42, 23] is a widely-used exploratory data analysis method that projects high dimensional data into a low dimensional subspace define by orthogonal axes that explain a maximal amount of empirical variance. These axes are defined by “loadings”–weightings of the dimensions that compose a datapoint. Dimensions with high loadings in the first few PCs are deemed “important” for differentiating the data points. Unfortunately, the loadings are dense, making them difficult to interpret especially for high dimensional data.

To aid interpretability, and to leverage that in many datasets only a few variables are expected to contribute meaningfully to variation between data points, formulations of sparse PCA were developed to encourage sparsity in the loadings, usually by means of $\ell_{1}$ regularization [57].

In a parallel line of work, a Bayesian interpretation of PCA, probabilistic PCA, was developed by showing that classical PCA can be derived as the limiting maximum a posteriori estimate of a particular generative model up to possible scaling and rotation [51]. The probabilistic formulation more naturally extends to non-Gaussian noise models [9], allows principled methods for choosing the number of principal components [37], gracefully handles missing data [51], and enables speedups for structured datasets [1].

The probabilistic PCA model is

	$\displaystyle Z_{1},\ldots,Z_{N}$	$\displaystyle\sim\mathcal{N}(0,\mathbf{I}_{K})$
	$\displaystyle X_{n}\|Z_{n}$	$\displaystyle\sim\mathcal{N}(\mathbf{W}Z_{n},\sigma^{2}_{e}\mathbf{I}_{P})$

where $K$ is the number of PCs desired, $\mathbf{W}\in\mathbb{R}^{P\times K}$ is the matrix of loadings, and $Z_{n}$ is the PC score (i.e., projection onto the first $K$ PCs) of the $n^{\text{th}}$ datapoint.

These two lines of work were then brought together in sparse probabilistic PCA [18], which encourages sparse loadings by putting a Laplace prior on each loading, $\mathbf{W}_{pk}$ . The Laplace prior is not conjugate to the Gaussian noise model, however, but the Laplace distribution is a scale mixture of Gaussians allowing for a hierarchical decomposition, making this formulation amenable to mean field VI.

There are a number of conceptually displeasing aspects to this formulation of sparse probabilistic PCA. First, while it is true that the maximum a posteriori estimate of the loadings is sparse under a Laplace prior, the generative model is not sparse: because the Laplace distribution is diffuse, the loadings are non-zero almost surely. Second, the LDpred model discussed in Section 2 is a discrete scale mixture of Gaussians with two components. I showed numerically in Section 4.1 and theoretically in Appendix A that mean field VI breaks down in such a setting suggesting that performing mean-field VI on a scale mixture of Gaussians may be problematic especially in sparsity-inducing cases.

Motivated by these considerations I consider a spike-and-slab prior on the loadings:

	$\displaystyle\mathbf{W}_{pk}$	$\displaystyle\sim p_{0}\delta_{0}+(1-p_{0})\mathcal{N}(0,\sigma^{2}_{1})$
	$\displaystyle Z_{n}$	$\displaystyle\sim\mathcal{N}(0,\mathbf{I}_{K})$
	$\displaystyle X_{n}\|Z_{n},\mathbf{W}$	$\displaystyle\sim\mathcal{N}(\mathbf{W}Z_{n},\sigma^{2}_{e}\mathbf{I}_{P})$

Note that [18] considered a fully Bayesian model, which here would correspond to putting uninformative conjugate priors on $p_{0}$ , $\sigma^{2}_{e}$ , and $\sigma^{2}_{1}$ . For ease of exposition, I consider those to be fixed hyperparameters, but future work could explore putting priors on them or fitting them by maximizing the ELBo with respect to the hyperparameters in an empirical Bayes-like procedure [6], which has been shown to automatically determine appropriate levels of sparsity in other settings [54].

To fit this model, I used the mean-field VI schemes described in Appendix E. Briefly, I compared the performance of the naive scheme of introducing auxiliary variables, $Y_{pk}\sim\text{Bernoulli}(1-p_{0})$ , to split the prior on $(\mathbf{W})_{pk}$ as $(\mathbf{W})_{pk}|Y_{pk}\sim\mathcal{N}(0,\sigma^{2}_{Y_{pk}})$ to the scheme where the prior on $(\mathbf{W})_{pk}$ is treated exactly using the non-overlapping mixtures trick. I compared the variational posterior mean estimates of the loadings and scores from both of these schemes to classical PCA based on SVD, as well as an “oracle” version of classical PCA that uses only those variables that are simulated to have non-zero loadings.

For each dataset, I simulated $500$ points with $10000$ dimensions. The data was split into four clusters of sizes $200$ , $200$ , $50$ , and $50$ . For $100$ of the dimensions, the value of the entry was drawn from $\mathcal{N}(\mu_{c},1)$ , where $c$ indexes the cluster and each $\mu_{c}$ was drawn from a standard normal independently for each dimension. The remaining dimensions were drawn from standard normals. The entire matrix was then centered and scaled so that each variable had empirical mean zero, and unit empirical standard deviation, causing the simulations to differ from the generative model. For inference I set $\sigma^{2}_{1}=0.5$ , $\sigma^{2}_{e}=1$ , and $p_{0}=1-100/10000$ . For all runs, I used $K=2$ to project onto a two-dimensional space to facilitate visualization.

In the naive scheme there is an additional hyperparameter, $\sigma^{2}_{0}$ , for which I consider several values. Tuning this hyperparameter is crucial to obtaining reasonable results. For $\sigma^{2}_{0}\approx\sigma^{2}_{1}$ the model is essentially probabilistic PCA with a non-sparsity inducing Gaussian prior on the loadings, while for $\sigma^{2}_{0}\ll\sigma^{2}_{1}$ the mean-field assumption together with the zero-avoidance of VI causes the approximate posterior to put very little mass on the event $Y_{pk}=0$ , and so again the model reduces to probabilistic PCA with a Gaussian prior on the loadings. On the other hand, the VI scheme based on the non-overlapping mixtures trick produces sensible results without requiring any tuning. Indeed, Figure 2 shows that the new scheme clusters the data better than any of the naive schemes, and that as $\sigma^{2}_{0}\downarrow 0$ the naive scheme becomes indistinguishable from classical PCA. Furthermore, whereas the posterior mean loadings from the non-overlapping mixtures trick are indeed sparse, the loadings from the other methods are dense (Figures 2(b) and Figures 2(c)). Additional simulations and a more quantitative measure of performance– reconstruction error–are presented in Appendix F.

5 Discussion

VI has made it possible to scale Bayesian inference on increasingly complex models to increasingly massive datasets, but the error induced by obtaining an approximate posterior can be substantial [50]. Some of this error can be mitigated by using more flexible variational families, but doing so can require alternative methods for fitting, like numerical calculation of gradients [32], sampling-based stochastic gradient estimators [31, 45], or other approximations [24]. The results of Theorems 1 and 2 provide an alternative method, using mixtures of non-overlapping exponential families to provide a more flexible variational family while maintaining conjugacy. Even in models that are not fully conjugate, methods have been developed to exploit portions of the model that are conjugate, and the results presented here may prove useful in such schemes [28]. These schemes could be especially useful in obtaining stochastic gradient updates for bayesian neural networks with spike-and-slab priors on the weights. In particular, a ReLU applied to a Gaussian random variable is a mixture of a point mass and a Gaussian restricted to be positive, which is an exponential family by Theorem 1.

Here I focused on modeling sparse phenomena and found that the non-overlapping mixtures trick is superior to a naive approach of introducing auxiliary variables. Yet, the pitfalls I described occur whenever mean field VI is applied to mixture distributions where the mixture components are very different. This suggests that in some cases, it may be beneficial to use the sparse distributions presented here to approximate non-sparse mixture distributions and then treat the sparse approximation exactly.

Throughout, I assumed that the domains of the non-overlapping mixtures were specified a priori. This assumption could be relaxed, treating the domains as hyperparameters that could then be optimized with respect to the ELBo. Yet, it is not obvious that for arbitrary models the objective function need to be differentiable with respect to these hyperparameters, which may necessitate the use of zeroth-order optimization procedures such as Bayesian Optimization [14].

Exponential families also play an important role in other forms of variational inference including Expectation Propagation [38]. The non-overlapping mixtures trick may be useful in variational approaches beyond the usual reverse KL-minimizing mean field VI.

While the non-overlapping mixtures trick makes it easy to add sparsity to conjugate models, it is not a panacea to some of the common pitfalls of VI. For example, I also considered a sparse extension of latent Dirichlet allocation (LDA) [6, 43], where documents can have exactly one topic with positive probability. Unfortunately, the zero-avoiding nature of the reverse KL results in pathological behavior: for a document with only one topic the prior topic probabilities for each word are sparse, but in the variational posterior they must be dense. Empirically, this results in the VI posterior only putting mass on the non-sparse mixture component and hence being indistinguishable from the usual VI approach to LDA. In general, care should be taken when the sparsity added to a model results in the possibility of variables having zero likelihood conditioned on latent variables coming from the sparse component.

In spite of these drawbacks, providing a recipe to easily model sparsity in otherwise conjugate Bayesian models provides another avenue to model complex phenomena while maintaining the analytical and computational benefits of mean-field VI.

Broader Impact

The primary contribution of this paper is theoretical and so the broader societal impact depends on how the theorems are used. The polygenic score application has the possibility to improve the overall quality of healthcare, but because the majority of GWAS are performed on individuals of European ancestries, PGSs are more accurate for individuals from those ancestry groups, potentially exacerbating health disparities between individuals of different ancestries as PGSs see clinical use [36]. The methods presented here are equally applicable to GWAS data collected from any ancestry group, however, and so efforts to diversify genetic data will ameliorate performance differences across ancestry groups. PGSs used for some traits such as sexual orientation [15], educational attainment [19], or stigmatized psychiatric disorders [11] raise thorny ethical considerations, especially when the application of such PGSs could enable genetic discrimination or fuel dangerous public misconceptions about the genetic basis of such traits [40]. On the other hand, PGSs applied to diseases have the potential to improve health outcomes and so if used responsibly could provide tremendous benefit to society.

Acknowledgments and Disclosure of Funding

I would like to thank Nasa Sinnott-Armstrong and Jonathan Pritchard for piquing my interest in the polygenic score application, and Jeffrey Chan, Amy Ko, Clemens Weiss, and four anonymous reviewers for helpful feedback on the manuscript. I was funded by a National Institutes of Health training grant (5T32HG000044-23).

References

[1] Aman Agrawal, Alec M. Chiu, Minh Le, Eran Halperin, and Sriram Sankararaman. Scalable probabilistic PCA for large-scale genetic variation data. bioRxiv, page 729202, 2019.
[2] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
[3] Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research, 20(1):973–978, 2019.
[4] Sean R Bittner and John P Cunningham. Approximating exponential family models (not single distributions) with a two-network architecture. arXiv preprint arXiv:1903.07515, 2019.
[5] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
[6] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
[7] Evan A. Boyle, Yang I. Li, and Jonathan K. Pritchard. An expanded view of complex traits: from polygenic to omnigenic. Cell, 169(7):1177–1186, 2017.
[8] Peter Carbonetto and Matthew Stephens. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis, 7(1):73–108, 2012.
[9] Julien Chiquet, Mahendra Mariadassou, and Stéphane Robin. Variational inference for probabilistic poisson PCA. The Annals of Applied Statistics, 12(4):2674–2698, 2018.
[10] 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68–74, 2015.
[11] International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460(7256):748, 2009.
[12] Perry de Valpine, Daniel Turek, Christopher J Paciorek, Clifford Anderson-Bergman, Duncan Temple Lang, and Rastislav Bodik. Programming with models: writing statistical algorithms for general model structures with nimble. Journal of Computational and Graphical Statistics, 26(2):403–413, 2017.
[13] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential families. The Annals of statistics, 7(2):269–281, 1979.
[14] Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
[15] Andrea Ganna, Karin JH Verweij, Michel G Nivard, Robert Maier, Robbee Wedow, Alexander S Busch, Abdel Abdellaoui, Shengru Guo, J Fah Sathirapongsasuti, Paul Lichtenstein, et al. Large-scale gwas reveals insights into the genetic architecture of same-sex sexual behavior. Science, 365(6456):eaat7693, 2019.
[16] Tian Ge, Chia-Yen Chen, Yang Ni, Yen-Chen Anne Feng, and Jordan W. Smoller. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature communications, 10(1):1–10, 2019.
[17] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(6):721–741, 1984.
[18] Yue Guan and Jennifer Dy. Sparse probabilistic principal component analysis. In Artificial Intelligence and Statistics, pages 185–192, 2009.
[19] K Paige Harden, Benjamin W Domingue, Daniel W Belsky, Jason D Boardman, Robert Crosnoe, Margherita Malanchini, Michel Nivard, Elliot M Tucker-Drob, and Kathleen Mullan Harris. Genetic associations with mathematics tracking and persistence in secondary school. NPJ science of learning, 5(1):1–8, 2020.
[20] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 04 1970.
[21] Matthew D Hoffman and David M Blei. Structured stochastic variational inference. In Artificial Intelligence and Statistics, 2015.
[22] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(4):1303–1347, 2013.
[23] Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
[24] Martin Jankowiak and Fritz Obermeyer. Pathwise derivatives beyond the reparameterization trick. In International Conference on Machine Learning, 2018.
[25] Matthew J Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pages 2946–2954, 2016.
[26] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
[27] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.
[28] Mohammad Khan and Wu Lin. Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 878–887, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
[29] Amit V. Khera, Mark Chaffin, Krishna G. Aragam, Mary E Haas, Carolina Roselli, Seung Hoan Choi, Pradeep Natarajan, Eric S. Lander, Steven A. Lubitz, Patrick T. Ellinor, and Sekar Kathiresan. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature genetics, 50(9):1219–1224, 2018.
[30] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[31] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2013.
[32] David A. Knowles and Tom Minka. Non-conjugate variational message passing for multinomial and binary regression. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1701–1709. Curran Associates, Inc., 2011.
[33] S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), 50(2):157–224, 1988.
[34] Luke R. Lloyd-Jones, Jian Zeng, Julia Sidorenko, Loïc Yengo, Gerhard Moser, Kathryn E. Kemper, Huanwei Wang, Zhili Zheng, Reedik Magi, Tonu Esko, et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nature communications, 10(1):1–11, 2019.
[35] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar Rätsch. Boosting black box variational inference. In Advances in Neural Information Processing Systems, pages 3401–3411, 2018.
[36] Alicia R Martin, Masahiro Kanai, Yoichiro Kamatani, Yukinori Okada, Benjamin M Neale, and Mark J Daly. Clinical use of current polygenic risk scores may exacerbate health disparities. Nature genetics, 51(4):584, 2019.
[37] Thomas P. Minka. Automatic choice of dimensionality for PCA. In Advances in neural information processing systems, pages 598–604, 2001.
[38] Thomas P. Minka. Expectation propagation for approximate bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, pages 362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[39] Elchanan Mossel and Eric Vigoda. Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science, 309(5744):2207–2209, 2005.
[40] Andrea C Palk, Shareefa Dalvie, Jantina De Vries, Alicia R Martin, and Dan J Stein. Potential use of clinical polygenic risk scores in psychiatry–ethical implications and communicating high polygenic risk. Philosophy, Ethics, and Humanities in Medicine, 14(1):4, 2019.
[41] Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive Systems Laboratory, School of Engineering and Applied Science …, 1982.
[42] Karl Pearson. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
[43] Jonathan K. Pritchard, Matthew Stephens, and Peter Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959, 2000.
[44] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, 2014.
[45] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
[46] Geoffrey Roeder, Yuhuai Wu, and David K. Duvenaud. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, pages 6925–6934, 2017.
[47] Lawrence K Saul and Michael I Jordan. Exploiting tractable substructures in intractable networks. In Advances in neural information processing systems, pages 486–492, 1996.
[48] Guy Sella and Nicholas H. Barton. Thinking about the evolution of complex traits in the era of genome-wide association studies. Annual Review of Genomics and Human Genetics, 20(1):461–493, 2019. PMID: 31283361.
[49] Huwenbo Shi, Kathryn S. Burch, Ruth Johnson, Malika K. Freund, Gleb Kichaev, Nicholas Mancuso, Astrid M. Manuel, Natalie Dong, and Bogdan Pasaniuc. Localizing components of shared transethnic genetic architecture of complex traits from GWAS summary data. bioRxiv, page 858431, 2019.
[50] Yee W. Teh, David Newman, and Max Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocationirichlet allocation. In Advances in neural information processing systems, pages 1353–1360, 2007.
[51] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
[52] Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of mcmc $\widehat{R}$ for assessing convergence of MCMC. arXiv preprint arXiv:1903.08008, 2019.
[53] Bjarni J Vilhjálmsson, Jian Yang, Hilary K. Finucane, Alexander Gusev, Sara Lindström, Stephan Ripke, Giulio Genovese, Po-Ru Loh, Gaurav Bhatia, Ron Do, et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. The american journal of human genetics, 97(4):576–592, 2015.
[54] Wei Wang and Matthew Stephens. Empirical bayes matrix factorization. arXiv preprint arXiv:1802.06931, 2018.
[55] Cheng Zhang, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008–2026, 2018.
[56] Han Zhao, Mazen Melibari, and Pascal Poupart. On the relationship between sum-product networks and Bayesian networks. In International Conference on Machine Learning, pages 116–124, 2015.
[57] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

Appendix A Analysis of naive Mean Field VI for the LDpred Model when $P=1$

When there is only one mutation, the naive mean field VI approach to the LDpred model simplifies to

	$\displaystyle Z$	$\displaystyle\sim\text{Bernoulli}(1-p_{0})$
	$\displaystyle\beta\|Z$	$\displaystyle\sim\mathcal{N}(0,\sigma^{2}_{Z})$
	$\displaystyle\hat{\beta}\|\beta$	$\displaystyle\sim\mathcal{N}(\beta,\sigma^{2}_{e}).$

In this simplified setting it is possible to obtain a closed form expression for the posterior. For the purposes of contrasting with VI, I consider the posterior probability that $\beta$ comes from each component of the mixture distribution, $p(Z=z|\hat{\beta})$ as well as the posterior mean of $\beta$ , $\mathbb{E}[\beta|\hat{\beta}]$ . In particular, for the case $\sigma_{0}^{2}=0$ ,

	$\displaystyle p(Z=0\|\hat{\beta})$	$\displaystyle=\frac{p_{0}}{p_{0}+(1-p_{0})\sqrt{2\pi\frac{1}{\frac{1}{\sigma_{e}^{2}}+\frac{1}{\sigma_{1}^{2}}}}\exp\Big{\{}\frac{\hat{\beta}^{2}}{2(\sigma_{e}^{2}+\frac{\sigma_{e}^{4}}{\sigma_{1}^{2}})}\Big{\}}}$		(2)
	$\displaystyle\mathbb{E}[\beta\|\hat{\beta}]$	$\displaystyle=\Big{(}1-p(Z=0\|\hat{\beta})\Big{)}\frac{\hat{\beta}}{\frac{\sigma_{e}^{2}}{\sigma_{1}^{2}}+1}.$		(3)

In this case, the usual approach to mean field VI would be to find an approximate posterior that factorizes $q(\beta,Z)=q(\beta)q(Z)$ and assume that $q(\beta)$ and $q(Z)$ are conditionally conjugate, which in this case would be that $q(\beta)=\mathcal{N}(\mu,s^{2})$ and $q(Z)=\text{Bernoulli}(1-\psi_{0})$ . As stated in the main text, the ELBo is undefined if $\sigma_{0}^{2}=0$ , so instead consider $\sigma_{0}^{2}$ to be small but nonzero.

Under these assumptions, I show that for any $\hat{\beta}$ there is a $\sigma_{0}^{2}$ small enough such that the probability under $q$ that $Z=1$ is approximately either 0 or 1 and as a result the variational posterior mean of beta is either approximately 0 or approximately equally to the non-sparse case where $p_{0}=0$ . That is, mean field VI either over-shrinks effects to zero or provides no more shrinkage than just having a single gaussian prior on the effect sizes. In contrast note that $p(Z=0|\hat{\beta})$ varies smoothly as a function of $\hat{\beta}$ , and consequently by Equation 3, the posterior mean varies smoothly from shrinking tiny effects to zero to providing less shrinkage for large effects.

Theorem 3.

Let $q_{\hat{\beta},\sigma_{0}^{2}}(\beta,Z)$ be the approximate posterior obtained from the LDpred model with $P=1$ for data $\hat{\beta}$ . For any $\delta$ , there exists an $\epsilon$ such that for all $\sigma_{0}^{2}<\epsilon$ , either:

	$\displaystyle q_{\hat{\beta},\sigma_{0}^{2}}(Z=0)$	$\displaystyle\geq 1-\delta$		(4)
	$\displaystyle\Big{\|}\mathbb{E}_{q_{\hat{\beta},\sigma_{0}^{2}}}[\beta]\Big{\|}$	$\displaystyle\leq\delta$		(5)

	$\displaystyle q_{\hat{\beta},\sigma_{0}^{2}}(Z=0)$	$\displaystyle\leq\delta$		(6)
	$\displaystyle\Big{\|}\mathbb{E}_{q_{\hat{\beta},\sigma_{0}^{2}}}[\beta]\Big{\|}$	$\displaystyle\geq\frac{\|\hat{\beta}\|}{\frac{\sigma_{e}^{2}}{\sigma_{1}^{2}}+1}-\delta,$		(7)

with the case depending on the values of $p_{0}$ , $\sigma_{e}^{2}$ , $\sigma_{1}^{2}$ , and $\hat{\beta}$ .

Proof.

I begin by writing the ELBo:

ELBo	$\displaystyle=\mathbb{E}_{q_{\hat{\beta},\sigma_{0}^{2}}}[\log p(\hat{\beta}\|\beta)]-\text{KL}(q_{\hat{\beta},\sigma_{0}^{2}}(\beta,Z)\|\|p(\beta,Z))$
	$\displaystyle=\text{constant}-\frac{1}{2\sigma_{e}^{2}}(\mu^{2}+s^{2}-2\hat{\beta}\mu)+\frac{1}{2}\log s^{2}-\psi_{0}\log\psi_{0}-(1-\psi_{0})\log(1-\psi_{0})$
	$\displaystyle\hskip 28.45274pt-\frac{1}{2}\psi_{0}\log\sigma^{2}_{0}-\frac{1}{2}(1-\psi_{0})\log\sigma_{1}^{2}-\Big{(}\frac{\psi_{0}}{2\sigma_{0}^{2}}+\frac{1-\psi_{0}}{2\sigma_{1}^{2}}\Big{)}(\mu^{2}+s^{2})$
	$\displaystyle\hskip 28.45274pt+\psi_{0}\log p_{0}+(1-\psi_{0})\log(1-p_{0}).$	(8)

Taking partial derivatives I arrive at the equations for the critical points for $\mu$ and $s^{2}$ :

	$\displaystyle\frac{d\text{ELBo}}{d\mu}$	$\displaystyle=-\frac{\hat{\beta}}{\sigma^{2}_{e}}-\Big{(}\frac{1}{\sigma_{e}^{2}}+\frac{\psi_{0}}{\sigma_{0}^{2}}+\frac{1-\psi_{0}}{\sigma_{1}^{2}}\Big{)}\mu$
	$\displaystyle\frac{d\text{ELBo}}{ds^{2}}$	$\displaystyle=-\frac{1}{2\sigma_{e}^{2}}+\frac{1}{2s^{2}}-\frac{\psi_{0}}{2\sigma_{0}^{2}}-\frac{1-\psi_{0}}{2\sigma_{1}^{2}}.$

Rearranging I obtain

	$\displaystyle\mu$	$\displaystyle=\frac{\hat{\beta}}{1+\psi_{0}\frac{\sigma_{e}^{2}}{\sigma_{0}^{2}}+(1-\psi_{0})\frac{\sigma_{e}^{2}}{\sigma_{1}^{2}}}$		(9)
	$\displaystyle s^{2}$	$\displaystyle=\frac{1}{1/\sigma_{e}^{2}+\psi_{0}/\sigma_{0}^{2}+(1-\psi_{0})/\sigma_{1}^{2}}.$		(10)

Now, I show that for $\psi_{0}=0$ or $\psi_{0}=1$ the ELBo is larger than for any other $\psi_{0}$ so long as $\lim_{\sigma_{0}^{2}\downarrow 0}\psi_{0}/\sigma_{0}^{2}>0$ . This indicates that the optimal value of $\psi_{0}$ must converge to either 0 or 1 in the limit of small $\sigma_{0}^{2}$ and furthermore, if $\psi_{0}$ converges to $0$ it must do so faster than $\sigma_{0}^{2}$ . Taking limits in Equations 9 and 10 under these conditions gives Equations 4, 5, 6, and 7.

If $\psi_{0}=0$ , then plugging $\psi_{0}$ into Equations 9 and 10, it is clear that the values of both $\mu$ and $s^{2}$ are independent of $\sigma_{0}^{2}$ . Therefore, $\text{ELBo}(\psi_{0}=0)=O(1)$ .

On the other hand, plugging $\psi_{0}=1$ into Equations 9 and 10 gives $\mu=\sigma_{0}^{2}\hat{\beta}/(\sigma_{0}^{2}+\sigma_{e}^{2})$ and $s^{2}=\sigma_{0}^{2}/(1+\frac{\sigma_{0}^{2}}{\sigma_{e}^{2}})$ . Therefore, $\log s^{2}=\log\sigma_{0}^{2}-\log(1+\frac{\sigma_{0}^{2}}{\sigma_{e}^{2}})=\log\sigma_{0}^{2}+O(\sigma_{0}^{2})$ , and $\mu^{2}+s^{2}=O(\sigma_{0}^{2})$ . Plugging these results into the ELBo of Equation 8 gives

\text{ELBo}(\psi_{0}=1)=\frac{1}{2}\log s^{2}-\frac{1}{2}\log\sigma_{0}^{2}-\frac{1}{2\sigma_{0}^{2}}(\mu^{2}+s^{2})+O(1)=O(1).

Now, for fixed $\psi_{0}\in(0,1)$ ,

	$\displaystyle\log s^{2}$	$\displaystyle=\log\sigma_{0}^{2}-\log\Big{(}\frac{\sigma_{0}^{2}}{\sigma_{e}^{2}}+\psi_{0}+(1-\psi_{0})\frac{\sigma_{0}^{2}}{\sigma_{1}^{2}}\Big{)}=\log\sigma_{0}^{2}+O(\sigma_{0}^{2})$
	$\displaystyle\mu^{2}+s^{2}$	$\displaystyle=O(\sigma_{0}^{2})$

which gives an ELBo of

\text{ELBo}(0<\psi_{0}<1)=\frac{1}{2}\log s^{2}-\frac{1}{2}\psi_{0}\log\sigma_{0}^{2}-\frac{\psi_{0}}{2\sigma_{0}^{2}}(\mu^{2}+\sigma^{2})+O(1)=\frac{1}{2}(1-\psi_{0})\log\sigma_{0}^{2}+O(1).

For $\psi_{0}\in(0,1)$ ,

\lim_{\sigma_{0}^{2}\downarrow 0}\frac{1}{2}(1-\psi_{0})\log\sigma_{0}^{2}=-\infty

showing that in the limit of small $\sigma_{0}^{2}$ , $\psi_{0}$ must converge to either $0$ or $1$ . Now, because $\psi_{0}/\sigma_{0}^{2}$ appears in Equations 9 and 10, some care must be taken in the case where $\psi_{0}$ converges to $0$ . In particular, I show that the ELBo is larger when $\psi_{0}=0$ than it is when $\lim_{\sigma_{0}^{2}\downarrow 0}\psi_{0}/\sigma_{0}^{2}=c$ , for some positive, finite constant $c$ so terms like $\psi_{0}/\sigma_{0}^{2}$ can be neglected in the limit when obtaining Equation 7 from Equation 9.

Noting that by Equations 9 and 10

\mu=\frac{s^{2}}{\sigma_{e}^{2}}\hat{\beta},

and collecting terms and rearranging the ELBo assuming that $\psi_{0}<1$ and $\psi_{0}\downarrow 0,\sigma_{0}^{2}\downarrow 0,\psi_{0}/\sigma_{0}^{2}\rightarrow c$ results in

	ELBo	$\displaystyle=-\frac{1}{2}+\frac{\hat{\beta}^{2}s^{2}}{2\sigma_{e}^{4}}+\frac{1}{2}\log s^{2}-\psi_{0}\log\psi_{0}-(1-\psi_{0})\log(1-\psi_{0})$
		$\displaystyle\hskip 28.45274pt-\frac{1}{2}\psi_{0}\log\sigma_{0}^{2}-\frac{1}{2}(1-\psi_{0})\log\sigma_{1}^{2}+\psi_{0}\log p_{0}+(1-\psi_{0})\log(1-p_{0})$
		$\displaystyle=-\frac{1}{2}+\log(1-p_{0})+\frac{\hat{\beta}^{2}s^{2}}{2\sigma_{e}^{4}}+\frac{1}{2}\log s^{2}+o(1).$

Now considering the ELBo as a function of $\psi_{0}$ , I consider the difference of the ELBo evaluated at $\psi_{0}=0$ , and that evaluated at $\psi_{0}<1$ which will be denoted as $\Delta\text{ELBo}$ .

	$\displaystyle\Delta\text{ELBo}$	$\displaystyle=\frac{\hat{\beta}^{2}}{2\sigma_{e}^{4}}(s^{2}\|_{\psi_{0}=0}-s^{2}\|_{\psi_{0}})+\frac{1}{2}\log s^{2}\|_{\psi_{0}=0}-\frac{1}{2}\log s^{2}\|_{\psi_{0}}+o(1)$
		$\displaystyle>\frac{\hat{\beta}^{2}}{2\sigma_{e}^{4}}(s^{2}\|_{\psi_{0}=0}-s^{2}\|_{\psi_{0}})+o(1)$
		$\displaystyle=\frac{\hat{\beta}^{2}}{2\sigma_{e}^{2}}\left(\frac{1}{1/\sigma_{e}^{2}+1/\sigma_{1}^{2}}-\frac{1}{1/\sigma_{e}^{2}+\psi_{0}/\sigma_{0}^{2}+(1-\psi_{0})/\sigma_{1}^{2})}\right)+o(1)$
		$\displaystyle=\frac{\hat{\beta}^{2}}{2\sigma_{e}^{2}}\left(\frac{\psi_{0}/\sigma_{0}^{2}-\psi_{0}/\sigma_{1}^{2}}{(1/\sigma_{e}^{2}+1/\sigma_{1}^{2})(1/\sigma_{e}^{2}+\psi_{0}/\sigma_{0}^{2}+(1-\psi_{0})/\sigma_{1}^{2})}\right)+o(1)$
		$\displaystyle=\frac{\hat{\beta}^{2}}{2\sigma_{e}^{2}}\left(\frac{c}{(1/\sigma_{e}^{2}+1/\sigma_{1}^{2})(1/\sigma_{e}^{2}+c+1/\sigma_{1}^{2})}\right)+o(1)$

where the first inequality follows from the fact that $s^{2}$ is largest when $\psi_{0}=0$ . This quantity is obviously positive for any $\sigma_{0}^{2}$ sufficiently small and therefore if $\psi_{0}$ converges to $0$ , it must do so faster that $\sigma_{0}^{2}$ completing the proof. ∎

The fact that under the VI approximate posterior $q(Z=1)$ is either close to $0$ or close to $1$ , while under the true posterior, $p(Z=1|\hat{\beta})$ varies smoothly as a function of $\hat{\beta}$ suggests a thresholding phenomenon where for $\hat{\beta}$ slightly less than the threshold, the VI approximate posterior dramatically over shrinks, while for $\hat{\beta}$ slightly greater than the threshold the VI approximate posterior dramatically under shrinks essentially performing hard thresholding. In Figure 3 I show numerically that this is indeed the case, highlighting the failure of mean field VI to provide a reasonable approximation to the posterior for even this toy model.

Appendix B Proof of Theorem 1

Proof.

First, note that any measure $f_{\text{mix}}\in F_{\text{mix}}$ may clearly be written as

\frac{df_{\text{mix}}}{dH}(x)=\sum_{i=1}^{K}\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}\pi_{i}\exp\left\{\langle\eta_{i},T_{i}\rangle-A_{i}(\eta_{i})\right\}\frac{dH_{i}}{dH}(x),

because each $f_{i}$ is a member of an exponential family. Since all of the events in the indicators are mutually exclusive by hypothesis, the resulting measure may be re-written as

	$\displaystyle\frac{df_{\text{mix}}}{dH}(x)$	$\displaystyle=\prod_{i=1}^{K}\exp\left\{\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}\left(\log\pi_{i}+\langle\eta_{i},T_{i}\rangle-A_{i}(\eta_{i})\right)\right\}\left(\frac{dH_{i}}{dH}(x)\right)^{\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}}$
		$\displaystyle=\left[\prod_{i=1}^{K}\left(\frac{dH_{i}}{dH}(x)\right)^{\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}}\right]\exp\left\{\sum_{i=1}^{K}\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}\left(\log\pi_{i}+\langle\eta_{i},T_{i}\rangle-A_{i}(\eta_{i})\right)\right\}$
		$\displaystyle=\left[\prod_{i=1}^{K}\left(\frac{dH_{i}}{dH}(x)\right)^{\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}}\right]\exp\left\{\sum_{i=1}^{K}\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}\left(\log\pi_{i}-A_{i}(\eta_{i})\right)+\langle\eta_{i},\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}T_{i}\rangle\right\}$
		$\displaystyle=\frac{dH_{\text{mix}}}{dH}(x)\exp\left\{\langle\eta_{\text{mix}},T_{\text{mix}}(x)\rangle-A_{\text{mix}}(\eta_{\text{mix}})\right\}$

completing the proof. ∎

Appendix C Proof of Theorem 2

Proof.

Begin by noting that by the conjugacy conditions (see e.g., [13]), for any measure $f_{Y|X}\in F_{Y|X}$ ,

df_{Y|X}(y|x)=dH^{*}(y)\exp\left\{\left\langle\left[T^{*}_{Y}(y),\alpha\right],T_{\text{prior}}(x)\right\rangle\right\},

for some $\alpha$ and $H^{*}(y)$ , where $T_{\text{prior}}$ is assumed without loss of generality to be ordered in a particular way, and $T^{*}_{Y}$ are the subset of sufficient statistics of $f_{Y|X}$ that are coefficients of $T_{\text{prior}}$ .

Now, since $\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}$ are mutually exclusive and exactly one such event occurs for each $x$ ,

	$\displaystyle df_{Y\|X}(y\|x)$	$\displaystyle=dH^{}(y)\exp\left\{\sum_{i=1}^{K}\left\langle\left[T^{}_{Y}(y),\alpha\right],\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}T_{\text{prior}}(x)\right\rangle\right\}$
		$\displaystyle=dH^{}(y)\exp\left\{\sum_{i=1}^{K}\left\langle\left[T^{}_{Y}(y),\alpha\right],\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}(M_{i}T_{i}(x)+v_{i})\right\rangle\right\}$
		$\displaystyle=dH^{}(y)\exp\Bigg{\{}\sum_{i=1}^{K}\left\langle M_{i}^{T}\left[T^{}_{Y}(y),\alpha\right],\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}T_{i}(x)\right\rangle$
		$\displaystyle\hskip 85.35826pt+\sum_{i=1}^{K-1}(v_{i}^{T}\left[T^{}_{Y}(y),\alpha\right]-v_{K}^{T}\left[T^{}_{Y}(y),\alpha\right])\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}$
		$\displaystyle\hskip 85.35826pt+v_{K}^{T}\left[T^{*}_{Y}(y),\alpha\right]\Bigg{\}}$

where I used the hypothesis on the sufficient statistics $T_{1},\ldots,T_{K}$ for the second equality. Multiplying by an arbitrary measure $f_{\text{mix}}\in F_{\text{mix}}$ and collecting terms I obtain

	$\displaystyle df_{X\|Y}(x\|y)$	$\displaystyle\propto\exp\Bigg{\{}\sum_{i=1}^{K}\left\langle M_{i}^{T}\left[T^{*}_{Y}(y),\alpha\right]+\eta_{i},\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}T_{i}(x)\right\rangle$
		$\displaystyle\hskip 56.9055pt+\sum_{i=1}^{K-1}\Big{(}v_{i}^{T}\left[T^{}_{Y}(y),\alpha\right]-v_{K}^{T}\left[T^{}_{Y}(y),\alpha\right]$
		$\displaystyle\hskip 99.58464pt+\log\pi_{i}-A_{i}(\eta_{i})-\log\pi_{K}+A_{K}(\eta_{K})\Big{)}\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}\rangle\Bigg{\}}dH_{\text{mix}}(x)$
		$\displaystyle=\exp\left\{\langle\eta_{\text{mix}}^{*},T_{\text{mix}}(x)\rangle\right\}dH_{\text{mix}}(x),$

where $\eta_{\text{mix}}^{*}$ is the updated parameter obtained by collecting terms, showing that the posterior is in the same exponential family as the prior. ∎

Appendix D VI Schemes for the LDpred model

Recall that the naive VI scheme introduces auxiliary variables to approximately model sparsity as in Equation 1. The natural mean field approach would then be to approximate the posterior over $\beta_{j}$ as a Gaussian with mean $\mu_{j}$ and variance $s^{2}_{j}$ , and approximate the posterior over $Z_{j}$ as a Bernoulli with probability of being zero $\psi_{j}$ to maintain conditional conjugacy.

Routine calculations then show that the coordinate ascent updates are

	$\displaystyle\psi_{i}$	$\displaystyle\leftarrow\frac{p_{0}\exp\left\{-\frac{1}{2}\log\sigma_{0}^{2}-\frac{1}{2\sigma_{0}^{2}}(\mu_{i}^{2}+s_{i}^{2})\right\}}{p_{0}\exp\left\{-\frac{1}{2}\log\sigma_{0}^{2}-\frac{1}{2\sigma_{0}^{2}}(\mu_{i}^{2}+s_{i}^{2})\right\}+(1-p_{0})\exp\left\{-\frac{1}{2}\log\sigma_{1}^{2}-\frac{1}{2\sigma_{1}^{2}}(\mu_{i}^{2}+s_{i}^{2})\right\}}$
	$\displaystyle\mu_{i}$	$\displaystyle\leftarrow\frac{\hat{\beta}_{i}-\sum_{j\neq i}\mathbf{X}_{ij}\mu_{j}}{\psi_{i}\sigma_{e}^{2}/\sigma_{0}^{2}+(1-\psi_{i})\sigma_{e}^{2}/\sigma_{1}^{2}+\mathbf{X}_{ii}}$
	$\displaystyle s^{2}_{i}$	$\displaystyle\leftarrow\frac{1}{\psi_{i}/\sigma_{0}^{2}+(1-\psi_{i})/\sigma_{1}^{2}+\mathbf{X}_{ii}/\sigma_{e}^{2}}.$

Using Theorems 1 and 2 it is possible to derive an alternative VI scheme that eschews the need for auxiliary variables. In particular, the set of distributions containing only the point mass at 0 is trivially an exponential family, and the support of distributions in that family do not overlap with the set of Gaussians supported on $\mathbb{R}\setminus\left\{0\right\}$ . Therefore, the set of distributions that are mixtures of a Gaussian and a point mass at 0 are also an exponential family by Theorem 1. Then, by Theorem 2, because a Gaussian prior on the mean of a Gaussian is conjugate, and the sufficient statistics of a Gaussian are constant on the set $\left\{0\right\}$ , this mixture distribution is also a conjugate prior for the mean of a Gaussian. The natural approximation to make for the variational posterior over $\beta_{j}$ would then lie in the same exponential family – a mixture of a Gaussian with mean $\mu_{j}$ and variance $s^{2}_{j}$ and a point mass at $0$ , with the probability of 0 being $\psi_{j}$ .

Because the model is conjugate and the distributions are in the exponential family, the optimal updates for the natural parameters can be obtained from

\displaystyle q(\beta_{i})\propto\exp\left\{\mathbb{E}_{-i}\log P(\hat{\beta}|\beta)+\log P(\beta_{i})\right\}

(11)

where $\mathbb{E}_{-i}[\cdot]$ is short hand for taking the expectation under the approximate posterior with respect to all variables except $\beta_{i}$ . The posterior mean under the variational approximation is $\mathbb{E}_{q}[\beta_{i}]=(1-\psi_{i})\mu_{i}$ , and so the first term expands to

	$\displaystyle\mathbb{E}_{-i}\log P(\hat{\beta}\|\beta)$	$\displaystyle=-\frac{1}{2\sigma^{2}_{e}}\mathbb{E}_{-i}(\hat{\beta}-\mathbf{X}\beta)^{T}\mathbf{X}^{-1}(\hat{\beta}-\mathbf{X}\beta)$
		$\displaystyle=-\frac{1}{2\sigma_{e}^{2}}\mathbf{X}_{ii}\beta_{i}^{2}+\frac{1}{\sigma_{e}^{2}}\left(\hat{\beta}_{i}-\sum_{j\neq i}\mathbf{X}_{ij}(1-\psi_{j})\mu_{j}\right)\beta_{i}+\text{const}$
		$\displaystyle=-\frac{1}{2\sigma_{e}^{2}}\mathbf{X}_{ii}\mathbb{I}\left\{\beta_{i}\neq 0\right\}\beta_{i}^{2}+\frac{1}{\sigma_{e}^{2}}\left(\hat{\beta}_{i}-\sum_{j\neq i}\mathbf{X}_{ij}(1-\psi_{j}),\mu_{j}\right)\mathbb{I}\left\{\beta_{i}\neq 0\right\}\beta_{i}+\text{const}.$

The natural parameters for a Gaussian with mean $\mu_{i}$ and variance $s^{2}_{i}$ are $-\frac{1}{2s_{i}^{2}}$ and $\frac{\mu_{i}}{s_{i}^{2}}$ with log normalizer $\frac{\mu_{i}^{2}}{2s_{i}^{2}}+\frac{1}{2}\log s_{i}^{2}$ , with corresponding sufficient statistics $\beta_{i}^{2}$ and $\beta_{i}$ . By Theorem 1, the natural parameters for the mixture distribution are therefore $-\frac{1}{2s_{i}^{2}}$ , $\frac{\mu_{i}}{s_{i}^{2}}$ , and $\log\psi_{i}-\log(1-\psi_{i})+\frac{\mu_{i}^{2}}{2s_{i}^{2}}+\frac{1}{2}\log s_{i}^{2}$ , with corresponding sufficient statistics $\mathbb{I}\left\{\beta_{i}\neq 0\right\}\beta_{i}^{2}$ , $\mathbb{I}\left\{\beta_{i}\neq 0\right\}\beta_{i}$ , and $\mathbb{I}\left\{\beta_{i}=0\right\}$ . Matching the coefficients of the sufficient statistics in Equation 11 and performing some algebra produces

	$\displaystyle\psi_{i}$	$\displaystyle\leftarrow 1-\frac{1}{1+\frac{p_{0}}{1-p_{0}}\sqrt{1+\mathbf{X}_{ii}\sigma_{1}^{2}/\sigma_{e}^{2}}\exp\left\{\frac{-\Big{(}\hat{\beta}_{i}-\sum_{j\neq i}\mathbf{X}_{ij}\mu_{j}(1-\psi_{j})\Big{)}^{2}}{2\sigma_{e}^{4}/\sigma_{1}^{2}+2\sigma_{e}^{2}\mathbf{X}_{ii}}\right\}}$
	$\displaystyle\mu_{i}$	$\displaystyle\leftarrow\frac{\hat{\beta}_{i}-\sum_{j\neq i}\mathbf{X}_{ij}\mu_{j}(1-\psi_{j})}{\sigma_{e}^{2}/\sigma_{1}^{2}+\mathbf{X}_{ii}}$
	$\displaystyle s^{2}_{i}$	$\displaystyle\leftarrow\frac{1}{1/\sigma_{1}^{2}+\mathbf{X}_{ii}/\sigma_{e}^{2}}.$

When fitting either VI scheme, I performed 100 iterations of coordinate ascent using the above update. For the naive scheme, for coordinate $i$ , I update $\mu_{i}$ and $s_{i}^{2}$ first, then update $\psi_{i}$ before moving on to coordinate $i+1$ . For initialization, $\mu_{i}=0$ for all $i$ , and $s_{i}^{2}=\sigma_{1}^{2}+\sigma_{e}^{2}$ . For the naive case, $\psi_{i}$ was initialized to be $1$ , while for new scheme, $\psi_{i}$ was initialized to be $p_{0}$ .

In both VI schemes, the rate-limiting step is clearly computing terms that involve summations of the type $\sum_{j\neq i}$ , which take $O(P)$ time, where $P$ is the number of variables. Since there are $O(P)$ variational parameters to update at each iteration, the runtime of each iteration is thus $O(P^{2})$ .

Appendix E VI Schemes for sparse probabilistic PCA

First I derive the naive VI scheme. For the auxiliary model,

	$\displaystyle Y_{pk}$	$\displaystyle\sim\text{Bernoulli}(1-p_{0})$
	$\displaystyle\mathbf{W}_{pk}\|Y_{pk}$	$\displaystyle\sim\mathcal{N}(0,\sigma_{Y_{pk}}^{2})$
	$\displaystyle Z_{n}$	$\displaystyle\sim\mathcal{N}(0,\mathbf{I}_{K})$
	$\displaystyle X_{n}\|Z_{n},\mathbf{W}$	$\displaystyle\sim\mathcal{N}(\mathbf{W}Z_{n},\sigma_{e}^{2}\mathbf{I}_{P})$

The natural mean field VI scheme for this model would be to assume that all variables are independent and assume that under the posterior $Y_{pk}$ is Bernoulli with parameter $\psi_{pk}$ , $\mathbf{W}_{pk}$ is Gaussian with mean $\mu_{W_{pk}}$ and variance $s^{2}_{W_{pk}}$ , and $Z_{n}$ is multivariate normal with mean $\mu_{Z_{n}}$ and covariance matrix $\mathbf{S}_{Z_{n}}$ . Below, I use the notation

\mathbf{X}:=\begin{pmatrix}|&&|\\ X_{1}&\cdots&X_{n}\\ |&&|\end{pmatrix}.

Routine calculations result in the following updates:

	$\displaystyle\mu_{Z_{n}}$	$\displaystyle\leftarrow\frac{1}{\sigma_{e}^{2}}\left(\frac{1}{\sigma_{e}^{2}}\mathbb{E}[\mathbf{W}^{T}\mathbf{W}]+\mathbf{I}_{K}\right)^{-1}\mathbb{E}[\mathbf{W}]^{T}X_{n}$
	$\displaystyle\mathbf{S}_{Z_{n}}$	$\displaystyle\leftarrow\left(\frac{1}{\sigma_{e}^{2}}\mathbb{E}[\mathbf{W}^{T}\mathbf{W}]+\mathbf{I}_{K}\right)^{-1}$
	$\displaystyle s^{2}_{W_{pk}}$	$\displaystyle\leftarrow\left[\frac{1}{\sigma_{e}^{2}}\left(\sum_{n}\mu_{Z_{n},k}^{2}+\mathbf{S}_{Z_{n},kk}^{2}\right)+\frac{\psi_{pk}}{\sigma_{0}^{2}}+\frac{1-\psi_{pk}}{\sigma_{1}^{2}}\right]^{-1}$
	$\displaystyle\mu_{W_{pk}}$	$\displaystyle\leftarrow\frac{s^{2}_{W_{pk}}}{\sigma_{e}^{2}}\left[\left(\sum_{n}\mathbf{X}_{np}\mu_{{Z_{n}},k}\right)-\left(\sum_{n}\sum_{\ell\neq k}\mu_{W_{p\ell}}\mathbf{S}_{Z_{n},k\ell}\right)\right]$
	$\displaystyle\psi_{pk}$	$\displaystyle\leftarrow 1-\frac{1}{1+\frac{p_{0}}{1-p_{0}}\sqrt{\sigma_{1}^{2}/{\sigma_{0}^{2}}}\exp\left\{\frac{1}{2}\left(\frac{1}{\sigma_{0}^{2}}-\frac{1}{\sigma_{1}^{2}}\right)(\mu_{W_{pk}}^{2}+s^{2}_{W_{pk}})\right\}}$

where

\mathbb{E}[\mathbf{W}]_{pk}=\mu_{W_{pk}}

and

\mathbb{E}[\mathbf{W}^{T}\mathbf{W}]_{k\ell}=\sum_{p}\mu_{W_{pk}}\mu_{W_{p\ell}}+\delta_{k\ell}\sum_{p}s^{2}_{W_{pk}}.

Now I derive a VI scheme using Theorems 1 and 2. The calculations are largely the same as in Appendix D and so a number of details are omitted. Because I am again replacing a Gaussian by a mixture of a Gaussian and point mass at zero, I assume the posterior for $\mathbf{W}_{pk}$ is a mixture of a point mass at zero and a Gaussian with mean $\mu_{W_{p}k}$ , variance $s^{2}_{W_{pk}}$ , and probability of being zero $\psi_{pk}$ . Working through the algebra as in the LDpred model results in:

	$\displaystyle\mu_{Z_{n}}$	$\displaystyle\leftarrow\frac{1}{\sigma_{e}^{2}}\left(\frac{1}{\sigma_{e}^{2}}\mathbb{E}[\mathbf{W}^{T}\mathbf{W}]+\mathbf{I}_{K}\right)^{-1}\mathbb{E}[\mathbf{W}]^{T}X_{n}$
	$\displaystyle\mathbf{S}_{Z_{n}}$	$\displaystyle\leftarrow\left(\frac{1}{\sigma_{e}^{2}}\mathbb{E}[\mathbf{W}^{T}\mathbf{W}]+\mathbf{I}_{K}\right)^{-1}$
	$\displaystyle s^{2}_{W_{pk}}$	$\displaystyle\leftarrow\left[\frac{1}{\sigma_{e}^{2}}\left(\sum_{n}\mu_{Z_{n},k}^{2}+\mathbf{S}_{Z_{n},kk}^{2}\right)+\frac{1}{\sigma_{1}^{2}}\right]^{-1}$
	$\displaystyle\mu_{W_{pk}}$	$\displaystyle\leftarrow\frac{s^{2}_{W_{pk}}}{\sigma_{e}^{2}}\left[\left(\sum_{n}\mathbf{X}_{np}\mu_{{Z_{n}},k}\right)-\left(\sum_{n}\sum_{\ell\neq k}\mu_{W_{p\ell}}(1-\psi_{p\ell})\mathbf{S}_{Z_{n},k\ell}\right)\right]$
	$\displaystyle\psi_{pk}$	$\displaystyle\leftarrow 1-\frac{1}{1+\frac{p_{0}}{1-p_{0}}\sqrt{\sigma_{1}^{2}/s^{2}_{W_{pk}}}\exp\left\{-\frac{\mu_{W_{pk}}^{2}}{2s^{2}_{W_{pk}}}\right\}}$

where

\mathbb{E}[\mathbf{W}]_{pk}=\mu_{W_{pk}}(1-\psi_{pk})

and

\mathbb{E}[\mathbf{W}^{T}\mathbf{W}]_{k\ell}=\sum_{p}\mu_{W_{pk}}(1-\psi_{pk})\mu_{W_{p\ell}}(1-\psi_{p\ell})+\delta_{k\ell}\sum_{p}s^{2}_{W_{pk}}(1-\psi_{pk}).

When fitting both VI schemes, I performed 250 iterations of coordinate ascent. For the naive scheme, I first updated every coordinate of $Z$ , then for each coordinate updated $Y_{pk}$ then $\mathbf{W}_{pk}$ . For the new scheme, I first updated $Z$ coordinate-wise then updated $\mathbf{W}$ coordinate-wise. Using singular value decomposition to decompose $\mathbf{X}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{T}$ , I initialized $\mu_{Z_{i}}=\mathbf{U}_{n}$ , $\mathbf{S}_{Z_{n}}=\mathbf{I}_{2}$ $\mu_{W_{pk}}=\mathbf{V}_{pk}\mathbf{\Sigma}_{kk}$ , $s^{2}_{W_{pk}}=1$ and $\psi_{pk}=1\times 10^{-10}$ for both schemes.

The updates for both models require the inversion of a $K\times K$ matrix which is $O(K^{3})$ and computing $\mathbb{E}[\mathbf{W}^{T}\mathbf{W}]$ is $O(PK^{2})$ , but these can be precomputed before updating each $\mu_{Z_{n}}$ and $\mathbf{S}_{Z_{n}}$ . Then, updated each $\mu_{Z_{n}}$ requires $O(K^{2}+PK)$ time. Therefore, updating all $\mu_{Z_{n}}$ and $\mathbf{S}_{Z_{n}}$ requires $O(NPK)$ time assuming that $K\ll N$ and $K\ll P$ . For fixed $\mu_{Z_{n}}$ and $\mathbf{S}_{Z_{n}}$ , updating $\mu_{W_{pk}}$ , $s^{2}_{W_{pk}}$ , and $\psi_{pk}$ is limited by computing $\sum_{n}\sum_{\ell\neq k}\mu_{W_{p\ell}}\mathbf{S}_{Z_{n},k\ell}$ or $\sum_{n}\sum_{\ell\neq k}\mu_{W_{p\ell}}(1-\psi_{W_{p}\ell})\mathbf{S}_{Z_{n},k\ell}$ which requires $O(NK)$ time. Therefore updating all $\mu_{W_{pk}}$ , $s^{2}_{W_{pk}}$ , and $\psi_{pk}$ requires $O(NPK^{2})$ time. Therefore, each iteration of coordinate ascent requires $O(NPK^{2})$ time.

Appendix F Additional PCA runs

To ensure that the results presented in the main text are not unusual, I generated five additional datasets as described in the main text and compared the resulting PCA projections and sparsity of the loadings for traditional PCA (based on singular value decomposition), my naive implementation of sparse probabilistic PCA, and the implementation of sparse probabilistic PCA based on the non-overlapping mixtures trick (Figures 4 and 5). In all five realizations, the new formulation of sparse probabilistic PCA produces the sparsest loadings, and subjectively best separates the four clusters using the first two principle components. As before, the naive implementation is indistinguishable from traditional PCA for small values of $\sigma_{0}^{2}$ or values of $\sigma_{0}^{2}$ close to $1$ .

I also computed reconstruction error as a quantitative measure of performance. I defined reconstruction error as the squared Frobenius norm between the reconstructed matrix ( $\mathbb{E}[\mathbf{W}\mathbf{Z}]$ for the VI methods) and the signal in the simulated matrix – that is, the matrix obtained by centering and scaling a matrix where each entry is the $\mu_{c}$ for that entry as defined above. The mean reconstruction error across five simulations was 4261 (min: 4072, max: 4517) for the non-overlapping mixtures trick; 3985 (min: 3923, max: 4240) for oracle PCA; and 29407 (min: 28761, max: 30429) for classical PCA. Across the naive schemes, taking $\sigma^{2}_{0}=0.05$ performed best with a mean reconstruction error of 7191 (min:7090, max: 7408). Overall, the non-overlapping mixtures trick performed only slightly worse than PCA using knowledge of which variables were non-zero, whereas even the best naive scheme had almost double the reconstruction error, but all methods that attemted to account for sparsity outperformed classical PCA.

	$\displaystyle df_{Y\|X}(y\|x)$	$\displaystyle=dH^{}(y)\exp\left\{\sum_{i=1}^{K}\left\langle\left[T^{}_{Y}(y),\alpha\right],\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}T_{\text{prior}}(x)\right\rangle\right\}$
		$\displaystyle=dH^{}(y)\exp\left\{\sum_{i=1}^{K}\left\langle\left[T^{}_{Y}(y),\alpha\right],\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}(M_{i}T_{i}(x)+v_{i})\right\rangle\right\}$
		$\displaystyle=dH^{}(y)\exp\Bigg{\{}\sum_{i=1}^{K}\left\langle M_{i}^{T}\left[T^{}_{Y}(y),\alpha\right],\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}T_{i}(x)\right\rangle$
		$\displaystyle\hskip 85.35826pt+\sum_{i=1}^{K-1}(v_{i}^{T}\left[T^{}_{Y}(y),\alpha\right]-v_{K}^{T}\left[T^{}_{Y}(y),\alpha\right])\mathbb{I}\left\{x\in\mathcal{S}_{i}\right\}$
		$\displaystyle\hskip 85.35826pt+v_{K}^{T}\left[T^{*}_{Y}(y),\alpha\right]\Bigg{\}}$

Flexible mean field variational inference using mixtures of non-overlapping exponential families

Abstract

1 Introduction

2 A Motivating Example

3 More flexible exponential families

Theorem 1.

Theorem 2.

4 Numerical results

4.1 Polygenic Scores

4.2 Sparse Probabilistic PCA

5 Discussion

Broader Impact

Acknowledgments and Disclosure of Funding

References

Appendix A Analysis of naive Mean Field VI for the LDpred Model when P=1P=1

Theorem 3.

Proof.

Appendix B Proof of Theorem 1

Proof.

Appendix C Proof of Theorem 2

Proof.

Appendix D VI Schemes for the LDpred model

Appendix E VI Schemes for sparse probabilistic PCA

Appendix F Additional PCA runs

Appendix A Analysis of naive Mean Field VI for the LDpred Model when $P=1$