This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regularizing Variational Autoencoder with Diversity and Uncertainty Awareness

Dazhong Shen1,2    Chuan Qin2    Chao Wang1,2    Hengshu Zhu2,∗    Enhong Chen1    Hui Xiong3,111This work was done when Dazhong Shen was an intern at Talent Intelligent Center, Baidu Inc. Hui Xiong and Hengshu Zhu are the corresponding authors. 1 School of Computer Science and Technology, University of Science and Technology of China
2Baidu Talent Intelligence Center
3Rutgers, The State University of New Jersey
sdz@mail.ustc.edu.cn, chuanqin0426@gmail.com, wdyx2012@mail.ustc.edu.cn, zhuhengshu@baidu.com, cheneh@ustc.edu.cn, hxiong@rutgers.edu
Abstract

As one of the most popular generative models, Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference. However, when the decoder network is sufficiently expressive, VAE may lead to posterior collapse; that is, uninformative latent representations may be learned. To this end, in this paper, we propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space, and thus the representation can be learned in a meaningful and compact manner. Specifically, we first theoretically demonstrate that it will result in better latent space with high diversity and low uncertainty awareness by controlling the distribution of posterior’s parameters across the whole data accordingly. Then, without the introduction of new loss terms or modifying training strategies, we propose to exploit Dropout on the variances and Batch-Normalization on the means simultaneously to regularize their distributions implicitly. Furthermore, to evaluate the generalization effect, we also exploit DU-VAE for inverse autoregressive flow based-VAE (VAE-IAF) empirically. Finally, extensive experiments on three benchmark datasets clearly show that our approach can outperform state-of-the-art baselines on both likelihood estimation and underlying classification tasks.

1 Introduction

Recent years have witnessed the great success of Variational Autoencoder (VAE) Kingma and Welling (2013) as a generative model for representation learning, which has been widely exploited in various challenging domains, such as natural language modeling and image processing Bowman et al. (2015b); Pu et al. (2016). Indeed, VAE models the generative process of observed data by defining a joint distribution with latent space, and approximates the posterior of latent variables based on the amortized variational inference. While the use of VAE has been well-recognized, it may lead to uninformative latent representations, particularly when the expressive and powerful decoder networks are employed, such as LSTMs Hochreiter and Schmidhuber (1997) on text or PixelCNN Van den Oord et al. (2016) on images. This is widely known as the posterior collapse phenomenon Zhao et al. (2019). In other words, the model may fail to diversify the posteriors of different data by simply using the single posterior distribution component to model all data instances. Also, the traditional VAE model usually produces the redundant information of representation due to the lack of guidance to characterize posterior space Bowman et al. (2015a); Chen et al. (2017). Therefore, the learned representation of VAE often results in an unsatisfied performance for downstream tasks, such as classification, even if it can approximate the marginal likelihood of observed data very well.

In the literature, tremendous efforts have been made for improving the representation learning of VAE and alleviating the problem of posterior collapse. One thread of these works is to attribute the posterior collapse to optimization challenges of VAEs and design various strategies, including KL annealing Bowman et al. (2015a); Fu et al. (2019), Free-Bits(FB) Kingma et al. (2016), aggressive training He et al. (2018), encoder network pretraining and decoder network weakening Yang et al. (2017). Among them, BN-VAE Zhu et al. (2020) applies the Batch-Normalization (BN) Ioffe and Szegedy (2015) to ensure one positive lower bound of the KL term. However, the theoretical basis of the effectiveness of BN on latent space learning is not yet understood, and more possible explanations based on the geometry analysis of latent space are needed. Other studies attempt to modify the objective carefully to direct the latent space learning Makhzani et al. (2016); Zheng et al. (2019). One feasible direction is to add additional Mutual Information (MI) based term to enhance the relation between data and latent space. However, due to the intractability, additional designs are always required for approximating MI-based objectives Fang et al. (2019); Zhao et al. (2019). Recently, Mutual Posterior-Divergence (MPD) Ma et al. (2018) is introduced to measure the diversity of the latent space, which is analytic and has one similar goal with MI. However, the scales of MPD and original objective are unbalanced, which requires deliberate normalization.

In this paper, to improve the representation learning performances of VAE, we propose a novel generative model, DU-VAE, for learning a more Diverse and less Uncertain latent space, and thus ensures the representation can be learned in a meaningful and compact manner. To be specific, we first analyze the expected latent space theoretically from two geometry properties, diversity and uncertainty, based on the MPD and Conditional Entropy (CE) metrics, respectively. We demonstrate that it will lead to a better latent space with high diversity and low uncertainty by controlling the distribution of posterior’s parameters across the whole data. Then, instead of introducing new loss terms or modifying training strategies, we propose to apply Dropout Srivastava et al. (2014) on the variances and Batch-Normalization on the means simultaneously to regularize their distributions implicitly. In particular, we also discuss and prove the effectiveness of two regularizations in a rigorous way. Furthermore, to verify the generalization of our approaches, we also demonstrate that DU-VAE can be extended empirically into VAE-IAF Kingma et al. (2016), a well-known normalizing flow-based VAE. Finally, extensive experiments have been conducted on three benchmark datasets, and the results clearly show that our approach can outperform state-of-the-art baselines on both likelihood estimation and classification tasks. Code and data are available at https://github.com/SmilesDZgk/DU-VAE.

2 Background of VAE

Given the input space x𝒳x\in\mathcal{X}, VAE aims to construct a smooth latent space z𝒵z\in\mathcal{Z} by learning a generative model p(x,z)p(x,z). Starting from a prior distribution p(z)p(z), such as standard multivariate Gaussian 𝒩(0,I)\mathcal{N}(0,I), VAE generates data with a complex conditional distribution pθ(x|z)p_{\theta}(x|z) parameterized by one neural network fθ()f_{\theta}(\cdot). The goal of the model training is to maximize the marginal likelihood Ep𝒟(x)[logpθ(x)]E_{p_{\mathcal{D}}(x)}[\log p_{\theta}(x)], where the p𝒟(x)p_{\mathcal{D}}(x) is the true underlying distribution. To calculate this intractable marginal likelihood, an amortized inference distribution qϕ(z|x)q_{\phi}(z|x) parameterized by one neural network fϕ()f_{\phi}(\cdot) has been utilized to approximate the true posterior. Then, it turns out to optimize the following lower bound:

=Ep𝒟(x)[Eqϕ(z|x)[logpθ(x|z)][DKL[qϕ(z|x)||p(z)]],\small\begin{split}\mathcal{L}=E_{p_{\mathcal{D}}(x)}[E_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]-[D_{KL}[q_{\phi}(z|x)||p(z)]],\end{split} (1)

where the first term is the reconstruction loss and the second one is the Kullback-Leibler (KL) divergence between the approximated posterior and prior.

Unfortunately, in practice, VAE may fail to capture meaningful representation. In particular, when applying auto-regressive models as the decoder network, such as LSTMs or PixelCNN, it is likely to model the data marginal distribution p𝒟(x)p_{\mathcal{D}}(x) very well even without latent variable zz, i.e., p(x|z)=Πip(xi|x<i)p(x|z)=\Pi_{i}p(x_{i}|x_{<i}). In this case, VAE degenerates to auto-regressive, the latent variable zz tends to be independent with the data xx. Meanwhile, with the goal to minimize DKL[q(z|x)||p(z)]D_{KL}[q(z|x)||p(z)] in ELBO objective, q(z|x)q(z|x) vanishes to p(z)p(z), i.e., q(z|xi)=q(z|xj)=p(z)q(z|x_{i})=q(z|x_{j})=p(z), xi,xj𝒳\forall x_{i},x_{j}\in\mathcal{X}. To solve this problem, we will direct the latent space learning carefully and purposefully for high diversity and low uncertainty in the following.

3 The Proposed Method

Here, we start with theoretical analysis on the latent space of VAE from two geometric properties: diversity and uncertainty, respectively. Then, we design Dropout on the variance parameters and Batch-Normalization on the mean parameters to encourage the latent space with high diversity and low uncertainty. In particular, the effectiveness of our approach will be discussed and proved rigorously. Finally, we extend DU-VAE into VAE-IAF Kingma et al. (2016) empirically.

3.1 Geometric Properties of Latent Space

For enabling meaningful and compact representation learning in VAE model, we have two intuitions: 1) for different data samples x1,x2x_{1},x_{2}, the posteriors q(z1|x1)q(z_{1}|x_{1}) and q(z2|x2)q(z_{2}|x_{2}) should mutually diversify from each other, which encourages posteriors to capture the characteristic or discriminative information from data; 2) given data sample xx, the degree of uncertainty of the latent variable zz should be minimized, which encourages removing redundant information from zz. Guided by those intuitions, we first analyze the diversity and uncertainty of latent space under quantitative metric, respectively.

3.1.1 Diversity of Latent Space.

Here, we attempt to measure the divergence among the posterior distribution family. One intuitive and reasonable metric is the expectation of the mutual divergence between a pair of posteriors. Following this idea, Ma et al. (2018) proposed the mutual posterior diversity (MPD) to measure the diversity of posteriors, which can be computed by:

MPDp𝒟(x)[z]=Ep𝒟(x)[DSKL[qϕ(z1|x1)|qϕ(z2||x2)]],\small\begin{split}MPD_{p_{\mathcal{D}}(x)}[z]&=E_{p_{\mathcal{D}(x)}}[D_{SKL}[q_{\phi}(z_{1}|x_{1})|q_{\phi}(z_{2}||x_{2})]],\end{split} (2)

where x1,x2p𝒟(x)x_{1},x_{2}\sim p_{\mathcal{D}(x)} are i.i.d.i.i.d. and DSKL[q1||q2]D_{SKL}[q_{1}||q_{2}] is symmetric KL divergence defined as the mean of DKL[q2||q1]D_{KL}[q_{2}||q_{1}] and DKL[q2||q1]D_{KL}[q_{2}||q_{1}], which is analytical under Gaussian distributions. Specifically, we have:

2MPDp𝒟(x)[z]=d=1nEp𝒟(x)[(μx1,dμx2,d)2δx1,d2]+d=1nEp𝒟(x)[δx,d2]Ep𝒟(x)[1δx,d2]1.\small\begin{split}&2MPD_{p_{\mathcal{D}}(x)}[z]=\sum_{d=1}^{n}E_{p_{\mathcal{D}}(x)}[\frac{(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}}{\delta_{x_{1},d}^{2}}]\\ &+\sum_{d=1}^{n}E_{p_{\mathcal{D}(x)}}[\delta_{x,d}^{2}]E_{p_{\mathcal{D}(x)}}[\frac{1}{\delta_{x,d}^{2}}]-1.\\ \end{split} (3)

Interestingly, if the value of δx,d2\delta^{2}_{x,d} is upper bounded, like less than 1 in most practical case for VAEs. then, we can find that MPD has one lower and strict bound proportional to d=1nVarP𝒟(x)[μx,d]\sum_{d=1}^{n}Var_{P_{\mathcal{D}(x)}}[\mu_{x,d}] (see Supplementary).

3.1.2 Uncertainty of Latent Space.

Here, we aim at quantifying the uncertainty about the outcome of latent variable zz given data sample xx and the learned encoder distribution qϕ(z|x)q_{\phi}(z|x). In information theory, conditional entropy is utilized to measure the average level of the uncertainty inherent in the variable’s possible outcomes when giving another variable. Due to the same goal, we follow this idea and use the Conditional Entropy (CE) Hqϕ(z|x)H_{q_{\phi}}(z|x) for zz conditioned on xx to measure the uncertainty of latent space:

Hqϕ(z|x)=Ep𝒟(x)[H(qϕ(z|x))],\small\begin{split}H_{q_{\phi}}(z|x)=E_{p_{\mathcal{D}}(x)}[H(q_{\phi}(z|x))],\end{split} (4)

where H(qϕ(z|x))H(q_{\phi}(z|x)) denotes the differential entropy of posterior qϕ(z|x)q_{\phi}(z|x). Actually, H(qϕ(z|x))H(q_{\phi}(z|x)) can be computed analytically as d=1n12log(2πeδx,d2)\sum_{d=1}^{n}\frac{1}{2}\log(2\pi e\delta^{2}_{x,d}), then we have:

Hqϕ(z|x)=n2log2πe+12d=1nEp𝒟(x)[logδx,d2].\small\begin{split}H_{q_{\phi}}(z|x)=\frac{n}{2}\log 2\pi e+\frac{1}{2}\sum_{d=1}^{n}E_{p_{\mathcal{D}}(x)}[\log\delta^{2}_{x,d}].\end{split} (5)

Intuitively, in order to reduce the uncertainty in the latent space, we need to minimize the conditional entropy Hqϕ(z|x)H_{q_{\phi}}(z|x). However, the differential entropies H(qϕ(z|x))H(q_{\phi}(z|x)) defined on continuous spaces are not bounded from below. That is, the variance δx,d2\delta^{2}_{x,d} can be scaled to be arbitrarily small achieving arbitrarily high-magnitude negative entropy. As a result, the optimization trajectories will invariably end with garbage networks as activations approach zero or infinite. To solve this problem, we enforce the differential entropy non-negativity by adding noise to the latent variable. For one latent variable zz, we replace it with z+ϵz+\epsilon, where ϵ𝒩(0,α)\epsilon\sim\mathcal{N}(0,\alpha) is one zero-entropy noise, where we set constant α=12πe\alpha=\frac{1}{2\pi e} for convenience. Then based on properties of Gaussian distribution, we have H(qϕ(z|x))>H(ϵ)=0H(q_{\phi}(z|x))>H(\epsilon)=0 and δx,d2>α\delta^{2}_{x,d}>\alpha.

In sum, in order to encourage the diversity and decrease the uncertainty of latent space, we need to constrain both MPD in Equation 3 and CE in Equation 4 . One feasible solution is to regard them as additional objectives explicitly and approximate them by using Monte Carlo in each mini-batch. However, the scales among different objective terms are unbalanced, which require deliberately designed normalization or careful weight parameters tuning Ma et al. (2018).

Instead, we propose control implicitly the MPD and CE without modifying the objective function. Based on Equation 3 and Equation 4, we note that both MPD and CE are only dependent on approximated posterior’ s parameters, i.e., μx,d\mu_{x,d} or δx,d\delta_{x,d}. This inspires us to select proper regularization on the distribution of posterior’ s parameters to encourage higher MPD and lower CE. Specifically, in the following two sub-sections. we will introduce the application of the Dropout on variance parameters and Batch-Normalization on mean parameters respectively, and provide theoretical analysis about the effectiveness of our approach.

3.2 Dropout on Variance Parameters

In order to encourage high diversity and low uncertainty of latent space, we need to increase the MPD in Equation 3 and decrease the CE in Equation 5, simultaneously. Meanwhile, we also need to avoid Ep𝒟(x)[δx,d2]E_{p_{\mathcal{D}}(x)}[\delta^{2}_{x,d}] to be too small for ensuring the smoothing of the latent space. One extreme case is that when Ep𝒟(x)[δx,d2]E_{p_{\mathcal{D}}(x)}[\delta^{2}_{x,d}] convergence to 0, i.e., δx,d20,x,d\delta^{2}_{x,d}\approx 0,\forall x,d, each data point is associated with a delta distribution in latent space and the VAEs degenerate into Autoencoders in this dimension. To accomplish these requirements together, we propose to apply Dropout Srivastava et al. (2014) to regularize posterior’s variance parameters in training as following,

δ2^x,d=gx,d(δx,d2α)+α,\small\begin{split}\hat{\delta^{2}}_{x,d}=g_{x,d}(\delta^{2}_{x,d}-\alpha)+\alpha,\end{split} (6)

where gx,dg_{x,d} denotes the independent random variable generated from the normalized Bernoulli distribution 1/pB(1,p),p(0,1)1/pB(1,p),~{}p\in(0,1), where EB[gx,d]=1E_{B}[g_{x,d}]=1. Then, we have the following proposition (see Supplementary for the proof.):

Proposition 1.

Given the Dropout strategy defined in Equation 6, we have:

Ep𝒟(x)B[δ^x,d2]=Ep𝒟(x)[δx,d2],MPDp𝒟(x)B[z]>MPDp𝒟(x)[z],HqϕB(z|x)<Hqϕ(z|x),\small\begin{split}&E_{p_{\mathcal{D}}(x)\cdot B}[\hat{\delta}^{2}_{x,d}]=E_{p_{\mathcal{D}}(x)}[\delta^{2}_{x,d}],\\ &MPD_{p_{\mathcal{D}(x)}\cdot B}[z]>MPD_{p_{\mathcal{D}(x)}}[z],\\ &H_{q_{\phi}\cdot B}(z|x)<H_{q_{\phi}}(z|x),\\ \end{split} (7)

where two inequalities are both strict, the gaps between two sides are greater as pp decreases to 0. Then, we also have:

MPDp𝒟(x)B[z]>1pαd=1nVarp𝒟(x)[μx,d].\small\begin{split}&MPD_{p_{\mathcal{D}(x)}\cdot B}[z]>\frac{1-p}{\alpha}\sum_{d=1}^{n}Var_{p_{\mathcal{D}}(x)}[\mu_{x,d}].\\ \end{split} (8)

Proposition 8 tells us that: 1) Dropout regularization encourages the increase of MPDp𝒟(x)[z]MPD_{p_{\mathcal{D}}(x)}[z] and the decrease of the conditional entropy Hqϕ(z|x)H_{q_{\phi}}(z|x) of the latent space while preserving the expectation of variance parameters, which is actually a simple but useful strategy what we need. 2) Dropout regularization also provides one lower bound of MPDp𝒟(x)[z]MPD_{p_{\mathcal{D}}(x)}[z] independent on the variance parameters, which makes it possible to ensure positive MPD with further controls on the variance d=1nVarp𝒟(x)[μx,d]\sum_{d=1}^{n}Var_{p_{\mathcal{D}}(x)}[\mu_{x,d}].

3.3 Batch-Normalization on Mean Parameters

Inspired by Batch-Normalization (BN) Ioffe and Szegedy (2015), which is an effective approach to control the distribution of the output of neural network layer. We apply BN on the mean parameters μx,d\mu_{x,d} to constrain d=1nVarp𝒟(x)[μx,d]\sum_{d=1}^{n}Var_{p_{\mathcal{D}}(x)}[\mu_{x,d}]. Mathematically, our BN is defined as:

μ^x,d=γμdμx,dμdδd+βμd,\small\begin{split}\hat{\mu}_{x,d}=\gamma_{\mu_{d}}\frac{\mu_{x,d}-\mu_{\mathcal{B}_{d}}}{\delta_{\mathcal{B}_{d}}}+\beta_{\mu_{d}},\end{split} (9)

where μ^x,d\hat{\mu}_{x,d} represents the output of BN layer, and μd\mu_{\mathcal{B}_{d}} and δd\delta_{\mathcal{B}_{d}} denote the mean and standard deviation of μx,d\mu_{x,d} estimated within each mini-batch. γμd\gamma_{\mu_{d}} and βμd\beta_{\mu_{d}} are the scale and shift parameters, which lead that the distribution of μ^x,d\hat{\mu}_{x,d} has the variance γμd2\gamma^{2}_{\mu_{d}} and mean βμd\beta_{\mu_{d}}. Therefore, we can control the d=1nVarp𝒟(x)[μx,d]\sum_{d=1}^{n}Var_{p_{\mathcal{D}}(x)}[\mu_{x,d}] by fixing the mean Ed[γμd2]=γ2E_{d}[\gamma^{2}_{\mu_{d}}]=\gamma^{2} with respect to each dimension dd. Specifically, we regard each γd\gamma_{d} as learnable parameters with initialization γ\gamma. Then after each training iteration, we re-scale each parameter γμd\gamma_{\mu_{d}} with coefficient γ/Ed[γμd2]\gamma/\sqrt{E_{d}[\gamma^{2}_{\mu_{d}}]}. In addition, all βμd\beta_{\mu_{d}} is learnable with initialization 0 and no constraint.

Overall, based on the analysis above, we propose our approach, namely DU-VAE, to encourage high diversity and low uncertainty of the latent space by applying Dropout regularizations on variance parameters and Batch-Normalization on mean parameters of approximated posteriors, simultaneously. Specifically, the training of DU-VAE is following Algorithm 1.

Algorithm 1 Training Procedure of DU-VAE
1:  Initialize ϕ\phi, θ\theta, γμ=γ\gamma_{\mu}=\gamma, and βμ=0\beta_{\mu}=0
2:  while not convergence do
3:     Sample a mini-batch xx
4:     μx,δx2=fϕ(x)\mu_{x},\delta_{x}^{2}=f_{\phi}(x).
5:     μ^x=BNγμ,βμ(μx)\hat{\mu}_{x}=BN_{\gamma_{\mu},\beta_{\mu}}(\mu_{x}), δ^x2=Dropoutp(δx2)\hat{\delta}^{2}_{x}=Dropout_{p}(\delta_{x}^{2}).
6:     Sample z𝒩(μ^x,δ^x2)z\sim\mathcal{N}(\hat{\mu}_{x},\hat{\delta}_{x}^{2}) and generate xx from fθ(z)f_{\theta}(z).
7:     Compute gradients gϕ,θϕ,θELBO(x;ϕ,θ)g_{\phi,\theta}\leftarrow\nabla_{\phi,\theta}\mathcal{L}_{ELBO}(x;\phi,\theta).
8:     Update ϕ,θ\phi,\theta, γμ\gamma_{\mu}, βμ\beta_{\mu} according to gϕ,θg_{\phi,\theta}.
9:     γμ=γE[γμ2]γμ\gamma_{\mu}=\frac{\gamma}{\sqrt{E[\gamma^{2}_{\mu}]}}\odot\gamma_{\mu}
10:  end while

Connections with BN-VAE. In the literature, BN-VAE Zhu et al. (2020) also applies BN on mean parameters. Zhu et al. claim that keeping one positive lower bound of the KL term, i.e., the expectation of the square of mean parameters d=1nEqϕ[μx,d2]\sum_{d=1}^{n}E_{q_{\phi}}[\mu_{x,d}^{2}], is sufficient for preventing posterior collapse. In practice, they ensure Eqϕ[μx,d2]>0E_{q_{\phi}}[\mu_{x,d}^{2}]>0 by fixing scale parameter γμd\gamma_{\mu_{d}} of BN for each dimension dd. However, here, we will demonstrate that keeping one positive lower bound of MPD is one more powerful strategy for preventing collapse posterior. As the discussion in Section 2, when posterior collapse occurs, we have q(z|xi)=q(z|xj)=p(z)q(z|x_{i})=q(z|x_{j})=p(z), xi,xj𝒳\forall x_{i},x_{j}\in\mathcal{X}. Therefore, to avoid this phenomenon, we actually need to control posterior distributions carefully so that:

q(z|x)p(z),x𝒳,q(z|xi)q(z|xj),xi,xj𝒳.\small\begin{split}&q(z|x)\neq p(z),~{}~{}\exists~{}x\in\mathcal{X},\\ &q(z|x_{i})\neq q(z|x_{j}),~{}~{}\exists~{}x_{i},x_{j}\in\mathcal{X}.\end{split} (10)

where the first term is actually implied in the second term as the necessary condition and DKL[q(z|x)||p(x)]>0D_{KL}[q(z|x)||p(x)]>0 is equivalent (both sufficient and necessary) to the first term, we can claim that keeping one positive lower bound of the KL term is not sufficient for the second term along with several certain abnormal cases (detailed analysis can be found in Supplementary.). By contrast, keeping one positive MPD in the latent space is actually one equivalent condition for the second term, which implies the first term. Actually, from the perspective of the diversity of latent space, we can provide one more possible explanation for the effectiveness of BN-VAE. That is, the application of BN on μx\mu_{x} ensures one positive value of Varp𝒟(x)[μx,d]Var_{p_{\mathcal{D}}(x)}[\mu_{x},d{}] for each dd, which is also one lower bound of MPD defined in Equation 2 when the variance parameters has one constant upper bound, like 1 in practice.

3.4 Extension to VAE-IAF

Here, to further examine the generalization of DU-VAE, we aim to extend our approach for other VAE variants, such as, VAE-IAF Kingma et al. (2016), one well-known normalizing flow-based VAE. Different from classic VAEs which assume the posterior distributions are diagonal Gaussian distributions, VAE-IAF can construct more flexible posterior distributions through applying one chain of invertible transformations, named the IAF chain, on an initial random variable drown from one diagonal Gaussian distribution. Specifically, the initial random variable z0z^{0} is sampled from the diagonal Gaussian with parameters μ0\mu^{0} and δ0\delta^{0} outputted from the encoder network. Then, TT invertible transformations, are applied to transform z0z^{0} into the final random variable zTz^{T}. More details can be found in Kingma et al. (2016).

Indeed, noting that the MPD and CE of the initial random variable z0z^{0} have the same form as these for classic VAEs in Equation 2 and Equation 4, one intuitive idea is to apply Dropout on δ0\delta^{0} and Batch Normalization on μ0\mu^{0} with the guidance in Algorithm 1 to control the MPD and CE of z0z^{0}. It is surprising to find that this simple extension of DU-VAE, called DU-IAF, demonstrated comparative performance in our experiments. This may be attributed to the close connection between z0z^{0} and zTz^{T}. In particular, we find that the CE of z0z^{0} is the upper bound of CE of zTz^{T}. Meanwhile, MPDp𝒟(x)[z0]MPD_{p_{\mathcal{D}}(x)}[z^{0}] is closely related with MPDp𝒟(x)[zT]MPD_{p_{\mathcal{D}}(x)}[z^{T}], even they are equal to each other when each invertible transformation in IAF chain is independent on the input data. Further discussion and proof can be found in Supplementary.

Yahoo Yelp OMNIGLOT
Model NLL KL MI AU NLL KL MI AU NLL KL MI AU
VAE 328.5 0.0 0.0 5.0 357.5 0.0 0.0 0.0 89.21 2.20 2.16 5.0
β\beta-VAE*(0.4/ 0.4/ 0.8) 328.7 6.4 6.0 13.0 357.4 5.8 5.6 4.0 89.15 9.98 3.84 13.0
SA-VAE* 327.2 5.2 2.7 8.6 355.9 2.8 1.7 8.4 89.07 3.32 2.63 8.6
Agg-VAE 326.7 5.7 2.9 6.0 355.9 3.8 2.4 11.3 89.04 2.48 2.50 6.0
FB (0.1) 328.1 3.4 2.5 32.0 357.1 4.8 2.5 32.0 89.17 7.98 6.87 32.0
δ\delta-VAE (0.1) 329.0 3.2 0.0 2.0 357.6 3.2 0.0 0.0 89.62 3.20 2.36 2.0
BN-VAE (0.6/ 0.6/ 0.5) 326.9 8.3 7.0 32.0 355.7 6.0 5.2 32.0 89.26 4.34 4.03 32.0
MAE (1/ 2/ 0.5, 0.2/ 0.2/ 0.2) 332.1 5.8 3.5 28.0 362.8 8.0 4.6 32.0 89.62 15.61 8.90 32.0
DU-VAE (0.5, 0.9) 327.0 5.2 4.3 18.0 355.6 5.3 4.9 18.0 89.00 6.63 5.97 19.0
DU-VAE (0.5, 0.8) 327.0 6.7 6.0 19.0 355.5 6.8 5.9 18.0 89.04 7.46 6.31 32.0
DU-VAE (0.6, 0.8) 326.7 8.7 7.2 28.0 355.8 9.6 7.7 23.0 89.18 10.99 8.22 32.0
IAF+FB (0.15/0.25/0.15) 328.4 5.2 - - 357.1 7.7 - - 88.98 6.77 - -
IAF+BN (0.6/0.7/0.5) 328.1 0.2 - - 356.6 0.6 - - 89.32 1.30 - -
DU-IAF (0.7/0.6/0.5, 0.70/0.70/0.85) 327.4 5.4 - - 356.1 5.1 - - 88.97 6.77 - -
Table 1: The performance on likelihood estimation. Due to the intractability of MI and AU metrics for IAF-based models, we just report NLL and KL same as Kingma et al. (2016). * indictes the results are referred from He et al. (2018). Hyper-parameters are reported in brackets and split by slashes if different on different datasets.

4 Experiments

In this section, our method would be evaluated on three benchmark datasets in terms of various metrics and tasks. Complete experimental setup can be found in Supplementary.

4.1 Experimental Setup

Setting. Following the same configuration as He et al. (2018), we evaluated our method on two text benchmark datasets, i.e., Yahoo and Yelp corpora Yang et al. (2017) and one image benchmark dataset, i.e., OMNIGLOT Lake et al. (2015). For text datasets, we utilized a single layer LSTM as both encoder and decoder networks, where the initial state of the decoder is projected by the latent variable zz. For images, a 3-layer ResNet He et al. (2016) encoder and a 13-layer Gated PixelCNN Van den Oord et al. (2016) decoder are applied. We set the dimension of zz as 32. and utilized SGD to optimize the ELBO objective for text and Adam Kingma and Ba (2015) for images. Following Burda et al. (2016), we utilized dynamically binarized images for training and the fixed binarization as test data. Meanwhile, following Bowman et al. (2015a), we applied a linear annealing strategy to increasing the KL weight from 0 to 1 in the first 10 epochs if possible.

Evaluation Metrics. Following Burda et al. (2016), we computed the approximate negative log-likelihood (NLL) by estimating 500 importance weighted samples. In addition, we also considered the value of KL term, mutual information (MI) I(x,z)I(x,z) Alemi et al. (2016) under the joint distribution q(x,z)q(x,z) and the number of activate units (AU) He et al. (2018) as additional metrics. In particular, the activity of each dimension zdz_{d} is measured as Az,d=Cov(Ezdq(zd|x)[zd])A_{z,d}=Cov(E_{z_{d}\sim q(z_{d}|x)}[z_{d}]). One dimension is regarded to be active when Az,d>0.01A_{z,d}>0.01.

Baselines. We compare our method with various VAE-based models, which can be grouped into two categories: 1) Classic VAEs: VAE with annealing Bowman et al. (2015a); Semi-Amortized VAE (SA-VAEKim et al. (2018); Agg-VAE He et al. (2018); β\beta-VAE Higgins et al. (2017) with parameter β\beta re-weighting the KL term; FB Kingma et al. (2016) with parameter λ\lambda constraining the minimum of KL term in each dimension; δ\delta-VAE Razavi et al. (2018) with parameter δ\delta constraining the range of KL term; BN-VAE  Zhu et al. (2020) with parameter γ\gamma keeping one positive KL term; MAE Ma et al. (2018) with parameters γ\gamma and η\eta controlling the diversity and smoothness of the latent space. Note that we implemented MAE with the standard Gaussian prior, instead of the AF prior in Ma et al. (2018) for one fair comparison. 2) IAF-based models: IAF+FB Kingma et al. (2016), which utilized the FB strategy with the parameter λ\lambda to avoid the posterior collapse in VAE-IAF; IAF+BN, where we applied BN regularization on the mean parameters of the distributions of z0z^{0} with the fixed scale parameters γ\gamma in each dimension.

4.2 Overall Performance

Log-Likelihood Estimation. Table 1 shows the results in terms of log-likelihood estimation. We can note that DU-VAE and DU-IAF achieve the best NLL among classic VAEs and IAF-based VAEs in all datasets, respectively. Besides, we also have some interesting findings. First, MAE does not perform well in all datasets, which may be caused by the difficulty to balance the additional training objective terms and ELBO. Second, although, Agg-VAE and SA-VAE also reach the great NLL in both datasets, they require the additional training procedure on the inference network, leading to the high training time cost Zhu et al. (2020). Third, BN-VAE also achieves completive performance on text datasets. However, for images, where the posterior collapse may be less of an issue Kim et al. (2018), BN-VAE fails to catch up with other models, even worse than basic VAE on NLL. Fourth, DU-VAE prefers to capture higher KL and MI compared with BN-VAE with the same scale parameter γ\gamma. In other words, DU-VAE can convert more information from the observed data into the latent variable. Fifth, based on the results of IAF+BN, we can find that the BN strategy used in BN-VAE can not prevent the collapse posterior in VAE-IAF with small KL. By contrast, our approach can be easily extended for VAE-IAF with the best performance. Finally, we also note that IAF based models may be more suitable for image dataset without sound performance on text, while DU-IAF nevertheless achieves competitive performance.

#\#label 100 500 1k 2k 10k
AE 84.05 86.82 87.93 88.19 88.75
VAE 71.10 71.43 71.58 72.96 77.11
δ\delta-VAE (0.1) 60.11 60.52 61.46 63.79 64.38
Agg-VAE 75.05 77.16 78.50 79.29 80.07
FB (0.1) 75.19 80.78 81.63 82.28 82.39
BN-VAE(0.6) 84.53 88.22 89.45 89.63 89.72
MAE (2, 0.2) 61.50 61.70 62.42 63.58 63.68
DU-VAE (0.5, 0.8) 88.91 89.63 90.36 90.51 90.77
IAF+FB(0.25) 89.73 90.60 90.94 90.91 91.01
IAF+BN(0.7) 87.98 89.03 89.18 89.35 90.29
DU-IAF (0.6, 0.7) 91.25 91.10 91.52 91.97 92.31
Table 2: The accuracy of the classification on Yelp.
#\#label for each character 5 10 15
AE 37.28 43.38 46.94
VAE 29.48 37.79 42.24
δ\delta-VAE (0.1) 37.28 43.38 46.94
Agg-VAE 33.72 41.31 46.27
FB (0.1) 33.93 41.05 45.21
BN-VAE (0.5) 31.17 39.15 43.24
MAE (05, 0.2) 35.05 41.72 44.95
DU-VAE (0.5, 0.1) 40.54 48.09 52.47
IAF+FB(0.15) 38.33 45.85 49.90
IAF+BN(0.5) 16.58 19.49 21.11
DU-IAF (0.5, 0.15) 41.84 49.86 52.97
Table 3: The average accuracy of classifications on OMNIGLOT.
Refer to caption
(a) True Latent Space
Refer to caption
(b) VAE
Refer to caption
(c) Agg-VAE
Refer to caption
(d) BN-VAE (1.0)
Refer to caption
(e) DU-VAE (1.0, 0.5)
Figure 1: The visualization of the latent space learned by DU-VAE and other baselines. Figure (a) is the counter plot of the true latent space for generating the synthetic dataset. In the rest, the first line shows the counter plot of the aggregated posterior qϕ(z)q_{\phi}(z). The brighter the color, the higher the probability. Meanwhile, the location of mean parameters are displayed in the second line with colors to distinguish different categories generated from different Gaussian components, where the blue ones correspond to the component in the center in Figure (a), and the others denote the other four components. All figures are located in the same region, i.e., z[3,3]×[3,3]z\in[-3,3]\times[-3,3], with the same scale.

Classification. To evaluate the quality of learned representation, we train a one-layer linear with the output from the trained model as the input for classification tasks on both text and image datasets. For classic VAEs, the mean parameter μ\mu of each latent variable has been used as the representation vector. For IAF-based models, we first selected the initial sample z0z^{0} in latent space as its mean parameter μ0\mu^{0}. Then, the combination of z0z^{0} and zTz^{T} is used as the representation vector.

Specifically, for text datasets, following Shen et al. (2017), we work with one downsampled version of Yelp sentiment dataset for binary classification. Table 2 shows the performance under varying number of labeled data. For the image dataset, noting that OMNIGLOT contains 1623 different handwritten characters from 50 different alphabets, where each character has 15 images in our training data and 5 images in our testing data, we conducted classifications on each alphabet with varying number of training samples for each character. Table 3 reports the average accuracy.

We can find that DU-VAE and DU-IAF achieve the best accuracy under all settings for classic VAEs and IAF-based models respectively. Interestingly, we also find that most baselines show inconsistent results on text and image classification. For example, Agg-VAE and BN-VAE may be better at text classification without sound accuracy in Table 3. On the contrary, δ\delta-VAE and MAE adapt to image classification better with uncompetitive performance in Table 2. Meanwhile, we note IAF chain trends to improve the classification accuracy for FB and our approach to both text and image datasets. However, IAF+BN fails to achieve competitive performance on image classification, which indicates that the applications of BN in BN-VAE may not be suitable for image again.

Refer to caption
(a) Yahoo
Refer to caption
(b) Yelp
Refer to caption
(c) OMNIGLOT
Figure 2: Parameter Analysis.

Parameter Analysis. Here, we train DU-VAE by varying γ\gamma from 0.4 to 0.7 and pp from 1 to 0.6. As the Figure 2 shows, we find that, DU-VAE would achieve the best NLL with parameters (γ\gamma, pp) as when (0.60.6, p=0.8p=0.8) for Yahoo, (0.50.5, p=0.8p=0.8) for Yelp, and (0.50.5, p=0.9p=0.9) for OMNIGLOT, respectively.

4.3 Case Study–Latent Space Visualization

Here, we aim to provide one intuitive comparison of latent spaces learned by different models based on one simple synthetic dataset. Specifically, following Kim et al. (2018), we first sample 2-dimensional latent variable zz from one mixture of Gaussian distributions that have 5 mixture components. Then one text-like dataset can be generated from one LSTM layer conditioned on those latent variables. Based on this synthetic dataset, we trained different VAEs with 2-dimensional standard Gaussian prior and diagonal Gaussian posterior. Then, we visualize the learned latent spaces by displaying the counter plot of the aggregated approximated posteriors q(z)=Ep𝒟(x)[qϕ(z|x)]q(z)=E_{p_{\mathcal{D}}(x)}[q_{\phi}(z|x)] and the location of approximated posterior’s mean parameters for different samples xx.

According to the results in Figure 1, we have some interesting observations. First, due to the posterior collapse, VAE learns an almost meaningless latent space where the posterior q(z|x)q(z|x) for all data are squeezed in the center. Actually, it is not surprising that the aggregated posterior matches the prior excessively in this case, because we almost have qϕ(z|x)=p(z)q_{\phi}(z|x)=p(z), x\forall x. Second, Agg-VAE, BN-VAE, and DU-VAE all tend to diverse samples in different categories, but in different manners and degrees. Intuitively, all three models force to embedding the blue category in the center around by the other four categories. However, only the average posterior learned by DU-VAE have five centers same as the true latent space. Meanwhile, DU-VAE with Dropout strategy encourages the aggregated posteriors to be more compact, while that of BN-VAE is more broad, compared with the prior. Those observations demonstrate that DU-VAE tends to guide the latent space to be more diverse and less uncertain.

5 Conclusion

In this paper, we developed a novel generative model, DU-VAE, for learning a more diverse and less uncertain latent space. The goal of DU-VAE is to ensure that more meaningful and compact representations can be learned. Specifically, we first demonstrated theoretically that it led to better latent space with high diversity and low uncertainty awareness by controlling the distribution of posterior’s parameters across the whole dataset respectively. Then, instead of introducing new loss terms or modifying training strategies, we proposed to apply Dropout on the variances and Batch-Normalization on the means simultaneously to regularize their distributions implicitly. Furthermore, we extended DU-VAE into VAE-IAF empirically. The experimental results on three benchmark datasets clearly showed that DU-VAE outperformed state-of-the-art baselines on both likelihood estimation and underlying classification tasks.

Acknowledgements

This work was partially supported by grants from the National Natural Science Foundation of China (Grant No. 91746301, 61836013).

References

  • Alemi et al. [2016] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. ICLR, 2016.
  • Blei et al. [2003] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
  • Bowman et al. [2015a] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. In EMNLP, 2015.
  • Bowman et al. [2015b] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. CCNLP, 2015.
  • Burda et al. [2016] Yuri Burda, Roger B Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.
  • Chen et al. [2017] Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, P. Dhariwal, John Schulman, Ilya Sutskever, and P. Abbeel. Variational lossy autoencoder. ICLR, 2017.
  • Dziugaite et al. [2015] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.
  • Fang et al. [2019] Le Fang, Chunyuan Li, Jianfeng Gao, Wen Dong, and Changyou Chen. Implicit deep latent variable models for text generation. In EMNLP, 2019.
  • Fu et al. [2019] Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, and Lawrence Carin. Cyclical annealing schedule: A simple approach to mitigating kl vanishing. In ACL, 2019.
  • Germain et al. [2015] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • He et al. [2018] Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. In ICLR, 2018.
  • Higgins et al. [2017] I. Higgins, Loïc Matthey, A. Pal, C. Burgess, Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hoffman and Johnson [2016] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.
  • Huai et al. [2014] Bao-Xing Huai, Teng-Fei Bao, Heng-Shu Zhu, and Qi Liu. Topic modeling approach to named entity linking. Journal of software, 2014.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
  • Kim et al. [2018] Yoon Kim, Sam Wiseman, Andrew Miller, David Sontag, and Alexander Rush. Semi-amortized variational autoencoders. In ICML, 2018.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2013.
  • Kingma et al. [2016] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In NeurIPS, 2016.
  • Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Liese and Vajda [2006] Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
  • Lin et al. [2017] Hao Lin, Hengshu Zhu, Yuan Zuo, Chen Zhu, Junjie Wu, and Hui Xiong. Collaborative company profiling: Insights from an employee’s perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • Ma et al. [2018] Xuezhe Ma, Chunting Zhou, and Eduard Hovy. Mae: Mutual posterior-divergence regularization for variational autoencoders. In ICLR, 2018.
  • Makhzani et al. [2016] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. Adversarial autoencoders. ICLR, 2016.
  • Pu et al. [2016] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. In NeurIPS, 2016.
  • Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Razavi et al. [2018] Ali Razavi, Aaron van den Oord, Ben Poole, and Oriol Vinyals. Preventing posterior collapse with delta-vaes. In ICLR, 2018.
  • Rezaabad and Vishwanath [2020] Ali Lotfi Rezaabad and Sriram Vishwanath. Learning representations by maximizing mutual information in variational autoencoders. In ISIT. IEEE, 2020.
  • Semeniuta et al. [2017] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A hybrid convolutional variational autoencoder for text generation. In EMNLP, 2017.
  • Shen et al. [2017] Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text by cross-alignment. In NeurIPS, 2017.
  • Shen et al. [2018] Dazhong Shen, Hengshu Zhu, Chen Zhu, Tong Xu, Chao Ma, and Hui Xiong. A joint learning approach to intelligent job interview assessment. In IJCAI, pages 3542–3548, 2018.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JLMR, 15(1):1929–1958, 2014.
  • Van den Oord et al. [2016] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In NeurIPS, pages 4790–4798, 2016.
  • Xu et al. [2018] Tong Xu, Hengshu Zhu, Chen Zhu, Pan Li, and Hui Xiong. Measuring the popularity of job skills in recruitment market: A multi-criteria approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Yang et al. [2017] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. In ICML, 2017.
  • Zhao et al. [2019] Shengjia Zhao, Jiaming Song, and S. Ermon. Infovae: Balancing learning and inference in variational autoencoders. In AAAI, 2019.
  • Zheng et al. [2019] Huangjie Zheng, Jiangchao Yao, Ya Zhang, Ivor W Tsang, and Jia Wang. Understanding vaes in fisher-shannon plane. In AAAI, 2019.
  • Zhu et al. [2014] Chen Zhu, Hengshu Zhu, Yong Ge, Enhong Chen, and Qi Liu. Tracking the evolution of social emotions: A time-aware topic modeling perspective. In 2014 IEEE International Conference on Data Mining, pages 697–706. IEEE, 2014.
  • Zhu et al. [2020] Qile Zhu, Wei Bi, Xiaojiang Liu, Xiyao Ma, Xiaolin Li, and D. Wu. A batch normalized inference network keeps the kl vanishing away. In ACL, 2020.

Supplementary

Appendix A Related Work

Representation learning as one important direction of machine learning has attracted enormous attention and severed for various applications Blei et al. [2003]; Huai et al. [2014]; Zhu et al. [2014]; Lin et al. [2017]; Xu et al. [2018]; Shen et al. [2018]; Radford et al. [2015], Among them, Variational Autoencoder (VAE) Kingma and Welling [2013] have achieve the great success in recent years. The literature for enhancing the representation learning of VAE can be roughly divided into two categories based on different motivations: solving optimization challenge and directing latent space.

Solving Optimization Challenge. Bowman et al. [2015a] first detailed that VAE tends to ignore the latent variable when employing one LSTM as the decoder network. They interpreted this problem as optimization challenges of VAE. Following this idea, various strategies have been proposed to spur the training trajectory jump out from the local optima, namely, posterior collapse.

One direction is to adjust the training strategy. For example, Bowman et al. [2015a] proposed KL annealing to address this problem, by slightly increasing the weight of KL term during the first few epochs of training. Kim et al. [2018] designed Semi-Amortized VAE (SA-VAE) to compose the inference network with additional updates. More recently,  He et al. [2018] proposed to aggressively optimize inference network multiple times before a single decoder update. However, those approaches often suffer from the additional training procedure. Meanwhile, weakening the expressive capacity of decoder networks is another option.  Semeniuta et al. [2017] and  Yang et al. [2017] implemented the decoder by CNN without autogressive modeling.  Chen et al. [2017] applied a lossy representation as the input of autogressive decoder to force the latent representation to discard irrelevant information. However, those approaches are often problem-specific and require manual designs of the decoder network.

Some other works constraint the minimum of the KL terms to prevent the KL from vanishing to 0. For instance, β\beta-VAE Higgins et al. [2017] introduces a hyperparameter to weight the KL term that constrains its minimum. Free bits (FB) Kingma et al. [2016] replaces the KL terms with a hinge loss term that maximizes the original KL with a constant. δ\delta-VAE Razavi et al. [2018] constrains the variational family carefully such that the posterior can never exactly recover the prior. However, those approaches often suffer from performance degradation in likelihood estimation or no-smooth objective Chen et al. [2017]. Recently, BN-VAE Zhu et al. [2020] directly utilizes the Batch-Normalization Ioffe and Szegedy [2015] on mean parameters to keep one positive lower bound of the KL term, which has witnessed a surprising effectiveness on preventing posterior collapse. But the theoretical basis behind the effect of BN on latent space is not yet understood. In our paper, we will provide one more possible explanation based on the geometry analysis of latent space.

Directing Latent Space Learning. Some other studies attempt to enhance the representation learning by directing the latent space learning. One common idea is to use the additional mutual information-based objective terms to enforce the relationship between the latent variable and observed data. However, since the MI term can not be computed analytically, auxiliary networks Fang et al. [2019]; Rezaabad and Vishwanath [2020], Monte-Carlo Hoffman and Johnson [2016], or Maximum-Mean Discrepancy method Dziugaite et al. [2015]; Zhao et al. [2019] are required for computing the approximation, which introduces additional training effort or expensive computation. Recently, MAE  Ma et al. [2018] further proposed Mutual Posterior-Divergence (MPD) regularization to control the geometry diversity of the latent space, which has one similar goal with MI to measure the average of divergence between each conditional distribution and marginal distribution. However, the scales of MPD and the original objective of VAE are unbalanced, which requires additional deliberated normalization. In our paper, we also aim at directing the latent space learning and encouraging diversity. Different from MAE, we control MPD by network regularizations, i.e., Dropout and Batch-Normalization, without adding explicit training objectives.

Appendix B Proof of Equation 3

Proof.

First, we can derive the formula of DSKL[qϕ(z1|x1)||qϕ(z2|x2)]D_{SKL}[q_{\phi}(z_{1}|x_{1})||q_{\phi}(z_{2}|x_{2})] under Gaussian distributions as following:

4DSKL[qϕ(z1|x1)||qϕ(z2|x2)]==(μx1,dμx2,d)2(1δx1,d2+1δx2,d2)+δx1,d2δx2,d2+δx2,d2δx1,d22.\small\begin{split}&4D_{SKL}[q_{\phi}(z_{1}|x_{1})||q_{\phi}(z_{2}|x_{2})]=\\ &=(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}(\frac{1}{\delta_{x_{1},d}^{2}}+\frac{1}{\delta_{x_{2},d}^{2}})+\frac{\delta^{2}_{x_{1},d}}{\delta^{2}_{x_{2},d}}+\frac{\delta^{2}_{x_{2},d}}{\delta^{2}_{x_{1},d}}-2.\\ \end{split} (S11)

Then, we compute the expectation of both sides of the equality with respect to i.i.d.x1,x2P𝒟(x)i.i.d.~{}x_{1},x_{2}\sim P_{\mathcal{D}}(x), we have the Equation 3.

In addition, when the value of δx,d2\delta_{x,d}^{2} is upper bound with constant CC, we further have:

4DSKL[qϕ(z1|x1)||qϕ(z2|x2)]=(μx1,dμx2,d)2(1δx1,d2+1δx2,d2)(μx1,dμx2,d)22C\small\begin{split}&4D_{SKL}[q_{\phi}(z_{1}|x_{1})||q_{\phi}(z_{2}|x_{2})]=\\ &\geq(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}(\frac{1}{\delta_{x_{1},d}^{2}}+\frac{1}{\delta_{x_{2},d}^{2}})\\ &\geq(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}\frac{2}{C}\end{split} (S12)

where the equality in the second line holds when δx1,d2=δx2,d2\delta_{x_{1},d}^{2}=\delta_{x_{2},d}^{2}. Then, we compute the expectation of both sides of the inequality, we have:

MPDp𝒟(x)[z]1Cd=1nVarp𝒟(x)[μx,d],ifδx,d2C.\small\begin{split}&MPD_{p_{\mathcal{D}}(x)}[z]\geq\frac{1}{C}\sum_{d=1}^{n}Var_{p_{\mathcal{D}(x)}}[\mu_{x,d}],~{}~{}\text{if}~{}~{}\forall~{}\delta^{2}_{x,d}\leq C.\end{split} (S13)

Appendix C Proof of Proposition 1

Proof.

Given the Dropout regularization defined in our paper, it is not hard to derive the following equations:

EB[δ^x,d2]=δx,d2,EB[1δ^x,d2]=p2δx,d2+(p1)α+1pα,EB[logδ^x,d2]=plog(δx,d2+(p1)αpα)+logα.\small\begin{split}&E_{B}[\hat{\delta}^{2}_{x,d}]=\delta^{2}_{x,d},\\ &E_{B}[\frac{1}{\hat{\delta}^{2}_{x,d}}]=\frac{p^{2}}{\delta_{x,d}^{2}+(p-1)\alpha}+\frac{1-p}{\alpha},\\ &E_{B}[\log\hat{\delta}^{2}_{x,d}]=p\log(\frac{\delta_{x,d}^{2}+(p-1)\alpha}{p\alpha})+\log\alpha.\\ \end{split} (S14)

Interestingly, we note that the second and third equations are strictly increasing and decreasing , respectively, as pp decreases from 1 to 0, which can be verified by analyzing their derivatives. Then, we have:

1α>EB[1δ^x,d2]>1δx,d2,logα<EB[logδ^x,d2]<logδx,d2.\small\begin{split}&\frac{1}{\alpha}>E_{B}[\frac{1}{\hat{\delta}^{2}_{x,d}}]>\frac{1}{\delta_{x,d}^{2}},\\ &\log\alpha<E_{B}[\log\hat{\delta}^{2}_{x,d}]<\log\delta_{x,d}^{2}.\\ \end{split}\vspace{-1mm} (S15)

Meanwhile, by considering MPDp𝒟(x)B[z]MPD_{p_{\mathcal{D}}(x)\cdot B}[z] and HqϕB(z|x)H_{q_{\phi}\cdot B}(z|x) as the function of EB[1δ^x,d2]E_{B}[\frac{1}{\hat{\delta}^{2}_{x,d}}] and EB[logδ^x,d2]E_{B}[\log\hat{\delta}^{2}_{x,d}] respectively, we can found they are also increasing and decreasing monotonously as pp decreases. Then, we turn to prove the three inequalities in Equation 7 and 8.

First, by computing the expectation of the second inequality with respect to xp𝒟(x)x\sim p_{\mathcal{D}}(x), we can derive:

Ep𝒟(x)B[logδ^x,d2]<Ep𝒟(x)[logδx,d2].\small\begin{split}&E_{p_{\mathcal{D}}(x)\cdot B}[\log\hat{\delta}^{2}_{x,d}]<E_{p_{\mathcal{D}}(x)}[\log\delta^{2}_{x,d}].\end{split} (S16)

Meanwhile, based on Equation 5, we have:

HqϕB(z|x)<Hqϕ(z|x).\small\begin{split}&H_{q_{\phi}\cdot B}(z|x)<H_{q_{\phi}}(z|x).\end{split} (S17)

In addition, due to the general observation:

Ep𝒟(x)B[(μx1,dμx2,d)2δ^x1,d2]=Ep𝒟(x)[(μx1,dμx2,d)2EB[1δ^x1,d2]],\small\begin{split}&E_{p_{\mathcal{D}}(x)\cdot B}[\frac{(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}}{\hat{\delta}^{2}_{x_{1},d}}]\\ &=E_{p_{\mathcal{D}}(x)}[(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}E_{B}[\frac{1}{\hat{\delta}^{2}_{x_{1},d}}]],\end{split} (S18)

we have the following derivation:

Ep𝒟(x)B[(μx1,dμx2,d)2δ^x1,d2]<2αVarp𝒟(x)[μx,d],Ep𝒟(x)B[(μx1,dμx2,d)2δ^x1,d2]>Ep𝒟(x)[(μx1,dμx2,d)2δx1,d2].\small\begin{split}&E_{p_{\mathcal{D}}(x)\cdot B}[\frac{(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}}{\hat{\delta}^{2}_{x_{1},d}}]<\frac{2}{\alpha}Var_{p_{\mathcal{D}}(x)}[\mu_{x,d}],\\ &E_{p_{\mathcal{D}}(x)\cdot B}[\frac{(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}}{\hat{\delta}^{2}_{x_{1},d}}]>E_{p_{\mathcal{D}}(x)}[\frac{(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}}{\delta^{2}_{x_{1},d}}].\end{split}\vspace{-1mm} (S19)

Then, combining Equation 3, S4, S5 and S9, we have:

2MPDp𝒟(x)B[z]=d=1nEp𝒟(x)B[(μx1,dμx2,d)2δ^x1,d2]+d=1nEp𝒟(x)B[δ^x,d2]Ep𝒟(x)B[1δ^x,d2]1>d=1nEp𝒟(x)[(μx1,dμx2,d)2δx1,d2]+d=1nEp𝒟(x)[δx,d2]Ep𝒟(x)[1δx,d2]1=2MPDp𝒟(x)[z].\small\begin{split}&2MPD_{p_{\mathcal{D}}(x)\cdot B}[z]=\sum_{d=1}^{n}E_{p_{\mathcal{D}}(x)\cdot B}[\frac{(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}}{\hat{\delta}_{x_{1},d}^{2}}]\\ &+\sum_{d=1}^{n}E_{p_{\mathcal{D}}(x)\cdot B}[\hat{\delta}_{x,d}^{2}]E_{p_{\mathcal{D}}(x)\cdot B}[\frac{1}{\hat{\delta}_{x,d}^{2}}]-1\\ &>\sum_{d=1}^{n}E_{p_{\mathcal{D}}(x)}[\frac{(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}}{\delta_{x_{1},d}^{2}}]\\ &+\sum_{d=1}^{n}E_{p_{\mathcal{D}(x)}}[\delta_{x,d}^{2}]E_{p_{\mathcal{D}}(x)}[\frac{1}{\delta_{x,d}^{2}}]-1=2MPD_{p_{\mathcal{D}}(x)}[z].\end{split}\vspace{-1mm} (S20)

Meanwhile, we note that EB[1δ^x,d2]>1pαE_{B}[\frac{1}{\hat{\delta}^{2}_{x,d}}]>\frac{1-p}{\alpha} based on Equation S14, so we also have:

MPDp𝒟(x)B[z]>d=1nEp𝒟(x)B[(μx1,dμx2,d)22δ^x1,d2]>1pαd=1nVarp𝒟(x)[μx,d].\small\begin{split}&MPD_{p_{\mathcal{D}}(x)\cdot B}[z]>\\ &\sum_{d=1}^{n}E_{p_{\mathcal{D}}(x)\cdot B}[\frac{(\mu_{x_{1},d}-\mu_{x_{2},d})^{2}}{2\hat{\delta}^{2}_{x_{1},d}}]>\frac{1-p}{\alpha}\sum_{d=1}^{n}Var_{p_{\mathcal{D}}(x)}[\mu_{x,d}].\\ \end{split} (S21)

Appendix D Connection between z0z^{0} and zTz^{T} in VAE-IAF

Here, we first introduce the background knowledge for VAE-IAF. Then we will discuss the strong relation between z0z^{0} and zTz^{T} in VAE-IAF with respect to CE and MPD, repspectively, which provide one explanation for the effectiveness of DU-IAF. In addition, Algorithm S1 shows the detailed training procedure of DU-IAF.

background knowledge of VAE-IAF. VAE-IAF aims to construct more flexible and expressive posterior distributions with the help of normalizing flows. Here, we briefly introduce the main steps on constructing posterior distribution in VAE-IAF. In practice, first, the encoder network outputs μ0\mu^{0} and δ0\delta^{0}, in addition to an extra embedding of the input data hh. Then, the initial random sample can be drown from the diagonal Gaussian qϕ(z0|x)=𝒩(μ0,(δ0)2)q_{\phi}(z^{0}|x)=\mathcal{N}(\mu^{0},(\delta^{0})^{2}). Second, one chain of nonlinear invertible transformations has been defined with TT of IAF blocks:

(mt,st)=AutoregressiveNeuralNett(zt1,h;ψt),δt=sigmod(st),zt=δtzt1+(1δt)mt,\small\begin{split}&(m^{t},s^{t})=\text{AutoregressiveNeuralNet}_{t}(z^{t-1},h;\psi_{t}),\\ &\delta^{t}=\text{sigmod}(s^{t}),z^{t}=\delta^{t}\odot z^{t-1}+(1-\delta^{t})\odot m^{t},\end{split} (S22)

where each IAF block is one different autoregressive neural network, which is structured to be autoregressive w.r.t. zt1z^{t-1}. Finally, the final iterate zTz^{T} is considered as the posterior approximate and fed into the decoder network. As a result, dztdzt1\frac{dz^{t}}{dz^{t-1}} is triangular with δt\delta^{t} on the diagonal. In other words, all transformations are invertible with positive determinant d=1Dδdt\prod_{d=1}^{D}\delta^{t}_{d} everywhere.

Hqϕ(z0|x)H_{q_{\phi}}(z^{0}|x) VS Hqϕ(zT|x)H_{q_{\phi}}(z^{T}|x). Here, we will prove that the CE of z0z^{0} is one upper bound of that of zTz^{T}, i.e., Hqϕ(z0|x)>Hqϕ(zT|x)H_{q_{\phi}}(z^{0}|x)>H_{q_{\phi}}(z^{T}|x).

Proof.

Specifically, we first need to compute the differential entropy H(q(zT|x))H(q(z^{T}|x)) of posterior q(zT|x)q(z^{T}|x) given xx:

H(q(zT|x))=q(zT|x)logq(zT|x)𝑑zT=q(z0|x)det|dz0dzT|logq(z0|x)det|dz0dzT|dzT=q(z0|x)logq(z0|x)𝑑z0q(z0|x)logdet|dz0dzT|dz0=H(q(z0|x))+Eq(z0|x)[d=1Dt=1Tlogδx,dt],\small\begin{split}&H(q(z^{T}|x))=-\int q(z^{T}|x)\log q(z^{T}|x)dz^{T}\\ &=-\int q(z^{0}|x)\det\left|\frac{dz^{0}}{dz^{T}}\right|\log q(z^{0}|x)\det\left|\frac{dz^{0}}{dz^{T}}\right|dz^{T}\\ &=-\int q(z^{0}|x)\log q(z^{0}|x)dz^{0}-\int q(z^{0}|x)\log\det\left|\frac{dz^{0}}{dz^{T}}\right|dz^{0}\\ &=H(q(z^{0}|x))+E_{q(z^{0}|x)}[\sum_{d=1}^{D}\sum_{t=1}^{T}\log\delta^{t}_{x,d}],\end{split}\vspace{-1mm} (S23)

where the quality arises because each step in IAF chain defined in Equation S22 is invertible and differentiable, and the Jacobian matrix dzTdz0\frac{dz^{T}}{dz^{0}} is inverse with the following determinant everywhere:

det|dzTdz0|=t=1Tdet|dzdtdzdt1|=t=1Td=1Dδdt.\small\begin{split}\det|\frac{dz^{T}}{dz^{0}}|=\prod_{t=1}^{T}\det\left|\frac{dz^{t}_{d}}{dz^{t-1}_{d}}\right|=\prod_{t=1}^{T}\prod_{d=1}^{D}\delta^{t}_{d}.\end{split}\vspace{-1mm} (S24)

Meanwhile, noting that each δdt<1\delta^{t}_{d}<1, we have:

H(q(z0|x))>H(q(zT|x)).\small\begin{split}H(q(z^{0}|x))>H(q(z^{T}|x)).\end{split}\vspace{-1mm} (S25)

Then, by computing the expectation of both with respect to xp𝒟(x)x\sim p_{\mathcal{D}}(x), we can drive that Hqϕ(z0|x)>Hqϕ(zT|x)H_{q_{\phi}}(z^{0}|x)>H_{q_{\phi}}(z^{T}|x). ∎

Algorithm S1 Training Procedure of DU-IAF
1:  Initialize ϕ\phi, θ\theta, γμ=γ\gamma_{\mu}=\gamma, and βμ=0\beta_{\mu}=0
2:  while not convergence do
3:     Sample a mini-batch xx
4:     μx0,(δx0)2,h=fϕ(x)\mu^{0}_{x},(\delta_{x}^{0})2,h=f_{\phi}(x).
5:     μ^x0=BNγμ,βμ(μx0)\hat{\mu}^{0}_{x}=BN_{\gamma_{\mu},\beta_{\mu}}(\mu^{0}_{x}), (δ^x0)2=Dropoutp((δx0)2)(\hat{\delta}^{0}_{x})^{2}=Dropout_{p}((\delta_{x}^{0})^{2}).
6:     Sample z0𝒩(μ^x0,(δ^x0)2)z^{0}\sim\mathcal{N}(\hat{\mu}^{0}_{x},(\hat{\delta}_{x}^{0})^{2}).
7:     Transform z0z^{0} into zTz^{T} with TT IAF blocks in Equation S22.
8:     Generate xx from fθ(zT)f_{\theta}(z^{T}).
9:     Compute gradients gϕ,θϕ,θELBO(x;ϕ,θ)g_{\phi,\theta}\leftarrow\nabla_{\phi,\theta}\mathcal{L}_{ELBO}(x;\phi,\theta).
10:     Update ϕ,θ\phi,\theta, γμ\gamma_{\mu}, βμ\beta_{\mu} according to gϕ,θg_{\phi,\theta}.
11:     γμ=γEd[γμd2]γμ\gamma_{\mu}=\frac{\gamma}{\sqrt{E_{d}[\gamma^{2}_{\mu_{d}}]}}\odot\gamma_{\mu}
12:  end while

MPDp𝒟(x)[z0]MPD_{p_{\mathcal{D}}(x)}[z^{0}] VS MPDp𝒟(x)[zT]MPD_{p_{\mathcal{D}}(x)}[z^{T}]. Here , we turn to further explore the possible connection between MPDp𝒟(x)[z0]MPD_{p_{\mathcal{D}}(x)}[z^{0}] and MPDp𝒟(x)[zT]MPD_{p_{\mathcal{D}}(x)}[z^{T}]. Noting that MPDp𝒟(x)[zT]MPD_{p_{\mathcal{D}}(x)}[z^{T}] is intractable due to the interconnection between hh, z0z^{0} and each δt\delta_{t}, it is difficult to compare MPDp𝒟(x)[z0]MPD_{p_{\mathcal{D}}(x)}[z^{0}] with MPDp𝒟(x)[zT]MPD_{p_{\mathcal{D}}(x)}[z^{T}] intuitively. However, we also find that when AutoregressiveNeuralNet in Equation S22 ignores the context information hh, i.e., the IAF chain is independent on input data, we can prove that MPDp𝒟(x)[z0]=MPDp𝒟(x)[zT]MPD_{p_{\mathcal{D}}(x)}[z^{0}]=MPD_{p_{\mathcal{D}}(x)}[z^{T}], which also demonstrates the strong relation between z0z^{0} and zTz^{T} in some degree.

Proof.

First , we have the lemma 1 Liese and Vajda [2006]:

Lemma 1.

Given two distributions pi(x)p_{i}(x) and pj(x)p_{j}(x) in space 𝒳(x𝒳)\mathcal{X}(x\in\mathcal{X}) and one differentiable and invertible transformation h:𝒳𝒴h:\mathcal{X}\rightarrow\mathcal{Y} that converts xx into yy, we have:

DKL(pi(x)||pj(x))=DKL(pi(y)||pj(y)),\small\begin{split}D_{KL}(p_{i}(x)||p_{j}(x))=D_{KL}(p^{\prime}_{i}(y)||p^{\prime}_{j}(y)),\end{split} (S26)

where y=h(x)y=h(x) and p(y)p^{\prime}(y) denotes the distribution of yy.

This lemma tells us that the KL-divergence between two distribution is invariant under one same differentiable and invertible transformation. Meanwhile, noting that IAF chains for different input data are all invertible and differentiable and same with each other when AutoregressiveNeuralNet ignores the context information hh, we have:

DKL(q(zt1|x1)||q(zt1|x2))=DKL(q(zt2|x1)||q(zt2|x2)),\small\begin{split}D_{KL}(q(z^{t_{1}}|x_{1})||q(z^{t_{1}}|x_{2}))=D_{KL}(q(z^{t_{2}}|x_{1})||q(z^{t_{2}}|x_{2})),\end{split} (S27)

where t1,t2=0,1,,Tt_{1},t_{2}=0,1,...,T and x1,x2𝒳x_{1},x_{2}\in\mathcal{X}. When we set t1=0t_{1}=0 and t2=Tt_{2}=T, we can derive:

MPD[z0]=Ep𝒟(x)[DSKL[qϕ(z0|x1)|qϕ(z0||x2)]]=12Ep𝒟(x)[DKL[qϕ(z0|x1)|qϕ(z0||x2)]+12Ep𝒟(x)[DKL[qϕ(z0|x2)|qϕ(z0||x1)]]=12Ep𝒟(x)[DKL[qϕ(zT|x1)|qϕ(zT||x2)]+12Ep𝒟(x)[DKL[qϕ(zT|x2)|qϕ(zT||x1)]]=Ep𝒟(x)[DSKL[qϕ(zT|x1)|qϕ(zT||x2)]]=MPD[zT].\small\begin{split}&MPD[z^{0}]=E_{p_{\mathcal{D}(x)}}[D_{SKL}[q_{\phi}(z^{0}|x_{1})|q_{\phi}(z^{0}||x_{2})]]\\ &=\frac{1}{2}E_{p_{\mathcal{D}(x)}}[D_{KL}[q_{\phi}(z^{0}|x_{1})|q_{\phi}(z^{0}||x_{2})]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{1}{2}E_{p_{\mathcal{D}(x)}}[D_{KL}[q_{\phi}(z^{0}|x_{2})|q_{\phi}(z^{0}||x_{1})]]\\ &=\frac{1}{2}E_{p_{\mathcal{D}(x)}}[D_{KL}[q_{\phi}(z^{T}|x_{1})|q_{\phi}(z^{T}||x_{2})]\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{1}{2}E_{p_{\mathcal{D}(x)}}[D_{KL}[q_{\phi}(z^{T}|x_{2})|q_{\phi}(z^{T}||x_{1})]]\\ &=E_{p_{\mathcal{D}(x)}}[D_{SKL}[q_{\phi}(z^{T}|x_{1})|q_{\phi}(z^{T}||x_{2})]]=MPD[z^{T}].\end{split} (S28)

Appendix E Detailed Experimental Setup

Our experiments are designed with the guidance of Kim et al. [2018] and He et al. [2018]. Here, we first introduce detailed experimental settings for experiments on real-world datasets, and the synthetic dataset, respectively. Finally, we will introduce more details of hyper-parameter selection for each baseline if needed.

Text and Image Dataset. In both Yahoo and Yelp dataset, we used the same train/val/test splits as provided by He et al. [2018], which randomly downsample 100K/10K/10K sentences for training/validation/test, respectively. We utilized a single-layer LSTM as both the encoder network and decoder network with 1024 hidden sizes. The size of word embeddings is 512. Uniform distributions on [0.01,0.01][-0.01,0.01] and [0.1,0.1][-0.1,0.1] were applied for initializing the LSTM layers and embedding layers, respectively. In addition, one Dropout of 0.5 was leveraged on both the input word embedding and the last dense for enhancing the performance. During the training process, the learning rate was initialized with 1.0 and decayed by 0.5 if the validation loss has not improved in the past 5 epochs. Meanwhile, the training would stop early after 5 learning rate decays with the maximum number of epochs as 120. As for OMNIGLTO dataset, we use the same train/val/test splits as provided by Kim et al. [2018]. When training, the inputs were binarized dynamically by sampling each pixel from one Bernouli distribution with the value in the original pixel as the parameter. Meanwhile, the fixed binarization was utilized for validating and testing and the decoder uses binary likelihood. The 3-layer ResNet and 13-layer PiexlCNN used in our model is the same as these in He et al. [2018]. Adam optimizer was used for optimizing our model, which starts with a learning rate of 0.001 and decay it by 0.5 if the validation loss has not improved in the past 20 epochs. Meanwhile, the training would stop early after 5 learning rate decays. In addition, following Kingma et al. [2016], for all IAF based models, we used a 2-layer MADE Germain et al. [2015] to implement each IAF block with hidden size 1920. Two IAF blocks were stacked as one IAF chain with ordering reversed between each other.

Synthetic Dataset. To generate the synthetic dataset with better visualization, we sample 2-dimensional latent variable zz from one mixture of Gaussian distributions that have 5 mixture components. Those Gaussian distributions are equipped with the mean parameters (0.0, 0.0), (-2.0, -2.0), (-2.0, 2.0), (2.0, -2.0) and (2.0, 2.0), respectively, and a unit variance. Then following Kim et al. [2018], one LSTM layer with 100 hidden units and 100-dimensional input embeddings was used for generating synthetic text, where the hidden state was initialized by the affine transformation of zz. Then the output of the LSTM and zz were concatenated as the inputs of one MLP to map them into vocabulary space. The LSTM parameters were initialized with uniform distribution on [-1,1], and the parameters in MLP networks were initialized by uniform distribution on [-5,5]. We fixed the length of each text sample as 10 and the vocabulary size as 1000. 16000/2000/2000 examples were generated for training/validation/testing, respectively. In the training procedure of different models, we utilized one single-layer LSTM as the encoder and decoder with 50 hidden units and 50 latent embeddings. The other experimental setup for synthetic experiments were the same as that for text datasets mentioned above. In particular, to ensure meaningful representation, we selected the suitable parameters for BN-VAE and DU-VAE with minimal NLL when the corresponding MI is larger than 1 among hyper-parameters set γ\gamma\in{0.5, 0.7, 1.0, 1.2}, or pp\in{0.1, 0.3, 0.5, 0.7}. Finally, we set γ=1.0\gamma=1.0 for BN-VAE and γ=1.0\gamma=1.0 and p=0.5p=0.5 for DU-VAE in synthetic experiments.

Model NLL KL MI AU
BN-VAE with β=0.1\beta=0.1 328.2 0.2 0.0 0.0
BN-VAE with β=0.3\beta=0.3 328.5 1.4 0.0 0.0
BN-VAE with β=0.5\beta=0.5 329.5 4.0 0.0 0.0
BN-VAE with β=0.7\beta=0.7 331.7 8.0 0.1 3.0
BN-VAE with β=1.0\beta=1.0 339.7 20.1 3.9 7.0
Table S1: The performance of likelihood estimation for BN-VAE with fixed non-zero β\beta on Yahoo.
#\#label 100 500 1k 2k 10k
BN-VAE with β=0.1\beta=0.1 60.46 60.46 60.46 60.46 62.9
BN-VAE with β=0.3\beta=0.3 60.41 60.46 60.46 60.46 62.41
BN-VAE with β=0.5\beta=0.5 60.46 60.46 60.46 60.46 60.57
BN-VAE with β=0.7\beta=0.7 60.46 60.46 60.49 60.53 62.78
BN-VAE with β=1.0\beta=1.0 60.46 60.46 60.45 60.45 62.45
Table S2: The accuracy of classification for BN-VAE with fixed non-zero β\beta on Yelp.

More details for Hyper-parameter Selection. For better Reproducibility, we show the specific ranges we considered to select the best hyper-parameters for each baseline and our models if needed. To be specific, for FB and IAF+FB, we varied the parameter λ\lambda in {0.1, 0.15, 0.2, 0.25, 0.3}. For δ\delta-VAE, we selected δ\delta from {0.1, 0.15, 0.2, 0.25}. For BN-VAE and IAF+BN, the parameter γ\gamma was selected from {0.3, 0.4, 0.5, 0.6, 0.7}. For MAE, we picked out parameter γ\gamma and η\eta from {0.5, 1.0, 2.0} ×\times {0.2, 0.5, 1.0}. In addition, as for DU-VAE, we varied parameter γ\gamma in { 0.4 ,0.5, 0.6, 0.7} and pp in {1.0, 0.9, 0.8, 0.7, 0.6}. As for DU-IAF, we determined parameters γ\gamma and pp from {0.5, 0.6, 0.7, 0.8} ×\times {0.9, 0.8, 0.7} for text datasets and from {0.4, 0.5, 0.6, 0.7} ×\times {0.9, 0.85, 0.8, 0.75} for image dataset due to the different performance of IAF based models on text and image datasets. And all hyper-parameters are determined based on the NLL metric.

In addition, our code is based on Pytorch under Linux server, and all experiments were conducted on one NVIDIA Tesla v100 with 32GB RAM.

Appendix F Fixing Non-zero βμ\beta_{\mu} in BN

Here, we aim to show one abnormal case to prove that keeping one positive lower bound of the KL term is not sufficient for the second term in Equation 10. To be specific, we explore the possibility of applying one BN on mean parameters μ\mu of posteriors with fixed non-zero shift parameter βμ\beta_{\mu} and learnable scale parameter γμ\gamma_{\mu}. We can find that this strategy can also ensure one positive lower bound of KL term, like BN-VAE. but, cannot avoid posterior collapse. To be specific, following the same experimental setting in Experimental Setup, we evaluated the performance of likelihood estimation for this approach on Yahoo dataset and classification on the downsampled Yelp dataset by varying β\beta from 0.10.1 to 1.01.0. The results are summarized in Table S1 and S2. Comparing with Table S1 and S2, we find that the BN strategy with fixed non-zero β\beta fails to capture great performances on both classification and classification tasks, even keeps MI and AU metric vanishing to 0, which implies that the approximated posteriors of different input samples are same as each other. In other words, keeping one positive KL term is not sufficient for avoiding posterior collapse.

Appendix G Generation and Reconstruction

Here, we show some examples of generation and reconstruction for DU-VAE, which provide one illustrative perspective for the generative capacity. Specifically, we first generate examples guided by random samples selected from the prior distribution. Meanwhile, we also followed Bowman et al. [2015b] and utilized linear interpolation between latent variables to evaluate the smoothness of the latent space. In particular, we displayed two samples from the test dataset and learn their latent representations zx1z_{x_{1}} and zx2z_{x_{2}} by the trained models. Then, we decoded greedily each linear interpolation point between those two with evenly divided intervals. Table S3 and S4 show the results in the downsampled version of Yelp dataset, while the result for OMNIGLOT is summarized in Figure S1. We can find that texts and images generated by DU-VAE are grammatically plausible or rich semantic. In addition, we also note that the examples decoded from each interpolation point are semantically consistent between source sample and target sample.

This is a hidden gem. There was no excuse for the service , but no complaints.
They have the best pizza in town. I got a new car and it was a complete waste of time.
The staff was very friendly and helpful. What a waste of time and money.
The food is delicious , and the service is very friendly. We ’ll definitely be back!
I love the food and the service here. No thanks for the job.
We will be back! The service is always good.
Food is good. The staff was very nice, and very helpful.
This place is great. The lady who helped me was very rude.
This place is a great place to go! It was a great experience.
No one ever came back to this place. Thanks for the great service!
Table S3: Samples generated from the prior distribution on Yelp
Source: I won’t be back. Source: I was very disappointed with this place.
Target: I highly recommend this place. Target: I love the atmosphere.
I will not be back. I was very disappointed with this place.
I would not recommend this place to anyone. I was very disappointed with the service.
I would recommend this place to anyone. I am very disappointed with the service.
I highly recommend this place. I love the atmosphere.
I highly recommend this place. I love the atmosphere.
Source: I got the chicken fajitas , which was a little dry. Source: Service was great , and the food was excellent.
Target: The food is always fresh and delicious. Target: My family and I love this place.
I got the chicken fajitas , it was a little dry. Service was great , and the food was great.
I ordered the chicken sandwich , which was very good. Service was great and the food was great.
I love the food here. My husband and I were here for a few years ago.
The food is always fresh and delicious. My husband and I love this place.
The food is always fresh and delicious. My family and I love this place.
Table S4: Interpolation between posterior samples on Yelp.
Refer to caption
(a) Samples generated from the prior distribution
Refer to caption
(b) Interpolation between posterior samples
Figure S1: Generation and reconstruction on OMNIGLOT. The source and target images in Figure (b) are lied in the first and last columns, respectively.