This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improve Variational AutoEncoder with
Auxiliary SOFTMAX MultiClassifier

Yao Li
Institute of Big Data
Fudan University
18110980007@fudan.edu.cn
Abstract

As a general-purpose generative model architecture, VAE has been widely used in the field of image and natural language processing. VAE maps high dimensional sample data into continuous latent variables with unsupervised learning. Sampling in the latent variable space of the feature, VAE can construct new image or text data. As a general-purpose generation model, the vanilla VAE can not fit well with various data sets and networks with different structures. Because of the need to balance the accuracy of reconstruction and the convenience of latent variable sampling in the training process, VAE often has problems known as “posterior collapse”, and images reconstructed by VAE are also often blurred. In this paper, we analyze the main cause of these problem, which is the lack of control over mutual information between the sample variable and the latent feature variable during the training process. To maintain mutual information in model training, we propose to use the auxiliary softmax multi-classification network structure to improve the training effect of VAE, named VAE-AS. We use MNIST and Omniglot data sets to test the VAE-AS model. Based on the test results, It can be show that VAE-AS has obvious effects on the mutual information adjusting and solving the posterior collapse problem.

1 INTRODUCTION

In recent years, generative model has played an important role in the field of machine learning such as natural language processing (NLP) and image processing. Compared with supervised learning for pattern recognition and classification, Unsupervised and semi-supervised generation model is used more for image synthesis, automatic text response, etc. Variational Autoencoder (VAE) (Kingma & Welling, 2013) is a typical generator model, which encodes the observed samples by an encoder, then restore the observation sample through a decoder. VAE’s feature space has good properties, compared with observed variables, the compressed latent variables encoding has the effect of dimensionality reduction and noise immunity. Latent variables capture and summarize the feature of the information better. At the same time, VAE can also generate new images or text by sampling in latent variable space and restoring by the decoder. VAE provides a good, scalable computing framework for tasks such as migration learning and more.

In the variational processing of VAE, in order to solve the problem that the posterior distribution is unintegrable, an approximation method of variational inference is used to calculate. VAE suppose a simple prior distribution of latent variables, and use KL-Divergence to compare the difference between the posterior and the prior distribution of the latent variable. For machine learning, the requirement for prior distribution can be seen as a regular term, which plays a role in simplifying calculations and optimizing the coding space. There are two contradictory aspects in VAE’s learning goals. On the one hand, we require that the code generated by the encoder can objectively and truly reflect the distribution of actual data. On the other hand, we also require the distribution of the code to be as simple as possible so that we can easily generate new samples.  (Bowman et al., 2015; Chen et al., 2016) mentions the problem of the posterior latent variable collapse in VAE. The decoder tends to ignore the latent variable 𝐳{\mathbf{z}} and generate 𝐱{\mathbf{x}} directly. As mentioned in (Dosovitskiy & Brox, 2016), VAE tends to produce blurred images when using complex natural image datasets.

Now, the methods for solving these problems mainly include three types. The first method is to cancel hypothesis of the simple normal prior distribution in VAE and instead use a more complex and accurate distribution to describe the latent code distribution. When we know more about the actual data distribution, we can use a more explicit prior distribution, such as  (Xu & Durrett, 2018) using the Mises-Fisher distribution instead of the standard normal prior distribution. The MCMC method is more accurate than variational inference but requires multiple iterations, how to combine the advantages of both and apply it to machine learning is proposed as a topic  (Salimans et al., 2015; Hoffman, 2017). The second type of approach is to start with the training process of the model. In the early stages of model training, the encoder could not establish the association between the latend variable 𝐳{\mathbf{z}} and the observed variable 𝐱{\mathbf{x}}, this led to the occurrence of posterior latent variable collapse. In these papers (Sønderby et al., 2016; Higgins et al., 2017; Burgess et al., 2018), the weight coefficient is added before the KL-Divergence term of VAE loss function, and an annealing mechanism is used to make the KL-Divergence DKL[qϕ(𝐳𝐱)||pθ(𝐳)]D_{KL}[q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})||p_{\theta}({\mathbf{z}})] play a small role at the beginning. When the encoder establishes an association between 𝐳{\mathbf{z}} and 𝐱{\mathbf{x}}, it gradually increases the alignment requirements between the approximate posterior and the prior distribution.  (He et al., 2019) changes the way VAE trains encoder and decoder at the same time, use sequence training to update the decoder and encoder parameters. The third method is to use the compensation mechanism in the original optimization goal of VAE.  (Zhao et al., 2017a; Dieng et al., 2018; Ma et al., 2019) added a regular term of the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) to the loss function of VAE. This method is highly targeted, but the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) is difficult to calculate and generally depends on Monte Carlo sampling for estimation.  (Tolstikhin et al., 2017; Zhao et al., 2017a) used the NMD-based penalty to calculate the difference between the two distributions.  (Dieng et al., 2018) proposes a method based on skip model to optimize mutual information.  (Ma et al., 2019) estimates mutual information based on the mutual KL-Divergence divergence(MPD).

In this article, (1) Based on the calculation process of VAE, we analyzed the causes of encode collapse and blurry problem in detail. (2) In the framework of vanilla VAE, we propose a novel method to estimate the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) and the marginal KL-Divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})) by the auxiliary softmax MultiClassifier, noted as VAE-AS. (3) we test the validity of VAE-AS using MNIST and Omniglot data sets, and evaluate the quality of hidden variable space from various aspects.

2 Background

2.1 VAE

Modeling the joint distribution of the latent variable 𝐳{\mathbf{z}} and the observed variable 𝐱{\mathbf{x}}, VAE calculates the margin distribution of 𝐱{\mathbf{x}}. Consider the sample data set 𝑿={𝒙(i)}i=1N{\bm{X}}=\{{\bm{x}}^{(i)}\}_{i=1}^{N} consisting of NN i.i.d. samples. Introduce the latent variable 𝐳{\mathbf{z}}, the joint distribution of 𝐱{\mathbf{x}} and 𝐳{\mathbf{z}} is pθ(𝐱,𝐳)p_{\theta}({\mathbf{x}},{\mathbf{z}}). The true distribution of 𝐱{\mathbf{x}} can be obtained by computing the margin distribution pθ(𝐱)=pθ(𝐱𝐳)pθ(𝐳)d𝐳p_{\theta}({\mathbf{x}})=\int p_{\theta}({\mathbf{x}}\mid{\mathbf{z}})p_{\theta}({\mathbf{z}})\text{d}{\mathbf{z}}. However, this integral is often intractable, so VAE uses the approximate distribution qϕ(𝐳𝐱)q_{\phi}({\mathbf{z}}\mid{\mathbf{x}}) to solve the problem by variational inference.

The margin likelihood function can be written as the sum of the likelihood function for each sample point, logpθ(𝒙(1),,𝒙(N))=i=1Nlog pθ(𝒙(i))\text{log}p_{\theta}({\bm{x}}^{(1)},\cdots,{\bm{x}}^{(N)})=\sum_{i=1}^{N}\text{log }p_{\theta}({\bm{x}}^{(i)}). Combin with Jensen Inequality, for each sample 𝒙(i){\bm{x}}^{(i)},

logpθ(𝒙(i))\displaystyle\text{log}p_{\theta}({\bm{x}}^{(i)}) =logpθ(𝒙(i),𝐳)d𝐳=log𝔼qϕ(𝐳𝒙(i))[pθ(𝒙(i),𝐳)qϕ(𝐳𝒙(i))]\displaystyle=\text{log}\int p_{\theta}({\bm{x}}^{(i)},{\mathbf{z}})\text{d}{\mathbf{z}}=\text{log}\mathbb{E}_{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})}\left[\frac{p_{\theta}({\bm{x}}^{(i)},{\mathbf{z}})}{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})}\right]
𝔼qϕ(𝐳𝒙(i))[logpθ(𝒙(i),𝐳)logqϕ(𝐳𝒙(i))]=ELBO(θ,ϕ;𝒙(i))\displaystyle\geq\mathbb{E}_{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})}\left[\text{log}p_{\theta}({\bm{x}}^{(i)},{\mathbf{z}})-\text{log}q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})\right]=\mathcal{L}_{ELBO}(\theta,\phi;{\bm{x}}^{(i)}) (1)

where, joint distribution is pθ(𝐱,𝐳)=pθ(𝐱𝐳)pθ(𝐳)p_{\theta}({\mathbf{x}},{\mathbf{z}})=p_{\theta}({\mathbf{x}}\mid{\mathbf{z}})p_{\theta}({\mathbf{z}}), the evidence lower bound can be written as

ELBO(θ,ϕ;𝑿)\displaystyle\mathcal{L}_{ELBO}(\theta,\phi;{\bm{X}}) =i=1N[𝔼qϕ(𝐳𝒙(i))[logpθ(𝒙(i)𝐳)]DKL[qϕ(𝐳𝒙(i))||pθ(𝐳)]]\displaystyle=\sum_{i=1}^{N}\left[\mathbb{E}_{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})}\left[\text{log}p_{\theta}({\bm{x}}^{(i)}\mid{\mathbf{z}})\right]-D_{KL}\left[q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})\ ||\ p_{\theta}({\mathbf{z}})\right]\right] (2)

It can be seen from the equation 2 that the loss function of VAE consists of two parts. One part of the loss is the observation 𝒙(i){\bm{x}}^{(i)} reconstruction error, and the other part is the difference between the posterior distribution qϕ(𝐳|𝐱)q_{\phi}({\mathbf{z}}|{\mathbf{x}}) and the prior distribution pθ(𝐳)p_{\theta}({\mathbf{z}}).

The prior distribution of the latent variable 𝐳{\mathbf{z}} chosen by VAE is denoted as pθ(𝐳)p_{\theta}({\mathbf{z}}), General assumption is a multidimensional standard normal distribution 𝒩(𝐳;𝟎,𝑰)\mathcal{N}({\mathbf{z}};\bm{0},{\bm{I}}) to facilitate sampling of the model. The complete calculation of VAE includes two processes of encoding and decoding. During the encoding process, the encoder training fit a conditional posterior distribution of 𝐳{\mathbf{z}} after 𝐱{\mathbf{x}} is observed, noted as qϕ(𝐳𝐱)q_{\phi}({\mathbf{z}}\mid{\mathbf{x}}). We use KL-Divergence DKL(qϕ(𝐳𝐱)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})\ ||\ p_{\theta}({\mathbf{z}})) to evaluate the similarity of a prior and conditional posterior distributions of latent variable. That is, the posterior distribution generated by the encoder is required to be as close as possible to a standard normal distribution. The decoding process is to sample 𝒛pθ(𝐳){\bm{z}}\sim p_{\theta}({\mathbf{z}}) from the prior distribution, then use the decoder pθ(𝐱|𝒛)p_{\theta}({\mathbf{x}}|{\bm{z}}) to generate a 𝒙{\bm{x}} refactoring 𝒙^\hat{{\bm{x}}}. Compare the difference between 𝒙{\bm{x}} and 𝒙^\hat{{\bm{x}}} as a loss function, calculating the gradient to update parameters θ,ϕ\theta,\phi.

The calculation of log pθ(𝒙(i),𝐳)\text{log }p_{\theta}({\bm{x}}^{(i)},{\mathbf{z}}) in VAE depends on sampling. Specifically, for each sample 𝒙(i){\bm{x}}^{(i)}, VAE encoder fits the variables μ(𝒙(i))\mu({\bm{x}}^{(i)}) and σ(𝒙(i))\sigma({\bm{x}}^{(i)}), it is used to define the posterior distribution qϕ(𝐳𝒙(i))=𝒩(𝐳;μ(𝒙(i)),σ(𝒙(i)))q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})=\mathcal{N}({\mathbf{z}};\mu({\bm{x}}^{(i)}),\sigma({\bm{x}}^{(i)})) of the latent variable 𝐳{\mathbf{z}}. Then VAE completes LL samples and calculates

~(θ,ϕ;𝑿)1Ni=1N[1Ll=1L(logpθ(𝒙(i)𝒛(l)))DKL[qϕ(𝐳𝒙(i))||pθ(𝐳)]]\tilde{\mathcal{L}}(\theta,\phi;{\bm{X}})\simeq\frac{1}{N}\sum_{i=1}^{N}\left[\frac{1}{L}\sum_{l=1}^{L}\left(\text{log}p_{\theta}({\bm{x}}^{(i)}\mid{\bm{z}}^{(l)})\right)-D_{KL}\left[q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})\ ||\ p_{\theta}({\mathbf{z}})\right]\right]

In order to ensure the continuity of the loss calculation, enable the inverse gradient descent training process. VAE uses the technique of reparameterization during the sampling process. First sample the random variable ε(l)\varepsilon^{(l)} from 𝒩(0,1)\mathcal{N}(0,1), then let 𝒛(l)=μ(𝒙(i))+σ(𝒙(i))ε(l){\bm{z}}^{(l)}=\mu({\bm{x}}^{(i)})+\sigma({\bm{x}}^{(i)})\odot\varepsilon^{(l)}.

2.2 Disadvantages of VAE

For VAE, the latent variable space needs to have sufficient feature coding ability to express complex real distributions. The sample space is often discrete, but the encoder needs to smooth and fill the missing parts between discrete samples to generate new pictures or text. At the same time, the coding space must be conducive to sampling and likelihood calculations, which requires that the prior distribution pθ(𝐳)p_{\theta}({\mathbf{z}}) is simple.

The VAE encoder encodes each discrete sample 𝒙(i){\bm{x}}^{(i)} of 𝑿{\bm{X}} into a continuous latent random variable 𝐳{\mathbf{z}}. It can be seen that 𝒙(i){\bm{x}}^{(i)} is mapped to a continuous area of distribution defined by qϕ(𝐳𝒙(i))=𝒩(z;μ(𝒙(i)),σ(𝒙(i)))q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})=\mathcal{N}(z;\mu({\bm{x}}^{(i)}),\sigma({\bm{x}}^{(i)})). Intuitively, in the calculation of the equation equation 2, VAE hopes that the difference between the distribution of 𝐳{\mathbf{z}} generated by each 𝒙(i){\bm{x}}^{(i)} encoding is large to reduce the error of reconstruction. But minimizing the KL-Divergence term pushes qϕ(𝐳|𝒙(i))q_{\phi}({\mathbf{z}}|{\bm{x}}^{(i)}) to a uniform standard normal distribution. These two optimization directions are contradictory, When VAE tries to reduce the reconstruction error, it is bound to increase the difference of the code distribution qϕ(𝐳𝒙(i))q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)}), then KL-Divergence will increase. When the reduced reconstruction error is smaller than the increased KL-Divergence, VAE directly reduces the KL-Divergence without the error reduction.

Design an experiment, when feeding a completely random set of images to VAE for learning, we will find that the KL-Divergence will soon tend to 0. Because the image does not contain any rules, the VAE reconstruction error is optimized so small that it does not start at all.

There are two cases of coding collapse. One is that in the early stage of model training, because the parameters optimization of the decoder have not been completed or because the randomness of the training samples is too strong, strong KL-Divergence constraints will result in a high optimization constraint threshold, models are easier to optimize only KL-Divergence. In a more general view, the text (He et al., 2019) rewrites the ELBO to

ELBO(θ,ϕ;𝑿)=i=1N[log pθ(𝒙(i))DKL[qϕ(𝐳𝒙(i))||pθ(𝐳𝒙(i))]]\mathcal{L}_{ELBO}(\theta,\phi;{\bm{X}})=\sum_{i=1}^{N}\left[\text{log }p_{\theta}({\bm{x}}^{(i)})-D_{KL}\left[q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})\ ||\ p_{\theta}({\mathbf{z}}\mid{\bm{x}}^{(i)})\right]\right]

The first term in R.H.S can be seen as the maximum log likelihood of the decoder’s distribution. The KL-Divergence of the second term characterizes the degree of similarity between pθ(𝐳𝐱)p_{\theta}({\mathbf{z}}\mid{\mathbf{x}}) and qϕ(𝐳𝐱)q_{\phi}({\mathbf{z}}\mid{\mathbf{x}}). When the VAE is trained in the beginning, the encoder is not perfect, the correlation between 𝐳{\mathbf{z}} and 𝐱{\mathbf{x}} is not strong and can be regarded as an independent variable. When pθ(𝐳)p_{\theta}({\mathbf{z}}) is simple, minimizing the KL-Divergence requires qϕ(𝐳|𝐱)q_{\phi}({\mathbf{z}}|{\mathbf{x}}) to be directly locked to pθ(𝐳)p_{\theta}({\mathbf{z}}), which is, 𝐳{\mathbf{z}} is independent with 𝐱{\mathbf{x}}, qϕ(𝐳d𝒙(i))=qϕ(𝐳d)𝒩(𝐳d;0,1)q_{\phi}({\mathbf{z}}_{d}\mid{\bm{x}}^{(i)})=q_{\phi}({\mathbf{z}}_{d})\sim\mathcal{N}({\mathbf{z}}_{d};0,1), similarly pθ(𝐳d𝒙(i))=pθ(𝐳d)𝒩(𝐳d;0,1)p_{\theta}({\mathbf{z}}_{d}\mid{\bm{x}}^{(i)})=p_{\theta}({\mathbf{z}}_{d})\sim\mathcal{N}({\mathbf{z}}_{d};0,1), DKLD_{KL} item disappears on this element.

Another type of collapse is due to the lack of correlation in the model encoding, or the need of correlation for reconstruction is small, it also causes the fact that the feature space dimension exceeds the active number of latent variable. Image and language coding often have autocorrelation, such as the value of a pixel in an image and the value of surrounding pixels. In the case of a strong decoder, using the autocorrelation information, 𝐱{\mathbf{x}} can be recovered without 𝐳{\mathbf{z}}, As can be seen from the Table 1, with the increase of layers of Decoder network, the decoding ability of decoder is continuously enhanced. Encoder qϕ(𝐳𝐱)q_{\phi}({\mathbf{z}}\mid{\mathbf{x}}) constructs μ(𝒙)\mu({\bm{x}}) according to 𝒙{\bm{x}}, The number of dimensions with larger variances of μ(𝒙)\mu({\bm{x}}) continues to decline. This shows that the higher the complexity of the Decoder, then the less information is used in the information contained in 𝐳{\mathbf{z}}.

Table 1: The number of active units of 𝐳{\mathbf{z}}, changes with number of decoder network layers
ecoder Layers 1 2 3 4 5
Covp(𝐱)(𝔼qϕ(𝐳𝐱)[zd])0.01\text{Cov}_{p({\mathbf{x}})}\left(\mathbb{E}_{q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})}[z_{d}]\right)\geq 0.01 23 21 17 12 8

From the above, in order to prevent the posterior collapse, we need to increase the correlation between 𝐳{\mathbf{z}} and 𝐱{\mathbf{x}} in the joint distribution to prevent independent of 𝐳{\mathbf{z}} and 𝐱{\mathbf{x}}. The correlation in the joint distribution is mainly controlled by mutual information. The problem of blurred images is also related to mutual information. Mentioned in the article (Zhao et al., 2017b), the optimal solution to reconstruction loss of decoder for a given qϕq_{\phi} is 𝔼qϕ(𝐱𝒛)[𝐱]\mathbb{E}_{q_{\phi}({\mathbf{x}}\mid{\bm{z}})}\left[{\mathbf{x}}\right]. That is to say, the reconstructed image of VAE is the expectation of a set of images, which is related to the smoothness of the latent variable distribution. The mutual information is directly related to the entropy of qϕ(𝐱|𝒛)q_{\phi}({\mathbf{x}}|{\bm{z}}). We prefer to think the blurry as a trade-off rather than a problem. In the article (Shwartz-Ziv & Tishby, 2017; Alemi et al., 2016) on information bottlenecks, the neuron network will have a decrease in mutual information Iqϕ(Z,X)I_{q_{\phi}}(Z,X) in the later stage of optimization training. It is actually an optimization that reduces overfitting. when the mutual information is reduced, the smoothness of the model is higher, and the anti-noise and fitting between samples is better. As the mutual information increases, the error of model reconstruction will be smaller, but the generalization performance of the model will be worse. Therefore, for mutual information, what we need more is to better control according to the model task, rather than simply increasing or decreasing.

3 Method

3.1 Mutual Information

According to (Hoffman & Johnson, 2016; Dieng et al., 2018), the KL-Divergence term in the ELBO of the VAE can be expressed as (See the appendix for details).

1Ni=1NDKL(qϕ(𝐳𝐱)||pθ(𝐳))=qϕ(𝐳,𝐱)+DKL(qϕ(𝐳)||pθ(𝐳))\frac{1}{N}\sum_{i=1}^{N}D_{KL}(q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})||p_{\theta}({\mathbf{z}}))=\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})+D_{KL}(q_{\phi}({\mathbf{z}})||p_{\theta}({\mathbf{z}}))

where qϕ\mathcal{I}_{q_{\phi}} is the mutual information between 𝐱{\mathbf{x}} and 𝐳{\mathbf{z}}, measure the correlation between 𝐱{\mathbf{x}} and 𝐳{\mathbf{z}}. And DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})||p_{\theta}({\mathbf{z}})) is the KL-Divergence between code marginal distribution qϕ(𝐳)q_{\phi}({\mathbf{z}}) and prior distribution pθ(𝐳)p_{\theta}({\mathbf{z}}).

In order to control the mutual information and the marginal KL-Divergence more precisely, rewrite the ELBO:

^(θ,ϕ;𝑿)\displaystyle\hat{\mathcal{L}}(\theta,\phi;{\bm{X}}) 1Ni=1N[1Ll=1L(logpθ(𝒙(i)𝒛(l)))]αqϕ(𝐳,𝐱)βDKL[qϕ(𝐳)||pθ(𝐳)]\displaystyle\simeq\frac{1}{N}\sum_{i=1}^{N}\left[\frac{1}{L}\sum_{l=1}^{L}\left(\text{log}p_{\theta}({\bm{x}}^{(i)}\mid{\bm{z}}^{(l)})\right)\right]-\alpha\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})-\beta D_{KL}\left[q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})\right] (3)

Where α\alpha and β\beta are Lagrange multiplier, By adjusting α\alpha, β\beta, we can adjust the performance of VAE-AS to meet different optimization requirements.

The calculation of mutual information and marginal KL-Divergence involves solving the margin distributions of qϕ(𝐳)q_{\phi}({\mathbf{z}}). For research objects such as images and texts, the dimensions are often large and the probability of the samples are both small. Select sample each time, the distribution of the sample {𝒙(1),,𝒙(N)}\{{\bm{x}}^{(1)},\cdots,{\bm{x}}^{(N)}\} can be considered as a discrete Categorical distribution Cat(c(1),,c(N))Cat(c^{(1)},\cdots,c^{(N)}), where c(1)==c(N)=1Nc^{(1)}=\cdots=c^{(N)}=\frac{1}{N}. We note this category distribution as qϕ(𝐱)q_{\phi}({\mathbf{x}}). Since qϕ(𝐱=𝒙(i))=1Nq_{\phi}({\mathbf{x}}={\bm{x}}^{(i)})=\frac{1}{N}, the empirical distribution can be derived as qϕ(𝐳)=i=1Nqϕ(𝐳𝒙(i))qϕ(𝐱=𝒙(i))=1Ni=1Nqϕ(𝐳𝒙(i))q_{\phi}({\mathbf{z}})=\sum_{i=1}^{N}q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})q_{\phi}({\mathbf{x}}={\bm{x}}^{(i)})=\frac{1}{N}\sum_{i=1}^{N}q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)}). qϕ(𝐳)q_{\phi}({\mathbf{z}}) is called aggregated posterior distribution (Makhzani et al., 2015), which is difficult to obtain analytical solutions. Sampling to calculate the estimate of mutual information and marginal KL-Divergence is generally (Hoffman & Johnson, 2016; Dieng et al., 2018).

We explore mutual information qϕ\mathcal{I}_{q_{\phi}} from the perspective of empirical distribution. According to the definition of mutual information, the mutual information qϕ\mathcal{I}_{q_{\phi}} can be rewritten as:

qϕ(𝐳,𝐱)=𝔼qϕ(𝐱)[qϕ(𝐳𝐱)log qϕ(𝐳𝐱)qϕ(𝐳)d𝐳]\displaystyle\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})=\mathbb{E}_{q_{\phi}({\mathbf{x}})}\left[\int q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})\text{log }\frac{q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})}{q_{\phi}({\mathbf{z}})}\text{d}{\mathbf{z}}\right] (4)

For the integral term in the form equation 4, since qϕ(𝐳)=1Ni=1Nqϕ(𝐳𝒙(i))q_{\phi}({\mathbf{z}})=\frac{1}{N}\sum_{i=1}^{N}q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)}). we can select MM sample to approximate in each training batch, qϕ(𝐳)1Mi=1Mqϕ(𝐳𝒙(i))q_{\phi}({\mathbf{z}})\simeq\frac{1}{M}\sum_{i=1}^{M}q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)}). But unfortunately, this method of estimation is biased, we will use another method to calculate mutual information.

3.2 MINE

Theorem 3.1 (Donsker-Varadhan representation).

The KL divergence between any two distributions \mathbb{P} and \mathbb{Q}, with \mathbb{P}\ll\mathbb{Q}, admits the following dual representation (Donsker & Varadhan, 1983)

DKL(||)=supT:Ω𝔼[T]log (𝔼[exp (T)])D_{KL}(\mathbb{P}||\mathbb{Q})=\sup_{T:\Omega\rightarrow\mathbb{R}}\mathbb{E}_{\mathbb{P}}[T]-\text{log }(\mathbb{E}_{\mathbb{Q}}[\text{exp }(T)]) (5)

where the supremum is taken over all functions TT such that the two expectations are finite.

From Theorem 3.1, MINE (Belghazi et al., 2018) has been proposed as a method for estimating mutual information using a neural network. If p(𝒙),q(𝒙)p({\bm{x}}),q({\bm{x}}) are the density functions of ,\mathbb{P},\mathbb{Q} respectively, given any subclass \mathcal{F} of such functions, then have

(𝐱,𝐳)supT{𝔼p(𝐱,𝐳)[T(𝐱,𝐳)]log (𝔼p(𝐱)p(𝐳)[exp (T(𝐱,𝐳))])}\mathcal{I}({\mathbf{x}},{\mathbf{z}})\geq\sup_{T\in\mathcal{F}}\big{\{}\mathbb{E}_{p({\mathbf{x}},{\mathbf{z}})}[T({\mathbf{x}},{\mathbf{z}})]-\text{log }\big{(}\mathbb{E}_{p({\mathbf{x}})p({\mathbf{z}})}\big{[}\text{exp }\big{(}T({\mathbf{x}},{\mathbf{z}})\big{)}\big{]}\big{)}\big{\}} (6)

where TT\in\mathcal{F} can be fitted with a neuron network TψT_{\psi} parameterized ψΨ\psi\in\Psi. The expectations in the above lower-bound can then be estimated by Monte-carlo sampling. After sampling, we can get

ψ^=argsupψΨ{𝔼q(𝐱,𝐳)[Tψ(𝐱,𝐳)]log (𝔼q(𝐱)q(𝐳)[exp (Tψ(𝐱,𝐳))])}\hat{\psi}=\arg\sup_{\psi\in\Psi}\big{\{}\mathbb{E}_{q({\mathbf{x}},{\mathbf{z}})}[T_{\psi}({\mathbf{x}},{\mathbf{z}})]-\text{log }\big{(}\mathbb{E}_{q({\mathbf{x}})q({\mathbf{z}})}\big{[}\text{exp }\big{(}T_{\psi}({\mathbf{x}},{\mathbf{z}})\big{)}\big{]}\big{)}\big{\}} (7)

then we obtain the Mutual Information Neural Estimator (MINE):

^(𝐱,𝐳)=𝔼q(𝐱,𝐳)[Tψ^(𝐱,𝐳)]log (𝔼q(𝐱)q(𝐳)[exp (Tψ^(𝐱,𝐳))])\hat{\mathcal{I}}({\mathbf{x}},{\mathbf{z}})=\mathbb{E}_{q({\mathbf{x}},{\mathbf{z}})}[T_{\hat{\psi}}({\mathbf{x}},{\mathbf{z}})]-\text{log }\big{(}\mathbb{E}_{q({\mathbf{x}})q({\mathbf{z}})}\big{[}\text{exp }\big{(}T_{\hat{\psi}}({\mathbf{x}},{\mathbf{z}})\big{)}\big{]}\big{)} (8)

3.3 Auxiliary SOFTMAX multiclassifier

In this paper, we add an auxiliary softmax multi-classifier based on VAE to solve the mutual information qϕ\mathcal{I}_{q_{\phi}}, which is called VAE-AS. For mutual information, we can write

qϕ(𝐳,𝐱)=log N𝔼qϕ(𝐳)[(qϕ(𝐱𝐳))]\displaystyle\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})=\text{log }N-\mathbb{E}_{q_{\phi}({\mathbf{z}})}\left[\mathcal{H}\left(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\right)\right] (9)

In order to calculate the mutual information, we need to solve qϕ(𝐱𝐳)q_{\phi}({\mathbf{x}}\mid{\mathbf{z}}) and the entropy of qϕ(𝐱𝐳)q_{\phi}({\mathbf{x}}\mid{\mathbf{z}}). According to Bayes’ theorem, qϕ(𝐱𝐳)=qϕ(𝐳𝐱)qϕ(𝐱)qϕ(𝐳)q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})=\frac{q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})q_{\phi}({\mathbf{x}})}{q_{\phi}({\mathbf{z}})}, and qϕ(𝐱=𝒙(i))=1Nq_{\phi}({\mathbf{x}}={\bm{x}}^{(i)})=\frac{1}{N}, qϕ(𝐳)=1Ni=1Nqϕ(𝐳𝒙(i))q_{\phi}({\mathbf{z}})=\frac{1}{N}\sum_{i=1}^{N}q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)}), we know,

qϕ(𝐱=𝒙(i)𝐳)=qϕ(𝐳𝒙(i))j=1Nqϕ(𝐳𝒙(j))\displaystyle q_{\phi}({\mathbf{x}}={\bm{x}}^{(i)}\mid{\mathbf{z}})=\frac{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})}{\sum_{j=1}^{N}q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(j)})} (10)

As can be seen from the discussion in the previous section, it is still difficult to directly calculate equation 10. We use the auxiliary softmax multi-classifier to fit qϕ(𝐱𝐳)q_{\phi}({\mathbf{x}}\mid{\mathbf{z}}). First number each sample, let {𝒙(1),,𝒙(N)}{1,,N}\{{\bm{x}}^{(1)},\ldots,{\bm{x}}^{(N)}\}\mapsto\{1,\cdots,N\}. Equivalent to determining a label 𝒆(i){\bm{e}}^{(i)} for each sample 𝒙(i){\bm{x}}^{(i)}, Where 𝒆(i){\bm{e}}^{(i)} is the one-hot variable [0,,0,1,0,,0][0,\dots,0,1,0,\dots,0] with a 1 at position ii. i.e.

𝐞j(i)={1ifj=i0ifji{\mathbf{e}}_{j}^{(i)}=\left\{\begin{aligned} &1\quad if\ j=i\\ &0\quad if\ j\neq i\end{aligned}\right.
Refer to caption
Figure 1: VAE-AS Network Structure

Based on the structure of VAE, we add a multi layer neural network sω(𝐞^𝐳)s_{\omega}(\hat{{\mathbf{e}}}\mid{\mathbf{z}}) to fit the distribution qϕ(𝐱𝐳)q_{\phi}({\mathbf{x}}\mid{\mathbf{z}}) . The network sω(𝐞^𝐳)s_{\omega}(\hat{{\mathbf{e}}}\mid{\mathbf{z}}) is trained in a supervised manner, we use the one-hot variable 𝒆{\bm{e}} as label, the activation function of the network sω(𝐞^𝐳)s_{\omega}(\hat{{\mathbf{e}}}\mid{\mathbf{z}}) is the softmax function.

For each sampled 𝒛(l){\bm{z}}^{(l)}, after a multi-layered neuron network transformation, let 𝒉0(l)=𝒛(l){\bm{h}}^{(l)}_{0}={\bm{z}}^{(l)}, 𝒉t(l)=g(ωt1T𝒉t1(l)){\bm{h}}^{(l)}_{t}=g(\omega_{t-1}^{T}{\bm{h}}^{(l)}_{t-1}), where g()g(\cdot) is activation function, such as tanhtanh or ReLUReLU. The output 𝒉o(l)=ωtT𝒉t(l){\bm{h}}^{(l)}_{o}=\omega_{t}^{T}{\bm{h}}^{(l)}_{t}, where 𝒉o(l){\bm{h}}^{(l)}_{o} is NN dimensions vector, The element ii of 𝒉o(l){\bm{h}}^{(l)}_{o} is 𝒉o,i(l){\bm{h}}^{(l)}_{o,i}. The auxiliary softmax multi-classifier can be written as:

sω(𝐞^i=1𝒛(l))=exp (𝒉o,i(l))j=1Nexp (𝒉o,j(l))\displaystyle s_{\omega}(\hat{{\mathbf{e}}}_{i}=1\mid{\bm{z}}^{(l)})=\frac{\text{exp }({\bm{h}}^{(l)}_{o,i})}{\sum_{j=1}^{N}\text{exp }({\bm{h}}^{(l)}_{o,j})} (11)

Since the serial number can be seen as a random tag 𝒆(i){\bm{e}}^{(i)} allocated for each sample 𝒙(i){\bm{x}}^{(i)}, The loss function is the cross entropy of the softmax function, i.e.

losss=1Ll=1Lj=1N𝒆j(i)log sω(𝐞^j=1𝒛(l))\displaystyle loss_{s}=-\frac{1}{L}\sum_{l=1}^{L}\sum_{j=1}^{N}{\bm{e}}_{j}^{(i)}\text{log }s_{\omega}(\hat{{\mathbf{e}}}_{j}=1\mid{\bm{z}}^{(l)}) (12)

Minimizing the cross entropy loss function, We hope sω(𝐞^𝐳)s_{\omega}(\hat{{\mathbf{e}}}\mid{\mathbf{z}}) will gradually converge to qϕ(𝐱𝐳)q_{\phi}({\mathbf{x}}\mid{\mathbf{z}}). After completing the fitting of the distribution qϕ(𝐱𝐳)q_{\phi}({\mathbf{x}}\mid{\mathbf{z}}), we can calculate the entropy (qϕ(𝐱𝐳))\mathcal{H}\left(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\right) .

(qϕ(𝐱𝐳))=i=1Nqϕ(𝐱=𝒙(i)𝐳)log qϕ(𝐱=𝒙(i)𝐳)\displaystyle\mathcal{H}\left(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\right)=-\sum_{i=1}^{N}q_{\phi}({\mathbf{x}}={\bm{x}}^{(i)}\mid{\mathbf{z}})\text{log }q_{\phi}({\mathbf{x}}={\bm{x}}^{(i)}\mid{\mathbf{z}}) (13)

then we can calculate the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) with equation 9.

For Calculating DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}\left(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})\right), we need calculate qϕ(𝐳)q_{\phi}({\mathbf{z}}), and then estimate with Monte Carlo Sampling. After using the auxiliary softmax to estimate qϕ(𝐱=𝒙(i)𝐳)q_{\phi}({\mathbf{x}}={\bm{x}}^{(i)}\mid{\mathbf{z}}), we can estimate qϕ(𝐳)q_{\phi}({\mathbf{z}}) by Bayesian formula.

qϕ(𝐳)=qϕ(𝐳𝒙(i))Nqϕ(𝐱=𝒙(i)𝐳)\displaystyle q_{\phi}({\mathbf{z}})=\frac{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})}{Nq_{\phi}({\mathbf{x}}={\bm{x}}^{(i)}\mid{\mathbf{z}})} (14)

After estimating the mutual information and the marginal divergence we can calculate ELBO as equation 3.

3.4 VAE-AS and MINE

Theorem 3.2.

VAE-AS and MINE are equivalent in the optimization goal and estimation of mutual information.

Proof.

Step 1: prove that VAE’s optimization goal is equivalent to MINE’s optimization goal. The output variable 𝒉{\bm{h}} of classifier sψs_{\psi} is NN dimensions vector. its ii component corresponds to the observation vector 𝒙(i){\bm{x}}^{(i)}, h(l)h^{(l)} can be thought as a function of 𝒙(i){\bm{x}}^{(i)} and 𝒛(l){\bm{z}}^{(l)}, 𝒉i(l)=Tψ(𝒙(i),𝒛(l)){\bm{h}}^{(l)}_{i}=T_{\psi}({\bm{x}}^{(i)},{\bm{z}}^{(l)}), then from equation 12 the optimization goal can be written:

max1NLi=1Nl=1Llog exp (Tψ(𝒙(i),𝒛(l)))j=1Nexp (Tψ(𝒙(j),𝒛(l)))\max\frac{1}{NL}\sum_{i=1}^{N}\sum_{l=1}^{L}\text{log }\frac{\text{exp }(T_{\psi}({\bm{x}}^{(i)},{\bm{z}}^{(l)}))}{\sum_{j=1}^{N}\text{exp }(T_{\psi}({\bm{x}}^{(j)},{\bm{z}}^{(l)}))}

Decomposing the log function and adding the constant log 1N\text{log }\frac{1}{N} does not affect optimization, the optimization goal is equivalent to

max1NLi=1Nl=1LTψ(𝒙(i),𝒛(l))1NLi=1Nl=1Llog (1Nj=1Nexp (Tψ(𝒙(j),𝒛(l))))\max\frac{1}{NL}\sum_{i=1}^{N}\sum_{l=1}^{L}T_{\psi}({\bm{x}}^{(i)},{\bm{z}}^{(l)})-\frac{1}{NL}\sum_{i=1}^{N}\sum_{l=1}^{L}\text{log }\Big{(}\frac{1}{N}\sum_{j=1}^{N}\text{exp }\big{(}T_{\psi}({\bm{x}}^{(j)},{\bm{z}}^{(l)})\big{)}\Big{)}

Since sample 𝒛(l){\bm{z}}^{(l)} from q(𝐳𝒙(i))q({\mathbf{z}}\mid{\bm{x}}^{(i)}), the first term will converge to 𝔼q(𝐱,𝐳)(Tψ(𝐱,𝐳))\mathbb{E}_{q({\mathbf{x}},{\mathbf{z}})}(T_{\psi}({\mathbf{x}},{\mathbf{z}})). For the second term, about z(l)z^{(l)}, we need to calculate TψT_{\psi} for each x(j)x^{(j)}, so z(l)z^{(l)} is independent of x(j)x^{(j)}, the second term will converge to log (𝔼q(𝐱)q(𝐳)[exp (Tψ(𝐱,𝐳))])\text{log }(\mathbb{E}_{q({\mathbf{x}})q({\mathbf{z}})}\left[\text{exp }(T_{\psi}({\mathbf{x}},{\mathbf{z}}))\right]).

Step 2: When ψ=ψ^\psi=\hat{\psi}, sψ^(𝐳)s_{\hat{\psi}}({\mathbf{z}}) can be seen as fit of q(𝐱𝐳)q({\mathbf{x}}\mid{\mathbf{z}}). From equation 9,

^q(𝐱,𝐳)=𝔼q(𝐳)[i=1Nq^(𝒙(i)𝐳)Tψ^(𝒙(i),𝐳)i=1Nq^(𝒙(i)𝐳)log (1Nj=1Nexp (Tψ^(𝒙(j),𝐳)))]\hat{\mathcal{I}}_{q}({\mathbf{x}},{\mathbf{z}})=\mathbb{E}_{q({\mathbf{z}})}\Big{[}\sum_{i=1}^{N}\hat{q}({\bm{x}}^{(i)}\mid{\mathbf{z}})T_{\hat{\psi}}({\bm{x}}^{(i)},{\mathbf{z}})\\ -\sum_{i=1}^{N}\hat{q}({\bm{x}}^{(i)}\mid{\mathbf{z}})\text{log }\Big{(}\frac{1}{N}\sum_{j=1}^{N}\text{exp }\big{(}T_{\hat{\psi}}({\bm{x}}^{(j),{\mathbf{z}}})\big{)}\Big{)}\Big{]}

where the first item will converge to 𝔼q(𝐱,𝐳)Tψ^(𝐱,𝐳)\mathbb{E}_{q({\mathbf{x}},{\mathbf{z}})}T_{\hat{\psi}}({\mathbf{x}},{\mathbf{z}}), the rest is converged to log (𝔼q(𝐱)q(𝐳)[exp (Tψ^(𝐱,𝐳))])\text{log }(\mathbb{E}_{q({\mathbf{x}})q({\mathbf{z}})}\left[\text{exp }(T_{\hat{\psi}}({\mathbf{x}},{\mathbf{z}}))\right]).

3.5 VAE-AS optimization

Calculating mutual information by the formula equation 13 requires traversal of all samples, which is computationally burdensome. The following proposition will give an approximate estimate.

Theorem 3.3.

For each 𝐱(i){\bm{x}}^{(i)}, Note the probability of multi-classifier prediction error (𝐞^i𝐞i𝐳)=sω(𝐞^(i)𝐞(i)𝐳)\mathbb{P}(\hat{{\mathbf{e}}}_{i}\neq{\mathbf{e}}_{i}\mid{\mathbf{z}})=s_{\omega}(\hat{{\bm{e}}}^{(i)}\neq{\bm{e}}^{(i)}\mid{\mathbf{z}}) as e(i)\mathbb{P}_{e}^{(i)}, let e=1Ni=1Ne(i)\mathbb{P}_{e}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{P}_{e}^{(i)}. For VAE-AS, let

^(𝐳,𝐱)=log N+𝔼qϕ(𝐳)[e(i)log e(i)+(1e(i))log (e(i))]\hat{\mathcal{I}}({\mathbf{z}},{\mathbf{x}})=\text{log }N+\mathbb{E}_{q_{\phi}({\mathbf{z}})}\left[\mathbb{P}_{e}^{(i)}\text{log }\mathbb{P}_{e}^{(i)}+(1-\mathbb{P}_{e}^{(i)})\text{log }(\mathbb{P}_{e}^{(i)})\right]

be an estimate of qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}). There is

^(𝐳,𝐱)qϕ(𝐳,𝐱)elog N\hat{\mathcal{I}}({\mathbf{z}},{\mathbf{x}})-\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})\leq\mathbb{P}_{e}\text{log }N

.

Proof.

According to the network structure, 𝐱𝐳𝐞^{\mathbf{x}}\rightarrow{\mathbf{z}}\rightarrow\hat{{\mathbf{e}}} constitutes a Markov chain. From Fano’s Inequality we know, for each 𝒙(i){\bm{x}}^{(i)},

(e(i))+e(i)log N(qϕ(𝐱=𝒙(i)𝐳))\mathcal{H}(\mathbb{P}_{e}^{(i)})+\mathbb{P}_{e}^{(i)}\text{log }N\geq\mathcal{H}(q_{\phi}({\mathbf{x}}={\bm{x}}^{(i)}\mid{\mathbf{z}}))
log N(e(i))e(i)log Nlog N(qϕ(𝐱=𝒙(i)𝐳))\text{log }N-\mathcal{H}(\mathbb{P}_{e}^{(i)})-\mathbb{P}_{e}^{(i)}\text{log }N\leq\text{log }N-\mathcal{H}(q_{\phi}({\mathbf{x}}={\bm{x}}^{(i)}\mid{\mathbf{z}}))

where,

(e(i))=sω(𝐞^(i)=𝐞(i)𝐳)log sω(𝐞^(i)=𝐞(i)𝐳)sω(𝐞^(i)𝐞(i)𝐳)log sω(𝐞^(i)𝐞(i)𝐳)\mathcal{H}(\mathbb{P}_{e}^{(i)})=-s_{\omega}(\hat{{\mathbf{e}}}^{(i)}={\mathbf{e}}^{(i)}\mid{\mathbf{z}})\text{log }s_{\omega}(\hat{{\mathbf{e}}}^{(i)}={\mathbf{e}}^{(i)}\mid{\mathbf{z}})-s_{\omega}(\hat{{\mathbf{e}}}^{(i)}\neq{\mathbf{e}}^{(i)}\mid{\mathbf{z}})\text{log }s_{\omega}(\hat{{\mathbf{e}}}^{(i)}\neq{\mathbf{e}}^{(i)}\mid{\mathbf{z}})

. From equation 9 we can get,

^(𝐳,𝐱)qϕ(𝐳,𝐱)elog N\hat{\mathcal{I}}({\mathbf{z}},{\mathbf{x}})-\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})\leq\mathbb{P}_{e}\text{log }N

Intuitively, Theorem prop-1 simplifies the entropy calculation in a Multinomial distribution by using the part of the error probability as a whole. Therefore, a bias of elog N\mathbb{P}_{e}\text{log }N will be generated. With optimization, the predicted error rate e\mathbb{P}_{e} will gradually decrease, The gap between ^(𝐳,𝐱)\hat{\mathcal{I}}({\mathbf{z}},{\mathbf{x}}) and qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) will be gradually reduced to 0.

When the sample size is large, the cost of softmax multi-classification calculation is large. qϕ(𝐳)q_{\phi}({\mathbf{z}}) needs to calculate the sum of all categories probability, i.e. the normalized part of the denominator of the softmax function. To reduce the computational load of the softmax function, similar to the Huffman tree method used in (Morin & Bengio, 2005), we uses bianry tree.

For each 𝒙(i){\bm{x}}^{(i)} we define tags 𝒆(i){\bm{e}}^{(i)}, 𝒆{\bm{e}} is one-hot random vector, w.r.t. sampling 𝒛(l){\bm{z}}^{(l)}, the output vector of classifier network sωs_{\omega} is 𝒉(l){\bm{h}}^{(l)}. Construct a VV layers binary tree, each leaf node, representing a component of the one-hot vector, every two child node has one parent node, finally shrink to the root node. Each layer before the leaf node is noted as 𝒕(1),,𝒕(V){\bm{t}}^{(1)},\dots,{\bm{t}}^{(V)}, let 𝒕(l,v)=ω(v)𝒉(l){\bm{t}}^{(l,v)}=\omega^{(v)}{\bm{h}}^{(l)}, where ω(v)\omega^{(v)} is parameter matrix. The vector length is increased layer by layer, with up to 2v2^{v} nodes per layer. It can be seen that for NN samples, V>=log2N2V>=\text{log}_{2}\frac{N}{2}. Except for leaf nodes, each node component tj(v)t^{(v)}_{j} has one left and one right child nodes (edge nodes maybe has one child). The path from the parent node to the child node is denoted as dj(v)d_{j}^{(v)}, and when dj(v)=1d_{j}^{(v)}=1, the node tj(v)t_{j}^{(v)} select the path to left child, select right child when dj(v)=0d_{j}^{(v)}=0.

using the sigmoid activation function σ()\sigma(\cdot) to calculate the probability of two paths.

(dj(v)=1𝒛(l))=σ(tj(l,v))\mathbb{P}(d^{(v)}_{j}=1\mid{\bm{z}}^{(l)})=\sigma(t^{(l,v)}_{j})

correspondingly we know (dj(v)=0𝒛(l))=1σ(tj(l,v))\mathbb{P}(d_{j}^{(v)}=0\mid{\bm{z}}^{(l)})=1-\sigma(t^{(l,v)}_{j}). For an one-hot vector 𝒆(i){\bm{e}}^{(i)}, there is a unique path on the binary tree pointing to its corresponding leaf node. The node through which the path passes is tj1(l,1),,tjV(l,V)t_{j_{1}}^{(l,1)},\dots,t_{j_{V}}^{(l,V)}, The path of the selected child node for each node tjv(l,v)t_{j_{v}}^{(l,v)} is djv(l,v)d_{j_{v}}^{(l,v)}. Then the fit probability of the classifier is

sω(𝒆^=𝒆(i)𝒛(l))=v=1V(djv(l,v)𝒛(l))s_{\omega}(\hat{{\bm{e}}}={\bm{e}}^{(i)}\mid{\bm{z}}^{(l)})=\prod_{v=1}^{V}\mathbb{P}(d_{j_{v}}^{(l,v)}\mid{\bm{z}}^{(l)})

where djv(l,v)d_{j_{v}}^{(l,v)} is 0 or 11. Compare 𝒆^\hat{{\bm{e}}} and 𝒆{\bm{e}} to construct the loss function, the parameters ω\omega in the binary tree are optimized to fit the softmax function.

Using the softmax function approximation, the computational complexity is O(hN)O(hN), where hh is the classifier output feature number. Based on multidimensional binary labels, the computational complexity is optimized to O(h log2N)O(h\text{ log}_{2}N).

In the structure of VAE, when the distributions mapped by one sample 𝒙(i){\bm{x}}^{(i)} and another sample 𝒙(j){\bm{x}}^{(j)} are far away in the latent variable space, i.e. 𝒙(i){\bm{x}}^{(i)} and 𝒙(j){\bm{x}}^{(j)} are quite different, Even if the identity of the two is same, it will not affect the system’s fit of qϕ(𝐱𝐳)q_{\phi}({\mathbf{x}}\mid{\mathbf{z}}). In other words, the number of label categories we randomly select for the sample can be smaller than the actual number of samples. If we select the number of tag categories as VV, refer to the entropy (qϕ(𝐱𝐳))\mathcal{H}(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})) for equation 13, The maximum entropy of the fitted distribution is log V\text{log }V, which is the case where the fitting distribution is completely indistinguishable from the uniform probability value of 𝐱{\mathbf{x}}. From 0<qϕ(𝐳,𝐱)<log N0<\mathcal{I}_{q_{\phi}({\mathbf{z}},{\mathbf{x}})}<\text{log }N (see Appendix for details), The maximum value of the mutual information qϕ\mathcal{I}_{q_{\phi}} and the maximum value of 𝔼qϕ(𝐳)(qϕ(𝐱|𝐳))\mathbb{E}_{q_{\phi}({\mathbf{z}})}\mathcal{H}(q_{\phi}({\mathbf{x}}|{\mathbf{z}})) is same, i.e. log N\text{log }N. The maximum mutual information that we can provide by using the VV tag category through the auxiliary softmax multi-classifier is log V\text{log }V. Since the log function grows very slowly, selecting VV categories smaller than NN will not affect the VAE-AS’s mutual information so much.

4 Related Work

The method for reducing the collapse of latent variables given in (Dieng et al., 2018) is also developed around adding the model’s mutual information qϕ(𝐱,𝐳)\mathcal{I}_{q_{\phi}}({\mathbf{x}},{\mathbf{z}}). The method is divided into two steps, First, the Skip Network is used to make the latent variables participate in the calculation of each hidden layer output of the subsequent decoder. Then the author also adds constraints of mutual information in the loss function. The author discusses that the mutual information of Skip Network is greater than the mutual information of VAE, i.e. qϕSKIPVAE(𝐳,𝐱)qϕVAE(𝐳,𝐱)\mathcal{I}_{q_{\phi}}^{SKIP-VAE}({\mathbf{z}},{\mathbf{x}})\geq\mathcal{I}_{q_{\phi}}^{VAE}({\mathbf{z}},{\mathbf{x}})

The text (Ma et al., 2019) optimizes the learning of the feature layer by adding regularization regular terms to the ELBO of the variation. The author proposed mutual KL-Divergence between a pair of data to measure the diversity of posteriors.

MPD=𝔼𝒙(1),𝒙(2)p(𝐱)[DKL(qϕ(𝐳𝒙(1))||qϕ(𝐳𝒙(2)))]MPD=\mathbb{E}_{{\bm{x}}^{(1)},{\bm{x}}^{(2)}\sim p({\mathbf{x}})}\left[D_{KL}\left(q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(1)})||q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(2)})\right)\right]

In addition to optimizing the ELBO of the VAE, the optimization goal also requires maximizing MPDMPD. In the article (Ma et al., 2019), the author gives the proof. The mutual posterior diversity (MPD) is actually a symmetric version of the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}). Therefore, the method of (Ma et al., 2019) is the same as (Dieng et al., 2018), both of them add regular items of mutual information based on VAE to improve the optimization effect.

(Tolstikhin et al., 2017; Zhao et al., 2017a) mentions the use of Maximum Mean Discrepancy (MMD) to estimate the distance between distributions qϕ(𝐳)q_{\phi}({\mathbf{z}}) and pθ(𝐳)p_{\theta}({\mathbf{z}}), the focus remains on the estimate of qϕ(𝐳)q_{\phi}({\mathbf{z}}). MMD uses a kernel function to estimate the distance between different sample points, which is similar in form to the softmax’s solution.

All of the above methods require the calculation of qϕ(𝐳)q_{\phi}({\mathbf{z}}) in the training batch. If the batch_size of the training batch is too small, the estimated bias of the mutual information qϕ\mathcal{I}_{q_{\phi}} will be larger. When the number of batches is large, such as estimating the distribution difference by MMD, if all NN samples are used as batch calculations, The calculation of the O(N2)O(N^{2}) kernel function is required, and the cost of the calculation is relatively large.

In this paper, VAE-AS removes the batch constraints by constructing a softmax classification network, and uses the parameters of the classification network to save the differences between different sample codes. Through the calculation of Hierarchical Softmax, we can effectively reduce the number of complex exponential operations.

5 Empirical Results

5.1 Dataset

To test the actual effect of VAE-AS, we selected two reference data sets of image, MNIST (LeCun et al., 1998) and Omniglot (Lake et al., 2013; Burda et al., 2015). The MNIST data set includes 70,000 binary black and white handwritten digital images, for MNIST, 55,000 for training and 10,000 for testing. The Omniglot data set contains 16231623 different handwritten characters from 50 different alphabets. The training set has 2434524345 samples, and test set has 80698069 samples. Each characters was drawn online by 2020 different people. The image size of both data sets is 28×2828\times 28

5.2 Estimate Mutual Information and Marginal Divergence

Firstly, we use of VAE-AS to estimate mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) and marginal divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})). For the dataset MNIST (LeCun et al., 1998), Construct encoders, decoders and classifiers using MLP, the number of layers is 22, 33, 22, the number of hidden layers is chosen to be 500500, the dimension of 𝐳{\mathbf{z}} is set to 4040, the mini-batches is set to 100100 and the iteration epochs is set to 120120.

From the model, for each sample 𝒙(i){\bm{x}}^{(i)}, μ(𝒙(i))\mu({\bm{x}}^{(i)}) and σ(𝒙(i))\sigma({\bm{x}}^{(i)}) is generated, Using Monte Carlo method to estimate qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) and DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})), the method is same as  (Hoffman & Johnson, 2016). Set sample number of Monte Carlo to SS, sampling 55000,10000,5000,1000,500,10055000,10000,5000,1000,500,100 to estimate mutual information and marginal divergence.

Set the number of softmax labels to the full sample size of training set of MNIST 5500055000, using VAE-AS to estimate mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) and marginal divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})). The test is notted as VAE-AS55000\text{VAE-AS}_{55000}. To compare the results, we set the number of softmax labels to 1000010000, performe a comparision test VAE-AS10000\text{VAE-AS}_{10000}. The test results are shown in Figure 2:

050501001000551010epochsValue(a) qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})VAE-AS55000\text{VAE-AS}_{55000}VAE-AS10000\text{VAE-AS}_{10000}S=55000S=10000S=5000S=1000S=500S=10005050100100101015152020epochsValue(b) DKL(q(𝐳)||p(𝐳))D_{KL}(q({\mathbf{z}})||p({\mathbf{z}}))
Figure 2: Estimate mutual information and marginal KL-Divergence.
05050100100010102020epochsValue(c) KL-DivergenceSC55000SC10000\text{SC}_{10000}55000q+DKL(q(𝐳)||p(𝐳))55000\quad\mathcal{I}_{q}+D_{KL}(q({\mathbf{z}})||p({\mathbf{z}}))10000q+DKL(q(𝐳)||p(𝐳))10000\quad\mathcal{I}_{q}+D_{KL}(q({\mathbf{z}})||p({\mathbf{z}}))DKL(q(𝐳𝒙)||p(𝐳))D_{KL}(q({\mathbf{z}}\mid{\bm{x}})\ ||\ p({\mathbf{z}}))
Figure 3: Estimate KL-Divergence.

From the Figure 2 we can see, for the Monte Carlo method, as the number of samples drops, whether the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) or marginal divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})), will produce large bias. And in the case of a small sample, there are a lot of fluctuations in the curve, which shows that Monte Carlo method will introduce a large variance due to sampling. Generally limited by computing resources, we can’t choose too much for mini-batch size during training. Using VAE-AS to fit qϕ(𝐱=𝒙(i)𝐳)q_{\phi}({\mathbf{x}}={\bm{x}}^{(i)}\mid{\mathbf{z}}), then estimate mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) and marginal divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})). The estimated results are close to the large sample size Monte Carlo method. Since VAE-AS only needs a large vocabulary, the dimension of the classification label is small, the demand for computation space can generally be satisfied.

The third figure depicts the sum of mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) and marginal divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})), compare with KL-Divergence DKL(qϕ(𝐳𝐱)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})\ ||\ p_{\theta}({\mathbf{z}})). It can be seen that the sum of mutual information and marginal divergence has a good convergence to the mean of KL-Divergence, where SC is softmax cross-entropy.

5.3 Results

We tested the results of different parameters α\alpha and β\beta on VAE-AS optimization. The encoder inference network qϕ(𝐳𝐱)q_{\phi}({\mathbf{z}}\mid{\mathbf{x}}) , the decoder generation network pθ(𝐱𝐳)p_{\theta}({\mathbf{x}}\mid{\mathbf{z}}) and the classifier network sω(𝐞𝐱)s_{\omega}({\mathbf{e}}\mid{\mathbf{x}}) are both constructed using multiple hidden layer neural networks MLP, the number of layers is 22, 55, 22. The number of hidden layers is chosen to be 500500. the dimension of 𝐳{\mathbf{z}} is set to 4040, the mini-batches is set to 128128 and iteration epochs is set to 300300. No other regular constraints are set.

In the test metrics, NLLtest is a non-negative likelihood based on the test set evaluation. Using the method described in  (Burda et al., 2015), the number of samples is 40964096. KLtest is the KL-Divergence DKL[qϕ(𝐳𝒙(i))||pθ(𝐳)]D_{KL}\left[q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})\ ||\ p_{\theta}({\mathbf{z}})\right] based on the test set. AU is the number of active units of the latent variable 𝐳{\mathbf{z}}, for this indicator we use the definition in  (Burda et al., 2015):

AU=d=1D𝕀{Covp(𝐱)(𝔼qϕ(𝐳𝐱)[𝐳d])ε}AU=\sum_{d=1}^{D}\mathbb{I}\{\text{Cov}_{p({\mathbf{x}})}\left(\mathbb{E}_{q_{\phi}({\mathbf{z}}\mid{\mathbf{x}})}[{\mathbf{z}}_{d}]\right)\geq\varepsilon\}

Where 𝕀()\mathbb{I}(\cdot) is the indicator function, DD is the dimension of the latent variable 𝐳{\mathbf{z}}, 𝐳d{\mathbf{z}}_{d} is the element dd of 𝐳{\mathbf{z}}, ε\varepsilon is the judgment threshold and is set to 0.010.01. NNLtrain, KLtrain is the reconstruction error and KL-Divergence based on the training set. MI is mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}), MD is marginal divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})||p_{\theta}({\mathbf{z}})), estimated by VAE-AS during training. SC is softmax cross-entropy.

Table 2: Adjust VAE-AS optimization effect by α\alpha and β\beta
Method α\alpha β\beta NNLtest KLtest AU NNLtrain KLtrain MI MD SC
VAE - - 100.99 18.39 8 73.45 18.57 - - -
VAE-AS 1.0 1.0 96.32 21.84 11 71.66 21.22 10.78 10.66 0.35
VAE-AS 1.0 0.1 116.61 64.70 36 58.39 65.70 10.86 54.89 0.11
VAE-AS 1.0 0.2 106.92 46.99 25 61.16 45.14 10.87 34.35 0.12
VAE-AS 1.0 0.5 96.63 29.52 18 65.64 29.40 10.83 18.71 0.22
VAE-AS 1.0 2.0 101.34 15.68 7 81.61 15.87 10.58 5.78 0.82
VAE-AS 1.0 5.0 141.38 5.35 2 145.04 5.56 3.34 2.27 7.62
VAE-AS 0.1 1.0 97.22 21.53 11 72.13 21.22 10.80 10.63 0.31
VAE-AS 0.2 1.0 97.51 21.23 10 71.84 20.92 10.81 10.32 0.31
VAE-AS 0.5 1.0 98.23 21.57 10 72.23 20.91 10.80 10.32 0.33
VAE-AS 2.0 1.0 96.56 21.13 11 71.85 20.99 10.75 10.48 0.41
VAE-AS 5.0 1.0 97.33 20.38 10 75.94 19.79 10.44 9.80 0.93
VAE-AS 10.0 1.0 104.69 21.84 10 93.05 20.28 9.00 12.60 3.23

The test results are shown in the Table 2. The test is divided into two phases. We fixed α=1\alpha=1 and tested the effect of β\beta adjustment on model training and testing. In the second phase, fix β=1\beta=1, test the effect of the change of α\alpha.

In VAE, the marginal divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})) controls the overall shape of the latent variable distribution. The larger DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})), the more scattered the latent variable distribution of 𝐳{\mathbf{z}}, i.e. the larger the variance of the latent variable distribution, more conducive to the training of the model. From Table 2 we can see, if the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) is close, Increasing DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}}))is more advantageous for optimizing the reconstruction error. But if the difference between the encoder marginal distribution and the prior distribution is far away, Sampling and Calculating based prior distribution is poor quality.

When β\beta is fixed to 11, it can be seen from the Table 2 that the change in α\alpha affects the performance of the model generalization. As can be seen from the equation 9, the size of the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) depends on entropy (qϕ(𝐱𝐳))\mathcal{H}\left(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\right), and (qϕ(𝐱𝐳))\mathcal{H}\left(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\right) depends on degree of overlap between two encoding conditional distribution qϕ(𝐳𝒙(i))q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)}) and qϕ(𝐳𝒙(j))q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(j)}). The overlap is more directly related to the variance of the encoder output σ(𝒙)\sigma({\bm{x}}). The smaller the σ(𝒙)\sigma({\bm{x}}), the smaller the degree of overlap, the greater the mutual information. So in the Table 2, if DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})) is close. The larger the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}), the smaller the reconstruction error of the training. But the decrease in training error caused by the increase of qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) is only for one single sample. Then it can’t be reproduced on the test set, which is actually a case of overfitting. If α\alpha is small, The model is more inclined to remember the appearance of each training sample than to observe the common law in the sample. This is due to small overlap of the latent variable distribution qϕ(𝐳𝒙)q_{\phi}({\mathbf{z}}\mid{\bm{x}}) caused by mutual information is large, the decoder can not abstract and summarize the rules. According to the information bottleneck theory (Alemi et al., 2016), we should minimize the mutual information to optimize model generalization while achieving the optimization goal. However, the larger α\alpha value, the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) included in the latent variable space is smaller, and the number of active encodings is smaller, the problem of posterior collapse is more likely to occur. This is because the strong mutual information constraint limits the optimization threshold of the encoder.

Refer to caption
(a) α=0.2\alpha=0.2
Refer to caption
(b) Test Set
Refer to caption
(c) α=5.0\alpha=5.0
Refer to caption
(d) α=0.2\alpha=0.2
Refer to caption
(e) Train Set
Refer to caption
(f) α=5.0\alpha=5.0
Figure 4: Compare the effects of image reconstruction when α=0.2\alpha=0.2 and α=5.0\alpha=5.0. (a)α=0.2\alpha=0.2, reconstruction image in Test sets (b)original image in Test sets (c)α=5.0\alpha=5.0, reconstruction image in Test sets (d) α=0.2\alpha=0.2, reconstruction image in train sets (e) original image in train sets (f) α=5.0\alpha=5.0, reconstruction image in train sets

The effect of α\alpha on the model can also be seen in the Figure 4. The smaller α\alpha has less restriction on mutual information, for training set samples, the model tends to reconstruct more accurate images. Compared with the original image (e), the training reconstruction errors of α=0.2\alpha=0.2 and α=5.0\alpha=5.0 are both small, but the graph (d) shows more subtle differences in handwritten fonts than in graph (f). But obviously it will cause more test error, so the clarity of Figure (a) is higher than that of Figure (c), but there are some obvious errors. This is because when the mutual information is large, the qϕ(𝐳𝐱)q_{\phi}({\mathbf{z}}\mid{\mathbf{x}}) generated by different 𝒙{\bm{x}} sample codes overlaps very little, and the model tends to restore the closest sample in the training set. The reconstruction details that can be expressed in the training set are not reproducible in the test set, which is obviously an over-fitting situation. Thus, a smaller α\alpha value will result in a sharper picture, avoiding the problem of the VAE model blurry. However, it will result in poor generalization performance of the model. This can be seen as a trade-off. If the continuity of ground-truth distribution is poor, then the model should choose a smaller α\alpha to get more mutual information in the latent variable space. If the continuity of ground-truth distribution is better, or if we need to interpolate in the discrete sampling training set, then we need mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) to be given more restrictions.

In the training process of VAE-AS, the values of qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) and DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})) have mutual influence. As can be seen from the Table 2, if β>1\beta>1, The constraint requires that the latent variable distribution qϕ(𝐳)q_{\phi}({\mathbf{z}}) is closer to pθ(𝐳)p_{\theta}({\mathbf{z}}), which is generally a standard normal distribution. Often the overall shape of the data distribution will be more concentrated. Then the overlap of latent variable distribution qϕ(𝐳𝐱)q_{\phi}({\mathbf{z}}\mid{\mathbf{x}}) is increased due to the extrusion of the 𝐳{\mathbf{z}} latent variable space, mutual information is thus reduced. With the decline of mutual information, the accuracy of decoder reconstruction will also decrease. And with decrease of β\beta, the contraint of marginal divergence DKL(qϕ(𝐳)||pθ(𝐳))D_{KL}(q_{\phi}({\mathbf{z}})\ ||\ p_{\theta}({\mathbf{z}})) is gradually relaxed. The distribution of qϕ(𝐳)q_{\phi}({\mathbf{z}}) is more likely to be optimized according to the decoder’s requirement to reduce the reconstruction error. There will also be an increase in mutual information in the training.

In the Table 2, when α\alpha is in the range of values from 0.10.1 to 2.02.0, due to the constraint of mutual information upper bound, the change of mutual information is small. When the α\alpha is small, although the mutual information qϕ(𝐳,𝐱)\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) is already close to the upper limit bound log N\text{log }N, But because the constraints are looser, encoder easily reduces the output variance σ(𝒙)\sigma({\bm{x}}). VAE-AS can maintain mutual information while reducing marginal divergence. However, VAE-AS needs to be optimized for training in a smaller encoding space, so the training will be more difficult and the number of iterations of training will be more.

From the test results, we can see that the marginal KL-Divergence controls the overall distribution shape of the code distribution, which determines the upper limit precision that the model can achieve. Mutual information determines the degree of overlap between different latent variable, or the degree of smoothness of distribution between latent variable, to a certain extent it controls the degree of overfitting.

6 Conclusion

This paper analyzes the reasons why VAE has a posterior collapse problem during training in detail. A method of randomly assigning labels to samples and identifying labels by auxiliary softmax multi-classifiers is proposed to control mutual information of models.

Compared with image processing, posterior collapse is more likely to appear in the field of natural language process. In NLP processing, decoding networks such as RNN are more widely used. Therefore we will further test the effect of VAE-AS on NLP and RNN. VAE-AS is able to accurately estimate mutual information and marginal divergence in training batches, this facilitates many other machine learning tasks. The information bottleneck theory discusses in detail the optimization process of mutual information. VAE is often used for disentanglement, which requires fine control of mutual information. Subsequent we will use VAE-AS to conduct further research in these areas.

References

  • Alemi et al. (2016) Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
  • Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Devon Hjelm, and Aaron Courville. Mutual information neural estimation. In International Conference on Machine Learning, pp. 530–539, 2018.
  • Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
  • Burda et al. (2015) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
  • Burgess et al. (2018) Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β\beta-vae. arXiv preprint arXiv:1804.03599, 2018.
  • Chen et al. (2016) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
  • Dieng et al. (2018) Adji B Dieng, Yoon Kim, Alexander M Rush, and David M Blei. Avoiding latent variable collapse with generative skip models. arXiv preprint arXiv:1807.04863, 2018.
  • Donsker & Varadhan (1983) Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212, 1983.
  • Dosovitskiy & Brox (2016) Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in neural information processing systems, pp. 658–666, 2016.
  • He et al. (2019) Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534, 2019.
  • Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β\beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3, 2017.
  • Hoffman (2017) Matthew D Hoffman. Learning deep latent gaussian models with markov chain monte carlo. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  1510–1519. JMLR. org, 2017.
  • Hoffman & Johnson (2016) Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.
  • Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Lake et al. (2013) Brenden M Lake, Ruslan R Salakhutdinov, and Josh Tenenbaum. One-shot learning by inverting a compositional causal process. In Advances in neural information processing systems, pp. 2526–2534, 2013.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Ma et al. (2019) Xuezhe Ma, Chunting Zhou, and Eduard Hovy. Mae: Mutual posterior-divergence regularization for variational autoencoders. arXiv preprint arXiv:1901.01498, 2019.
  • Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
  • Morin & Bengio (2005) Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp.  246–252. Citeseer, 2005.
  • Salimans et al. (2015) Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pp. 1218–1226, 2015.
  • Shwartz-Ziv & Tishby (2017) Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  • Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. In 33rd International Conference on Machine Learning (ICML 2016), 2016.
  • Tolstikhin et al. (2017) Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
  • Xu & Durrett (2018) Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. arXiv preprint arXiv:1808.10805, 2018.
  • Zhao et al. (2017a) Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017a.
  • Zhao et al. (2017b) Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017b.

7 Appendix

7.1 ELBO Rewrite

For the sample 𝒙(1),,𝒙(N){{\bm{x}}^{(1)},\cdots,{\bm{x}}^{(N)}}, consider qϕ(𝐱=𝒙(i))=1Nq_{\phi}({\mathbf{x}}={\bm{x}}^{(i)})=\frac{1}{N}.

qϕ(𝐳,𝐱)\displaystyle\quad\ \mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})
=i=1Nqϕ(𝐳𝒙(i))qϕ(𝒙(i))log[qϕ(𝐳𝒙(i))qϕ(𝒙(i))qϕ(𝒙(i))qϕ(𝐳)]d𝐳\displaystyle=\sum_{i=1}^{N}\int q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})q_{\phi}({\bm{x}}^{(i)})\text{log}\left[\frac{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})q_{\phi}({\bm{x}}^{(i)})}{q_{\phi}({\bm{x}}^{(i)})q_{\phi}({\mathbf{z}})}\right]\text{d}{\mathbf{z}}
=1Ni=1Nqϕ(𝐳𝒙(i))logqϕ(𝐳𝒙(i))qϕ(𝐳)d𝐳\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\int q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})\text{log}\frac{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})}{q_{\phi}({\mathbf{z}})}\text{d}{\mathbf{z}}
qϕ(𝐱,𝐳)+DKL(qϕ(𝐳)||pθ(𝐳))\displaystyle\quad\ \mathcal{I}_{q_{\phi}}({\mathbf{x}},{\mathbf{z}})+D_{KL}\left(q_{\phi}({\mathbf{z}})||p_{\theta}({\mathbf{z}})\right)
=qϕ(𝐱,𝐳)+qϕ(𝐳)logqϕ(𝐳)pθ(𝐳)d𝐳\displaystyle=\mathcal{I}_{q_{\phi}}({\mathbf{x}},{\mathbf{z}})+\int q_{\phi}({\mathbf{z}})\text{log}\frac{q_{\phi}({\mathbf{z}})}{p_{\theta}({\mathbf{z}})}\text{d}{\mathbf{z}}
=qϕ(𝐱,𝐳)+1Ni=1Nqϕ(𝐳𝒙(i))logqϕ(𝐳)pθ(𝐳)d𝐳\displaystyle=\mathcal{I}_{q_{\phi}}({\mathbf{x}},{\mathbf{z}})+\frac{1}{N}\sum_{i=1}^{N}\int q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})\text{log}\frac{q_{\phi}({\mathbf{z}})}{p_{\theta}({\mathbf{z}})}\text{d}{\mathbf{z}}
=1Ni=1Nqϕ(𝐳𝒙(i))logqϕ(𝐳𝒙(i))pθ(𝐳)d𝐳\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\int q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})\text{log}\frac{q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})}{p_{\theta}({\mathbf{z}})}\text{d}{\mathbf{z}}
=1Ni=1NDKL(qϕ(𝐳𝒙(i))||pθ(𝐳))\displaystyle=\frac{1}{N}\sum_{i=1}^{N}D_{KL}\left(q_{\phi}({\mathbf{z}}\mid{\bm{x}}^{(i)})||p_{\theta}({\mathbf{z}})\right)

7.2 Mutual information value range

qϕ(𝐳,𝐱)\displaystyle\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}}) =𝔼qϕ(𝐱,𝐳)[log qϕ(𝐱,𝐳)qϕ(𝐱)qϕ(𝐳)]\displaystyle=\mathbb{E}_{q_{\phi}({\mathbf{x}},{\mathbf{z}})}\left[\text{log }\frac{q_{\phi}({\mathbf{x}},{\mathbf{z}})}{q_{\phi}({\mathbf{x}})q_{\phi}({\mathbf{z}})}\right]
=𝔼qϕ(𝐳)[qϕ(𝐱𝐳)log qϕ(𝐱𝐳)qϕ(𝐱)d𝐱]\displaystyle=\mathbb{E}_{q_{\phi}({\mathbf{z}})}\left[\int q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\text{log }\frac{q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})}{q_{\phi}({\mathbf{x}})}\text{d}{\mathbf{x}}\right]
=log N𝔼qϕ(𝐳)[(qϕ(𝐱𝐳))]\displaystyle=\text{log }N-\mathbb{E}_{q_{\phi}({\mathbf{z}})}\left[\mathcal{H}\left(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\right)\right]

When 𝐱{\mathbf{x}} is independent with 𝐳{\mathbf{z}}, 𝔼qϕ(𝐳)[(qϕ(𝐱𝐳))]=(qϕ(𝐱))=log N\mathbb{E}_{q_{\phi}({\mathbf{z}})}\left[\mathcal{H}\left(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\right)\right]=\mathcal{H}\left(q_{\phi}({\mathbf{x}})\right)=\text{log }N, 所以qϕ(𝐳,𝐱)=0\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})=0. When 𝐱{\mathbf{x}} is mapped one-to-one with 𝐳{\mathbf{z}}, qϕ(𝐱𝐳)=1q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})=1, qϕ(𝐱𝐳)=0\mathcal{H}_{q_{\phi}}({\mathbf{x}}\mid{\mathbf{z}})=0, qϕ(𝐳,𝐱)=log N\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})=\text{log }N. Thus,

0qϕ(𝐳,𝐱)=log N𝔼[(qϕ(𝐱𝐳))]log N0\leq\mathcal{I}_{q_{\phi}}({\mathbf{z}},{\mathbf{x}})=\text{log }N-\mathbb{E}\left[\mathcal{H}\left(q_{\phi}({\mathbf{x}}\mid{\mathbf{z}})\right)\right]\leq\text{log }N

7.3 Results for Omniglot

Table 3: Results of Omniglot Dataset
Method α\alpha β\beta NNLtest KLtest AU NNLtrain KLtrain MI MD SC
VAE - - 147.21 9.33 3 133.70 9.25 - - -
VAE-AS 1.0 0.1 158.54 81.14 40 61.31 82.70 10.10 72.61 0.00
VAE-AS 1.0 0.2 144.05 61.90 34 63.96 61.59 10.10 51.50 0.01
VAE-AS 1.0 0.5 133.44 37.22 21 70.27 37.59 10.09 27.52 0.02
VAE-AS 1.0 1.0 140.08 22.81 10 83.91 22.56 10.07 12.53 0.07
VAE-AS 1.0 2.0 146.47 8.38 3 135.03 8.56 6.17 2.56 4.10
VAE-AS 1.0 5.0 161.17 2.35 1 161.56 2.66 1.83 0.84 8.28
VAE-AS 1.0 10.0 174.49 1.46 0 177.22 2.09 0.59 1.55 9.56
VAE-AS 0.1 1.0 136.82 23.23 11 82.03 23.45 10.07 13.41 0.06
VAE-AS 0.2 1.0 140.38 21.17 9 86.22 21.49 10.07 11.45 0.06
VAE-AS 0.5 1.0 137.52 21.57 10 84.68 22.13 10.07 12.10 0.07
VAE-AS 2.0 1.0 140.60 21.16 9 86.02 21.27 10.06 11.25 0.08
VAE-AS 5.0 1.0 133.10 20.97 10 86.42 21.31 10.04 11.35 0.13
VAE-AS 10.0 1.0 175.63 41.15 7 145.96 39.76 4.22 36.84 7.18
Refer to caption
(a) α=0.2\alpha=0.2
Refer to caption
(b) Test Set
Refer to caption
(c) α=5.0\alpha=5.0
Refer to caption
(d) α=0.2\alpha=0.2
Refer to caption
(e) Train Set
Refer to caption
(f) α=5.0\alpha=5.0
Figure 5: Compare the effects of image reconstruction for Omniglot dataset when α=0.2\alpha=0.2 and α=5.0\alpha=5.0. (a)α=0.2\alpha=0.2, reconstruction image in Test sets (b)original image in Test sets (c)α=5.0\alpha=5.0, reconstruction image in Test sets (d) α=0.2\alpha=0.2, reconstruction image in train sets (e) original image in train sets (f) α=5.0\alpha=5.0, reconstruction image in train sets