Discond-VAE: Disentangling Continuous Factors from the Discrete
Abstract
In the real-world data, there are common variations shared by all classes (e.g. category label) and exclusive variations of each class. We propose a variant of VAE capable of disentangling both of these variations. To represent these generative factors of data, we introduce two sets of continuous latent variables, private variable and public variable. Our proposed framework models the private variable as a Mixture of Gaussian and the public variable as a Gaussian, respectively. Each mode of the private variable is responsible for a class of the discrete variable.
Most of the previous attempts to integrate the discrete generative factors to disentanglement assume statistical independence between the continuous and discrete variables. Our proposed model, which we call Discond-VAE, DISentangles the class-dependent CONtinuous factors from the Discrete factors by introducing the private variables. The experiments show that Discond-VAE can discover the private and public factors from data. Moreover, even under the dataset with only public factors, Discond-VAE does not fail and adapts the private variables to represent the public factors.
1 Introduction
Learning disentangled representation of data without supervision has been considered as an important task for representation learning. [1] Although there are diverse quantitative measures for the disentangled representation [5, 8, 12, 16, 19, 30], most of the qualitative interpretation of disentanglement agrees on the statistical independence between each basic element of representation. In other words, each element of the disentangled representation corresponds to only one generative factor of data while being invariant to the others. Hence, the disentangled representation is naturally a concise and explainable feature of data. Various VAE-based models to obtain more disentangled representation in an unsupervised way have been proposed such as [5, 7, 8, 9, 12, 16, 17, 19].
In particular, JointVAE [7] introduced discrete latent variables as well as continuous variables to represent the generative factors of data. For the real-world data, there are intrinsically discrete generative factors such as digit-type in the MNIST dataset. Therefore, it is natural to adopt a discrete variable to get a disentangled representation of those generative factors. However, JointVAE has a limitation of assuming the independence between the continuous and discrete variables.

The independence assumption of JointVAE is too restrictive to the real world. For example, consider the CelebA dataset [23]. The CelebA has 40 attribute labels, including Male(Gender) and Mustache. In this case, a continuous generative factor representing the Mustache volume is not independent with a discrete factor of the Gender. Hence, the class-independent continuous variable of JointVAE cannot properly represent the Mustache volume. From a similar perspective, [25] proposed a decomposition to generalize the independence assumption of the disentanglement. [25] defines the decomposition of the latent variable as imposing the desired structure to the aggregate posterior , such as sparsity or clustering.
In this paper, we propose a new VAE model called Discond-VAE. Instead of imposing independence between the continuous and discrete factors, we propose learning the independent and dependent continuous factors jointly. Discond-VAE splits the continuous latent variable into two groups, private and public variable. We refer to each category of the discrete generative factor of data as class. The private variable represents variation within each class and the public variable encodes the common generative factor of the entire classes. Therefore, Discond-VAE is able to represent the intra-class variation while keeping the capacity to represent the class-independent generative factor as in JointVAE.
Following the intuitive interpretation, we assume the public variable is independent of the discrete and private variables. The public and private variables are modeled by the Gaussian distribution and the Mixture of Gaussian, respectively. Each mode of the private variables corresponds to a class of the discrete variable. The experiments demonstrate that Discond-VAE can extract the private and public variables from data qualitatively and quantitatively.
1.1 Contribution
-
•
We propose a new VAE model called Discond-VAE. To the best of our knowledge, Discond-VAE is the first VAE model to represent the public and private continuous generative factors and the discrete generative factors at the same time.
-
•
We propose a CondSprites dataset reassembled from the dSprites [26] to evaluate the disentanglement of private and public variables. The CondSprites dataset is designed to mimic the class-independent and class-dependent generative factors of the real-world.
-
•
The existing disentanglement metrics assume the continuity of the latent variables and the independence of the generative factors. To integrate the discrete latent variable and class-dependent continuous variable into the disentanglement metrics, we propose a conditional disentanglement evaluation.
-
•
We assess Discond-VAE on the CondSprites, dSprites, MNIST, and CelebA datasets. The experiments show that Discond-VAE can disentangle the public and private factors qualitatively and quantitatively.
2 Background
2.1 VAE
Variational Autoencoder (VAE, [18, 29]) is a probabilistic model that learns a joint distribution of the observed data and a continuous latent variable . VAE models the joint distribution by
(1) | ||||
(2) |
where is a probabilistic distribution model with distribution parameters . is often referred to as the decoder.
Given a dataset , VAE applies the variational inference by introducing an inference network , which is often referred to as the encoder. The encoder approximates the true posterior with a factorized Gaussian distribution with parameters encoded by the neural network. The encoder and the decoder are simultaneously optimized by maximizing the Evidence Lower Bound (ELBO) .
(3) |
The first term of the ELBO is the reconstruction loss which encourages the VAE to encode informative latent variables to reconstruct the data . The second term regularizes the posterior distribution by promoting the encoded latent variables to match the prior . From this point of view, -VAE [12] scales the regularization term of the ELBO by . The -VAE induces more disentangled representations by matching the encoded latent variables with the factorized Gaussian prior by higher pressure [4, 12].
(4) |
2.2 JointVAE
VAE and -VAE employs only continuous latent variables, especially factorized Gaussian, to model the latent variable . JointVAE [7] generalizes the previous VAE and -VAE by introducing discrete variables to disentangle the generative factors of the observed data.
Let be a continuous latent variable and be a discrete latent variable. By assuming conditional independence between the continuous and discrete latent variables, i.e. and independent prior, i.e. , JointVAE derived an optimization objective (Eq 5) from the -VAE objective (Eq 4). To prevent the posterior collapse phenomenon of the discrete latent variable, JointVAE applied the capacity control [4] to the objective.
(5) |
Since the sampling process from the categorical distribution is non-differentiable, the reparametrization trick can not be applied directly to the discrete encoder . To address this problem, JointVAE employed a differentiable relaxation of discrete variable sampling called Gumbel-Softmax distribution [11, 13, 24]. The Gumbel-Softmax provides a continuous approximation of the categorical distribution.
3 Discond-VAE
In this section, we describe the motivation and probabilistic formulation of Discond-VAE. Then, we describe how the probabilistic formulation is instantiated by a neural network.
3.1 Motivation
Although JointVAE [7] extends the capability of VAE to encode discrete factors, JointVAE has a limitation of assuming the independence between continuous and discrete variables. This assumption usually does not hold for the general dataset. (e.g. CelebA [23] or ImageNet [31]). Therefore, we generalize the assumption.
Consider a generative modeling problem with the observed data , the discrete generative factor , and a set of continuous factors. Some of the continuous factors may be independent of the discrete factor , but the others may not. We refer to the former independent continuous factor as a public generative factor and the latent variable representing it as a public variable . Likewise, we call the latter dependent continuous factor as a private factor and the latent variable representing it as a private variable .
For example, consider a synthetic dataset of 2D Square and Ellipse images. For each shape, the images vary in scale and orientation. Also, the Square images vary in -position and the Ellipse images vary in -position. In this dataset, a latent variable that encodes scale and orientation should be independent of the discrete variable which encodes shape. However, the latent variables representing -position should be dependent on the discrete variable. In short, this dataset has the public factor of scale and orientation, and the private factor of -position. We refer to this dataset as CondSprites and use it to evaluate Discond-VAE in Sec 4.3.
3.2 Model
3.2.1 Probabilistic Model
We propose a modification to JointVAE [7] whose latent variable is composed of the discrete, public, and private variables. Since the public variable represents generative factors shared by all classes and the private variable represents variation within each class, it is natural to assume that the prior factorizes to and .
(6) |
Likewise, the variational distribution is modeled as the following.
(7) |
For our Discond-VAE model, the -VAE objective (Eq 4) becomes
(8) |
The former log-likelihood term stands for the reconstruction error as in the previous VAE models. The latter KL divergence regularizer can be decomposed by the independence assumption.
(9) |
Then, we can address the latter KL divergence as the following. (See appendix for proof)
(10) |
In brief, the learning objective of Discond-VAE (Eq 8) is expressed as
(11) |
Discond-VAE models the by the factorized Gaussian and Gumbel-Softmax as in the JointVAE [7]. Moreover, Discond-VAE introduces the Gaussian Mixture encoder to model the joint distribution of the private and discrete variables. Each mode of the Mixture represents the generative factors within a class.
(12) | ||||
(13) | ||||
(14) | ||||
(15) |
where denote a one-hot sample from the -dimensional categorical distribution and denote a one-hot vector with th component is one. Then, the KL divergence term of private variable becomes
(16) |
where denotes the variational distribution of the discrete variable. Eq 16 shows that the Discond-VAE encourages the disentanglement of private variables by regularizing the KL divergence to the prior by each mode.
3.2.2 Implementation
We propose two methods to implement the probabilistic model of Discond-VAE. These two methods differ in how they encode and reparametrize the private variable .
First, we can model the posterior distribution while keeping the discreteness of the categorical variables. For each class , the private variable encoder takes a concatenation of features extracted from the data and one-hot encoding of the class to infer the corresponding mode of the Gaussian Mixture.
(17) |
Note that the private variable encoder infers -times where denotes the number of classes. For the reparametrization trick, this method takes a sample from the mode of the most likely class in .
(18) |
where and . (Fig 2) We refer to this model by Discond-VAE-exact. Note that for the Discond-VAE-exact model with the perfect classification, the continuous variable encoder is a combination of the vanilla encoder applied to the entire dataset and the class-specific vanilla encoder applied only to the corresponding class.

Instead, the private variable encoder can take the data and the discrete variable to encode the private variable. We refer to this model by Discond-VAE-approx.
(19) |
The Discond-VAE-approx model adopts a continuous approximation to the sampling from Gaussian Mixture by taking a linear combination of the samples from each mode . (Fig 2)
(20) |
where and denotes a sample from the Gumbel-Softmax distribution. As in the Discond-VAE-exact, the Discond-VAE-approx with the perfect classification of confidence has a continuous variable encoder equivalent to a combination of the public vanilla encoder and the class-specific vanilla encoder.
Two Discond-VAE implementations are optimized with the capacity objective [4] as in JointVAE to prevent a discrete variable from posterior collapsing. Hence, the learning objective of Discond-VAE becomes
(21) |
The mean vector of each mode in Mixture prior are hyperparameters. The Discond-VAE-approx model sets the by the random samples from the standard Gaussian . Interestingly, the Discond-VAE-exact models show similar performance for the with a smaller variance. Therefore, we set the for the Discond-VAE-exact models in Sec 4. We tried the EM algorithm or Warm-up approach to optimize the . However, these approaches showed inferior performance and often unstable training dynamics. (The details of these approaches are provided in the appendix.)
4 Experiments
We evaluate the Discond-VAE model on the CondSprites, dSprites [26], MNIST [21], and CelebA [23] datasets. In the CondSprites, we compare the disentangling ability for the dataset with public and private generative factors. By contrast, the experiments on the dSprites evaluate the disentangling ability for the dataset with public generative factors only. Lastly, we evaluate the Discond-VAE model on the real-world dataset, MNIST, and CelebA, quantitatively and qualitatively.
For each dataset, we assess the Discond-VAE model in an unsupervised manner. Since the Discond-VAE divides continuous variables of the JointVAE into the private and public variables, we compare the Discond-VAE models with the same continuous dimension and the models with the same public dimension for a fair evaluation. For each quantitative score, we evaluate the model ten times randomly and report the means, standard deviations, and best scores. (See appendix for the full architecture and training hyperparameters.)
\hlineB3Generative factors | Shape | |
Square | Ellipse | |
\hlineB2 Scale | ✓ | ✓ |
Orientation | ✓ | ✓ |
Position X | ✓ | |
Position Y | ✓ | |
\hlineB3 |
4.1 Dataset
Since the MNIST and CelebA are standard benchmark datasets, we skip a detailed description of the datasets. For the CelebA, we center-cropped and resized each image to resolution as in JointVAE [7] following the custom of [20].
The dSprites [26] is a synthetic dataset to evaluate the disentanglement property of a model. Each sample is a 2D shape image generated from five generative factors. The dSprites dataset has one discrete generative factor of shape (square, ellipse, heart), and four continuous generative factors of scale, orientation, and position in axis. Since the dSprites assumes the independent generative factors, each combination of generative factors corresponds to an image. Thus, the dSprites has 737,280 images in total.
To evaluate disentangling ability further, we constructed a CondSprites dataset from the dSprites [26]. The CondSprites is designed to mimic the coexistence of class-independent variation and intra-class variation generative factors of the real-world. (See Sec 3.1 and Table 1 for details) The CondSprites has 15,360 two-dimensional images consisting of 7,680 for Square and Ellipse, respectively.
4.2 Quantitative Disentanglement Evaluation
As a quantitative evaluation, we assess the Discond-VAE models by the two disentanglement metrics (FactorVAE metric [16] and MIG [5]) and the accuracy.
Most of the disentanglement metrics assume that the latent variables are continuous. For example, the -VAE metric [12] and FactorVAE metric [16] measures the degree of disentanglement from the accuracy of the classifier predicting the generative factor based on the variance of each axis of representation. However, our Discond-VAE model and JointVAE adopt the discrete variable to represent the discrete generative factor. Therefore, we evaluate the disentanglement metric on the continuous latent variable based on the continuous generative factors on the CondSprites and dSprites.
Moreover, the CondSprites has class-dependent latent variables. Because each mode of a private variable can represent different variations of the corresponding class, evaluating the disentanglement metric on the entire dataset gives an inappropriate evaluation of the private generative factors. Therefore, we propose a conditional disentanglement evaluation. We define the conditional disentanglement metric as an expectation over discrete variables of the class-wise disentanglement metrics. By the conditional disentanglement metric, we can assess the disentanglement of private factors properly as well as the public factors.
(22) |
where denotes the examples from with the class .
We evaluate the disentanglement of discrete factors by accuracy. The CondSprites and dSprites have a discrete factor of shape and MNIST has a discrete factor of digit-type. We consider the discrete variable encoder as an unsupervised majority-vote classifier and evaluate its accuracy.
\hlineB3Method | CondSprites | dSprites | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Pb, Prv | Cond-FactorVAE(%) | Cond-MIG | Pb, Prv | FactorVAE(%) | MIG | |||||
Mean (std) | Best | Mean (std) | Best | Mean (std) | Best | Mean (std) | Best | |||
\hlineB2 Joint | 10, - | 74.3 (12.1) | 87.0 | 0.188 (0.075) | 0.284 | 6, - | 92.1 (0.2) | 92.5 | 0.336 (0.002) | 0.337 |
5, - | 73.4 (4.1) | 76.1 | 0.243 (0.041) | 0.305 | 4, - | 98.9 (0.4) | 99.1 | 0.223 (0.022) | 0.241 | |
Exact (ours) | 10, 3 | 96.0 (5.9) | 100 | 0.291 (0.065) | 0.322 | 6, 2 | 91.6 (0.0) | 91.6 | 0.309 (0.024) | 0.338 |
5, 3 | 98.5 (2.4) | 100 | 0.385 (0.124) | 0.466 | 4, 2 | 83.2 (3.7) | 88.8 | 0.355 (0.017) | 0.382 | |
8, 2 | 99.8 (0.5) | 100 | 0.362 (0.028) | 0.388 | 2, 2 | 81.8 (2.2) | 82.5 | 0.329 (0.065) | 0.390 | |
3, 2 | 95.5 (5.6) | 100 | 0.267 (0.079) | 0.363 | — | |||||
Approx (ours) | 10, 3 | 92.7 (8.5) | 100 | 0.201 (0.120) | 0.396 | 6, 2 | 90.1 (8.5) | 99.8 | 0.299 (0.065) | 0.376 |
5, 3 | 97.7 (2.3) | 100 | 0.228 (0.103) | 0.442 | 4, 2 | 92.1 (7.3) | 99.8 | 0.340 (0.039) | 0.419 | |
8, 2 | 92.6 (6.6) | 100 | 0.208 (0.101) | 0.402 | 2, 2 | 89.4 (4.9) | 94.0 | 0.397 (0.044) | 0.454 | |
3, 2 | 89.0 (10.1) | 99.8 | 0.206 (0.101) | 0.339 | — | |||||
\hlineB3 |
4.3 Experiment results on CondSprites
4.3.1 Quantitative Result
The Discond-VAE-exact model achieves much higher disentanglement scores in both metrics than the JointVAE on the CondSprites in Table 2. The Discond-VAE-approx model shows the higher FactorVAE metric and comparable MIG to the JointVAE. The disentanglement results demonstrate that the Discond-VAE can disentangle the private variables. Furthermore, we report the classification accuracy on the CondSprites in Table 3. The Discond-VAE outperforms the JointVAE by a significant margin on the CondSprites. By disentangling the private variables from the discrete variable, the Discond-VAE can attain a more disentangled discrete representation, which is proved by a higher classification accuracy in Table 3.
4.4 Experiment results on dSprites
\hlineB3Method | Pb | Prv | Mean (std) | Best |
---|---|---|---|---|
\hlineB2 Joint | 10 | - | 0.617 (0.068) | 0.720 |
5 | - | 0.599 (0.064) | 0.704 | |
Exact (ours) | 10 | 3 | 0.630 (0.060) | 0.763 |
5 | 3 | 0.648 (0.083) | 0.805 | |
8 | 2 | 0.641 (0.088) | 0.778 | |
3 | 2 | 0.613 (0.103) | 0.853 | |
Approx (ours) | 10 | 3 | 0.679 (0.121) | 0.943 |
5 | 3 | 0.595 (0.088) | 0.825 | |
8 | 2 | 0.677 (0.118) | 0.946 | |
3 | 2 | 0.724 (0.146) | 0.962 | |
\hlineB3 |
\hlineB3Method | Pb | Prv | Mean (std) | Best |
\hlineB2 Joint | 6 | - | ||
4 | - | |||
Exact (ours) | 6 | 2 | 0.389 (0.040) | 0.444 |
4 | 2 | 0.369 (0.010) | 0.381 | |
2 | 2 | 0.351 (0.005) | 0.361 | |
Approx (ours) | 6 | 2 | 0.426 (0.044) | 0.460 |
4 | 2 | 0.434 (0.032) | 0.449 | |
2 | 2 | 0.458 (0.012) | 0.482 | |
\hlineB3 |
4.4.1 Quantitative Result
The dSprites is a synthetic dataset created with five independent generative factors. Hence, the probabilistic model of the JointVAE is a more suitable assumption for representing the dSprites compared to that of Discond-VAE. Nevertheless, the Discond-VAE-approx models show a comparable classification accuracy in Table 4, and a similar disentanglement metric of the continuous variables in Table 2.
For the case of the two-dimensional public variable, both of the Discond-VAE models show a relatively low FactorVAE score compared to the other models of the same type. In fact, the two-dimensional public variable is insufficient to model four independent continuous factors of dSprites. Nevertheless, the FactorVAE scores are higher than the theoretical limit with two public variables only. This result implies that Discond-VAE can adapt the private variables to represent public generative factors. Considering the CondSprites and dSprites results, the Discond-VAE-exact disentangles the private factors better but has less flexibility to adapt the private variables.
4.5 Experiment results on MNIST
4.5.1 Accuracy and Negative Log-likelihood(NLL)
Table 5 shows the unsupervised classification accuracy and NLL scores of each model on the MNIST. The NLL of each model is evaluated by the standard importance weighted sampling strategy [3] with 100 samples. Both types of Discond-VAE show a similar or better accuracy and NLL scores compared to the JointVAE while representing the additional private variables. The result indicates that introducing the private variable to learn the intra-class variation provides a capability to disentangle the discrete factors and learn the data distribution better. Both types of Discond-VAE models with two-dimensional public variable show lower NLL scores compared to the other cases. We suspect this is because a two-dimensional public variable is insufficient to model the major public variations of MNIST such as Angle and Thickness in Fig 3.
\hlineB3Method | Pb | Prv | Mean (std) | Best | NLL |
---|---|---|---|---|---|
\hlineB2 JointVAE | 10 | - | 0.686 (0.092) | 0.809 | 152.5 |
4 | - | 0.708 (0.059) | 0.792 | 153.3 | |
Exact (ours) | 10 | 3 | 0.686 (0.078) | 0.807 | 145.4 |
4 | 3 | 0.722 (0.099) | 0.876 | 144.7 | |
8 | 2 | 0.704(0.064) | 0.828 | 147.3 | |
2 | 2 | 0.712(0.038) | 0.765 | 152.6 | |
Approx (ours) | 10 | 3 | 0.723 (0.069) | 0.804 | 153.4 |
4 | 3 | 0.718 (0.052) | 0.792 | 153.0 | |
8 | 2 | 0.755 (0.076) | 0.830 | 150.3 | |
2 | 2 | 0.716 (0.075) | 0.832 | 163.0 | |
\hlineB3 |




4.5.2 Latent Traversal
We observed the latent traversals of the Discond-VAE to evaluate the disentanglement property qualitatively. For the continuous latent variables, each row corresponds to the latent traversals of an axis over a given example. For the discrete variable, each row shows the one-hot traversal of the discrete variable.
The Discond-VAE shows a smooth variation in angle, slant, thickness, and width of the decoded images as we traverse the public variable in Fig 3. In the discrete variable traversal (Fig 4(c)), we can observe a transition in digit-type of given examples. These results demonstrate that the Discond-VAE can disentangle the public and discrete generative factors from the MNIST.
Moreover, the Discond-VAE discovers the class-specific variation of the digit-type 2 and 7. The Fig 4(a) and 4(b) shows the private variable traversal of the Discond-VAE. Each private variable in Fig 4(a) and 4(b) represents the ring of digit-type 2 and the center-stroke of digit-type 7, respectively. Since these two variations are exclusive to each class, the latent traversals of these two variables show relatively minor or irrelevant variations to the other class images.



4.6 Experiment results on CelebA
4.6.1 Latent Traversal
To further test the generalizability of Discond-VAE, we observed the latent traversals on the CelebA. Fig 5 shows that Discond-VAE can disentangle the public and private generative factors on the more challenging domain of CelebA. The top two rows and the bottom two rows in Fig 5 represent latent traversals of two examples from the two different classes. The traversal of the public variable (Fig 5(a)) generates variation in brightness for the both classes. For the private variable(Fig 5(b)), the Gender traversal is observed for the class 1, but such change is not observed in the class 0. (See appendix for the more public traversals representing the background and face-width.)


5 Related Works
Extracting disentangled features from data without supervision is an important task for representation learning. [1] Several VAE variants adopting continuous latent variables are proposed to obtain more disentangled representations. For example, -VAE [12] increases a disentangling pressure of VAE by increasing the weight of the KL divergence between the variational posterior and the prior . The KL divergence regularizer of -VAE penalizes not only the total correlation (TC) of the aggregate posterior , which induces the factorized posterior, but also the mutual information between the data and the latent variables. Since penalizing the mutual information is detrimental to extract meaningful features, several works proposed penalizing TC only in various ways. (e.g. auxiliary discriminator in FactorVAE [16], mini-batch weighted sampling in -TCVAE [5], and covariance matching in DIP-VAE [19]) In addition, Bayes-Factor-VAE [17] suggests dividing the continuous variables into the relevant variable and nuisance variable. Bayes-Factor-VAE promotes disentangled features by introducing hyper-priors on the variances of Gaussian prior. HFVAE [9] uses a two-level hierarchical objective to control the independence between groups of variables and the independence between each variable in the same group.
Recently, a number of works to model the intrinsic discreteness of the real-world data are proposed. Some of these works proposed representing the discrete variable by modeling the continuous variable as a multimodal distribution or tree-structured model. GMVAE [6] represents continuous variables as a Gaussian Mixture and infers the discrete variable by Bayes rule. CascadeVAE [15] proposed an iterative optimization method to minimize the TC of continuous variables and an alternating optimization method between discrete and continuous variables to train the model. CascadeVAE infers the discrete variable via the inner maximization step over the discrete variables [14]. Moreover, [10] and LTVAE [22] encode the latent variable as a tree-structured model and learns the tree structure from the data themselves. [10] employs nested Chinese Restaurant Process [2] to accommodate a hierarchical prior to the data. LTVAE adjusts a tree-structure of the latent variable via the EM algorithm. Furthermore, VQ-VAE [32] and VQ-VAE-2 [28] proposed a discrete code representation of continuous variable by introducing a nearest neighbor loop-up.
By contrast, JointVAE [7] and InfoCatVAE [27] have an explicit encoder to encode discrete variable. JointVAE [7] proposed a method of jointly training continuous and discrete variables. However, JointVAE has a limitation of assuming that the discrete and continuous variables are independent of each other. By introducing the private latent variables, our Discond-VAE can represent the dependent discrete and continuous variable structure. InfoCatVAE [27] proposed encoding a conditional distribution of continuous variable . In this respect, the private variable of our proposed Discond-VAE and InfoCatVAE have a similar probabilistic formulation. However, InfoCatVAE adopts the axis-division of Gaussian to separate the meaningful variables of each class. The axis-division strategy requires each subdivision to encode a certain latent variable even for the irrelevant classes while the mode-division strategy of our Discond-VAE does not. Therefore, Discond-VAE can promote more disentangled representation compared to InfoCatVAE, even only with the private variable.
6 Conclusion
We proposed Discond-VAE for learning the public and private continuous generative factors and the discrete generative factor from the data. We developed a probabilistic framework and the learning objective for the Discond-VAE, and suggested two implementations of the framework according to how the private variable is addressed. Also, we proposed the CondSprites dataset to evaluate the disentanglement capacity for the class-dependent generative factors of a model. Then, we evaluated both types of the Discond-VAE model on the CondSprites, dSprites, MNIST, and CelebA. The experiment results prove that the Discond-VAE model can disentangle the class-dependent and class-independent factors in an unsupervised manner. Moreover, the Discond-VAE shows a moderate degree of disentanglement even on the dSprites which has only independent generative factors.
References
- [1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- [2] David M Blei, Thomas L Griffiths, and Michael I Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM), 57(2):1–30, 2010.
- [3] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
- [4] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599, 2018.
- [5] Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610–2620, 2018.
- [6] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
- [7] Emilien Dupont. Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pages 710–720, 2018.
- [8] Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018.
- [9] Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, Narayanaswamy Siddharth, Brooks Paige, Dana H Brooks, Jennifer Dy, and Jan-Willem Meent. Structured disentangled representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2525–2534, 2019.
- [10] Prasoon Goyal, Zhiting Hu, Xiaodan Liang, Chenyu Wang, and Eric P Xing. Nonparametric variational auto-encoders for hierarchical representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 5094–5102, 2017.
- [11] Emil Julius Gumbel. Statistical theory of extreme values and some practical applications. NBS Applied Mathematics Series, 33, 1954.
- [12] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 0, 2016.
- [13] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- [14] Yeonwoo Jeong and Hyun Oh Song. Efficient end-to-end learning for quantizable representations. In ICML, 2018.
- [15] Yeonwoo Jeong and Hyun Oh Song. Learning discrete and continuous factors of data via alternating disentanglement. In International Conference on Machine Learning (ICML), 2019.
- [16] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2649–2658, 2018.
- [17] Minyoung Kim, Yuting Wang, Pritish Sahu, and Vladimir Pavlovic. Bayes-factor-vae: Hierarchical bayesian deep auto-encoder models for factor disentanglement. In Proceedings of the IEEE International Conference on Computer Vision, pages 2979–2987, 2019.
- [18] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- [19] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848, 2017.
- [20] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
- [21] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- [22] Xiaopeng Li, Zhourong Chen, Leonard KM Poon, and Nevin L Zhang. Learning latent superstructures in variational autoencoders for deep multidimensional clustering. arXiv preprint arXiv:1803.05206, 2018.
- [23] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- [24] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
- [25] Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pages 4402–4412, 2019.
- [26] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
- [27] Edouard Pineau and Marc Lelarge. Infocatvae: representation learning with categorical variational autoencoders. arXiv preprint arXiv:1806.08240, 2018.
- [28] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, volume 32, 2019.
- [29] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
- [30] Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss. In Advances in Neural Information Processing Systems, pages 185–194, 2018.
- [31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- [32] Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems, volume 30, 2017.
Appendix A KL divergence of Private and Discrete Variables
The KL divergence regularizer of the private and discrete variables can be decomposed as the following Eq 23. Eq 29 can be interpreted as the expectation of the mode-wise regularizer of the private variable. Eq 30 represents the regularizer of the discrete variable.
(23) | ||||
(24) | ||||
(25) | ||||
(26) | ||||
(27) | ||||
(28) | ||||
(29) | ||||
(30) |
Appendix B Mixture prior Optimization
The Discond-VAE models the private variable as the following.
(31) | ||||
(32) | ||||
(33) | ||||
(34) |
The Discond-VAE sets the mean vector of Gaussian Mixture prior by the zero for Discond-VAE-exact and by the random samples for Discond-VAE-approx. These initialization methods do not reflect the semantic relationship of discrete generative factors. Thus, we tried the EM algorithm or Warm-up approaches to optimize the .
The KL divergence term of private variable is expressed as
(35) |
Since is a factorized Gaussian distribution, the KL divergence of each mode becomes
(36) |
where indexes the dimension of the private variable. Therefore, we can obtain a closed-form solution of the optimization problem below over the .
(37) |
where indexes the sample from training data.
To exploit the semantic features extracted from the model, we updated the mean vector (Eq 37) by the EM-like or Warm-up approaches. The Warm-up approach updates the mean vector once right after the of entire training epochs. The EM approach updates the mean vector for every of the entire training epochs. However, these approaches showed the inferior performance to the initialize-and-fix methods and often did not converge.
Encoder | Decoder | ||||
\hlineB3 132 Conv 44, stride 2 | Public | Private | |||
|
Linear Pb 128 | Linear ( * Pr + ) 128 | |||
3264 Conv 44, stride 2 |
|
||||
6464 Conv 44, stride 2 | (64 64 Conv Transpose 44, stride 2) | ||||
Linear (64 * 4 * 4) 256 | 64 32 Conv Transpose 44, stride 2 | ||||
Public | Private | Discrete | 32 32 Conv Transpose 44, stride 2 | ||
Linear 256 Pb | Linear (256+) * Pr | Linear 256 | 32 1 Conv Transpose 44, stride 2 | ||
\hlineB3 |
Encoder | Decoder | ||||||||
\hlineB3 132 Conv 44, stride 2 | Linear (Pb + Pr + ) | ||||||||
|
|
||||||||
3264 Conv 44, stride 2 | (64 64 Conv Transpose 44, stride 2) | ||||||||
6464 Conv 44, stride 2 | 64 32 Conv Transpose 44, stride 2 | ||||||||
Linear (64 * 4 * 4) 256 | 32 32 Conv Transpose 44, stride 2 | ||||||||
|
|
|
32 1 Conv Transpose 44, stride 2 | ||||||
\hlineB3 |
Appendix C Implementation details
C.1 Network Architecture
C.2 Training details
As a reminder, the learning objective of Discond-VAE is expressed as the following.
(38) |
For both Discond-VAE models, we apply the linear scheduling of capacity [4] as in the JointVAE. Each capacity is linearly increased from 0 to in the iteration hyperparameters. For the JointVAE, we applied the same hyperparameters as in [7] for the MNIST and dSprites. All models in the paper are optimized by Adam optimizer. The parameters for the Adam optimizer are betas=(0.9, 0.999), eps= with no weight decay, which is the default setting in the PyTorch library.
CondSprites | dSprites | MNIST | |||||||||
\hlineB3 | 10 | 8 | 5 | 3 | 6 | 4 | 2 | 10 | 8 | 4 | 2 |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 2 | 3 | 2 | 2 | 2 | 2 | 3 | 2 | 3 | 2 | |
2 | 2 | 2 | 2 | 3 | 3 | 3 | 10 | 10 | 10 | 10 | |
30 | 30 | 30 | 30 | 200 | 100 | 200 | 25 | 25 | 25 | 25 | |
30 | 30 | 30 | 30 | 200 | 100 | 200 | 25 | 25 | 25 | 25 | |
30 | 40 | 30 | 40 | 200 | 100 | 200 | 5 | 2.5 | 5 | 3 | |
30 | 30 | 30 | 30 | 20 | 20 | 20 | 5 | 5 | 5 | 5 | |
30 | 30 | 30 | 30 | 20 | 20 | 20 | 5 | 5 | 5 | 5 | |
5 | 5 | 5 | 10 | 1.1 | 1.1 | 1.1 | 25 | 25 | 25 | 25 | |
iteration | 25000 | 25000 | 25000 | 25000 | 300000 | 300000 | 300000 | 25000 | 25000 | 25000 | 25000 |
\hlineB3 |
CondSprites | dSprites | MNIST | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
\hlineB3 | 10 | 8 | 5 | 3 | 6 | 4 | 2 | 10 | 8 | 4 | 2 |
3 | 2 | 3 | 2 | 2 | 2 | 2 | 3 | 2 | 3 | 2 | |
2 | 2 | 2 | 2 | 3 | 3 | 3 | 10 | 10 | 10 | 10 | |
10 | 10 | 10 | 20 | 20 | 20 | 20 | 30 | 10 | 20 | 10 | |
20 | 20 | 20 | 40 | 40 | 40 | 40 | 60 | 20 | 40 | 20 | |
20 | 20 | 20 | 40 | 40 | 40 | 40 | 60 | 20 | 40 | 20 | |
20 | 20 | 20 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | |
20 | 20 | 20 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | |
5 | 5 | 5 | 5 | 5 | 5 | 5 | 10 | 10 | 5 | 10 | |
iteration | 25000 | 25000 | 25000 | 25000 | 300000 | 300000 | 300000 | 25000 | 25000 | 25000 | 25000 |
\hlineB3 |
CondSprites | dSprites | MNIST | ||||
\hlineB3 | 10 | 5 | 6 | 4 | 10 | 4 |
---|---|---|---|---|---|---|
2 | 2 | 3 | 3 | 10 | 10 | |
30 | 30 | 150 | 150 | 30 | 30 | |
30 | 30 | 150 | 150 | 30 | 30 | |
30 | 30 | 40 | 40 | 5 | 5 | |
5 | 5 | 1.1 | 1.1 | 5 | 5 | |
iteration | 25000 | 25000 | 300000 | 300000 | 25000 | 25000 |
\hlineB3 |
C.3 Discond-VAE-exact Hyperparameters
C.3.1 MNIST
-
•
Epochs: 100
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 5e-4
C.3.2 dSprites
-
•
Epochs: 30
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 5e-4
C.3.3 CondSprites
-
•
Epochs: 200
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 5e-4
All other hyperparameters in Table 8.
C.3.4 Hyperparameters Search range
-
•
Learning rate - { 5e-4}
-
•
- {5, 10,15,20,25 30, 50,100,200}
-
•
- {1,3,5,10,20,30,50,100,200 }
-
•
- { 5, 10, 20,30,50 }
-
•
- { 1,1.1 5, 10,25,50 }
C.4 Discond-VAE-approx Hyperparameters
C.4.1 MNIST
-
•
Epochs: 100
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 2e-3
C.4.2 dSprites
-
•
Epochs: 20
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 1e-3
C.4.3 CondSprites
-
•
Epochs: 300
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 1e-3
All other hyperparameters in Table 9.
C.4.4 Hyperparameters Search range
-
•
Learning rate - { 1e-3, 2e-3 }
-
•
- {10, 20, 30, 50}
-
•
- { 2:1, 1:1, 1:2 }
-
•
- { 5, 10, 20 }
-
•
- { 1, 5, 10 }
C.5 JointVAE Hyperparameters
C.5.1 MNIST
-
•
Epochs: 100
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 5e-4
C.5.2 dSprites
-
•
Epochs: 30
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 5e-4
C.5.3 CondSprites
-
•
Epochs: 200
-
•
Batch size: 64
-
•
Optimizer: Adam with learning rate 5e-4
All other hyperparameters in Table 10.
Appendix D CondSprites Details
The CondSprites is a subset of the dSprites [26] designed to model the dependence between the continuous and discrete generative factors. We removed the Heart shape to maintain a reasonable amount of examples. Since the Square images do not vary in the -position in Table 1, we fix the Square images at the center of -axis. In other words, the generative factor for the -position of Square images in the CondSprites is fixed to 16 (Center of the range(0, 32)). Likewise, the generative factor for the -position of every Ellipse image is fixed to 16. The total number of CondSprites examples is .
\hlineB3Generative factors | Shape | |
Square | Ellipse | |
\hlineB2 Scale | ✓ | ✓ |
Orientation | ✓ | ✓ |
Position X | ✓ | |
Position Y | ✓ | |
\hlineB3 |




Appendix E Traversal
Fig 6 shows the private variable traversals of all classes in MNIST [21]. Since the Ring of 2 and Center-stroke of 7 are exclusive intra-class variations of digit-type 2 and 7, latent traversals on the other classes show relatively minor or irrelevant variations. Note that the private variable traversal of an image shows the latent traversal of the corresponding mode in the Mixture of Gaussian.
Fig 7 shows the additional public variable traversals in CelebA [23]. The top two rows and the bottom two rows show latent traversals of two examples from the two classes. Since the public variable is independent of the class, each public variable encodes the same variation for the classes. In Fig 7(a), as we traverse the public variable, the background changes from left dark, right bright to left bright, right dark. In Fig 7(b), as we traverse the public variable, the face changes from wide to narrow.