This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Discond-VAE: Disentangling Continuous Factors from the Discrete

Jaewoong Choi   Geonho Hwang11footnotemark: 1   Myungjoo Kang
Seoul National University
{chjw1475, hgh2134, mkang}@snu.ac.kr
Equal contribution
Abstract

In the real-world data, there are common variations shared by all classes (e.g. category label) and exclusive variations of each class. We propose a variant of VAE capable of disentangling both of these variations. To represent these generative factors of data, we introduce two sets of continuous latent variables, private variable and public variable. Our proposed framework models the private variable as a Mixture of Gaussian and the public variable as a Gaussian, respectively. Each mode of the private variable is responsible for a class of the discrete variable.

Most of the previous attempts to integrate the discrete generative factors to disentanglement assume statistical independence between the continuous and discrete variables. Our proposed model, which we call Discond-VAE, DISentangles the class-dependent CONtinuous factors from the Discrete factors by introducing the private variables. The experiments show that Discond-VAE can discover the private and public factors from data. Moreover, even under the dataset with only public factors, Discond-VAE does not fail and adapts the private variables to represent the public factors.

1 Introduction

Learning disentangled representation of data without supervision has been considered as an important task for representation learning. [1] Although there are diverse quantitative measures for the disentangled representation [5, 8, 12, 16, 19, 30], most of the qualitative interpretation of disentanglement agrees on the statistical independence between each basic element of representation. In other words, each element of the disentangled representation corresponds to only one generative factor of data while being invariant to the others. Hence, the disentangled representation is naturally a concise and explainable feature of data. Various VAE-based models to obtain more disentangled representation in an unsupervised way have been proposed such as [5, 7, 8, 9, 12, 16, 17, 19].

In particular, JointVAE [7] introduced discrete latent variables as well as continuous variables to represent the generative factors of data. For the real-world data, there are intrinsically discrete generative factors such as digit-type in the MNIST dataset. Therefore, it is natural to adopt a discrete variable to get a disentangled representation of those generative factors. However, JointVAE has a limitation of assuming the independence between the continuous and discrete variables.

Refer to caption
Figure 1: Overview of Discond-VAE. Discond-VAE introduces two continuous latent variables (public and private variables) and one discrete variable to represent the data 𝐱\mathbf{x}. The public continuous latent variable is assume to be independent to the private and discrete variables.

The independence assumption of JointVAE is too restrictive to the real world. For example, consider the CelebA dataset [23]. The CelebA has 40 attribute labels, including Male(Gender) and Mustache. In this case, a continuous generative factor representing the Mustache volume is not independent with a discrete factor of the Gender. Hence, the class-independent continuous variable of JointVAE cannot properly represent the Mustache volume. From a similar perspective, [25] proposed a decomposition to generalize the independence assumption of the disentanglement. [25] defines the decomposition of the latent variable as imposing the desired structure to the aggregate posterior qϕ(𝐳)q_{\phi}(\mathbf{z}), such as sparsity or clustering.

In this paper, we propose a new VAE model called Discond-VAE. Instead of imposing independence between the continuous and discrete factors, we propose learning the independent and dependent continuous factors jointly. Discond-VAE splits the continuous latent variable into two groups, private and public variable. We refer to each category of the discrete generative factor of data as class. The private variable represents variation within each class and the public variable encodes the common generative factor of the entire classes. Therefore, Discond-VAE is able to represent the intra-class variation while keeping the capacity to represent the class-independent generative factor as in JointVAE.

Following the intuitive interpretation, we assume the public variable is independent of the discrete and private variables. The public and private variables are modeled by the Gaussian distribution and the Mixture of Gaussian, respectively. Each mode of the private variables corresponds to a class of the discrete variable. The experiments demonstrate that Discond-VAE can extract the private and public variables from data qualitatively and quantitatively.

1.1 Contribution

  • We propose a new VAE model called Discond-VAE. To the best of our knowledge, Discond-VAE is the first VAE model to represent the public and private continuous generative factors and the discrete generative factors at the same time.

  • We propose a CondSprites dataset reassembled from the dSprites [26] to evaluate the disentanglement of private and public variables. The CondSprites dataset is designed to mimic the class-independent and class-dependent generative factors of the real-world.

  • The existing disentanglement metrics assume the continuity of the latent variables and the independence of the generative factors. To integrate the discrete latent variable and class-dependent continuous variable into the disentanglement metrics, we propose a conditional disentanglement evaluation.

  • We assess Discond-VAE on the CondSprites, dSprites, MNIST, and CelebA datasets. The experiments show that Discond-VAE can disentangle the public and private factors qualitatively and quantitatively.

2 Background

2.1 VAE

Variational Autoencoder (VAE, [18, 29]) is a probabilistic model that learns a joint distribution p(x,z)p(x,z) of the observed data xx and a continuous latent variable zdz\in\mathbb{R}^{d}. VAE models the joint distribution by

p(𝐳)\displaystyle p(\mathbf{z}) =𝒩(0,Id×d)\displaystyle=\mathcal{N}(0,I_{d\times d}) (1)
pθ(𝐱𝐳)\displaystyle p_{\theta}(\mathbf{x}\mid\mathbf{z}) =p(𝐱;μθ(𝐳))\displaystyle=p(\mathbf{x};\mu_{\theta}(\mathbf{z})) (2)

where p(𝐱;μθ(𝐳))p(\mathbf{x};\mu_{\theta}(\mathbf{z})) is a probabilistic distribution model with distribution parameters μθ(𝐳)\mu_{\theta}(\mathbf{z}). p(𝐱;μθ(𝐳))p(\mathbf{x};\mu_{\theta}(\mathbf{z})) is often referred to as the decoder.

Given a dataset 𝐗={𝐱1,𝐱2,,𝐱N}\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{N}\}, VAE applies the variational inference by introducing an inference network qϕ(𝐳|𝐱)q_{\phi}(\mathbf{z}|\mathbf{x}), which is often referred to as the encoder. The encoder qϕ(𝐳|𝐱)q_{\phi}(\mathbf{z}|\mathbf{x}) approximates the true posterior pθ(𝐳|𝐱)p_{\theta}(\mathbf{z}|\mathbf{x}) with a factorized Gaussian distribution with parameters encoded by the neural network. The encoder qϕ(𝐳|𝐱)q_{\phi}(\mathbf{z}|\mathbf{x}) and the decoder p(𝐱;μθ(𝐳))p(\mathbf{x};\mu_{\theta}(\mathbf{z})) are simultaneously optimized by maximizing the Evidence Lower Bound (ELBO) VAE(θ,ϕ)\mathcal{L}_{\textrm{VAE}}(\theta,\phi).

VAE(θ,ϕ)=𝔼qϕ(𝐳𝐱)[logpθ(𝐱𝐳)]DKL(qϕ(𝐳𝐱)p(𝐳))\begin{split}\mathcal{L}_{\textrm{VAE}}(\theta,\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z}\mid\mathbf{x})}&\left[\log p_{\theta}(\mathbf{x}\mid\mathbf{z})\right]\\ &-D_{KL}(q_{\phi}(\mathbf{z}\mid\mathbf{x})\mid\mid p(\mathbf{z}))\end{split} (3)

The first term of the ELBO is the reconstruction loss which encourages the VAE to encode informative latent variables 𝐳\mathbf{z} to reconstruct the data 𝐱\mathbf{x}. The second term regularizes the posterior distribution by promoting the encoded latent variables qϕ(𝐳|𝐱)q_{\phi}(\mathbf{z}|\mathbf{x}) to match the prior p(𝐳)p(\mathbf{z}). From this point of view, β\beta-VAE [12] scales the regularization term of the ELBO by β>1\beta>1. The β\beta-VAE induces more disentangled representations by matching the encoded latent variables with the factorized Gaussian prior p(𝐳)p(\mathbf{z}) by higher pressure [4, 12].

βVAE(θ,ϕ)=𝔼qϕ(𝐳𝐱)[logpθ(𝐱𝐳)]βDKL(qϕ(𝐳𝐱)p(𝐳))\begin{split}\mathcal{L}_{\beta-\textrm{VAE}}(\theta,\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z}\mid\mathbf{x})}[&\log p_{\theta}(\mathbf{x}\mid\mathbf{z})]\\ -\beta\,&D_{KL}(q_{\phi}(\mathbf{z}\mid\mathbf{x})\mid\mid p(\mathbf{z}))\end{split} (4)

2.2 JointVAE

VAE and β\beta-VAE employs only continuous latent variables, especially factorized Gaussian, to model the latent variable 𝐳\mathbf{z}. JointVAE [7] generalizes the previous VAE and β\beta-VAE by introducing discrete variables to disentangle the generative factors of the observed data.

Let 𝐳\mathbf{z} be a continuous latent variable and 𝐜\mathbf{c} be a discrete latent variable. By assuming conditional independence between the continuous and discrete latent variables, i.e. qϕ(𝐳,𝐜|𝐱)=qϕ(𝐳|𝐱)qϕ(𝐜|𝐱)q_{\phi}(\mathbf{z},\mathbf{c}|\mathbf{x})=q_{\phi}(\mathbf{z}|\mathbf{x})\,q_{\phi}(\mathbf{c}|\mathbf{x}) and independent prior, i.e. p(𝐳,𝐜)=p(𝐳)p(𝐜)p(\mathbf{z},\mathbf{c})=p(\mathbf{z})\,p(\mathbf{c}), JointVAE derived an optimization objective (Eq 5) from the β\beta-VAE objective (Eq 4). To prevent the posterior collapse phenomenon of the discrete latent variable, JointVAE applied the capacity control [4] to the objective.

Joint(θ,ϕ)=𝔼q(𝐳𝐱,𝐜)[logp(𝐱𝐳)]β𝐳DKL(q(𝐳𝐱)p(𝐳))CzβcDKL(q(𝐜𝐱)p(𝐜))Cc\mathcal{L}_{\textrm{Joint}}(\theta,\phi)=\mathbb{E}_{q(\mathbf{z}\mid\mathbf{x},\mathbf{c})}\left[\log p(\mathbf{x}\mid\mathbf{z})\right]\\ -\beta_{\mathbf{z}}\,\mid D_{KL}(q(\mathbf{z}\mid\mathbf{x})\,\mid\mid\,p(\mathbf{z}))-C_{z}\mid\\ -\beta_{c}\,\mid D_{KL}(q(\mathbf{c}\mid\mathbf{x})\,\mid\mid\,p(\mathbf{c}))-C_{c}\mid (5)

Since the sampling process from the categorical distribution is non-differentiable, the reparametrization trick can not be applied directly to the discrete encoder qϕ(𝐜|𝐱)q_{\phi}(\mathbf{c}|\mathbf{x}). To address this problem, JointVAE employed a differentiable relaxation of discrete variable sampling called Gumbel-Softmax distribution [11, 13, 24]. The Gumbel-Softmax provides a continuous approximation of the categorical distribution.

3 Discond-VAE

In this section, we describe the motivation and probabilistic formulation of Discond-VAE. Then, we describe how the probabilistic formulation is instantiated by a neural network.

3.1 Motivation

Although JointVAE [7] extends the capability of VAE to encode discrete factors, JointVAE has a limitation of assuming the independence between continuous and discrete variables. This assumption usually does not hold for the general dataset. (e.g. CelebA [23] or ImageNet [31]). Therefore, we generalize the assumption.

Consider a generative modeling problem with the observed data 𝐱\mathbf{x}, the discrete generative factor 𝐜\mathbf{c}, and a set of continuous factors. Some of the continuous factors may be independent of the discrete factor 𝐜\mathbf{c}, but the others may not. We refer to the former independent continuous factor as a public generative factor and the latent variable representing it as a public variable 𝐳\mathbf{z}. Likewise, we call the latter dependent continuous factor as a private factor and the latent variable representing it as a private variable 𝐰\mathbf{w}.

For example, consider a synthetic dataset of 2D Square and Ellipse images. For each shape, the images vary in scale and orientation. Also, the Square images vary in xx-position and the Ellipse images vary in yy-position. In this dataset, a latent variable that encodes scale and orientation should be independent of the discrete variable which encodes shape. However, the latent variables representing x,yx,y-position should be dependent on the discrete variable. In short, this dataset has the public factor of scale and orientation, and the private factor of x,yx,y-position. We refer to this dataset as CondSprites and use it to evaluate Discond-VAE in Sec 4.3.

3.2 Model

3.2.1 Probabilistic Model

We propose a modification to JointVAE [7] whose latent variable is composed of the discrete, public, and private variables. Since the public variable represents generative factors shared by all classes and the private variable represents variation within each class, it is natural to assume that the prior p(𝐳,𝐰,𝐜)p(\mathbf{z},\mathbf{w},\mathbf{c}) factorizes to p(𝐳)p(\mathbf{z}) and p(𝐰,𝐜)p(\mathbf{w},\mathbf{c}).

p(𝐳,𝐰,𝐜)=p(𝐳)p(𝐰,𝐜)=p(𝐳)p(𝐜)p(𝐰𝐜)p(\mathbf{z},\mathbf{w},\mathbf{c})=p(\mathbf{z})\cdot p(\mathbf{w},\mathbf{c})=p(\mathbf{z})\cdot p(\mathbf{c})\cdot p(\mathbf{w}\mid\mathbf{c}) (6)

Likewise, the variational distribution qϕ(𝐳,𝐰,𝐜|𝐱)q_{\phi}(\mathbf{z},\mathbf{w},\mathbf{c}|\mathbf{x}) is modeled as the following.

qϕ(𝐳,𝐰,𝐜𝐱)=qϕ(𝐳𝐱)qϕ(𝐰,𝐜𝐱)=qϕ(𝐳𝐱)qϕ(𝐜𝐱)qϕ(𝐰𝐱,𝐜)\begin{split}q_{\phi}\left(\mathbf{z},\mathbf{w},\mathbf{c}\mid\mathbf{x}\right)=\,\,&q_{\phi}\left(\mathbf{z}\mid\mathbf{x}\right)\cdot q_{\phi}\left(\mathbf{w},\mathbf{c}\mid\mathbf{x}\right)\\ =q_{\phi}&\left(\mathbf{z}\mid\mathbf{x}\right)\cdot\,\,q_{\phi}(\mathbf{c}\mid\mathbf{x})\cdot q_{\phi}\left(\mathbf{w}\mid\mathbf{x},\mathbf{c}\right)\end{split} (7)

For our Discond-VAE model, the β\beta-VAE objective (Eq 4) becomes

Cond(θ,ϕ)=𝔼qϕ(𝐳,𝐰,𝐜𝐱)[logpθ(𝐱𝐳,𝐰,𝐜)]βDKL(qϕ(𝐳,𝐰,𝐜𝐱)p(𝐳,𝐰,𝐜))\mathcal{L}_{\textrm{Cond}}(\theta,\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z},\mathbf{w},\mathbf{c}\mid\mathbf{x})}[\log p_{\theta}(\mathbf{x}\mid\mathbf{z},\mathbf{w},\mathbf{c})]\\ -\beta\,D_{KL}(q_{\phi}(\mathbf{z},\mathbf{w},\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{z},\mathbf{w},\mathbf{c})) (8)

The former log-likelihood term stands for the reconstruction error as in the previous VAE models. The latter KL divergence regularizer can be decomposed by the independence assumption.

DKL(qϕ(𝐳,𝐰,𝐜𝐱)p(𝐳,𝐰,𝐜)=DKL(qϕ(𝐳𝐱)p(𝐳))+DKL(qϕ(𝐰,𝐜𝐱)p(𝐰,𝐜))\begin{split}&D_{KL}(q_{\phi}(\mathbf{z},\mathbf{w},\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{z},\mathbf{w},\mathbf{c})\\ &=D_{KL}(q_{\phi}(\mathbf{z}\mid\mathbf{x})\mid\mid p(\mathbf{z}))+D_{KL}(q_{\phi}(\mathbf{w},\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{w},\mathbf{c}))\end{split} (9)

Then, we can address the latter KL divergence as the following. (See appendix for proof)

DKL(qϕ(𝐰,𝐜𝐱)p(𝐰,𝐜))=𝔼qϕ(𝐜𝐱)[DKL(qϕ(𝐰𝐱,𝐜)p(𝐰𝐜))]+DKL(qϕ(𝐜𝐱)p(𝐜))\begin{split}D_{KL}(q_{\phi}(\mathbf{w},\mathbf{c}&\mid\mathbf{x})\mid\mid p(\mathbf{w},\mathbf{c}))\\ =\,\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})}&\left[D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\mid\mid p(\mathbf{w}\mid\mathbf{c}))\right]\\ &+D_{KL}(q_{\phi}(\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{c}))\end{split} (10)

In brief, the learning objective of Discond-VAE (Eq 8) is expressed as

maxθ,ϕCond(θ,ϕ)=𝔼qϕ(𝐳,𝐰,𝐜𝐱)[logpθ(𝐱𝐳,𝐰,𝐜)]β𝐳DKL(qϕ(𝐳𝐱)p(𝐳))β𝐰𝔼qϕ(𝐜𝐱)[DKL(qϕ(𝐰𝐱,𝐜)p(𝐰𝐜))]β𝐜DKL(qϕ(𝐜𝐱)p(𝐜))\begin{split}\max_{\theta,\phi}\,&\mathcal{L}_{\textrm{Cond}}(\theta,\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z},\mathbf{w},\mathbf{c}\mid\mathbf{x})}[\log p_{\theta}(\mathbf{x}\mid\mathbf{z},\mathbf{w},\mathbf{c})]\\ &-\beta_{\mathbf{z}}\cdot D_{KL}(q_{\phi}(\mathbf{z}\mid\mathbf{x})\mid\mid p(\mathbf{z}))\\ &-\beta_{\mathbf{w}}\cdot\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})}\left[D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\mid\mid p(\mathbf{w}\mid\mathbf{c}))\right]\\ &-\beta_{\mathbf{c}}\cdot D_{KL}(q_{\phi}(\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{c}))\end{split} (11)

Discond-VAE models the qϕ(𝐳|𝐱),qϕ(𝐜|𝐱)q_{\phi}\left(\mathbf{z}|\mathbf{x}\right),q_{\phi}(\mathbf{c}|\mathbf{x}) by the factorized Gaussian and Gumbel-Softmax as in the JointVAE [7]. Moreover, Discond-VAE introduces the Gaussian Mixture encoder to model the joint distribution of the private and discrete variables. Each mode of the Mixture represents the generative factors within a class.

p(𝐰𝐜)\displaystyle p(\mathbf{w}\mid\mathbf{c}) =ip(𝐰𝐜=ei)ci\displaystyle=\prod_{i}p(\mathbf{w}\mid\mathbf{c}=e_{i})^{c_{i}} (12)
=i𝒩(μi,I)ci\displaystyle=\prod_{i}\mathcal{N}\left(\mu_{i},I\right)^{c_{i}} (13)
qϕ(𝐰𝐱,𝐜)\displaystyle q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}) =iqϕ(𝐰𝐱,𝐜=ei)ci\displaystyle=\prod_{i}q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}=e_{i})^{c_{i}} (14)
=i𝒩(μi(𝐱,𝐜),Σi(𝐱,𝐜))ci\displaystyle=\prod_{i}\mathcal{N}\left(\mu_{i}(\mathbf{x},\mathbf{c}),\Sigma_{i}(\mathbf{x},\mathbf{c})\right)^{c_{i}} (15)

where 𝐜=(c1,c2,,cd){0,1}d\mathbf{c}=(c_{1},c_{2},\cdots,c_{d})\in\{0,1\}^{d} denote a one-hot sample from the dd-dimensional categorical distribution and eie_{i} denote a one-hot vector with iith component is one. Then, the KL divergence term of private variable 𝐰\mathbf{w} becomes

𝔼qϕ(𝐜𝐱)[DKL(qϕ(𝐰𝐱,𝐜)p(𝐰𝐜))]=i=1dαiDKL(qϕ(𝐰𝐱,𝐜=ei)p(𝐰𝐜=ei))\begin{split}&\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})}\left[D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\mid\mid p(\mathbf{w}\mid\mathbf{c}))\right]\\ =&\sum_{i=1}^{d}\alpha_{i}\cdot D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}=e_{i})\mid\mid p(\mathbf{w}\mid\mathbf{c}=e_{i}))\end{split} (16)

where qϕ(𝐜|𝐱)=(α1,α2,,αd)q_{\phi}(\mathbf{c}|\mathbf{x})=(\alpha_{1},\alpha_{2},\cdots,\alpha_{d}) denotes the variational distribution of the discrete variable. Eq 16 shows that the Discond-VAE encourages the disentanglement of private variables by regularizing the KL divergence to the prior by each mode.

3.2.2 Implementation

We propose two methods to implement the probabilistic model of Discond-VAE. These two methods differ in how they encode and reparametrize the private variable 𝐰\mathbf{w}.

First, we can model the posterior distribution qϕ(𝐰|𝐱,𝐜)q_{\phi}(\mathbf{w}|\mathbf{x},\mathbf{c}) while keeping the discreteness of the categorical variables. For each class 𝐜=ei\mathbf{c}=e_{i}, the private variable encoder qϕ(𝐰|𝐱,𝐜)q_{\phi}(\mathbf{w}|\mathbf{x},\mathbf{c}) takes a concatenation of features extracted from the data 𝐱\mathbf{x} and one-hot encoding of the class eie_{i} to infer the corresponding mode of the Gaussian Mixture.

qϕ(𝐰𝐱,𝐜=ei)=𝒩(μi(𝐱,ei),Σi(𝐱,ei))q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}=e_{i})=\mathcal{N}\left(\mu_{i}(\mathbf{x},e_{i}),\Sigma_{i}(\mathbf{x},e_{i})\right) (17)

Note that the private variable encoder infers dd-times where dd denotes the number of classes. For the reparametrization trick, this method takes a sample from the mode of the most likely class in qϕ(𝐜|𝐱)q_{\phi}(\mathbf{c}|\mathbf{x}).

𝐰=𝐰j\mathbf{w}=\mathbf{w}_{j} (18)

where j=argmaxiqϕ(𝐜=ei|𝐱)j=\arg\max_{i}q_{\phi}(\mathbf{c}=e_{i}|\mathbf{x}) and 𝐰jqϕ(𝐰|𝐱,𝐜=ej)\mathbf{w}_{j}\sim q_{\phi}(\mathbf{w}|\mathbf{x},\mathbf{c}=e_{j}). (Fig 2) We refer to this model by Discond-VAE-exact. Note that for the Discond-VAE-exact model with the perfect classification, the continuous variable encoder is a combination of the vanilla encoder applied to the entire dataset and the class-specific vanilla encoder applied only to the corresponding class.

Refer to caption
Figure 2: Encoder and Sampling of Discond-VAE. The dashed lines denote the sampling from each variational distribution. The Discond-VAE-exact takes the private variable sample from the Mixture mode of the most likely class. The Discond-VAE-approx takes a linear combination of the Gaussian samples from each mode as the private variable sample.

Instead, the private variable encoder qϕ(𝐰|𝐱,𝐜)q_{\phi}(\mathbf{w}|\mathbf{x},\mathbf{c}) can take the data 𝐱\mathbf{x} and the discrete variable qϕ(𝐜|𝐱)=𝜶q_{\phi}(\mathbf{c}|\mathbf{x})=\boldsymbol{\alpha} to encode the private variable. We refer to this model by Discond-VAE-approx.

qϕ(𝐰𝐱,𝐜=ei)=𝒩(μi(𝐱,𝜶),Σi(𝐱,𝜶))q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}=e_{i})=\mathcal{N}\left(\mu_{i}(\mathbf{x},\boldsymbol{\alpha}),\Sigma_{i}(\mathbf{x},\boldsymbol{\alpha})\right) (19)

The Discond-VAE-approx model adopts a continuous approximation to the sampling from Gaussian Mixture qϕ(𝐰,𝐜|𝐱)q_{\phi}(\mathbf{w},\mathbf{c}|\mathbf{x}) by taking a linear combination of the samples from each mode qϕ(𝐰|𝐱,𝐜=ei)q_{\phi}(\mathbf{w}|\mathbf{x},\mathbf{c}=e_{i}). (Fig 2)

𝐰=idπi𝐰i\mathbf{w}=\sum_{i}^{d}\pi_{i}\cdot\mathbf{w}_{i} (20)

where 𝐰iqϕ(𝐰|𝐱,𝐜=ei)\mathbf{w}_{i}\sim q_{\phi}(\mathbf{w}|\mathbf{x},\mathbf{c}=e_{i}) and 𝝅=(π1,,πd)\boldsymbol{\pi}=(\pi_{1},\cdots,\pi_{d}) denotes a sample from the Gumbel-Softmax distribution. As in the Discond-VAE-exact, the Discond-VAE-approx with the perfect classification of confidence 100%100\% has a continuous variable encoder equivalent to a combination of the public vanilla encoder and the class-specific vanilla encoder.

Two Discond-VAE implementations are optimized with the capacity objective [4] as in JointVAE to prevent a discrete variable from posterior collapsing. Hence, the learning objective of Discond-VAE becomes

maxθ,ϕCond(θ,ϕ)=𝔼qϕ(𝐳,𝐰,𝐜𝐱)[logpθ(𝐱𝐳,𝐰,𝐜)]β𝐳|DKL(qϕ(𝐳𝐱)p(𝐳))C𝐳|β𝐰|𝔼qϕ(𝐜𝐱)[DKL(qϕ(𝐰𝐱,𝐜)p(𝐰𝐜))]C𝐰|β𝐜|DKL(qϕ(𝐜𝐱)p(𝐜))C𝐜|\begin{split}&\max_{\theta,\phi}\mathcal{L}_{\textrm{Cond}}(\theta,\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z},\mathbf{w},\mathbf{c}\mid\mathbf{x})}[\log p_{\theta}(\mathbf{x}\mid\mathbf{z},\mathbf{w},\mathbf{c})]\\ &-\beta_{\mathbf{z}}\left|D_{KL}(q_{\phi}(\mathbf{z}\mid\mathbf{x})\mid\mid p(\mathbf{z}))-C_{\mathbf{z}}\right|\\ &-\beta_{\mathbf{w}}\left|\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})}\left[D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\mid\mid p(\mathbf{w}\mid\mathbf{c}))\right]-C_{\mathbf{w}}\right|\\ &-\beta_{\mathbf{c}}\left|D_{KL}(q_{\phi}(\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{c}))-C_{\mathbf{c}}\right|\end{split} (21)

The mean vector of each mode in Mixture prior μi\mu_{i} are hyperparameters. The Discond-VAE-approx model sets the μi\mu_{i} by the random samples from the standard Gaussian 𝒩(0,1)\mathcal{N}(0,1). Interestingly, the Discond-VAE-exact models show similar performance for the μi=0\mu_{i}=0 with a smaller variance. Therefore, we set the μi=0\mu_{i}=0 for the Discond-VAE-exact models in Sec 4. We tried the EM algorithm or Warm-up approach to optimize the μi\mu_{i}. However, these approaches showed inferior performance and often unstable training dynamics. (The details of these approaches are provided in the appendix.)

4 Experiments

We evaluate the Discond-VAE model on the CondSprites, dSprites [26], MNIST [21], and CelebA [23] datasets. In the CondSprites, we compare the disentangling ability for the dataset with public and private generative factors. By contrast, the experiments on the dSprites evaluate the disentangling ability for the dataset with public generative factors only. Lastly, we evaluate the Discond-VAE model on the real-world dataset, MNIST, and CelebA, quantitatively and qualitatively.

For each dataset, we assess the Discond-VAE model in an unsupervised manner. Since the Discond-VAE divides continuous variables of the JointVAE into the private and public variables, we compare the Discond-VAE models with the same continuous dimension and the models with the same public dimension for a fair evaluation. For each quantitative score, we evaluate the model ten times randomly and report the means, standard deviations, and best scores. (See appendix for the full architecture and training hyperparameters.)

\hlineB3Generative factors Shape
Square Ellipse
\hlineB2 Scale
Orientation
Position X
Position Y
\hlineB3
Table 1: Generative factors of the CondSprites.

4.1 Dataset

Since the MNIST and CelebA are standard benchmark datasets, we skip a detailed description of the datasets. For the CelebA, we center-cropped and resized each image to 64×6464\times 64 resolution as in JointVAE [7] following the custom of [20].

The dSprites [26] is a synthetic dataset to evaluate the disentanglement property of a model. Each sample is a 2D shape image generated from five generative factors. The dSprites dataset has one discrete generative factor of shape (square, ellipse, heart), and four continuous generative factors of scale, orientation, and position in x,yx,y axis. Since the dSprites assumes the independent generative factors, each combination of generative factors corresponds to an image. Thus, the dSprites has 737,280 images in total.

To evaluate disentangling ability further, we constructed a CondSprites dataset from the dSprites [26]. The CondSprites is designed to mimic the coexistence of class-independent variation and intra-class variation generative factors of the real-world. (See Sec 3.1 and Table 1 for details) The CondSprites has 15,360 two-dimensional images consisting of 7,680 for Square and Ellipse, respectively.

4.2 Quantitative Disentanglement Evaluation

As a quantitative evaluation, we assess the Discond-VAE models by the two disentanglement metrics (FactorVAE metric [16] and MIG [5]) and the accuracy.

Most of the disentanglement metrics assume that the latent variables are continuous. For example, the β\beta-VAE metric [12] and FactorVAE metric [16] measures the degree of disentanglement from the accuracy of the classifier predicting the generative factor based on the variance of each axis of representation. However, our Discond-VAE model and JointVAE adopt the discrete variable to represent the discrete generative factor. Therefore, we evaluate the disentanglement metric on the continuous latent variable based on the continuous generative factors on the CondSprites and dSprites.

Moreover, the CondSprites has class-dependent latent variables. Because each mode of a private variable can represent different variations of the corresponding class, evaluating the disentanglement metric on the entire dataset gives an inappropriate evaluation of the private generative factors. Therefore, we propose a conditional disentanglement evaluation. We define the conditional disentanglement metric as an expectation over discrete variables of the class-wise disentanglement metrics. By the conditional disentanglement metric, we can assess the disentanglement of private factors properly as well as the public factors.

Conditional Metric=𝔼p(𝐜)[Metric(𝐗𝐜)]\begin{split}\textrm{Conditional Metric}=\mathbb{E}_{p(\mathbf{c})}\left[\textrm{Metric}(\mathbf{X_{\mathbf{c}}})\right]\end{split} (22)

where 𝐗𝐜\mathbf{X_{\mathbf{c}}} denotes the examples from 𝐗\mathbf{X} with the class 𝐜\mathbf{c}.

We evaluate the disentanglement of discrete factors by accuracy. The CondSprites and dSprites have a discrete factor of shape and MNIST has a discrete factor of digit-type. We consider the discrete variable encoder q(𝐜|𝐱)q(\mathbf{c}|\mathbf{x}) as an unsupervised majority-vote classifier and evaluate its accuracy.

\hlineB3Method CondSprites dSprites
Pb, Prv Cond-FactorVAE(%) Cond-MIG Pb, Prv FactorVAE(%) MIG
Mean (std) Best Mean (std) Best Mean (std) Best Mean (std) Best
\hlineB2 Joint 10, - 74.3 (12.1) 87.0 0.188 (0.075) 0.284 6, - 92.1 (0.2) 92.5 0.336 (0.002) 0.337
5, - 73.4 (4.1) 76.1 0.243 (0.041) 0.305 4, - 98.9 (0.4) 99.1 0.223 (0.022) 0.241
Exact (ours) 10, 3 96.0 (5.9) 100 0.291 (0.065) 0.322 6, 2 91.6 (0.0) 91.6 0.309 (0.024) 0.338
5, 3 98.5 (2.4) 100 0.385 (0.124) 0.466 4, 2 83.2 (3.7) 88.8 0.355 (0.017) 0.382
8, 2 99.8 (0.5) 100 0.362 (0.028) 0.388 2, 2 81.8 (2.2) 82.5 0.329 (0.065) 0.390
3, 2 95.5 (5.6) 100 0.267 (0.079) 0.363
Approx (ours) 10, 3 92.7 (8.5) 100 0.201 (0.120) 0.396 6, 2 90.1 (8.5) 99.8 0.299 (0.065) 0.376
5, 3 97.7 (2.3) 100 0.228 (0.103) 0.442 4, 2 92.1 (7.3) 99.8 0.340 (0.039) 0.419
8, 2 92.6 (6.6) 100 0.208 (0.101) 0.402 2, 2 89.4 (4.9) 94.0 0.397 (0.044) 0.454
3, 2 89.0 (10.1) 99.8 0.206 (0.101) 0.339
\hlineB3
Table 2: Disentanglement scores on dSprites and CondSprites. (The higher, the better) Pb and Prv represent the dimension of the public and private variables. For each JointVAE model, we compare the Discond-VAE models with the same public dimension (upper block) and with the same continuous dimension (lower block). The best scores for each combination of a dataset, dimension, and metric are shown in boldface. On the CondSprites, which has class-dependent and class-independent factors, the Discond-VAE outperforms the JointVAE. Under the unfavorable condition of dSprites, the result shows that the Discond-VAE can adjust the private variable to represent the public factors.

4.3 Experiment results on CondSprites

4.3.1 Quantitative Result

The Discond-VAE-exact model achieves much higher disentanglement scores in both metrics than the JointVAE on the CondSprites in Table 2. The Discond-VAE-approx model shows the higher FactorVAE metric and comparable MIG to the JointVAE. The disentanglement results demonstrate that the Discond-VAE can disentangle the private variables. Furthermore, we report the classification accuracy on the CondSprites in Table 3. The Discond-VAE outperforms the JointVAE by a significant margin on the CondSprites. By disentangling the private variables from the discrete variable, the Discond-VAE can attain a more disentangled discrete representation, which is proved by a higher classification accuracy in Table 3.

4.4 Experiment results on dSprites

\hlineB3Method Pb Prv Mean (std) Best
\hlineB2 Joint 10 - 0.617 (0.068) 0.720
5 - 0.599 (0.064) 0.704
Exact (ours) 10 3 0.630 (0.060) 0.763
5 3 0.648 (0.083) 0.805
8 2 0.641 (0.088) 0.778
3 2 0.613 (0.103) 0.853
Approx (ours) 10 3 0.679 (0.121) 0.943
5 3 0.595 (0.088) 0.825
8 2 0.677 (0.118) 0.946
3 2 0.724 (0.146) 0.962
\hlineB3
Table 3: Unsupervised classification accuracy for CondSprites.
\hlineB3Method Pb Prv Mean (std) Best
\hlineB2 Joint 6 - 0.448(0.039)\text{{0.448}}^{*}(0.039) 0.531\text{{0.531}}^{*}
4 - 0.440(0.039)\text{{0.440}}^{*}(0.039) 0.541\text{{0.541}}^{*}
Exact (ours) 6 2 0.389 (0.040) 0.444
4 2 0.369 (0.010) 0.381
2 2 0.351 (0.005) 0.361
Approx (ours) 6 2 0.426 (0.044) 0.460
4 2 0.434 (0.032) 0.449
2 2 0.458 (0.012) 0.482
\hlineB3
Table 4: Unsupervised classification accuracy for dSprites. indicates the results from [15].

4.4.1 Quantitative Result

The dSprites is a synthetic dataset created with five independent generative factors. Hence, the probabilistic model of the JointVAE is a more suitable assumption for representing the dSprites compared to that of Discond-VAE. Nevertheless, the Discond-VAE-approx models show a comparable classification accuracy in Table 4, and a similar disentanglement metric of the continuous variables in Table 2.

For the case of the two-dimensional public variable, both of the Discond-VAE models show a relatively low FactorVAE score compared to the other models of the same type. In fact, the two-dimensional public variable is insufficient to model four independent continuous factors of dSprites. Nevertheless, the FactorVAE scores are higher than the theoretical limit with two public variables only. This result implies that Discond-VAE can adapt the private variables to represent public generative factors. Considering the CondSprites and dSprites results, the Discond-VAE-exact disentangles the private factors better but has less flexibility to adapt the private variables.

4.5 Experiment results on MNIST

4.5.1 Accuracy and Negative Log-likelihood(NLL)

Table 5 shows the unsupervised classification accuracy and NLL scores of each model on the MNIST. The NLL of each model is evaluated by the standard importance weighted sampling strategy [3] with 100 samples. Both types of Discond-VAE show a similar or better accuracy and NLL scores compared to the JointVAE while representing the additional private variables. The result indicates that introducing the private variable to learn the intra-class variation provides a capability to disentangle the discrete factors and learn the data distribution better. Both types of Discond-VAE models with two-dimensional public variable show lower NLL scores compared to the other cases. We suspect this is because a two-dimensional public variable is insufficient to model the major public variations of MNIST such as Angle and Thickness in Fig 3.

\hlineB3Method Pb Prv Mean (std) Best NLL
\hlineB2 JointVAE 10 - 0.686 (0.092) 0.809 152.5
4 - 0.708 (0.059) 0.792 153.3
Exact (ours) 10 3 0.686 (0.078) 0.807 145.4
4 3 0.722 (0.099) 0.876 144.7
8 2 0.704(0.064) 0.828 147.3
2 2 0.712(0.038) 0.765 152.6
Approx (ours) 10 3 0.723 (0.069) 0.804 153.4
4 3 0.718 (0.052) 0.792 153.0
8 2 0.755 (0.076) 0.830 150.3
2 2 0.716 (0.075) 0.832 163.0
\hlineB3
Table 5: Unsupervised classification accuracy and negative log-likelihood(NLL) for MNIST.
Refer to caption
(a) Angle
Refer to caption
(b) Slant
Refer to caption
(c) Thickness
Refer to caption
(d) Width
Figure 3: Public variable traversal on MNIST. Each subfigure corresponds to a different public variable and each row shows the latent traversal of a test example. Discond-VAE encodes public generative factors of MNIST such as Angle, Slant, Thickness, and Width.

4.5.2 Latent Traversal

We observed the latent traversals of the Discond-VAE to evaluate the disentanglement property qualitatively. For the continuous latent variables, each row corresponds to the latent traversals of an axis over a given example. For the discrete variable, each row shows the one-hot traversal of the discrete variable.

The Discond-VAE shows a smooth variation in angle, slant, thickness, and width of the decoded images as we traverse the public variable in Fig 3. In the discrete variable traversal (Fig 4(c)), we can observe a transition in digit-type of given examples. These results demonstrate that the Discond-VAE can disentangle the public and discrete generative factors from the MNIST.

Moreover, the Discond-VAE discovers the class-specific variation of the digit-type 2 and 7. The Fig 4(a) and 4(b) shows the private variable traversal of the Discond-VAE. Each private variable in Fig 4(a) and 4(b) represents the ring of digit-type 2 and the center-stroke of digit-type 7, respectively. Since these two variations are exclusive to each class, the latent traversals of these two variables show relatively minor or irrelevant variations to the other class images.

Refer to caption
(a) Ring of 2
Refer to caption
(b) Center-stroke of 7
Refer to caption
(c) Discrete variable traversal
Figure 4: Private and Discrete variable traversal on MNIST. Fig 4(a) and 4(b) shows private variable traversal. Two private variables encode private generative factors such as Ring of 2 and Center-stroke of 7 to the corresponding classes and irrelevant variations to the non-corresponding classes.

4.6 Experiment results on CelebA

4.6.1 Latent Traversal

To further test the generalizability of Discond-VAE, we observed the latent traversals on the CelebA. Fig 5 shows that Discond-VAE can disentangle the public and private generative factors on the more challenging domain of CelebA. The top two rows and the bottom two rows in Fig 5 represent latent traversals of two examples from the two different classes. The traversal of the public variable (Fig 5(a)) generates variation in brightness for the both classes. For the private variable(Fig 5(b)), the Gender traversal is observed for the class 1, but such change is not observed in the class 0. (See appendix for the more public traversals representing the background and face-width.)

Refer to caption
(a) Public - Brightness
Refer to caption
(b) Private - Gender (Class 1)
Figure 5: Public and Private variable traversal on CelebA. The public variable represents the Brightness for the both classes. The private variable represents the Gender for the class 1 (Female \rightarrow Male) but irrelevant variations for the class 0.

5 Related Works

Extracting disentangled features from data without supervision is an important task for representation learning. [1] Several VAE variants adopting continuous latent variables are proposed to obtain more disentangled representations. For example, β\beta-VAE [12] increases a disentangling pressure of VAE by increasing the weight of the KL divergence between the variational posterior q(𝐳|𝐱)q(\mathbf{z}|\mathbf{x}) and the prior p(𝐳)p(\mathbf{z}). The KL divergence regularizer of β\beta-VAE penalizes not only the total correlation (TC) of the aggregate posterior q(𝐳)q(\mathbf{z}), which induces the factorized posterior, but also the mutual information between the data and the latent variables. Since penalizing the mutual information is detrimental to extract meaningful features, several works proposed penalizing TC only in various ways. (e.g. auxiliary discriminator in FactorVAE [16], mini-batch weighted sampling in β\beta-TCVAE [5], and covariance matching in DIP-VAE [19]) In addition, Bayes-Factor-VAE [17] suggests dividing the continuous variables into the relevant variable and nuisance variable. Bayes-Factor-VAE promotes disentangled features by introducing hyper-priors on the variances of Gaussian prior. HFVAE [9] uses a two-level hierarchical objective to control the independence between groups of variables and the independence between each variable in the same group.

Recently, a number of works to model the intrinsic discreteness of the real-world data are proposed. Some of these works proposed representing the discrete variable by modeling the continuous variable as a multimodal distribution or tree-structured model. GMVAE [6] represents continuous variables as a Gaussian Mixture and infers the discrete variable by Bayes rule. CascadeVAE [15] proposed an iterative optimization method to minimize the TC of continuous variables and an alternating optimization method between discrete and continuous variables to train the model. CascadeVAE infers the discrete variable via the inner maximization step over the discrete variables [14]. Moreover, [10] and LTVAE [22] encode the latent variable as a tree-structured model and learns the tree structure from the data themselves. [10] employs nested Chinese Restaurant Process [2] to accommodate a hierarchical prior to the data. LTVAE adjusts a tree-structure of the latent variable via the EM algorithm. Furthermore, VQ-VAE [32] and VQ-VAE-2 [28] proposed a discrete code representation of continuous variable by introducing a nearest neighbor loop-up.

By contrast, JointVAE [7] and InfoCatVAE [27] have an explicit encoder to encode discrete variable. JointVAE [7] proposed a method of jointly training continuous and discrete variables. However, JointVAE has a limitation of assuming that the discrete and continuous variables are independent of each other. By introducing the private latent variables, our Discond-VAE can represent the dependent discrete and continuous variable structure. InfoCatVAE [27] proposed encoding a conditional distribution of continuous variable q(𝐰|𝐱,𝐜)q(\mathbf{w}|\mathbf{x},\mathbf{c}). In this respect, the private variable of our proposed Discond-VAE and InfoCatVAE have a similar probabilistic formulation. However, InfoCatVAE adopts the axis-division of Gaussian to separate the meaningful variables of each class. The axis-division strategy requires each subdivision to encode a certain latent variable even for the irrelevant classes while the mode-division strategy of our Discond-VAE does not. Therefore, Discond-VAE can promote more disentangled representation compared to InfoCatVAE, even only with the private variable.

6 Conclusion

We proposed Discond-VAE for learning the public and private continuous generative factors and the discrete generative factor from the data. We developed a probabilistic framework and the learning objective for the Discond-VAE, and suggested two implementations of the framework according to how the private variable is addressed. Also, we proposed the CondSprites dataset to evaluate the disentanglement capacity for the class-dependent generative factors of a model. Then, we evaluated both types of the Discond-VAE model on the CondSprites, dSprites, MNIST, and CelebA. The experiment results prove that the Discond-VAE model can disentangle the class-dependent and class-independent factors in an unsupervised manner. Moreover, the Discond-VAE shows a moderate degree of disentanglement even on the dSprites which has only independent generative factors.

References

  • [1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • [2] David M Blei, Thomas L Griffiths, and Michael I Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM), 57(2):1–30, 2010.
  • [3] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
  • [4] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β\beta-vae. arXiv preprint arXiv:1804.03599, 2018.
  • [5] Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610–2620, 2018.
  • [6] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
  • [7] Emilien Dupont. Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pages 710–720, 2018.
  • [8] Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018.
  • [9] Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, Narayanaswamy Siddharth, Brooks Paige, Dana H Brooks, Jennifer Dy, and Jan-Willem Meent. Structured disentangled representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2525–2534, 2019.
  • [10] Prasoon Goyal, Zhiting Hu, Xiaodan Liang, Chenyu Wang, and Eric P Xing. Nonparametric variational auto-encoders for hierarchical representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 5094–5102, 2017.
  • [11] Emil Julius Gumbel. Statistical theory of extreme values and some practical applications. NBS Applied Mathematics Series, 33, 1954.
  • [12] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 0, 2016.
  • [13] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • [14] Yeonwoo Jeong and Hyun Oh Song. Efficient end-to-end learning for quantizable representations. In ICML, 2018.
  • [15] Yeonwoo Jeong and Hyun Oh Song. Learning discrete and continuous factors of data via alternating disentanglement. In International Conference on Machine Learning (ICML), 2019.
  • [16] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2649–2658, 2018.
  • [17] Minyoung Kim, Yuting Wang, Pritish Sahu, and Vladimir Pavlovic. Bayes-factor-vae: Hierarchical bayesian deep auto-encoder models for factor disentanglement. In Proceedings of the IEEE International Conference on Computer Vision, pages 2979–2987, 2019.
  • [18] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [19] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848, 2017.
  • [20] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
  • [21] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • [22] Xiaopeng Li, Zhourong Chen, Leonard KM Poon, and Nevin L Zhang. Learning latent superstructures in variational autoencoders for deep multidimensional clustering. arXiv preprint arXiv:1803.05206, 2018.
  • [23] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • [24] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  • [25] Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pages 4402–4412, 2019.
  • [26] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
  • [27] Edouard Pineau and Marc Lelarge. Infocatvae: representation learning with categorical variational autoencoders. arXiv preprint arXiv:1806.08240, 2018.
  • [28] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • [29] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • [30] Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss. In Advances in Neural Information Processing Systems, pages 185–194, 2018.
  • [31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [32] Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems, volume 30, 2017.

Appendix A KL divergence of Private and Discrete Variables

The KL divergence regularizer of the private and discrete variables can be decomposed as the following Eq 23. Eq 29 can be interpreted as the expectation of the mode-wise regularizer of the private variable. Eq 30 represents the regularizer of the discrete variable.

DKL(qϕ(𝐰,\displaystyle D_{KL}(q_{\phi}(\mathbf{w}, 𝐜𝐱)p(𝐰,𝐜))\displaystyle\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{w},\mathbf{c})) (23)
=𝔼qϕ(𝐜𝐱)\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})} 𝔼qϕ(𝐰𝐱,𝐜)[logqϕ(𝐰𝐱,𝐜)qϕ(𝐜𝐱)p(𝐰𝐜)p(𝐜)]\displaystyle\mathbb{E}_{q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})}[\log\frac{q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\cdot q_{\phi}(\mathbf{c}\mid\mathbf{x})}{p(\mathbf{w}\mid\mathbf{c})\cdot p(\mathbf{c})}] (24)
=𝔼qϕ(𝐜𝐱)\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})} [𝔼qϕ(𝐰𝐱,𝐜)[logqϕ(𝐰𝐱,𝐜)p(𝐰𝐜)]]\displaystyle\left[\mathbb{E}_{q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})}[\log\frac{q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})}{p(\mathbf{w}\mid\mathbf{c})}]\right] (25)
+𝔼qϕ(𝐜𝐱)[logqϕ(𝐜𝐱)p(𝐜)]\displaystyle\qquad+\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})}\left[\log\frac{q_{\phi}(\mathbf{c}\mid\mathbf{x})}{p(\mathbf{c})}\right] (26)
=𝔼qϕ(𝐜𝐱)\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})} [𝔼qϕ(𝐰𝐱,𝐜)[logqϕ(𝐰𝐱,𝐜)p(𝐰𝐜)]]\displaystyle\left[\mathbb{E}_{q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})}[\log\frac{q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})}{p(\mathbf{w}\mid\mathbf{c})}]\right] (27)
+DKL(qϕ(𝐜𝐱)p(𝐜))\displaystyle\qquad+D_{KL}(q_{\phi}(\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{c})) (28)
=𝔼qϕ(𝐜𝐱)\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})} [DKL(qϕ(𝐰𝐱,𝐜p(𝐰𝐜))]\displaystyle\left[D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}\mid\mid p(\mathbf{w}\mid\mathbf{c}))\right] (29)
+DKL(qϕ(𝐜𝐱)p(𝐜))\displaystyle\qquad+D_{KL}(q_{\phi}(\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{c})) (30)

Appendix B Mixture prior Optimization

The Discond-VAE models the private variable as the following.

p(𝐰𝐜)\displaystyle p(\mathbf{w}\mid\mathbf{c}) =ip(𝐰𝐜=ei)ci\displaystyle=\prod_{i}p(\mathbf{w}\mid\mathbf{c}=e_{i})^{c_{i}} (31)
=i𝒩(μi,I)ci\displaystyle=\prod_{i}\mathcal{N}\left(\mu_{i},I\right)^{c_{i}} (32)
qϕ(𝐰𝐱,𝐜)\displaystyle q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}) =iqϕ(𝐰𝐱,𝐜=ei)ci\displaystyle=\prod_{i}q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}=e_{i})^{c_{i}} (33)
=i𝒩(μi(𝐱,𝐜),Σi(𝐱,𝐜))ci\displaystyle=\prod_{i}\mathcal{N}\left(\mu_{i}(\mathbf{x},\mathbf{c}),\Sigma_{i}(\mathbf{x},\mathbf{c})\right)^{c_{i}} (34)

The Discond-VAE sets the mean vector μi\mu_{i} of Gaussian Mixture prior by the zero for Discond-VAE-exact and by the random samples for Discond-VAE-approx. These initialization methods do not reflect the semantic relationship of discrete generative factors. Thus, we tried the EM algorithm or Warm-up approaches to optimize the μi\mu_{i}.

The KL divergence term of private variable 𝐰\mathbf{w} is expressed as

𝔼qϕ(𝐜𝐱)[DKL(qϕ(𝐰𝐱,𝐜)p(𝐰𝐜))]=i=1dαiDKL(qϕ(𝐰𝐱,𝐜=ei)p(𝐰𝐜=ei))\begin{split}&\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})}\left[D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\mid\mid p(\mathbf{w}\mid\mathbf{c}))\right]\\ =&\sum_{i=1}^{d}\alpha_{i}\cdot D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}=e_{i})\mid\mid p(\mathbf{w}\mid\mathbf{c}=e_{i}))\end{split} (35)

Since qϕ(𝐰𝐱,𝐜=ei)q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c}=e_{i}) is a factorized Gaussian distribution, the KL divergence of each mode becomes

DKL(qϕ(𝐰𝐱,𝐜)p(𝐰𝐜))=12j((μi,j(𝐱,𝐜)μi,j)2+σi,j2log(σi,j2)1)\begin{split}&D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\mid\mid p(\mathbf{w}\mid\mathbf{c}))\\ =&\frac{1}{2}\cdot\sum_{j}\left(\left(\mu_{i,j}(\mathbf{x},\mathbf{c})-\mu_{i,j}\right)^{2}+\sigma_{i,j}^{2}-\log(\sigma_{i,j}^{2})-1\right)\end{split} (36)

where jj indexes the dimension of the private variable. Therefore, we can obtain a closed-form solution 𝝁\boldsymbol{\mu^{*}} of the optimization problem below over the 𝝁\boldsymbol{\mu}.

argmin𝝁𝔼p(𝐱)𝔼qϕ(𝐜𝐱)[DKL(qϕ(𝐰𝐱,𝐜)p(𝐰𝐜))]μi,j=n,iαn,i(xn)μi,j(xn)i,jαn,i(xn)\begin{split}\arg\min_{\boldsymbol{\mu}}\,\,&\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})}\left[D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\mid\mid p(\mathbf{w}\mid\mathbf{c}))\right]\\ \mu_{i,j}^{*}&=\frac{\sum_{n,i}\alpha_{n,i}(x_{n})\cdot\mu_{i,j}(x_{n})}{\sum_{i,j}\alpha_{n,i}(x_{n})}\end{split} (37)

where nn indexes the sample xnx_{n} from training data.

To exploit the semantic features extracted from the model, we updated the mean vector 𝝁\boldsymbol{\mu} (Eq 37) by the EM-like or Warm-up approaches. The Warm-up approach updates the mean vector once right after the 10%10\% of entire training epochs. The EM approach updates the mean vector for every 10%10\% of the entire training epochs. However, these approaches showed the inferior performance to the initialize-and-fix methods and often did not converge.

Encoder Decoder
\hlineB3      1×\times32 Conv 4×\times4, stride 2 Public Private
(32×\times32 Conv 4×4\times 4, stride 2)
Linear Pb ×\times 128 Linear (dd * Pr + dd) ×\times 128
32×\times64 Conv 4×\times4, stride 2
Linear 256 ×\times (64 * 4 * 4)
64×\times64 Conv 4×\times4, stride 2 (64 ×\times 64 Conv Transpose 4×\times4, stride 2)
Linear (64 * 4 * 4) ×\times 256 64 ×\times 32 Conv Transpose 4×\times4, stride 2
Public Private Discrete 32 ×\times 32 Conv Transpose 4×\times4, stride 2
Linear 256 ×\times Pb Linear (256+dd) ×\times dd * Pr Linear 256 ×\times dd 32 ×\times 1 Conv Transpose 4×\times4, stride 2
\hlineB3
Table 6: Discond-VAE-exact architecture. For each layer, a×ba\times b at the front represents that the layer has aa in-channels and bb out-channels. The layers in () are added for the dSprites and CondSprites. dd denotes the dimension of the discrete variable. Pr and Pb represent the dimension of the private and public variables, respectively.
Encoder Decoder
\hlineB3      1×\times32 Conv 4×\times4, stride 2 Linear (Pb + Pr + dd) ×256\times 256
(32×\times32 Conv 4×4\times 4, stride 2)
Linear 256 ×\times (64 * 4 * 4)
32×\times64 Conv 4×\times4, stride 2 (64 ×\times 64 Conv Transpose 4×\times4, stride 2)
64×\times64 Conv 4×\times4, stride 2 64 ×\times 32 Conv Transpose 4×\times4, stride 2
Linear (64 * 4 * 4) ×\times 256 32 ×\times 32 Conv Transpose 4×\times4, stride 2
Public
Linear 256 ×\times Pb
Private
Linear (256+dd) ×\times dd * Pr
Discrete
Linear 256 ×\times dd
32 ×\times 1 Conv Transpose 4×\times4, stride 2
\hlineB3
Table 7: Discond-VAE-approx architecture. The same notation as in Table B is applied.

Appendix C Implementation details

C.1 Network Architecture

We use similar network architectures as in JointVAE [7]. The only modification is that the Discond-VAE-exact model embeds the public variable and the private and discrete variables separately. (Decoder of the Table B)

C.2 Training details

As a reminder, the learning objective of Discond-VAE is expressed as the following.

maxθ,ϕCond(θ,ϕ)=𝔼qϕ(𝐳,𝐰,𝐜𝐱)[logpθ(𝐱𝐳,𝐰,𝐜)]β𝐳|DKL(qϕ(𝐳𝐱)p(𝐳))C𝐳|β𝐰|𝔼qϕ(𝐜𝐱)[DKL(qϕ(𝐰𝐱,𝐜)p(𝐰𝐜))]C𝐰|β𝐜|DKL(qϕ(𝐜𝐱)p(𝐜))C𝐜|\begin{split}&\max_{\theta,\phi}\mathcal{L}_{\textrm{Cond}}(\theta,\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z},\mathbf{w},\mathbf{c}\mid\mathbf{x})}[\log p_{\theta}(\mathbf{x}\mid\mathbf{z},\mathbf{w},\mathbf{c})]\\ &-\beta_{\mathbf{z}}\left|D_{KL}(q_{\phi}(\mathbf{z}\mid\mathbf{x})\mid\mid p(\mathbf{z}))-C_{\mathbf{z}}\right|\\ &-\beta_{\mathbf{w}}\left|\mathbb{E}_{q_{\phi}(\mathbf{c}\mid\mathbf{x})}\left[D_{KL}(q_{\phi}(\mathbf{w}\mid\mathbf{x},\mathbf{c})\mid\mid p(\mathbf{w}\mid\mathbf{c}))\right]-C_{\mathbf{w}}\right|\\ &-\beta_{\mathbf{c}}\left|D_{KL}(q_{\phi}(\mathbf{c}\mid\mathbf{x})\mid\mid p(\mathbf{c}))-C_{\mathbf{c}}\right|\end{split} (38)

For both Discond-VAE models, we apply the linear scheduling of capacity [4] as in the JointVAE. Each capacity C𝐳,C𝐰,C𝐜C_{\mathbf{z}},C_{\mathbf{w}},C_{\mathbf{c}} is linearly increased from 0 to CC in the iteration hyperparameters. For the JointVAE, we applied the same hyperparameters as in [7] for the MNIST and dSprites. All models in the paper are optimized by Adam optimizer. The parameters for the Adam optimizer are betas=(0.9, 0.999), eps=1e81e^{-8} with no weight decay, which is the default setting in the PyTorch library.

CondSprites dSprites MNIST
\hlineB3PbPb 10 8 5 3 6 4 2 10 8 4 2
PrPr 3 2 3 2 2 2 2 3 2 3 2
dd 2 2 2 2 3 3 3 10 10 10 10
β𝐳\beta_{\mathbf{z}} 30 30 30 30 200 100 200 25 25 25 25
β𝐰\beta_{\mathbf{w}} 30 30 30 30 200 100 200 25 25 25 25
β𝐜\beta_{\mathbf{c}} 30 40 30 40 200 100 200 5 2.5 5 3
C𝐳C_{\mathbf{z}} 30 30 30 30 20 20 20 5 5 5 5
C𝐰C_{\mathbf{w}} 30 30 30 30 20 20 20 5 5 5 5
C𝐜C_{\mathbf{c}} 5 5 5 10 1.1 1.1 1.1 25 25 25 25
iteration 25000 25000 25000 25000 300000 300000 300000 25000 25000 25000 25000
\hlineB3
Table 8: Hyperparameters of DiscondVAE-Exact. Capacity increased linearly from 0 to CC in an iteration.
CondSprites dSprites MNIST
\hlineB3PbPb 10 8 5 3 6 4 2 10 8 4 2
PrPr 3 2 3 2 2 2 2 3 2 3 2
dd 2 2 2 2 3 3 3 10 10 10 10
β𝐳\beta_{\mathbf{z}} 10 10 10 20 20 20 20 30 10 20 10
β𝐰\beta_{\mathbf{w}} 20 20 20 40 40 40 40 60 20 40 20
β𝐜\beta_{\mathbf{c}} 20 20 20 40 40 40 40 60 20 40 20
C𝐳C_{\mathbf{z}} 20 20 20 10 10 10 10 10 10 10 10
C𝐰C_{\mathbf{w}} 20 20 20 10 10 10 10 10 10 10 10
C𝐜C_{\mathbf{c}} 5 5 5 5 5 5 5 10 10 5 10
iteration 25000 25000 25000 25000 300000 300000 300000 25000 25000 25000 25000
\hlineB3
Table 9: Hyperparameters of DiscondVAE-approx. Capacity increased linearly from 0 to CC in an iteration.
CondSprites dSprites MNIST
\hlineB3PbPb 10 5 6 4 10 4
dd 2 2 3 3 10 10
β𝐳\beta_{\mathbf{z}} 30 30 150 150 30 30
β𝐜\beta_{\mathbf{c}} 30 30 150 150 30 30
C𝐳C_{\mathbf{z}} 30 30 40 40 5 5
C𝐜C_{\mathbf{c}} 5 5 1.1 1.1 5 5
iteration 25000 25000 300000 300000 25000 25000
\hlineB3
Table 10: Hyperparameters of JointVAE. Capacity increased linearly from 0 to CC in an iteration.

C.3 Discond-VAE-exact Hyperparameters

C.3.1 MNIST

  • Epochs: 100

  • Batch size: 64

  • Optimizer: Adam with learning rate 5e-4

C.3.2 dSprites

  • Epochs: 30

  • Batch size: 64

  • Optimizer: Adam with learning rate 5e-4

C.3.3 CondSprites

  • Epochs: 200

  • Batch size: 64

  • Optimizer: Adam with learning rate 5e-4

All other hyperparameters in Table 8.

C.3.4 Hyperparameters Search range

  • Learning rate - { 5e-4}

  • (β𝐳=β𝐰)(\beta_{\mathbf{z}}=\beta_{\mathbf{w}}) - {5, 10,15,20,25 30, 50,100,200}

  • β𝐜\beta_{\mathbf{c}} - {1,3,5,10,20,30,50,100,200 }

  • (C𝐳=C𝐰)(C_{\mathbf{z}}=C_{\mathbf{w}}) - { 5, 10, 20,30,50 }

  • C𝐜C_{\mathbf{c}} - { 1,1.1 5, 10,25,50 }

C.4 Discond-VAE-approx Hyperparameters

C.4.1 MNIST

  • Epochs: 100

  • Batch size: 64

  • Optimizer: Adam with learning rate 2e-3

C.4.2 dSprites

  • Epochs: 20

  • Batch size: 64

  • Optimizer: Adam with learning rate 1e-3

C.4.3 CondSprites

  • Epochs: 300

  • Batch size: 64

  • Optimizer: Adam with learning rate 1e-3

All other hyperparameters in Table 9.

C.4.4 Hyperparameters Search range

  • Learning rate - { 1e-3, 2e-3 }

  • β𝐳\beta_{\mathbf{z}} - {10, 20, 30, 50}

  • β𝐳:(β𝐰=β𝐜)\beta_{\mathbf{z}}:(\beta_{\mathbf{w}}=\beta_{\mathbf{c}}) - { 2:1, 1:1, 1:2 }

  • (C𝐳=C𝐰)(C_{\mathbf{z}}=C_{\mathbf{w}}) - { 5, 10, 20 }

  • C𝐜C_{\mathbf{c}} - { 1, 5, 10 }

C.5 JointVAE Hyperparameters

C.5.1 MNIST

  • Epochs: 100

  • Batch size: 64

  • Optimizer: Adam with learning rate 5e-4

C.5.2 dSprites

  • Epochs: 30

  • Batch size: 64

  • Optimizer: Adam with learning rate 5e-4

C.5.3 CondSprites

  • Epochs: 200

  • Batch size: 64

  • Optimizer: Adam with learning rate 5e-4

All other hyperparameters in Table 10.

Appendix D CondSprites Details

The CondSprites is a subset of the dSprites [26] designed to model the dependence between the continuous and discrete generative factors. We removed the Heart shape to maintain a reasonable amount of examples. Since the Square images do not vary in the yy-position in Table 1, we fix the Square images at the center of yy-axis. In other words, the generative factor for the yy-position of Square images in the CondSprites is fixed to 16 (Center of the range(0, 32)). Likewise, the generative factor for the xx-position of every Ellipse image is fixed to 16. The total number of CondSprites examples is 15360=6(Scale)40(Orientation)(32(x for Squares)+32(y for Ellipses))15360=6(\text{Scale})*40(\text{Orientation})*(32(x\text{ for Squares})+32(y\text{ for Ellipses})).

\hlineB3Generative factors Shape
Square Ellipse
\hlineB2 Scale
Orientation
Position X
Position Y
\hlineB3
Table 11: Generative factors of the CondSprites.
Refer to caption
(a) Ring of 2
Refer to caption
(b) Center-stroke of 7
Figure 6: Private variable traversals of all classes on MNIST. Two private variables represent private generative factors of the Ring of 2 and Center-stroke of 7. Each private variable represents relatively minor or irrelevant variations to the non-corresponding classes.
Refer to caption
(a) Background
Refer to caption
(b) Face-width
Figure 7: Public variable traversals on CelebA. The public variables represent the Background and Face-width regardless of the classes.

Appendix E Traversal

Fig 6 shows the private variable traversals of all classes in MNIST [21]. Since the Ring of 2 and Center-stroke of 7 are exclusive intra-class variations of digit-type 2 and 7, latent traversals on the other classes show relatively minor or irrelevant variations. Note that the private variable traversal of an image shows the latent traversal of the corresponding mode in the Mixture of Gaussian.

Fig 7 shows the additional public variable traversals in CelebA [23]. The top two rows and the bottom two rows show latent traversals of two examples from the two classes. Since the public variable is independent of the class, each public variable encodes the same variation for the classes. In Fig 7(a), as we traverse the public variable, the background changes from left dark, right bright to left bright, right dark. In Fig 7(b), as we traverse the public variable, the face changes from wide to narrow.