¹¹institutetext: Jiantao Wu ²²institutetext: Lin Wang ³³institutetext: Bo Yang ⁴⁴institutetext: Fanqi Li ⁵⁵institutetext: Chunxiuzi Liu ⁶⁶institutetext: Jin Zhou⁷⁷institutetext: Shandong Provincial Key Laboratory of Network Based Intelligent Computing,
University of Jinan, Jinan 250022, China,
⁷⁷email: wangplanet@gmail.com (L. Wang)

DEFT: Distilling Entangled Factors by Preventing Information Diffusion^†^†thanks: *Corresponding author

Jiantao Wu Lin Wang Bo Yang Fanqi Li Chunxiuzi Liu Jin Zhou

(Received: date / Accepted: date)

Abstract

Disentanglement is a highly desirable property of representation owing to its similarity to human understanding and reasoning. Many works achieve disentanglement upon information bottlenecks (IB). Despite their elegant mathematical foundations, the IB branch usually exhibits lower performance. In order to provide an insight into the problem, we develop an annealing test to calculate the information freezing point (IFP), which is a transition state to freeze information into the latent variables. We also explore these clues or inductive biases for separating the entangled factors according to the differences in the IFP distributions. We found the existing approaches suffer from the information diffusion problem, according to which the increased information diffuses in all latent variables.

Based on this insight, we propose a novel disentanglement framework, termed the distilling entangled factor (DEFT), to address the information diffusion problem by scaling backward information. DEFT applies a multistage training strategy, including multigroup encoders with different learning rates and piecewise disentanglement pressure, to disentangle the factors stage by stage. We evaluate DEFT on three variants of dSprite and SmallNORB, which show low-variance and high-level disentanglement scores. Furthermore, the experiment under the correlative factors shows incapable of TC-based approaches. DEFT also exhibits a competitive performance in the unsupervised setting.

Keywords:

Disentanglement Information Bottleneck VAE Representation Learning Information Diffusion

1 Introduction

An understanding and reasoning about the world based on a limited set of observations is important in the field of artificial intelligence. For instance, we can infer the movement of a ball in motion at a single glance, as the human brain is capable of disentangling positions from a set of images without supervision. Therefore, disentanglement learning is highly desirable to build intelligent applications. A disentangled representation has been proposed to be beneficial for a large variety of downstream tasks (Schölkopf et al., 2012; Peters et al., 2017). According to Kim and Mnih (2018), a disentangled representation promotes interpretable semantic information, resulting in substantial advancement, which includes but is not limited to reducing the performance gap between humans and AI approaches (Lake et al., 2017; Higgins et al., 2018b). Other instances of disentangled representations include semantic image understanding and generation (Lample et al., 2017), zero-shot learning (Zhu et al., 2019), and reinforcement learning (Higgins et al., 2017b).

As depicted in the seminal paper by Bengio et al. (2013), humans can understand and reason from a complex observation, after which they can reduce the explanatory factors. The observations are generated by explanatory ground-truth factors ${\bm{c}}$ , which are invisible from the observations. The task of disentanglement learning aims to obtain a disentangled representation that separates these factors from the observations. The notion of disentanglement remains an open topic (Higgins et al., 2018a; Do and Tran, 2020), and we follow a strict version of discourse that one and only one latent variable $z_{i}$ represents one corresponding factor, $c_{j}$ (Burgess et al., 2018).

Locatello et al. (2019) proved the impossibility of disentanglement learning without inductive biases on the model and data. One popular inductive bias on the model assumes that the latent variables are independent. These approaches, which are based on total correlation (TC), dominate visual disentanglement learning (Chen et al., 2018; D. S. Kim et al., 2018; Kumar et al., 2018). This assumption is correct when the factors are sampled uniformly; however, the independent factors show statistical relevance in reality (Träuble et al., 2021). For instance, we observe that men are more likely to have short hair, and based on the observations, there is a correlation between gender and hair length. However, a man who is not bald may grow long hair if desired. In other words, sex does not determine hair length, and they are two independent factors. Therefore, the exploration of disentanglement approaches beyond the independence assumption is vital to reality applications.

Another popular research approach is based on information theory (Insu Jeon et al., 2019; Chen et al., 2016). They hypothesize that the gradually increased information bottleneck (IB) leads to a better disentanglement (Burgess et al., 2018; Dupont, 2018). Unfortunately, in practice, the approaches based on IB usually exhibit lower performance than those penalizing the TC (Locatello et al., 2019). However, it is important to understand whether this means that the total correlation beats the IB. It is believed that the answer is negative. In this research, we investigate the reason for which IBs fall behind TC in practice. We found that the information diffusion (ID) problem is an invisible hurdle that should be addressed in the IB community.

Information diffusion indicates that one factor’s information diffuses into two or more latent variables; thus, the disentanglement scores fluctuate during training. Fig. 1 shows the disentanglement scores of three approaches with the best hyperparameter settings, and it is observed that a large number of trials have a high variance¹¹1We use the pretrained models in disentanglement lib by Locatello et al... We bridge the ID problem with the instability of the current approaches in Section 3.1.

Refer to caption — Figure 1: The distribution of beta VAE metric, MIG, and DCI disentanglement on dSprites. Models are abbreviated (0= $\beta$ -VAE, 1= $\beta$ -TCVAE, 2=AnnealedVAE), and 50 trials are run with different random seeds.

In this paper, we trace the ID problem by measuring the NMI1. The learned information may diffuse when AnnealedVAE and CascadeVAEC learn new information. We have developed the annealing test to measure information freezing point (IFP) that the critical value for learning information from inputs. The current IB-base approaches may prefer data with significant separable factors, because IB assumes the latent components have different contributions (Burgess et al., 2018).

Inspired by distillation²²2Distillation is the process of separating a mixture into its components by heating at an appropriate temperature, such that components boil and freeze into the target containers. in chemistry, we can divide the training process into several stages and extract one component at each stage. In particular, we propose a framework, called the distilling entangled factor (DEFT), to disentangle factors stage-by-stage. According to the IFP distribution, DEFT chooses selective pressure to enable some of the information to pass through the IB. In addition, DEFT reduces the backward information of the first $m$ encoders by reducing the learning rate to reduce the ID problem. We evaluate DEFT on three four datasets, which shows robust performances. We also exam DEFT on the dataset with correlative factors. We have published our codes and all experimental settings in dlib for PyTorch forked from disentanglement lib. Our contributions are summarized in the following:

•

We hypothesize that the ID problem is one reason for the low performances of IB-based approaches.
•

We propose DEFT, a multistage disentangling framework, to address the ID problem by scaling the backward information.

2 Preliminary

2.1 Disentanglement Approaches

Variational Autoencoder

In variational inference, posterior $p(z|x)$ is intractable. The variational autoencoder (VAE) (Kingma and Welling, 2014) uses a neural network $q_{\phi}(z|x)$ (encoder) to approximate the posterior $p(z|x)$ . The other neural network $p_{\theta}(x|z)$ (decoder) rebuilds the observations. The objective of the VAE is to optimize the evidence lower bound (ELBO):

\displaystyle\mathcal{L}(\theta,\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log{p_{\theta}(x|z)}]-D_{\mathrm{KL}}(q_{\phi}(z|x)||p(z)).

(1)

$\beta$ -VAE

Higgins et al. discovered the relationship between the disentanglement and the Kullback–Liebler (KL) divergence penalty strength. They proposed the $\beta$ -VAE to introduce additional pressure on the KL term:

\displaystyle\mathcal{L}^{1}(\theta,\phi;\beta)=\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log{p_{\theta}(x|z)}]-\beta D_{\mathrm{KL}}(q_{\phi}(z|x)||p(z)).

(2)

$\beta$ controls the pressure for the posterior, $q_{\phi}(z|x)$ , to match the factorized unit Gaussian prior, $p(z)$ . However, there is a trade-off between the quality of the reconstructed images and disentanglement.

AnnealedVAE

Burgess et al. (2018) proposed the AnnealedVAE, which progressively increases the information capacity of the latent variables while training:

\displaystyle\mathcal{L}^{2}(\theta,\phi;C)=\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log{p_{\theta}(x|z)}]-\gamma\left|D_{\mathrm{KL}}(q_{\phi}(z|x)||p(z))-C\right|,

(3)

where $\gamma$ is a sufficiently large constant (usually 1000) to constrain the latent information, and $C$ is a value that gradually increases from zero to a large number.

$\beta$ -TCVAE

The TC Watanabe (1960) quantifies the dependency among variables. $\beta$ -TCVAE (Chen et al., 2018) decomposed the KL term into three parts: mutual information (MI), total correlation (TC), and dimensional-wise KL (DWKL). They proposed that the TC be penalized to achieve high reconstruction quality and disentanglement:

	$\displaystyle\mathcal{L}^{3}(\theta,\phi;\beta)=$	$\displaystyle\mathbb{E}_{q_{\phi}(\mathbf{z}\|\mathbf{x})}[\log{p_{\theta}(x\|z)}]-\mathbb{E}_{q(z,n)}\left[\log\frac{q_{\phi}(z\mid n)p(n)}{q_{\phi}(z)p(n)}\right]-$		(4)
		$\displaystyle\beta\mathbb{E}_{q_{\phi}(z)}\left[\log\frac{q_{\phi}(z)}{\prod_{j}q_{\phi}\left(z_{j}\right)}\right]-\sum_{j}\mathbb{E}_{q_{\phi}\left(z_{j}\right)}\left[\log\frac{q_{\phi}\left(z_{j}\right)}{p\left(z_{j}\right)}\right].$		(4)

CascadeVAEC

Jeong and Song provided another total correlation penalization through information cascading. They proved that $TC(z)=\sum_{i=2}^{d}I(z_{1:i-1};z_{i})$ . CascadeVAEC, the continuous version, sequentially relieves one latent variable at a time, encouraging the model to disentangle one factor during the $i$ -th stage:

	$\displaystyle\mathcal{L}^{4}(\theta,\phi;\beta_{l},\beta_{h})=$	$\displaystyle\mathbb{E}_{q_{\phi}(\mathbf{z}\|\mathbf{x})}[\log{p_{\theta}(x\|z)}]-$		(5)
		$\displaystyle\beta_{l}D_{\mathrm{KL}}(q_{\phi}(z_{1:i}\|x)\|\|p(z_{1:i}))-\beta_{h}D_{\mathrm{KL}}(q_{\phi}(z_{i+1:d}\|x)\|\|p(z_{i+1:d})),$		(5)

where $\beta_{l}$ is a small value for opening the information flow, $\beta_{h}$ is a large beta coefficient, and $d$ is the number of dimensions.

2.2 Disentanglement Evaluation

Supervised Disentanglement Metric

Artificial datasets usually have ground-truth labels, such as dSprites (Matthey et al., 2017). The supervised metrics use this label information to measure the disentanglement responsibly. Several metrics have been proposed to evaluate the disentanglement. These include the BetaVAE metric (Higgins et al., 2017a), FactorVAE metric (Kim and Mnih, 2018), MI gap (Chen et al., 2018), modularity (Ridgeway and Mozer, 2018), DCI (Eastwood and Williams, 2018), and SAP score (Kumar et al., 2018). Shannon MI is an information-theoretic quantity that measures the amount of information shared between two variables. The MIG measures the gap between the top two latent variables with the highest MI (Chen et al., 2018). $j^{1}=\operatorname*{arg\,max}_{j}I(z_{j};c_{k})$ is the index of latent variables that maximize the largest MI for $c_{k}$ . $j^{2}=\operatorname*{arg\,max}_{j}I(z_{j};c_{k}),j\neq j^{1}$ is the second largest MI. We define the $m$ -th largest normalized MI (NMI) between $z_{j}$ and $c_{k}$ as $\text{NMI}(c_{k},m)$ :

\displaystyle\text{NMI}(c_{k},m)=\frac{1}{H\left(c_{k}\right)}I(z_{j^{m}};c_{k}).

(6)

$\text{NMI}(c_{k},2)$ is the largest $\frac{I(z_{j};c_{k})}{H(c_{k})}$ except $\text{NMI}(c_{k},1)$ . The MIG can be calculated as

\text{MIG}=\frac{1}{\|{\bm{c}}\|}\sum_{i=1}^{\|{\bm{c}}\|}\text{NMI}(c_{i},1)-\text{NMI}(c_{i},2).

(7)

3 Motivation

3.1 Disentanglement Fluctuation

Locatello et al. conducted a survey of current disentanglement approaches, and the results show that these approaches have high variance of disentanglement scores. They concluded that “tuning hyperparameters matters more than the choice of the objective function” (See Figure 7 in their paper). A reliable and robust approach should therefore have a consistently high performance and low variance. We investigated the performance of $\beta$ -VAE ( $\beta=4$ ), $\beta$ -TCVAE ( $\beta=6$ ), and AnnealedVAE ( $C=25$ ) on dSprites, and traced the disentanglement scores through the training processes. Fig. 2 shows the curves of three metrics (beta VAE metric, MIG, and DCI disentanglement) for four models ( $\beta$ -VAE, $\beta$ -TCVAE, CascadeVAEC, and AnnealedVAE). AnnealedVAE, CascadeVAEC, and $\beta$ -TCVAE show significant improvements in the very first iteration. However, CascadeVAEC has a sharp decrement in the 1e4 iteration, and AnnealedVAE shows a downward trend after 1e4 iteration. The training process did not consistently enhance the model being disentangled, resulting in poor performance.

3.2 Information Bottleneck

One solution to address fluctuation is to block some of the information by using a narrow bottleneck and then assign the increased information to a new latent variable by increasing the bottleneck. AnnealedVAE and CascadeVAEC follow this concept; however, differ in terms of expanding the IB. AnnealedVAE directly controls the capacity of the latent variables by an annealed increasing parameter, $C$ . CascadeVAEC increases the capacity by relieving the pressure on the $i$ -th latent variable at the $i$ -th stage, opening the information flow. Ideally, these approaches that are based on IB should have a steady growth of disentanglement; however, they also show fluctuation.

3.3 Information Diffusion

A perfect disentangled representation should project one factor into one latent variable. In other words, the largest NMI, $\text{NMI}({\bm{c}},1)$ , reaches the maximum $H({\bm{c}})$ , and the second largest NMI, $\text{NMI}({\bm{c}},2)$ , is close to zero. Therefore, the decrement of $\text{NMI}({\bm{c}},1)$ implies that the information of one factor diffuses into another latent variable, which we define as ID. The representation can be said to re-entangle in the case of an ID.

We monitored $\text{NMI}({\bm{c}},1)$ and $\text{NMI}({\bm{c}},2)$ during training with AnnealedVAE on dSprites (training details in Section 5.1), as shown in Fig. 3. We computed the NMIs for the five factors every 1e4 iterations and presented them in one row. Ideally, the expanded capacity would promote the model learn new information. Oppositely, $\text{NMI}({\bm{c}},1)$ (scale) decreased after 5e4 iterations. AnnealedVAE suffered ID, which caused the low performance.

4 Method

4.1 Information Freezing

Burgess et al. proposed that the value of beta in $\beta$ -VAE controls the IB between inputs and latent variables, similar to the role of temperature in distillation; a low value of beta encourages the MI $I(x;z)$ , and more information condenses on the latent space. The IFP is a critical point at which the model starts to learn information from observations. It is an intrinsic property of a dataset and almost invariant. Thus, different factors can be identified using IFP. Intuitively, an approach will work if it progressively disentangles factors when the IFP distributions for these factors are separable.

Definition 1

The IFP is the maximum value of $\beta$ , such that $I(x;z)>0$ for the $\beta$ -VAE objective.

We introduce the annealing test to determine the IFP for a given dataset. The objective of the annealing test is the same as that of $\beta$ -VAE, except that it uses an annealing $\beta$ from a high value to 1 (i.e., it starts with value 200 and ends with value 1). While the pressure of the KL term decays, there exists a critical point where $I(x;z)$ increases and the reconstruction error decreases. For example, we trained the model with an annealing $\beta$ from 200 to 0 in 100,000 iterations. As shown in Fig. 4, the IFP is approximately 32 after 7400 iterations. Roughly, we regard the IFP as the value of beta, where the model learns information ( $I(x;z)$ is over 0.1).

4.2 DEFT

Inspired by distillation in chemistry, this paper proposes a novel disentanglement approach that is based on $\beta$ -VAE, according to IB theory. It splits the latent variables into G groups, and has K independent encoders for each of these groups. The decoder takes the concatenation of latent variables of all groups. DEFT also divides the training process into G stages, such that during the i-th stage, the parameters of all encoders, except the i-th encoder, train with a smaller learning rate that is $\gamma$ times smaller than the i-th encoder. In addition, each stage has a decayed coefficient beta of $\beta$ -VAE, which controls the IB from inputs to latent variables. The architecture of the DEFT ( $\text{G}=2,\text{K}=2$ ) is shown in Fig. 5. The left and right figures show the forward and backward processes of the DEFT, respectively. We show the algorithm of DEFT in Algorithm 1, where $q_{\phi_{i}}^{i}(z|x)$ denotes one group of encoders, $p_{\theta}(x|z)$ denotes the decoder, and $\mathcal{L}^{1}$ denotes the $\beta$ -VAE objective.

DEFT chooses a suitable value of beta to separate factors that act as the temperature, such that the desired factor’s information passes the bottleneck and freezes into the latent variables. Each stage disentangles one factor in the ideal situation. However, the information can be assigned to another variable if the pressure decays (reduces beta), which causes ID. Therefore, backward information scaling is performed for these variables to prevent the diffusion of information into others.

Input: pressure

\beta_{j}

, stages G,

\gamma

Initialize

\theta,\left\{\phi_{i}\right\}

for

p_{\theta}(x|z),\{q_{\phi_{i}}^{i}(z|x)\}

for

j=1

to G do

g_{\theta},\left\{g_{\phi_{i}}\right\}=\nabla_{\theta,\left\{\phi_{i}\right\}}\mathcal{L}^{1}(\theta,\left\{\phi_{i}\right\};\beta_{j})

for

i=1

to G do

i=j

then

\phi_{i}=\phi_{i}-g_{\phi_{i}}\times\mathrm{lr}

else

\phi_{i}=\phi_{i}-g_{\phi_{i}}\times\mathrm{lr}\times\gamma

end if

end for

\theta=\theta-g_{\theta}\times\mathrm{lr}

end for

Algorithm 1 The algorithm for DEFT.

5 Experiment

5.1 Settings

In this study, there are two types (standard and lite) of encoders and one decoder architecture, as shown in Table 1. One group encoder of DEFT uses the lite architecture and has K latent variables—the dimension of z is $2\text{K}$ in total; the other approaches use the standard architecture. All models use the same decoder architecture. All layers are activated by ReLU. The optimizer is Adam with a learning rate of 5e-4. The batch size is 256, which accelerates the training process.

Table 1: Lite encoder, standard encoder, and decoder architecture for all experiments. For dSprites and SmallNORB,

c=1

. For Color and Scream,

c=3

Lite Encoder	Standard Encoder	Decoder
$4\times 4$ conv. Eight Stride 2	$4\times 4$ conv. 32 stride 2	FC. 256
$4\times 4$ conv. Eight Stride 2	$4\times 4$ conv. 32 stride 2	FC. $4\times 4\times 64$
$4\times 4$ conv. Sixteen stride 2	$4\times 4$ conv. 64 stride 2	$4\times 4$ upconv. 64 stride 2
$4\times 4$ conv. Sixteen stride 2	$4\times 4$ conv. 64 stride 2	$4\times 4$ upconv. 32 stride 2
FC. 64.	FC. 256.	$4\times 4$ upconv. 32 stride 2
FC. $2\times\text{K}$ .	FC. $2\times 10$ .	$4\times 4$ upconv. $c$ . stride 2

5.2 Backward Information Scaling

Selection of Gamma

We trained the model in the first stage with $\beta=70$ and then trained the second stage with $\beta=30$ for different values of $\gamma$ . As shown in Fig. 6, all backward information is clipped when $\gamma=0$ , and $I(x;z)$ will not increase; $\text{NMI}({\bm{c}},2)$ and $\gamma$ are simultaneously increasing. In contrast, $\text{NMI}({\bm{c}},1)$ has a small increment. A small value of $\gamma$ is sufficient to learn the majority information, and it also prevents that information from diffusing into another variable. In practice, $\gamma=0.1$ achieves a good balance between learning new information and preventing ID.

Comparison

AnnealedVAE directly controls the capacity of the latent variables by an annealed increasing parameter, $C$ . CascadeVAEC increases the capacity by relieving the pressure on the $i$ -th latent variable at the $i$ -th stage, opening the information flow. We found that ID is an invisible hurdle for the IB-based approaches, which can be detected by the mean of $\text{NMI}({\bm{c}}_{i},1),\text{NMI}({\bm{c}}_{i},2)$ :

	$\displaystyle\text{NMI1}=\sum_{i=1}^{\\|{\bm{c}}\\|}\text{NMI}({\bm{c}}_{i},1),$		(8)
	$\displaystyle\text{NMI2}=\sum_{i=1}^{\\|{\bm{c}}\\|}\text{NMI}({\bm{c}}_{i},2).$		(8)

We conducted experiments involving AnnealedVAE, CascadeVAEC, and DEFT on dSprites. As depicted in Fig. 7 (a), AnnealedVAE expands the capacity by a controllable parameter $C$ ; however, the KL consists of MI, TC, and DWKL. There is an increasing gap between the MI and $C$ . At each stage, CascadeVAEC activates a new latent variable to expand the capacity. DEFT expands the capacity by relieving the pressure. One can see, DEFT has a steady increment of MI. Both AnnealedVAE and CascadeVAEC suffer from the ID problem (see in Fig. 7 (b)): AnnealedVAE tends to diminish NMI1; CascadeVAEC also has a sharp decrement of NMI1. Only DEFT remits the ID problem and promotes the disentanglement steadily.

5.3 Supervised Problem

Dataset Detail

We compared DEFT with others on dSprites (Matthey et al., 2017), color (color for short), screen (screen for short), and SmallNORB (LeCun et al., 2004). The images of dSprites are strictly generated by the five factors. It has three shapes: square, ellipse, and heart; six scale values: 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0, 40 orientation values in [0, 2 pi], 32 position X values, and 32 position Y values. Two variants of dSprites (color and screen), which introduce random noise, were closer to the true situation. SmallNORB is generated from 3D objects and is much more complex than 2D shapes. It contains five generic categories, namely, four-legged animals, human figures, airplanes, trucks, and cars; nine elevation values, i.e., 30, 35, 40, 45, 50, 55, 60, 65, and 70; eighteen azimuth values, 0, 20, 40,…, 340; and six lighting conditions. Fig. 8 shows the visualization of the datasets.

Information Freezing Point

The ideal situation is to find a set of $\beta$ to isolate IFPs into several parts without overlaps. To obtain the distribution of IFPs with respect to a factor ${\bm{c}}_{i}$ , we enumerate all possible values of factor ${\bm{c}}_{i}$ for a random sample, and calculate its ICP using the algorithm introduced in Section 4.1. Then, we repeated the above procedure 50 times to estimate the ICP distribution of ${\bm{c}}_{i}$ .

We measured the IFPs of the factors on the four datasets, as shown in Fig. 9. dSprites and Color had more separable IFPs than Scream and SmallNORB. Although the three variants of dSprites have the same factors, their IFPs are different. The difference in IFP distributions explains why current approaches fail to transfer hyperparameters across different problems in Locatello et al. (2019). Note that the IFP distributions of factors are almost separable for dSprites and Color; the ground-truth factors are independent of the four datasets. In summary, they are all independent; Scream and SmallNorb are inseparable. Based on the distribution of IFPs, we summarize the optimal training settings for the DEFT in Table 2. We tune the hyperparameters of compared approaches with the highest MIG and show theses setting in Table 3.

Table 2: Experimental settings for DEFT.

\gamma

is always

0.1

. The number of epochs is sufficiently large such that the objective converges. The number of latents per encoder (K) is not less than the size of the newly learned factors. The group number (G) is determined by the number of separable areas in Fig. 9.

	G	K	Epochs per Stage	$\beta_{i}$
Color	4	3	7	160,105,30,4
dSprites	4	3	7	70,30,12,4
SmallNORB	2	5	100	50,1
Scream	2	5	10	140,1

Table 3: Experimental settings for compared approaches.

	Color	dSprites	Scream	SmallNORB
AnnealedVAE ( $C$ )	10	5	25	5
$\beta$ -TCVAE ( $\beta$ )	10	10	6	1
$\beta$ -VAE ( $\beta$ )	16	16	6	1
CacadeVAE ( $\beta_{h}$ )	10	10	10	10

Quantitative Analysis

We trained each model 50 times and compared our model with the other four disentanglement approaches on dSprites, Color, Scream, and SmallNORB in Fig. 10. All approaches (not just ours) have a lower performance on Scream and SmallNorb (inseparable). It appears understandable that DEFT fails to handle the situation having inseparable IFP distributions. However, there is a question as to why these TC-based approaches also failed in an independent but inseparable situation (Scream). We also show the distributions of the reconstruction error in Fig. 11. In general, DEFT achieved both a high image quality and disentanglement.

The distribution of MIG scores at different stages are shown in Fig. 12. All experimental results on the four datasets reveal that DEFT obtains low scores in the first stage, and gradually improves disentanglement in the following stages.

Failure rate

We define the failure rate as the percentage of models that fail to learn a disentangled representation if the MIG score is lower than 0.1. Table 4 shows the failure rate. It can be seen that DEFT has the lowest average failure rates. Although AnnealedVAE success to disentangle factors on three datasets, it fails to disentangle factors on Scream for most cases. Note that, it is possible to reduce the failure rate for AnnealedVAE on Scream, but we have tried six settings. Despite all approaches show low performances on SmallNORB, the failure rates are lower than others. From the IFP distributions in Fig. 9, we can see that SmallNORB has a separable factor that is easy to be disentangled. That causes SmallNORB to have a high lower bound of disentanglement but gets a low overall score. Generally, DEFT significantly decreased the failure rate compared to the other approaches.

Table 4: Failure rate (%) for each approach (column) and dataset (row).

	DEFT	$\beta$ -VAE	$\beta$ -TCVAE	AnnealedVAE	CascadeVAEC
Color	0	24	0	0	8
dSprites	8	16	2	0	0
SmallNORB	0	0	0	0	10
Scream	12	12	26	80	25

Qualitative Analysis

Higgins et al. (2017a) introduced the latent traversal to visualize the generated images through the traversal of a single latent $z_{i}$ . Fig. 13 shows the latent traversal of the best model with the highest MIG score. One can see the intrinsic relationship between IFP and disentanglement. Orientation has the lowerest IFP among all factors; meanwhile, it is the hardest one to be disentangled for all approaches. For SmallNORB, the lighting condition (3) is separable with others, which is easy to be disentnagled. For Scream, three factors have similar IFP distributions, and it is also a hard problem for the disentanglement approaches.

5.4 Correlative but Separable

To demonstrate the superiority of the IB approaches, we built a dataset of a triangle with three factors (posX, posY, and orientation). PosX and posY were independent; however, the orientation pointed to the center of the canvas $\theta=\arctan(\text{posY}-16,\text{posX}-16)$ , as shown in Fig. 14 (a). We trained CascadeVAEC, $\beta$ -TCVAE ( $\beta=6$ ), and DEFT ( $\text{K}=2,\text{G}=2$ ) within 10,000 steps and repeated 10 times. Fig. 14 (a) illustrates this toy dataset. Each sample contains one triangle which has a unique position and points to the center. From Fig. 14 (b), all three approaches disentangle posX and posY successfully. However, only DEFT extracts orientation information ( $I(z_{4};\text{orientation})$ is high, $I(z_{4};\text{poxX})$ and $I(z_{4};\text{poxY})$ are low). DEFT has higher disentanglement scores for all three metrics, as shown in Fig. 14 (c). The latent traversal in Fig. 15 shows that DEFT has a high image quality and separated orientation information. The correlation made it difficult for $\beta$ -VAE to disentangle orientation.

5.5 Unsupervised Problem

3D Chairs (Aubry et al., 2014) is an unlabeled dataset containing 1394 3D models from the Internet.

Annealing Test without Supervision

The label information is unavailable for the common situations. Therefore, the factor’s IFP distribution is hard to be obtained. Alternatively, we calculate the upper bound of IFP distribution for the unsupervision setting. Intuitively, the rate of information increment changes if there is a new factor starting to freeze. We conducted an annealing test on dSprites and 3D chairs without labels and plotted the curves of beta vs. $\Delta I(x;z)$ in Fig. 16. This method is in agreement with the upper bound of the IFP distribution for position and scaling, as shown in Fig. 16 (a). One can recognize four points where the latent information suddenly increases: 36 and 16 from Fig. 16 (b). Though this method needs human participation, we only show the potency to develop a fully unsupervised procedure for the separations. Therefore, we set $\text{K=3},\text{G=3},\beta_{j}=\{36,16,1\}$ for 3D Chairs and trained the DEFT 20 epochs per stage. We compared the performance with $\beta$ -TCVAE and CascadeVAEC on 3D Chairs, as shown in Fig. 17. We notice that DEFT can learn one additional interpretable property compared with CascadeVAEC: leg orientation.

6 Conclusion

Based on existing studies involving IBs, we have developed new insights into the reason for which these approaches have lower performances than the TC-based. In particular, we identified the IFP distributions for each factor by performing an annealing test, and a dataset was easily disentangled if the IFP distributions were separable. Furthermore, we found that the ID problem is an invisible hurdle that prevents steady improvements in disentanglement. We proposed DEFT to retain the learned information by scaling the backward information. To do this, we proposed a learning rate decay $\gamma$ on these encoders, rather than delivering backward information equally. Our results show that approaches that are based on IBs are competitive and have the potential to solve problems with correlative factors.

We varified the ID problem that causes the low performance of IB-based approaches. However, as a plain solution, the DEFT method still needs to be further improved. In the future, an automatic way to adjust the best separation of IFP distribution is highly required.

Acknowledgement

This work was supported by National Natural Science Foundation of China under Grant No. 61872419, No. 62072213, No. 61873324. Taishan Scholars Program of Shandong Province, China, under Grant No. tsqn201812077.

References

Aubry et al. (2014) Aubry M, Maturana D, Efros AA, Russell BC, Sivic J (2014) Seeing 3d chairs: Exemplar part-based 2d-3d alignment using a large dataset of CAD models. In: CVPR, IEEE Computer Society, pp 3762–3769, DOI 10.1109/CVPR.2014.487
Bengio et al. (2013) Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1798–1828
Burgess et al. (2018) Burgess CP, Higgins I, Pal A, Matthey L, Watters N, Desjardins G, Lerchner A (2018) Understanding disentangling in $\beta$ -vae. In: ICML
Chen et al. (2018) Chen TQ, Li X, Grosse RB, Duvenaud D (2018) Isolating sources of disentanglement in variational autoencoders. In: NeurIPS, pp 2615–2625
Chen et al. (2016) Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P (2016) Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: NeurIPS, pp 2172–2180
D. S. Kim et al. (2018) D S Kim, B Liu, A Elgammal, M Mazzone (2018) Finding principal semantics of style in art. In: ICSC, pp 156–163, DOI 10.1109/ICSC.2018.00030
Do and Tran (2020) Do K, Tran T (2020) Theory and evaluation metrics for learning disentangled representations. In: ICLR, OpenReview.net
Dupont (2018) Dupont E (2018) Learning disentangled joint continuous and discrete representations. In: Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) NeurIPS, pp 708–718
Eastwood and Williams (2018) Eastwood C, Williams CKI (2018) A framework for the quantitative evaluation of disentangled representations. In: ICLR, OpenReview.net
Higgins et al. (2017a) Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017a) beta-vae: Learning basic visual concepts with a constrained variational framework. In: ICLR, OpenReview.net
Higgins et al. (2017b) Higgins I, Pal A, Rusu AA, Matthey L, Burgess C, Pritzel A, Botvinick M, Blundell C, Lerchner A (2017b) DARLA: improving zero-shot transfer in reinforcement learning. In: Precup D, Teh YW (eds) ICML, PMLR, Proceedings of Machine Learning Research, vol 70, pp 1480–1490
Higgins et al. (2018a) Higgins I, Amos D, Pfau D, Racaniere S, Matthey L, Rezende D, Lerchner A (2018a) Towards a definition of disentangled representations. arXiv preprint arXiv:181202230 pp 1–29
Higgins et al. (2018b) Higgins I, Sonnerat N, Matthey L, Pal A, Burgess CP, Bosnjak M, Shanahan M, Botvinick M, Hassabis D, Lerchner A (2018b) SCAN: learning hierarchical compositional visual concepts. In: ICLR, OpenReview.net
Insu Jeon et al. (2019) Insu Jeon, Wonkwang Lee, Gunhee Kim (2019) Ib-gan: Disentangled representation learning with information bottleneck gan
Jeong and Song (2019) Jeong Y, Song HO (2019) Learning discrete and continuous factors of data via alternating disentanglement. In: Chaudhuri K, Salakhutdinov R (eds) ICML, PMLR, Proceedings of Machine Learning Research, vol 97, pp 3091–3099
Kim and Mnih (2018) Kim H, Mnih A (2018) Disentangling by factorising. In: Dy JG, Krause A (eds) ICML, PMLR, Proceedings of Machine Learning Research, vol 80, pp 2654–2663
Kingma and Welling (2014) Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: Bengio Y, LeCun Y (eds) ICLR
Kumar et al. (2018) Kumar A, Sattigeri P, Balakrishnan A (2018) Variational inference of disentangled latent concepts from unlabeled observations. In: ICLR, OpenReview.net
Lake et al. (2017) Lake BM, Ullman TD, Tenenbaum JB, Gershman SJ (2017) Building machines that learn and think like people. Behavioral and brain sciences 40
Lample et al. (2017) Lample G, Zeghidour N, Usunier N, Bordes A, Denoyer L, Ranzato M (2017) Fader networks: Manipulating images by sliding attributes. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) NeurIPS, pp 5967–5976
LeCun et al. (2004) LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. In: CVPR, IEEE Computer Society, pp 97–104, DOI 10.1109/CVPR.2004.144
Locatello et al. (2019) Locatello F, Bauer S, Lucic M, Rätsch G, Gelly S, Schölkopf B, Bachem O (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In: Chaudhuri K, Salakhutdinov R (eds) ICML, PMLR, Proceedings of Machine Learning Research, vol 97, pp 4114–4124
Matthey et al. (2017) Matthey L, Higgins I, Hassabis D, Lerchner A (2017) dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/
Peters et al. (2017) Peters J, Janzing D, Schölkopf B (eds) (2017) Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press
Ridgeway and Mozer (2018) Ridgeway K, Mozer MC (2018) Learning deep disentangled embeddings with the f-statistic loss. In: Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) NeurIPS, pp 185–194
Schölkopf et al. (2012) Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K, Mooij JM (2012) On causal and anticausal learning. In: ICML, icml.cc / Omnipress
Suter et al. (2019) Suter R, Miladinovic D, Schölkopf B, Bauer S (2019) Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In: International Conference on Machine Learning, PMLR, pp 6056–6065
Träuble et al. (2021) Träuble F, Creager E, Kilbertus N, Locatello F, Dittadi A, Goyal A, Schölkopf B, Bauer S (2021) On disentangled representations learned from correlated data. 2006.07886
Watanabe (1960) Watanabe S (1960) Information theoretical analysis of multivariate correlation. IBM Journal of research and development 4:66–82
Zhu et al. (2019) Zhu Y, Xie J, Liu B, Elgammal A (2019) Learning feature-to-feature translator by alternating back-propagation for zero-shot learning. arXiv preprint arXiv:190410056

DEFT: Distilling Entangled Factors by Preventing Information Diffusion††thanks: *Corresponding author