Visualizing Information Bottleneck through Variational Inference

Cipta Herwana
New York University
ciptah@nyu.edu
&Abhishek Kadian
New York University
abhishekk@nyu.edu

Abstract

The Information Bottleneck theory provides a theoretical and computational framework for finding approximate minimum sufficient statistics. Analysis of the Stochastic Gradient Descent (SGD) training of a neural network on a toy problem has shown the existence of two phases, fitting and compression. In this work, we analyze the SGD training process of a Deep Neural Network on MNIST classification and confirm the existence of two phases of SGD training. We also propose a setup for estimating the mutual information for a Deep Neural Network through Variational Inference.

1 Introduction

Deep Neural Networks (DNNs) have found wide application on large-scale tasks like visual object recognition (Krizhevsky et al. [2017]), machine translation (Wu et al. [2016]) and reinforcement learning (Silver et al. [2016]). The success of DNNs in many areas has led to a growing interest in trying to explain their performance. Tishby and Zaslavsky [2015] proposed to analyze DNNs through the Information Bottleneck lens. Shwartz-Ziv and Tishby [2017] analyzed the information plane of a small neural network on a toy problem and reported two phases of a neural network trained using SGD, fitting and compression. In our work we test the hypothesis of Information Bottleneck theory for Deep Learning on a tougher problem of image classification.

For our experiments we need a way to estimate the mutual information between the input and the output of a hidden layer. Alemi et al. [2016] proposed a variational approximation to the Information Bottleneck. We also build upon the Variational Autoencoder derivation (Kingma and Welling [2013]) to propose an alternative MI upper bound for a teacher-student model.

We use the variational estimates to test the Information Bottleneck hypothesis of two phases of DNN training, fitting and compression. We find that the two phases are present in the training of a 4 layer DNN on MNIST classification task. Our results show that the claims of Information Bottleneck theory about Deep Learning hold true for a deep VIB network.

Our paper begins with a discussion of Information Bottleneck theory and it’s application to analyze DNNs (section-2). In section-3 we define the classification problem and leveraging variational inference to calculate mutual information. We also derive the mutual information upper bounds in section-3. In section-4 we discuss the experimental settings and report results on the classification task. We deliver our concluding remarks in section-5 and also suggest future research directions.

2 Information Bottleneck and Deep Learning

2.1 Mutual Information

Given any two random variables, $X$ and $Y$ , with a joint distribution $p(x,y)$ , their Mutual Information is defined as:

I(X;Y)=D_{KL}[p(x,y)||p(x)p(y)]=\sum\limits_{x\in X,y\in Y}p(x,y)\log\left(\frac{p(x,y)}{p(x)p(y)}\right)

(1)

=\sum\limits_{x\in X,y\in Y}p(x,y)\log\left(\frac{p(x|y)}{p(x)}\right)=H(X)-H(X|Y)

(2)

where $D_{KL}[p||q]$ is the Kullback-Liebler divergence of the distributions $p$ and $q$ , and $H(X)$ and $H(X|Y)$ are the entropy and conditional entropy of $X$ and $Y$ , respectively.

2.2 Information Bottleneck

When analyzing representations of $X$ w.r.t $Y$ , the classical notion of minimal sufficient statistics provide good candidates for optimal representation. Sufficient statistics, in our context, are maps or partitions of $X$ , $S(X)$ , that capture all the information that $X$ has on $Y$ . Namely, $I(S(X);Y)=I(X;Y)$ .

Minimal sufficient statistics, $T(X)$ , are the simplest sufficient statistics and induce the coarsest sufficient partition on $X$ . A simple of way of formulating this is through the Markov chain: $Y\rightarrow X\rightarrow S(X)\rightarrow T(X)$ , which should hold for a minimal sufficient statistics $T(X)$ with any other sufficient statistics $S(X)$ . Since exact minimal sufficient statistics only exist for very special distributions, (i.e. exponential families), Tishby et al. [2001] relaxed this optimization problem by first allowing the map to be stochastic, defined as an encoder $P(T|X)$ , and then, by allowing the map to capture as much as possible of $I(X;Y)$ , not necessarily all of it.

This leads to the Information Bottleneck (IB) tradeoff [Tishby et al. [2001]], which provides a computational framework for finding approximate minimal sufficient statistics, or the optimal tradeoff between compression of $X$ and prediction of $Y$ .

2.3 Information Plane of Neural Nets

Any representation variable, $T$ , defined as a (possibly stochastic) map of the input $X$ , is characterized by its joint distributions with $X$ and $Y$ , or by its encoder and decoder distributions, $P(T|X)$ and $P(Y|T)$ , respectively. Given $P(X;Y)$ , $T$ is uniquely mapped to a point in the Information Plane with coordinates $I(X;T),I(T;Y)$ .

Tishby and Zaslavsky [2015] proposed to analyze DNNs in the Information Plane and suggested that the goal of the neural network is to optimize the Information Bottleneck tradeoff between compression and prediction, successively, for each layer.

Building on top of this work Shwartz-Ziv and Tishby [2017] analyzed the information plane of a network with 7 fully connected hidden layers, and widths 12-10-7-5-4-3-2 neurons with hyperbolic tangent activations. The analysis showed that the Stochastic Gradient Descent (SGD) optimization has two main phases, in the first and shorter phase the layers increase the information on the input (fitting), while in the second much longer phase the layers reduce the information on the input (compression phase). The tasks were chosen as binary decision rules which are invariant under $O(3)$ rotations of the sphere, with 12 binary inputs that represent 12 uniformly distributed points on the sphere. As the network size was small they calculated the mutual information exhaustively by binning the output activations into 30 buckets and computing the joint distribution. Figure 1 shows the information plane for the described setup.

Refer to caption — Figure 1: Results taken from Shwartz-Ziv and Tishby [2017]. The layers information paths during the SGD optimization for different architectures. Each panel is the information plane for a network with a different number of hidden layers. The width of the hidden layers start with 12, and each additional layer has 2 fewer neurons. The final layer with 2 neurons is shown in all panels. The line colors correspond to the number of training epochs.

Anonymous [2018] showed that results of Shwartz-Ziv and Tishby [2017] are not applicable for networks with ReLU activations and also claim that the shape of information plane is a result of saturating hyperbolic tangent non-linearities and for ReLU activation the same shape is not observed. Anonymous [2018] follows a binning strategy to estimate the mutual information for ReLU activations. This method of estimating mutual information has been refuted by authors of Shwartz-Ziv and Tishby [2017], ICLR-2018 Openreview¹¹1Review comments and discussion available at https://goo.gl/U24Kfp.

In section 4 we discuss how we estimate mutual information using variational inference. Using these results we draw the information plane for a neural network learned to classify on the MNIST dataset²²2http://yann.lecun.com/exdb/mnist. We use ReLU non-linearities and a deeper neural network for our task. In our results we observe the two distinct phases of SGD, fitting and compression as originally observed by Shwartz-Ziv and Tishby [2017].

3 Problem Definition

Information plane analysis is usually limited to toy problems with simple distributions, otherwise the MI calculation quickly becomes intractable. We examine approaches to estimating the information plane position using variational inference.

3.1 Task: MNIST Classification

The MNIST classification task consists of images of digits and the task is to predict the label of digit. An example of the task is shown in in Figure 2. We consider $X$ to be the input space (images) and $Y$ to be the output label space (digit labels). We modify this task into the teacher-student setting, where the digit images are generated by a pre-trained teacher model.

3.2 Variational Information Bottleneck Model

The IB setup views the data generating process as a Markov chain $Y\rightarrow X\rightarrow Z\rightarrow\hat{Y}$ . $Y$ is a signal we wish to extract from observations $X$ . $Z$ is some intermediate representation computed from $Z$ by the model on the way to computing the predictions $\hat{Y}$ .

The VIB [Alemi et al., 2016] method is a generalization of the Variational Autoencoder [Kingma and Welling, 2013] into the supervised setting. Its objective function is to maximize the rate distortion tradeoff:

\mathcal{L}(\theta)=I(Y;Z)-\beta I(X;Z)

These two quantities correspond to the position in the information plane, so computing an approximation of the loss will give us an estimate of the network’s position. The MI of $Y$ (target variable) and $Z$ can be computed as the difference between entropy and conditional entropy (the remaining uncertainty after observing $Z$ ):

	$\displaystyle I(Y;Z)$	$\displaystyle=H(Y)-H(Y\|Z)$
		$\displaystyle=H(Y)+\sum_{y,z}p(y,z)\log p(y\|z)$

Under our setup $p(y|z)$ is difficult to compute, because we would need to invoke Bayes’ rule on $p(z|y)$ , which is what we have (after marginalizing over all $x$ ). So we introduce an approximate distribution $q(y|z)$ :

	$\displaystyle I(Y;Z)$	$\displaystyle=H(Y)+\sum_{y,z}p(y,z)\log p(y\|z)\frac{q(y\|z)}{q(y\|z)}$
		$\displaystyle=H(Y)+\sum_{y,z}p(y,z)\log q(y\|z)+\sum_{y,z}p(y,z)\log\frac{p(y\|z)}{q(y\|z)}$
		$\displaystyle=H(Y)+\mathbb{E}_{y,z}\log q(y\|z)+D_{KL}(p(y\|z)\|\|q(y\|z))$
		$\displaystyle\geq H(Y)+\mathbb{E}_{y,z}\log q(y\|z)$

The second term is the average log-likelihood of $y$ under $q$ . In practice we will use the learned decoder of the VIB model as the approximate $q$ . The second term can be thought of as a conditional cross entropy term, which is an upper bound on the conditional entropy. To calculate $I(X;Z)$ we decompose it the same way:

	$\displaystyle I(X;Z)$	$\displaystyle=H(Z)-H(Z\|X)$
		$\displaystyle=H(Z)-\mathbb{E}_{x}H[p(z\|x)]$

$\mathbb{E}_{x}H[p(z|x)]$ is the average entropy of the conditional distribution $p(z|x)$ for a fixed $x$ . Intuitively, this makes sense. If the entropy of the encoding is high, then there is high uncertainty of choosing $z$ from $x$ , so mutual information should be low. Because $H(Z)$ is hard to compute, we substitute it with the cross-entropy of $p(z)$ and some variational approximation $r(z)$ . Cross entropy is always greater than entropy, so:

	$\displaystyle I(X;Z)$	$\displaystyle\leq-\sum_{x,z}p(x,z)\log r(z)+\sum_{x,z}p(x,z)\log p(z\|x)$
		$\displaystyle=\sum_{x,z}p(x,z)\log\frac{p(z\|x)}{r(z)}$
		$\displaystyle=\mathbb{E}_{x}D_{KL}(p(z\|x)\|\|r(z))$

3.3 Alternative Method

Suppose we assume that $X$ can be sampled using the process $Z_{v}\rightarrow X$ . In practice, this would mean pre-training an unsupervised latent variable model which we will define as $p(x|z_{v})$ . We can think of this as an instance of the teacher-student setup, where the student is trained to mimic the teacher. Given this, we can use another approach to calculate $H(Z)$ by estimating $p(z)$ for a given $z$ :

\displaystyle p(z)

\displaystyle=\sum\limits_{x}p(x,z)=\sum\limits_{x}p(z|x)p(x)=\sum\limits_{x}\sum_{z_{v}}p(z|x)p(z_{v},x)

To perform variational inference, we introduce a distribution $q(z_{v},x)$ :

p(z)=\sum_{z_{v},x}q(z_{v},x)p(z|x)\frac{p(z_{v},x)}{q(z_{v},x)}=\mathbb{E}_{z_{v},x\sim Q}\,\,\,\,p(z|x)\frac{p(z_{v},x)}{q(z_{v},x)}

(3)

We can use this to lower bound $\log p(z)$ :

$\displaystyle\log p(z)$	$\displaystyle=\log\mathbb{E}_{z_{v},x\sim Q}p(z\|x)\frac{p(z_{v},x)}{q(z_{v},x)}$	(4)
	$\displaystyle\geq\mathbb{E}_{x\sim Q}\log p(z\|x)-\mathbb{E}_{z_{v},x\sim Q}\log\frac{q(z_{v},x)}{p(z_{v},x)}$
	$\displaystyle=\mathbb{E}_{x\sim Q}\log p(z\|x)-D_{\text{KL}}(Q(z_{v},x)\|\|P(z_{v},x))$

If we define $q(z_{v},x)$ to use the teacher model:

q(z_{v},x)=q(z_{v})p(x|z_{v})

we can simplify the KL divergence:

	$\displaystyle D_{\text{KL}}[q(z_{v},x)\mid\mid p(z_{v},x)]$	$\displaystyle=\sum_{z_{v},x}q(z_{v})p(x\mid z_{v})\log\frac{q(z_{v})p(x\mid z_{v})}{p(z_{v})p(x\mid z_{v})}$
		$\displaystyle=\sum_{z_{v}}q(z_{v})\sum_{x}p(x\mid z_{v})\log\frac{q(z_{v})}{p(z_{v})}$
		$\displaystyle=\sum_{z}q(z)\log\frac{q(z_{v})}{p(z_{v})}\left(\sum_{x}p(x\mid z_{v})\right)$
		$\displaystyle=D_{\text{KL}}[q(z_{v})\mid\mid p(z_{v})]$

In other words, if we have access to the "real" data generating distribution $p(x|z_{v})$ , we can use it during variational inference. The final equation is as follows:

	$\displaystyle I(X,Z)\leq$	$\displaystyle-\mathbb{E}_{x}H[p(z\|x)]$		(5)
		$\displaystyle-\mathbb{E}_{z,x}[\mathbb{E}_{z_{v},x^{\prime}\sim Q(z)}\log p(z\|x^{\prime})-D_{KL}(Q(z_{v}\|x)\|\|P(z_{v}))]$		(5)

The variational distribution $Q$ is computed for a specific $z$ . Algorithmically the inference process is as follows:

1.

Sample $z_{v}$ , $x$ from the teacher model $p(x|z_{v})$
2.

Run the student encoder to get $p(z|x)$
3.

Sample $z$ from $p(z|x)$
4.

Run the inference network to get $q(z_{v}|z)$
5.

Sample $z_{v}^{\prime}$ from $q$ , and re-run the teacher model $p(x^{\prime}|z_{v}^{\prime})$
6.

Sample $x^{\prime}$ from this distribution to compute $p(z|x^{\prime})$
7.

Use all the samples to compute an MI upper bound via Equation 5
8.

Use the upper bound as a loss function to train the inference network $q(z_{v}|z)$

3.4 Hypothesis

The classification model we analyze is trained to maximize the IB tradeoff directly. However, the objective does not specify how this optimization will take place. We want to see whether the two training modes observed by Shwartz-Ziv and Tishby [2017] will happen for a VIB model, or whether it will train to fit and compress in an interleaved manner [Anonymous, 2018].

Our secondary goal is to compare our two approaches to estimate mutual information of $X$ and $Z$ . Both methods upper bound the true quantity $I(X,Z)$ , we would like to see which methods will produce a lower result. The VIB objective has the advantage of being fast to calculate, whereas our method is optimization-based that runs over multiple iterations. However we might get a better result due to the network having access to the data generating distribution.

4 Experiments and Results

We test our approach on classifying MNIST digits generated by a teacher model. The teacher is a VAE with 20 hidden states and trained for 100 epochs. The student model is a classifier trained using VIB. The hidden state $z$ is 40 dimensions and we use a Gaussian with zero mean and unit variance as $r(z)$ . The decoder $p(y|z)$ is a 2-layer MLP. The encoder $p(z|x)$ can be either a 2-layered MLP or a 3-layered CNN.

To generate labeled training data, we reconstruct the training images using the teacher VAE and keep the original labels. After every epoch of training, we estimate the mutual information $I(X,T)$ from both the VIB loss function or the inference network. This network has access to the original data generating VAE.

4.1 Zero Information Signals

To verify our implementation, we ran training on a dataset where the images are generated independently from the labels, thereby having $I(X,Y)=0$ . We would expect to see $I(X,Z)$ and $I(Y,Z)$ to drop rapidly once the network has determined that the image contains no useful information for predicting the digit.

4.2 Comparing MI Estimators

Our results show that the direct estimation method from Alemi et al. [2016] is better at bounding $I(X,T)$ during the later stages of training. We suspect that this is because at later stages of training, the marginal $p(z)$ is sufficiently close to the approximation $p(z)$ , which produces a better estimate than doing it the roundabout way. In contrast, the inference network was able to reach a lower bound while mutual information is still rising.

In the following sections, we will take the minimum of both estimates.

4.3 Information Plane Dynamics

Our experiments reveal that we were able to see an inflection point in the mutual information $I(X,T)$ as the VIB model is training (see figures 5 and 6). We see that the regularization parameter strongly affects how much the model will fit to the data before starting compression.

4.4 Model Uncertainty

VIB models can declare their uncertainty of an encoding given a model $p(z|x)$ by enlarging the variance. Surprisingly, during compression phase we are still able to see the variance given to real samples go down.

5 Conclusion

In this work we analyzed the training of a DNN on an image classification task. We confirm the existence of the two phases of SGD in the information plane for classification models whose mutual information can be calculated through variational inference. We proposed a mutual information upper bound for a teacher-student training setting and compared its performance to the bounds formulated by Alemi et al. [2016].

Our next step is to do a baseline analysis of a linear model whose mutual information we can exactly calculate. Going in the other direction, we also wish to extend this technique to more difficult problems such as tougher image related tasks or perhaps analyzing the discriminator of a GAN. We also want to find ways to improve our mutual information lower bound estimate, and to extend this analysis to deterministic neural networks which do not train to maximize mutual information directly.

References

Alemi et al. [2016] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612.00410.
Anonymous [2018] Anonymous. On the information bottleneck theory of deep learning. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ry_WPG-A-.
Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, 2017. doi: 10.1145/3065386. URL http://doi.acm.org/10.1145/3065386.
Shwartz-Ziv and Tishby [2017] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. CoRR, abs/1703.00810, 2017. URL http://arxiv.org/abs/1703.00810.
Silver et al. [2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. doi: 10.1038/nature16961. URL https://doi.org/10.1038/nature16961.
Tishby and Zaslavsky [2015] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. CoRR, abs/1503.02406, 2015. URL http://arxiv.org/abs/1503.02406.
Tishby et al. [2001] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Control and Computation, volume 49, 07 2001.
Wu et al. [2016] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.

	$\displaystyle I(Y;Z)$	$\displaystyle=H(Y)+\sum_{y,z}p(y,z)\log p(y\|z)\frac{q(y\|z)}{q(y\|z)}$
		$\displaystyle=H(Y)+\sum_{y,z}p(y,z)\log q(y\|z)+\sum_{y,z}p(y,z)\log\frac{p(y\|z)}{q(y\|z)}$
		$\displaystyle=H(Y)+\mathbb{E}_{y,z}\log q(y\|z)+D_{KL}(p(y\|z)\|\|q(y\|z))$
		$\displaystyle\geq H(Y)+\mathbb{E}_{y,z}\log q(y\|z)$

	$\displaystyle I(X;Z)$	$\displaystyle\leq-\sum_{x,z}p(x,z)\log r(z)+\sum_{x,z}p(x,z)\log p(z\|x)$
		$\displaystyle=\sum_{x,z}p(x,z)\log\frac{p(z\|x)}{r(z)}$
		$\displaystyle=\mathbb{E}_{x}D_{KL}(p(z\|x)\|\|r(z))$

$\displaystyle\log p(z)$	$\displaystyle=\log\mathbb{E}_{z_{v},x\sim Q}p(z\|x)\frac{p(z_{v},x)}{q(z_{v},x)}$	(4)
	$\displaystyle\geq\mathbb{E}_{x\sim Q}\log p(z\|x)-\mathbb{E}_{z_{v},x\sim Q}\log\frac{q(z_{v},x)}{p(z_{v},x)}$
	$\displaystyle=\mathbb{E}_{x\sim Q}\log p(z\|x)-D_{\text{KL}}(Q(z_{v},x)\|\|P(z_{v},x))$