This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Entropy, Free Energy, and Work of Restricted Boltzmann Machines

Sangchul Oh soh@hbku.edu.qa Qatar Environment and Energy Research Institute, Hamad Bin Khalifa University, Qatar Foundation, P.O.Box 5825, Doha, Qatar    Abdelkader Baggag abaggag@hbku.edu.qa Qatar Computing Research Institute, Hamad Bin Khalifa University, Qatar Foundation, P.O.Box 5825, Doha, Qatar    Hyunchul Nha hyunchul.nha@qatar.tamu.edu Department of Physics, Texas A&M University at Qatar, Education City, P.O.Box 23874, Doha, Qatar
Abstract

A restricted Boltzmann machine is a generative probabilistic graphic network. A probability of finding the network in a certain configuration is given by the Boltzmann distribution. Given training data, its learning is done by optimizing parameters of the energy function of the network. In this paper, we analyze the training process of the restricted Boltzmann machine in the context of statistical physics. As an illustration, for small size Bar-and-Stripe patterns, we calculate thermodynamic quantities such as entropy, free energy, and internal energy as a function of training epoch. We demonstrate the growth of the correlation between the visible and hidden layers via the subadditivity of entropies as the training proceeds. Using the Monte-Carlo simulation of trajectories of the visible and hidden vectors in configuration space, we also calculate the distribution of the work done on the restricted Boltzmann machine by switching the parameters of the energy function. We discuss the Jarzynski equality which connects the path average of the exponential function of the work and the difference in free energies before and after training.

Restricted Boltzmann machines, Entropy, Subadditivity of Entropy, Jarzynski Equality

I Introduction

A restricted Boltzmann machine (RBM) Smolensky (1986) is a generative probabilistic neural network. RBMs and general Boltzmann machines are described by a probability distribution with parameters, i.e., the Boltzmann distribution. An RBM is an undirected Markov random fields, and is considered a basic building block of deep neural network. RBMs have been applied widely, for example, dimensional reduction, classification, feature learning, pattern recognition, topic modeling, and so on Hinton (2012); Fischer and Igel (2014); Melchior et al. (2016).

As its name implies, a RBM is closely connected to physics, and they shares some important concepts such as entropy, free energy, etc. Mehta et al. (2019). Recently, RBMs have renewed much attention in physics since Carleo and Troyer Carleo and Troyer (2017) showed that a quantum many-body state could be efficiently represented by the RBM. Gabré et al. and Tramel et al. Tramel et al. (2018) employed the Thouless-Anderson-Palmer mean-field approximation, used for a spin glass problem, to replace the Gibbs sampling of contrast-divergence training. Amin et al. Amin et al. (2018) proposed a quantum Boltzmann machine based on quantum Boltzmann distribution of a quantum Hamiltonian. More interestingly, there is a deep connection between Boltzmann machine and tensor networks of quantum many-body systems Stoudenmire and Schwab (2016); Gao and Duan (2017); Chen et al. (2018); Das Sarma et al. (2019); Huggins et al. (2019). Xia and Kais combined the restricted Boltzmann machine and quantum algorithms to calculate the electronic energy of small molecules Xia and Kais (2018).

While the working principles of RBMs have been well established, it may be still needed to understand the RBM better for further applications. In this paper, we investigate the RBM from the perspective of statistical physics. As an illustration, for bar-and-stripe pattern data, the thermodynamic quantities such as the entropy, the internal energy, the free energy, and the work, are calculated as a function of epoch. Since the RBM is a bipartite system composed of visible and hidden layers, it may be interesting, and informative, to see how the correlation between the two layers grows as the training goes on. We show that the total entropy of the RBM is always less than the sum of the entropies of visible and hidden layers, except at the initial time when the training begins. This is the so-called subadditivity of entropy, indicating that the visible layer becomes correlated with the hidden layer as the training proceeds. The training of the RBM is to adjust the parameters of the energy function, which can be considered as the work done on the RBM, from a thermodynamic point of view. Using the Monte-Carlo simulation of the trajectories of the visible and hidden vectors in configuration space, we calculate the work of a single trajectory and the statistics of the work over the ensemble of trajectories. We also examine the Jarzynski equality that connects the ensemble of the work done on the RBM and the difference in free energies before and after training of the RBM.

The paper is organized as follows. In Section II, the detailed analysis of the RBM from the statistical physics point of view is described. In Section III, we presents the summary of the result together with discussions.

II Statistical Physics of Restricted Boltzmann Machines

II.1 Restricted Boltzmann machines

Refer to caption
Figure 1: Graph structure of a restricted Boltzmann machine with the visible layer and the hidden layer.

Let us start with a brief introduction of the RBM Smolensky (1986); Hinton (2012); Fischer and Igel (2014). As shown in Fig. (1), the RBM is composed of two layers; the visible layer and the hidden layer. Possible configurations of the visible and hidden layers are represented by the random binary vectors, 𝐯=(v1,,vN){0,1}N\mathbf{v}=(v_{1},\dots,v_{N})\in\{0,1\}^{N} and 𝐡=(h1,,hM){0,1}M\mathbf{h}=(h_{1},\dots,h_{M})\in\{0,1\}^{M}, respectively. The interaction between the visible and hidden layers is given by the so-called weight matrix wN×Mw\in\mathbb{R}^{N}\times\mathbb{R}^{M} where the weight wijw_{ij} is the connection strength between a visible unit viv_{i} and a hidden unit hjh_{j}. The biases bi.b_{i}\in\mathbb{R}. and cjc_{j}\in\mathbb{R} are applied to visible unit ii and hidden unit jj, respectively. Given random vectors 𝐯\mathbf{v} and 𝐡\mathbf{h}, the energy function of the RBM is written as an Ising-type Hamiltonian

E(𝐯,𝐡;θ)=i=1Nj=1Mwijvihji=1Nbivii=1Mcihi,E(\mathbf{v},\mathbf{h};\theta)=-\sum_{i=1}^{N}\sum_{j=1}^{M}w_{ij}v_{i}h_{j}-\sum_{i=1}^{N}b_{i}v_{i}-\sum_{i=1}^{M}c_{i}h_{i}\,, (1)

where the set of model parameters is denoted by θ{wij,bi,cj}\theta\equiv\{w_{ij},b_{i},c_{j}\}. The joint probability of finding 𝐯\mathbf{v} and 𝐡\mathbf{h} of the RBM is given by the Boltzmann distribution

p(𝐯,𝐡;θ)=eE(𝐯,𝐡;θ)Z,p(\mathbf{v},\mathbf{h};\theta)=\frac{e^{-E(\mathbf{v},\mathbf{h};\theta)}}{Z}, (2)

where the partition function, Z(θ)𝐯,𝐡eE(𝐯,𝐡;θ)Z(\theta)\equiv\sum_{\mathbf{v},\mathbf{h}}e^{-E(\mathbf{v},\mathbf{h};\theta)}, is the sum over all possible configurations. The marginal probabilities p(𝐯;θ)p(\mathbf{v};\theta) and p(𝐡;θ)p(\mathbf{h};\theta) for visible and hidden layers are obtained by summing up the hidden or visible variables, respectively,

p(𝐯;θ)\displaystyle p(\mathbf{v};\theta) =𝐡p(𝐯,𝐡;θ)=1Z(θ)𝐡eE(𝐯,𝐡;θ),\displaystyle=\sum_{\mathbf{h}}p(\mathbf{v},\mathbf{h};\theta)=\frac{1}{Z(\theta)}\sum_{\mathbf{h}}e^{-E(\mathbf{v},\mathbf{h};\theta)}\,, (3a)
p(𝐡;θ)\displaystyle p(\mathbf{h};\theta) =𝐯p(𝐯,𝐡;θ)=1Z(θ)𝐯eE(𝐯,𝐡;θ).\displaystyle=\sum_{\mathbf{v}}p(\mathbf{v},\mathbf{h};\theta)=\frac{1}{Z(\theta)}\sum_{\mathbf{v}}e^{-E(\mathbf{v},\mathbf{h};\theta)}\,. (3b)

The training of the RBM is to adjust the model parameter θ\theta such that the marginal probability of the visible layer p(𝐯;θ)p(\mathbf{v};\theta) becomes as close as possible to the unknown probability pdata(𝐯)p_{\rm data}(\mathbf{v}) that generate the training data. Given identically and independently sampled training data 𝒟{𝐯(1),,𝐯(D)}{\cal D}\in\{\mathbf{v}^{(1)},\dots,\mathbf{v}^{(D)}\}, the optimal model parameters θ\theta can be obtained by maximizing the likelihood function of the parameters, (θ|𝒟)=i=1Dp(𝐯(i);θ){\cal L}(\theta|{\cal D})=\prod_{i=1}^{D}p(\mathbf{v}^{(i)};\theta), or equivalently by maximizing the log-likelihood function ln(θ|𝒟)=i=1Dlnp(𝐯(i);θ)\ln{\cal L}(\theta|{\cal D})=\sum_{i=1}^{D}\ln p(\mathbf{v}^{(i)};\theta). Maximizing the likelihood function is equivalent to minimizing the Kullback-Leibler divergence or the relative entropy of p(𝐯;θ)p(\mathbf{v};\theta) from q(𝐯)q(\mathbf{v}) Kullback and Leibler (1951); Cover and Thomas (2006)

DKL(q||p)=𝐯q(𝐯)lnq(𝐯)p(𝐯;θ),D_{\rm KL}(q\,||\,p)=\sum_{\mathbf{v}}q(\mathbf{v})\ln\frac{q(\mathbf{v})}{p(\mathbf{v};\theta)}\,,\\ (4)

where q(𝐯)q(\mathbf{v}) is an unknown probability that generates the training data. Another method of monitoring the progress of training is the cross-entropy cost between the input visible vector 𝐯(i)\mathbf{v}^{(i)} and a reconstructed visible vector 𝐯¯(i)\bar{\mathbf{v}}^{(i)} of the RBM,

C=1DiD[𝐯(i)ln𝐯¯(i)+(1𝐯(i))ln(1𝐯¯(i))].C=-\frac{1}{D}\sum_{i\in D}\left[\mathbf{v}^{(i)}\ln\bar{\mathbf{v}}^{(i)}+(1-\mathbf{v}^{(i)})\ln(1-\bar{\mathbf{v}}^{(i)})\right]\,. (5)

The stochastic gradient ascent method for the log-likelihood function is used to train the RBM. Estimating the log-likelihood function requires the Monte-Carlo sampling for the model probability distribution. Well-known sampling methods are the contrast-divergence, denoted by CD-kk, and the persistent contrast divergence PCD-kk. For the detail of the RBM algorithm, please see Refs. Hinton (2012); Fischer and Igel (2014); Melchior et al. (2016). Here we employ the CD-kk method.

II.2 Free energy, entropy, and internal energy

From physics point of view, the RBM is a finite classical system composed of two subsystems, similar to an Ising spin system. The training of the RBM is considered the driving of the system from an initial equilibrium state to the target equilibrium state by switching the model parameters. It may be interesting to see how thermodynamic quantities such as free energy, entropy, internal energy, and work, change as the training progresses.

It is straightforward to write down various thermodynamic quantities for the total system. The free energy FF is given by the logarithm of the partition function ZZ,

F(θ)=lnZ(θ).F(\theta)=-\ln Z(\theta)\,. (6)

The internal energy UU is given by the expectation value of the energy function E(𝐯,𝐡;θ)E(\mathbf{v},\mathbf{h};\theta)

U(θ)=𝐯,𝐡E(𝐯,𝐡;θ)p(𝐯,𝐡;θ).U(\theta)=\sum_{\mathbf{v},\mathbf{h}}E(\mathbf{v},\mathbf{h};\theta)p(\mathbf{v},\mathbf{h};\theta)\,. (7)

The entropy SS of the total system comprising the hidden and visible layers is given by

S(θ)=𝐯,𝐡p(𝐯,𝐡;θ)lnp(𝐯,𝐡;θ).S(\theta)=-\sum_{\mathbf{v},\mathbf{h}}p(\mathbf{v},\mathbf{h};\theta)\ln p(\mathbf{v},\mathbf{h};\theta)\,. (8)

Here, the convention of 0ln0=10\ln 0=1 is employed if p(𝐯,𝐡)=0p(\mathbf{v},\mathbf{h})=0. The free energy  (6) is related to the difference between the internal energy (8) and the entropy (9)

F=UTS,F=U-TS\,, (9)

where TT is set to 1.

Refer to caption
Figure 2: 6 samples of 2×22\times 2 Bar-and-Stripe patterns used as the training data in this work. Each configuration is represented by a visible vector 𝐯{0,1}2×2\mathbf{v}\in\{0,1\}^{2\times 2} or by a decimal number; (0,0,0,0)=0(0,0,0,0)=0, (0,0,1,1)=3(0,0,1,1)=3, (0,1,0,1)=5(0,1,0,1)=5, (1,0,1,0)=10(1,0,1,0)=10, (1,1,0,0)=12(1,1,0,0)=12, (1,1,1,1)=15(1,1,1,1)=15 in row-major ordering.

Generally, it is very challenging to calculate the thermodynamic quantities, even numerically. The number of possible configurations of NN visible units and MM hidden units grow exponentially as 2N+M2^{N+M}. Here, for a feasible benchmark test, the 2×22\times 2 Bar-and-Stripe data are considered  Hinton and Sejnowski (1986); MacKay (2002). Fig. 2 shows the 6 possible 2×22\times 2 Bar-and-Stripe patterns out of 16 possible configurations, which will be used as the training data in this work. We take the sizes of the visible and the hidden layers as N=4N=4 and M=6M=6, respectively. In order to understand better how the RBM is trained, the thermodynamic quantities are calculated numerically for this small benchmark system.

Refer to caption
Figure 3: (a) Bias bib_{i} on the visible unit ii and bias cjc_{j} on the hidden unit jj are plotted as a function of epoch. (b) Weight wijw_{ij} connecting the visible unit ii and the hidden unit jj are plotted as a function of epoch.
Refer to caption
Refer to caption
Figure 4: Marginal probabilities p(𝐯)p(\mathbf{v}) of visible layer and p(𝐡)p(\mathbf{h}) of hidden layer are plotted (a) before training and (b) after training. The binary vector 𝐯\mathbf{v} or 𝐡\mathbf{h} in x-axis is represented by the decimal number as noted in the caption of Fig. 2. The visible and the hidden layers have total number of configurations given by 24=162^{4}=16 and 26=642^{6}=64, respectively. The learning rate is 0.15, the training epoch 20000, and CD-kk 100.

Fig. 3 shows how the weight wijw_{ij}, the bias bib_{i} on the visible unit ii and the bias cjc_{j} on the hidden unit jj change as the training goes on. The weight wijw_{ij} are clustered into 3 classes. The evolution of the bias bib_{i} on the visible layer is somewhat different from that of the bias cjc_{j} on the hidden layer. The change in cic_{i} are larger than that in bib_{i}. Fig. 4 shows the change in the marginal probabilities p(𝐯)p(\mathbf{v}) of the visible layer and p(𝐡)p(\mathbf{h}) of the hidden layer before and after training. Note that the marginal probability p(𝐯)p(\mathbf{v}) after training is not distributed exclusively over 6 possible outcomes according to the training data set in Fig. 2.

Typically, the progress of learning of the RBM is monitored by the loss function. Here the Kullback-Leibler divergence, Eq. (4) and the reconstructed cross entropy, Eq. (5) are are used. Fig. 5 plots the reconstructed cross entropy CC, the Kullback-Leibler divergence DKLD_{\rm KL}, the entropy SS, the free energy FF, and the internal energy UU as a function of epoch. As shown in Fig. 5 (a), it is interesting to see that even after a large number of epochs 10,000\sim 10,000, the cost function CC continues approaching zero while the entropy SS and the Kullback-Leibler divergence DKLD_{\rm KL} become steady. On the other hand, the free energy FF continues decreasing together with the internal energy UU, as depicted in Fig. 5 (b). The Kullback-Leibler divergence is a well-known indicator to the performance of RBMs. Then, our result implies that the entropy may be another good indicator to monitor the progress of RBM while other thermodynamic quantities may be not.

Refer to caption
Figure 5: For 2×22\times 2 bar-and-stripe data, (a) cost function CC, entropy SS, and the Kullback-Leibler divergence DKL(q||p)D_{\rm KL}(q\,||\,p) are plotted as a function of epoch. (b) Free energy FF, entropy SS, and internal energy UU of the RBM are calculated as a function of epoch.

In addition to the thermodynamic quantities of the total system of the RBM, Eqs. (6), (8), and (7), it is interesting to see how the two subsystems of the RBM evolve. Since the RBM has no intra-layer connection, the correlation between the visible layer and the hidden layer may increase as the training proceeds. The correlation between the visible layer and the hidden layer can be measured by the difference between the total entropy and the sum of the entropies of the two subsystems. The entropies of the visible and hidden layers are given by

SV\displaystyle S_{V} =𝐯p(𝐯;θ)lnp(𝐯;θ),\displaystyle=-\sum_{\mathbf{v}}p(\mathbf{v};\theta)\ln p(\mathbf{v};\theta)\,, (10a)
SH\displaystyle S_{H} =𝐡p(𝐡;θ)lnp(𝐡;θ).\displaystyle=-\sum_{\mathbf{h}}p(\mathbf{h};\theta)\ln p(\mathbf{h};\theta)\,. (10b)

The entropy SVS_{V} of the visible layer is closely related to the Kullback-Leibler divergence of p(𝐯;θ)p(\mathbf{v};\theta) to an unknown probability q(𝐯)q(\mathbf{v}) which produces the data. Eq. (4) is expanded as

DKL(q||p)=𝐯q(𝐯)lnq(𝐯)𝐯q(𝐯)lnp(𝐯;θ).\displaystyle D_{\rm KL}(q\,||\,p)=\sum_{\mathbf{v}}q(\mathbf{v})\ln q(\mathbf{v})-\sum_{\mathbf{v}}q(\mathbf{v})\ln p(\mathbf{v};\theta)\,. (11)

The second term 𝐯q(𝐯)lnp(𝐯;θ)-\sum_{\mathbf{v}}q(\mathbf{v})\ln p(\mathbf{v};\theta) depends on the parameter θ\theta. As the training proceeds, p(𝐯;θ)p(\mathbf{v};\theta) becomes close to q(𝐯)q(\mathbf{v}) so the behavior of the second term is very similar to that of the entropy SVS_{V} of the visible layer. If the training is perfect, we have q(𝐯)=p(𝐯;θ)q(\mathbf{v})=p(\mathbf{v};\theta) that leads to DKL(q||p)=0D_{\rm KL}(q\,||\,p)=0 while SVS_{V} remains nonzero.

Refer to caption
Figure 6: (a) Kullback-Leibler divergence DKL(q||p)D_{\rm KL}(q\,||\,p), entropy SVS_{V}, and their difference are plotted as a function of epoch. (b) Entropy SS of the total system, entropy SVS_{V} of the visible layer, entropy SHS_{H} of the hidden layer, and the difference SSHSVS-S_{H}-S_{V} are plotted as a function of epoch.

The difference between the total entropy and the sum of the entropies of subsystems is written as

S(SV+SH)=𝐯,𝐡p(𝐯,𝐡)ln[p(𝐯)p(𝐡)p(𝐯,𝐡)].S-(S_{V}+S_{H})=\sum_{\mathbf{v},\mathbf{h}}p(\mathbf{v},\mathbf{h})\ln\left[\frac{p(\mathbf{v})\,p(\mathbf{h})}{p(\mathbf{v},\mathbf{h})}\right]\,. (12)

Eq. (12) tells that if the visible random vector 𝐯\mathbf{v} and the hidden random vector 𝐡\mathbf{h} are independent, i.e., p(𝐯,𝐡;θ)=p(𝐯;θ)p(𝐡;θ)p(\mathbf{v},\mathbf{h};\theta)=p(\mathbf{v};\theta)p(\mathbf{h};\theta), then the entropy SS of the total system is the sum of the entropies of subsystems. In general, the entropy SS of the total system is always less than or equal to the sum of the entropy of the visible layer, SVS_{V}, and the entropy of the hidden layer, SHS_{H}Reif (1965),

SSV+SH.S\leq S_{V}+S_{H}\,. (13)

This is called the subadditivity of entropy, one of the basic properties of the Shannon entropy, which is also valid for the von Neumann entropy Araki and Lieb (1970); Nielsen and Chuang (2000). This property can be proved using the log inequality, lnxx+1-\ln x\geq-x+1. In other way, Eq. (13) may be proved by using the log-sum inequality, which states that for the two sets of nonnegative numbers, a1,,ana_{1},\dots,a_{n} and b1,,bnb_{1},\dots,b_{n},

iailogaibi(iai)log(iai)(ibi)\sum_{i}a_{i}\log\frac{a_{i}}{b_{i}}\geq\left(\sum_{i}a_{i}\right)\log\frac{\left(\sum_{i}a_{i}\right)}{\left(\sum_{i}b_{i}\right)} (14)

In other words, Eq. (12) can be regarded as the negative of the relative entropy or Kullback-Leibler divergence of the joint probability p(𝐯,𝐡)p(\mathbf{v},\mathbf{h}) to the product probability p(𝐯)p(𝐡)p(\mathbf{v})\cdot p(\mathbf{h}),

I(p(𝐯,𝐡)||p(𝐯)p(𝐡))=𝐯,𝐡p(𝐯,𝐡)log[p(𝐯,𝐡)p(𝐯)p(𝐡)].I\bigl{(}p(\mathbf{v},\mathbf{h})\,||\,p(\mathbf{v})p(\mathbf{h})\bigr{)}=\sum_{\mathbf{v},\mathbf{h}}p(\mathbf{v},\mathbf{h})\log\left[\frac{p(\mathbf{v},\mathbf{h})}{p(\mathbf{v})p(\mathbf{h})}\right]\,. (15)

For the 2×2\times 2 Bar-and-Stripe pattern, the entropies of visible and hidden layers, SV,SHS_{V},S_{H} are calculated numerically. Fig. 6 plots the entropies, SV,SHS_{V},S_{H}, SS, and the Kullback-Leibler divergence DKL(q||p)D_{\rm KL}(q\,||\,p) as a function of epoch. Fig. 6 (a) shows that the Kullback-Leibler divergence, DKL(q||p)D_{\rm KL}(q\,||\,p) becomes saturated, though above zero, as the training proceeds. Similarly, the entropy SVS_{V} of the visible layer is saturated. This implies that the entropy of the visible layer, as well as the total entropy shown in Fig. 5, can be an indicator to learning better than the reconstructed cross entropy CC, Eq. (5). The same can also be said about the entropy of the hidden layer, SHS_{H}.

The difference between the total entropy and the sum of the entropies of the two subsystems, S(SV+SH)S-(S_{V}+S_{H}), becomes less than 0, as shown in Fig. 6 (b). Thus it demonstrates the subadditivity of entropy, i.e., the correlation between the visible and the hidden layer as the training proceeds. As it is saturated just as the total entropy and the entropies of the visible and hidden layers after a large number of epoch, the correlation between the visible layer and the hidden layer can also be a good quantifier of the RBM progress.

II.3 Work, free energy, and Jarzynski equality

The training of the RBM may be viewed as driving a finite classical spin system from an initial equilibrium state to a final equilibrium state by changing the system parameters θ\theta slowly. If the parameters θ\theta are switched infinitely slowly, the classical system remains in quasi-static equilibrium. In this case, the total work done on the systems is equal to the Helmholtz free energy difference between the before-training and the after-training, W=F1F0.W_{\infty}=F_{1}-F_{0}\,. For switching θ\theta at a finite rate, the system may not evolve immediately to an equilibrium state, the work done on the system depends on a specific path of the system in the configuration space. Jarzynski Jarzynski (1997, 2011) proved that for any switching rate the free energy difference ΔF\Delta F is related to the average of the exponential function of the amount of work WW over the paths

eWpath=eΔF.\langle e^{-W}\rangle_{\rm path}=e^{-\Delta F}\,. (16)

The RBM is trained by changing the parameters θ\theta through a sequence {θ0,θ1,,θτ}\{\theta_{0},\theta_{1},\dots,\theta_{\tau}\}, as shown in Fig. 3. To calculate the work done during the training, we perform the Monte Carlo simulation of the trajectory of a state (𝐯,𝐡)(\mathbf{v},\mathbf{h}) of the RBM in configuration space. From the initial configuration, (𝐯0,𝐡0)(\mathbf{v}_{0},\mathbf{h}_{0}) which is sampled from the initial Boltzmann distribution, Eq. (2), the trajectory (𝐯0,𝐡0)(𝐯1,𝐡1)(𝐯τ,𝐡τ)(\mathbf{v}_{0},\mathbf{h}_{0})\to(\mathbf{v}_{1},\mathbf{h}_{1})\to\cdots\to(\mathbf{v}_{\tau},\mathbf{h}_{\tau}) is obtained using the Metropolis-Hastings algorithm of the Markov chain Monte-Carlo method Metropolis et al. (1953); Hastings (1970). Assuming the evolution is Markovian, the probability of taking a specific trajectory is the product of the transition probabilities at each step,

p(𝐯0,𝐡0θ1𝐯1,𝐡1)p(𝐯1,𝐡1θ2𝐯1,𝐡1)\displaystyle p(\mathbf{v}_{0},\mathbf{h}_{0}\stackrel{{\scriptstyle\theta_{1}}}{{\longrightarrow}}\mathbf{v}_{1},\mathbf{h}_{1})\,p(\mathbf{v}_{1},\mathbf{h}_{1}\stackrel{{\scriptstyle\theta_{2}}}{{\longrightarrow}}\mathbf{v}_{1},\mathbf{h}_{1})
p(𝐯τ1,𝐡τ1θτ𝐯τ,𝐡τ)\displaystyle\dots p(\mathbf{v}_{\tau-1},\mathbf{h}_{\tau-1}\stackrel{{\scriptstyle\theta_{\tau}}}{{\longrightarrow}}\mathbf{v}_{\tau},\mathbf{h}_{\tau}) .\displaystyle\,. (17)

The transition (𝐯,𝐡)(𝐯,𝐡)(\mathbf{v},\mathbf{h})\to(\mathbf{v}^{\prime},\mathbf{h}^{\prime}) can be implemented by the Metropolis-Hastings algorithm based on the detailed balance condition for the fixed parameter θ\theta,

p(𝐯,𝐡θ𝐯,𝐡)p(𝐯,𝐡θ𝐯,𝐡)=eE(𝐯,𝐡;θ)eE(𝐯,𝐡;θ).\displaystyle\frac{p(\mathbf{v},\mathbf{h}\stackrel{{\scriptstyle\theta}}{{\longrightarrow}}\mathbf{v}^{\prime},\mathbf{h}^{\prime})}{p(\mathbf{v},\mathbf{h}\stackrel{{\scriptstyle\theta}}{{\longleftarrow}}\mathbf{v}^{\prime},\mathbf{h}^{\prime})}=\frac{e^{-E(\mathbf{v}^{\prime},\mathbf{h}^{\prime};\theta)}}{e^{-E(\mathbf{v},\mathbf{h};\theta)}}\,. (18)

The work done on the RBM at epoch ii may be given by

δWi=E(𝐯i,𝐡i;θi+1)E(𝐯i,𝐡i;θi).\delta W_{i}=E(\mathbf{v}_{i},\mathbf{h}_{i};\theta_{i+1})-E(\mathbf{v}_{i},\mathbf{h}_{i};\theta_{i})\,. (19)

The total work W=δWiW=\sum\delta W_{i} performed on the system is written as Crooks (1998)

W\displaystyle W =i=0τ1[E(𝐯i,𝐡i;θi+1)E(𝐯i,𝐡i;θi)].\displaystyle=\sum_{i=0}^{\tau-1}\left[E(\mathbf{v}_{i},\mathbf{h}_{i};\theta_{i+1})-E(\mathbf{v}_{i},\mathbf{h}_{i};\theta_{i})\right]\,. (20)
Refer to caption
Figure 7: Heat map of energy function E(𝐯,𝐡;θ)E(\mathbf{v},\mathbf{h};\theta), representing the energy level of each configuration, after training of 2×22\times 2 Bar-and-Stripe patterns for 50000 epochs. The sizes of visible and hidden layers are N=4N=4 and M=6M=6, respectively. The learning rate is r=0.15r=0.15 and the value of CDk is k=100k=100. The vertical and the horizontal axes represent each configuration of the visible and the hidden layers, respectively. The black tiles represent the lowest energy configurations among all configurations, thus the probability of finding that configuration is high.
Refer to caption
Figure 8: Markov chain Monte-Carlo trajectories of the visible vector 𝐯i\mathbf{v}_{i} are plotted as a function of epoch. The visible vector jumps frequently in the early state of training and becomes trapped into one of target states as the training proceeds.

Given the sequence of the model parameter {θ0,θ1,,θτ}\{\theta_{0},\theta_{1},\dots,\theta_{\tau}\}, the Markov evolution of the visible and hidden vectors (𝐯,𝐡){0,1}N+M(\mathbf{v},\mathbf{h})\in\{0,1\}^{N+M} may be considered the discrete random walk. Random walkers move to the points with low energy in configuration space. Fig. 7 shows the heat map of energy function E(𝐯,𝐡;θ)E(\mathbf{v},\mathbf{h};\theta) of the RBM for the 2×22\times 2 Bar-and-Stripe pattern after training. One can see the energy function has deep levels at the visible vectors corresponding to the Bar-and-Stripe patterns of the training data set in Fig. 2, representing probability of generating the trained patterns is high. Also note that the energy function has many local minima. Fig. 8 plots a few Monte-Carlo trajectories of the visible vector 𝐯\mathbf{v} as a function of epoch. Before training, the visible vector 𝐯\mathbf{v} is distributed over all possible configurations, represented by the number (0,,15)(0,\cdots,15). As the training progresses, the visible vector 𝐯\mathbf{v} becomes trapped into one of the six possible outcomes (0,3,5,10,12,15)(0,3,5,10,12,15).

Refer to caption
Figure 9: Gaussian distribution of work done by the RBM during the training. The number of the Monte-Carlo sampling is 50000. The red curve is the plot of the Gaussian distribution using the mean and the standard deviation calculated by the Monte-Carlo simulation.
Refer to caption
Figure 10: Average of work done with standard deviation and free energy difference ΔF=F(epoch)F(epoch=0)\Delta F=F({\rm epoch})-F({\rm epoch}=0) as a function of epoch. The error bar of the work represent the standard deviation of the Gaussian distribution.

In order to examine the relation between work done on the RBM during the training and the free energy difference, the Monte-Carlo simulation is performed to calculate the average of the work over paths generated by the Metropolis-Hastings algorithm of the Markov chain Monte-Carlo method. Each path starts from an initial state sampled from the uniform distribution over configuration space, as shown in Fig. 4 (a). Since the work done on the system depends on the path, the distribution of the work is calculated by generating many trajectories. Fig. 9 shows the distribution of the work over 50000 paths at 5000 training epoch. The Monte-Carlo average of the work is W5.481\langle{W}\rangle\approx-5.481, and its standard deviation is σW3.358\sigma_{W}\approx 3.358. The distribution of the work generated by the Monte-Carlo simulation is well fitted to the Gaussian distribution, as depicted by the red curve in Fig. 9. This agrees with the statement of in Ref. Jarzynski (2011) that for the slow switching of the model parameters the probability distribution of work is approximated to the Gaussian.

We perform the Monte-Carlo calculation of the exponential average of work, eWpath\langle{e^{-W}}\rangle_{\rm path} to check the Jarzynski equality, Eq. (16). The free energy difference can be estimated as

eΔF=eWpath1Nmcn=1NmceWn,e^{-\Delta F}=\langle e^{-W}\rangle_{\rm path}\approx\frac{1}{N_{\rm mc}}\sum_{n=1}^{N_{\rm mc}}e^{-W_{n}}\,, (21)

where NmcN_{\rm mc} is the number of the Monte-Carlo samplings. At small epoch number, the Monte-Carlo estimated value of free energy difference is close to ΔF\Delta F calculated from the partition function. However, this Monte-Carlo calculation gives rise to the poor estimation of the free energy difference if the epoch is greater than 5000. This numerical errors can be explained by the fact that the exponential average of the work is dominated by rare realization Jarzynski (2006); Zuckerman and Woolf (2002); Lechner et al. (2006); Lechner and Dellago (2007); Halpern and Jarzynski (2016). As shown in Fig. 9, the distribution of work is given by the Gaussian distribution ρ(W)\rho(W) with the mean W\langle W\rangle and the standard deviation σW\sigma_{W}. If the standard deviation σW\sigma_{W} becomes larger, the peak position of ρ(W)eW\rho(W)e^{-W} moves to the long tail of the Gaussian distribution. So the main contrition of the integration of eW\langle e^{-W}\rangle comes from the rare realizations. Fig. 10 shows that the standard deviation σW\sigma_{W} grows with epoch, so the error of the Monte-Carlo estimation of the exponential average of the work grows quickly.

If σW2kBT\sigma_{W}^{2}\ll k_{B}T, the free energy is related to the average of work and its variance as

ΔF=WpathσW22kBT.\Delta F=\langle{W}\rangle_{\rm path}-\frac{\sigma_{W}^{2}}{2k_{B}T}\,. (22)

Here, the case is opposite, the spread of the value of work is large, i.e., σW2kBT\sigma_{W}^{2}\gg k_{B}T (=1)(=1), so the central limit theorem does not work and the above equation can not be applied Hendrix and Jarzynski (2001). Fig. 10 shows how the average of work, Wpath\langle W\rangle_{\rm path}, over the Markov chain Monte-Carlo paths changes as a function of epoch. The standard deviation of the Gaussian distribution of the work also grows as a function of training epoch. The free energy difference between before-training and after-training is called the reversible work Wr=ΔFW_{r}=\Delta F. The difference between the actual work and the reversible work is called the dissipative work, Wd=WWrW_{d}=W-W_{r} Crooks (1998). As depicted in Fig. 10, the magnitude of the dissipative work grows with training epoch.

III Summary

In summary, we analyzed the training process of the RBM in the context of statistical physics. In addition to the typical loss function, i.e., the reconstructed cross entropy, the thermodynamic quantities such as free energy FF, internal energy UU, and entropy SS were calculated as a function of epoch. While the free energy and the internal energy decrease rather indefinitely with epochs, the total entropy and the entropies of the visible and the hidden layers become saturated together with the Kullback-Leibler divergence after a sufficient number of epochs. This result suggests that the entropy of the system may be a good indicator to the RBM progress along with Kullback-Leibler divergence. It seems worth investigating the entropy for other larger data sets, for example, MNIST handwritten digits LeCun et al. (2010), in future works.

We have further demonstrated the subadditivity of the entropy, i.e., the entropy of the total system is less than the sum of the entropies of the two layers. This manifested the correlation between the visible and hidden layers growing with the training progress. Just as the entropies are well saturated together with Kullback-Leibler divergence, so is the correlation that is determined by the total and the local entropies. In this sense, the correlation between the visible and the hidden layer may become another good indicator to the RBM performance.

We also investigated the work done on the RBM by switching the parameters of the energy function. The trajectories of the visible and hidden vectors in configuration space were generated using Markov chain Monte-Carlo simulation. The distribution of the work follows the Gaussian distribution and its standard deviation grows with training epochs. We discussed the Jarzynski equality, which connects the free energy difference and the average of the exponential function of the work over the trajectories.

A more detailed analysis from a full thermodynamics or statistical physics point of view can bring us useful insights into the performance of RBM. This course of study may enable us to come up with possible methods for a better performance of RBM for many different applications in the long run. Therefore, it may be worthwhile to further pursue our study, e.g. a rigorous assessment of scaling behavior of thermodynamic quantities with respect to epochs as the sizes of the visible and hidden layers increase. We also expect that a similar analysis on a quantum Boltzmann machine can be valuable as well.

References

  • Smolensky (1986) P. Smolensky, “Information processing in dynamical systems: foundations of harmony theory,” in Parallel distributed processing: Explorations in the microstructure of cognition, edited by D. Rumelhart and J. McLelland (MIT Press, Cambridge, 1986) pp. 194–281.
  • Hinton (2012) G. E. Hinton, “A practical guide to training restricted Boltzmann machines,” in Neural Networks: Tricks of the Trade: Second Edition, edited by G. Montavon, G. B. Orr,  and K.-R. Müller (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012) pp. 599–619.
  • Fischer and Igel (2014) A. Fischer and C. Igel, “Training restricted Boltzmann machines: An introduction,” Pattern Recognition 47, 25–39 (2014).
  • Melchior et al. (2016) J. Melchior, A. Fischer,  and L. Wiskott, “How to center deep Boltzmann machines,” Journal of Machine Learning Research 17, 1–61 (2016).
  • Mehta et al. (2019) P. Mehta, M. Bukov, C.-H. Wang, A. G. R. Day, C. Richardson, C. K. Fisher,  and D. J. Schwab, “A high-bias, low-variance introduction to machine learning for physicists,” Physics Reports 810, 1–124 (2019).
  • Carleo and Troyer (2017) G. Carleo and M. Troyer, “Solving the quantum many-body problem with artificial neural networks,” Science 355, 602–606 (2017).
  • Tramel et al. (2018) E. W. Tramel, M. Gabrié, A. Manoel, F. Caltagirone,  and F. Krzakala, “Deterministic and generalized framework for unsupervised learning with restricted boltzmann machines,” Phys. Rev. X 8, 041006 (2018).
  • Amin et al. (2018) M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy,  and R. Melko, “Quantum boltzmann machine,” Phys. Rev. X 8, 021050 (2018).
  • Stoudenmire and Schwab (2016) E. Stoudenmire and D. J. Schwab, “Supervised learning with tensor networks,” in Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,  and R. Garnett (Curran Associates, Inc., 2016) pp. 4799–4807.
  • Gao and Duan (2017) X. Gao and L.-M. Duan, “Efficient representation of quantum many-body states with deep neural networks,” Nature Communications 8, 662 (2017).
  • Chen et al. (2018) J. Chen, S. Cheng, H. Xie, L. Wang,  and T. Xiang, “Equivalence of restricted Boltzmann machines and tensor network states,” Phys. Rev. B 97, 085104 (2018).
  • Das Sarma et al. (2019) S. Das Sarma, D.-L. Deng,  and L.-M. Duan, “Machine learning meets quantum physics,” Physics Today 72, 48–54 (2019).
  • Huggins et al. (2019) W. Huggins, P. Patil, B. Mitchell, K. B. Whaley,  and E. M. Stoudenmire, “Towards quantum machine learning with tensor networks,” Quantum Science and Technology 4, 024001 (2019).
  • Xia and Kais (2018) R. Xia and S. Kais, “Quantum machine learning for electronic structure calculations,” Nature Communications 9, 4195 (2018).
  • Kullback and Leibler (1951) S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Statist. 22, 79–86 (1951).
  • Cover and Thomas (2006) T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. (Wiley, New York, 2006).
  • Hinton and Sejnowski (1986) G. E. Hinton and T. J. Sejnowski, “Learning and relearning in Boltzmann machines,” in Parallel distributed processing: Explorations in the microstructure of cognition, edited by D. E. Rumelhart and J. L. McLelland (MIT Press, Cambridge, 1986) pp. 282–317.
  • MacKay (2002) D. J. C. MacKay, Information Theory, Inference & Learning Algorithms (Cambridge University Press,New York, 2002).
  • Reif (1965) F. Reif, Fundamentals of Statistical and Thermal Physics (McGraw Hill, New York, 1965).
  • Araki and Lieb (1970) H. Araki and E. H. Lieb, “Entropy inequalities,” Communications in Mathematical Physics 18, 160–170 (1970).
  • Nielsen and Chuang (2000) Michael A. Nielsen and Isaac L. Chuang, Quantum Computation and Quantum Information (Cambridge University Press, 2000).
  • Jarzynski (1997) C. Jarzynski, “Nonequilibrium equality for free energy differences,” Phys. Rev. Lett. 78, 2690–2693 (1997).
  • Jarzynski (2011) C. Jarzynski, “Equalities and inequalities: Irreversibility and the second law of thermodynamics at the nanoscale,” Annual Review of Condensed Matter Physics 2, 329–351 (2011).
  • Metropolis et al. (1953) N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller,  and E. Teller, “Equation of state calculations by fast computing machines,” The Journal of Chemical Physics 21, 1087–1092 (1953).
  • Hastings (1970) W. K. Hastings, “Monte carlo sampling methods using markov chains and their applications,” Biometrika 57, 97–109 (1970).
  • Crooks (1998) G. E. Crooks, “Nonequilibrium measurements of free energy differences for microscopically reversible markovian systems,” Journal of Statistical Physics 90, 1481 – 1487 (1998).
  • Jarzynski (2006) C. Jarzynski, “Rare events and the convergence of exponentially averaged work values,” Phys. Rev. E 73, 046105 (2006).
  • Zuckerman and Woolf (2002) D. M. Zuckerman and T. B. Woolf, “Theory of a systematic computational error in free energy differences,” Phys. Rev. Lett. 89, 180602 (2002).
  • Lechner et al. (2006) W. Lechner, H. Oberhofer, C. Dellago,  and P. L. Geissler, “Equilibrium free energies from fast-switching trajectories with large time steps,” The Journal of Chemical Physics 124, 044113 (2006).
  • Lechner and Dellago (2007) W. Lechner and C. Dellago, “On the efficiency of path sampling methods for the calculation of free energies from non-equilibrium simulations,” Journal of Statistical Mechanics: Theory and Experiment 2007, P04001–P04001 (2007).
  • Halpern and Jarzynski (2016) Y. N. Halpern and C. Jarzynski, “Number of trials required to estimate a free-energy difference, using fluctuation relations,” Phys. Rev. E 93, 052144 (2016).
  • Hendrix and Jarzynski (2001) D. A. Hendrix and C. Jarzynski, “A fast growth method of computing free energy differences,” The Journal of Chemical Physics 114, 5974–5981 (2001).
  • LeCun et al. (2010) Y. LeCun, C. Cortes,  and C.J. Burges, “Mnist handwritten digit database,” ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2 (2010).