This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Continuous-mixture Autoregressive Networks for efficient variational calculation of many-body systems

Lingxiao Wang111lwang@fias.uni-frankfurt.de Frankfurt Institute for Advanced Studies, Ruth-Moufang-Str. 1, 60438 Frankfurt am Main, Germany Department of Physics, Tsinghua University, Beijing 100084, China.    Yin Jiang222jiang_y@buaa.edu.cn Department of Physics, Beihang University, Beijing 100191, China.    Lianyi He333lianyi@mail.tsinghua.edu.cn Department of Physics, Tsinghua University, Beijing 100084, China.    Kai Zhou444zhou@fias.uni-frankfurt.de Frankfurt Institute for Advanced Studies, Ruth-Moufang-Str. 1, 60438 Frankfurt am Main, Germany
Abstract

We develop deep autoregressive networks with multi channels to compute many-body systems with continuous spin degrees of freedom directly. As a concrete example, we embed the two-dimensional XY model into the continuous-mixture networks and rediscover the Kosterlitz-Thouless (KT) phase transition on a periodic square lattice. Vortices characterizing the quasi-long range order are accurately detected by the autoregressive neural networks. By learning the microscopic probability distributions from the macroscopic thermal distribution, the neural networks compute the free energy directly and find that free vortices and anti-vortices emerge in the high-temperature regime. As a more precise evaluation, we compute the helicity modulus to determine the KT transition temperature. Although the training process becomes more time-consuming with larger lattice sizes, the training time remains unchanged around the KT transition temperature. The continuous-mixture autoregressive networks we developed thus can be potentially used to study other many-body systems with continuous degrees of freedom.

Introduction.—Machine learning techniques are attracting widespread interest in different fields of science and technology because of its power to extract and express structures inside complex data. In particular, physicists have employed it to do projects including classification, regression, and pattern generation Buchanan (2019); Carleo et al. (2019). It was found that the neural networks can classify the phase structures of many-body systems  Wang (2016); Carrasquilla and Melko (2017); Pang et al. (2018); Fujimoto et al. (2018, 2020). As for the regression, it was successfully applied to the event selection in a large data set (e.g., from the LHCb) Metodiev and Thaler (2018); Kasieczka et al. (2019), the spinodal decomposition in heavy ion collisions Steinheimer et al. (2019), and the molecular structure prediction Smith et al. (2017). Furthermore, it also sheds light on the innovation of the methods of first-principle calculations of many-body systems Carleo and Troyer (2017); Nagy and Savona (2019); Hartmann and Carleo (2019); Pfau et al. (2019); Vicentini et al. (2019); Yoshioka and Hamazaki (2019). The restricted Boltzmann machine was applied to solve quantum many-body systems Carleo and Troyer (2017). A deep neural network was constructed to derive the solution of the many-electron schrödinger equation Pfau et al. (2019). It was found that proper neural networks can work as an efficient AnsatzAnsatz to characterize many-body systems. Another interesting direction is modification of the classical algorithms with machine learning  Shen et al. (2018); Mori et al. (2018); Carleo et al. (2019), which may improve or assist the conventional first-principle computations.

In addition, it is natural to apply neural networks to many-body systems on the lattices, since they share similar discrete architectures. In some pioneer attempts Alexandru et al. (2017); Broecker et al. (2017); Urban and Pawlowski (2018); Mori et al. (2018); Zhou et al. (2019), the training data were generated by the classical Markov Chain Monte Carlo (MCMC) method. However, its expandability and efficiency are limited because of the critical slowing down near the critical point Urban and Pawlowski (2018) and the sign-problem Broecker et al. (2017). Recently, a new method based on an autoregressive neural network was proposed and applied to discrete spin systems (e.g., the Ising model) Wu et al. (2019); Sharir et al. (2020). This new method uses a variational Ansatz that decomposes the macroscopic probability distribution into microscopic distributions on the lattice sites. It was demonstrated that a higher computational accuracy can be achieved in solving several Ising-type systems Ou (2019).

However, it still remains challenging to solve a general many-body system with continuous degrees of freedom, e.g., a continuous spin system which may exhibit a topological phase transition. Previous attempts failed when one applied methods that work well for discrete spin systems to continuous cases Cristoforetti et al. (2017). Being different from the classical phase transition, the topological phase transition occurs with topological defects emerging. This has been attracting the attention from various fields of physics Background (2016). Some related efforts using both supervised learning and unsupervised learning have been made to handle this problem Wang and Zhai (2017); Beach et al. (2018); Suchsland and Wessel (2018); Zhang et al. (2018); Carvalho et al. (2018); Hu et al. (2019); Fukushima et al. (2019). The winding numbers were recognized by a neural network with supervised training for an one-dimensional insulator model Zhang et al. (2018). By generating the configurations with MCMC sampling and supplying feature engineered vortex configurations as the input, neural networks could detect the topological phase transition from well-preprocessed configurations Beach et al. (2018). While the unsupervised learning was applied to identify the topological orders Cristoforetti et al. (2017); Rodriguez-Nieva and Scheurer (2019); Scheurer and Slager (2020), an efficient variational approach combined with powerful autoregressive neural networks is still missing.

In this Letter, we propose Continuous-mixture Autoregressive Networks (CANs) to solve the continuous spin systems efficiently, which can be further applied to continuous field systems. As a concise reference example, we study the two-dimensional (2D) XY model on a square lattice, which exhibits the so-called Kosterlitz-Thouless (KT) phase transition Gupta et al. (1988); Kosterlitz (1974); Weber and Minnhagen (1988). The CANs are introduced to recognize the topological phase transition with continuous variables in an unsupervised manner. In such autoregressive neural networks, the microscopic state at each lattice site is modeled by a conditional probability, which constructs a joint probability for the whole configuration Wu et al. (2019); Sharir et al. (2020). In this work, we introduce a generic framework for CANs and construct suitable neural networks to study the XY model. The vortices, serving as the signal of the KT transition, are automatically generated by the neural networks. Correspondingly, the KT transition temperature of the 2D XY model can be more accurately determined by calculating the helicity modulus Weber and Minnhagen (1988). As for the computing time cost, the training process becomes undoubtedly more time-consuming for larger lattice sizes. However, the training time remains unchanged around the transition point. Considering the advantages of the CANs, we propose further potential applications to other many-body systems with continuous degrees of freedom in the final part of this work.

Continuous-mixture Autoregressive Networks.—Let us consider a many-body system on a lattice with continuous spin degrees of freedom at each lattice site. Because the orientation of each spin changes continuously, the conditional probability q(si)q(s_{i}) for spin sis_{i} at site ii must also be continuous. We propose that a proper mixture of the beta distribution Beta(a,b){\rm Beta}(a,b) is the prior probability to ensure that the continuous variables distribute randomly in a finite interval bet . The beta distribution Beta(a,b){\rm Beta}(a,b) is continuously defined in a finite interval with two positive shape parameters a,ba,b. Thus the output layers of neural networks are designed to be of two channels for each Beta component, and the conditional probabilities are derived as

fθ(si|s1,,si1)=Γ(ai+bi)Γ(ai)Γ(bi)siai1(1si)bi1,f_{\theta}(s_{i}|s_{1},...,s_{i-1})=\frac{\Gamma(a_{i}+b_{i})}{\Gamma(a_{i})\Gamma(b_{i})}s_{i}^{a_{i}-1}(1-s_{i})^{b_{i}-1}, (1)

where Γ(a)\Gamma(a) is the gamma function, {θ}\{\theta\} is a set of parameters of the networks, and si=ϕi/2π[0,1]s_{i}=\phi_{i}/2\pi\in[0,1]. The outputs of the hidden layers are 𝐚(a1,a2,)\mathbf{a}\equiv(a_{1},a_{2},\cdots) and 𝐛(b1,b2,)\mathbf{b}\equiv(b_{1},b_{2},\cdots), which can be realized by Fig. 1 as an autoregressive neural networks.

Refer to caption
Figure 1: The architecture of the Continuous-mixture Autoregressive Networks(CANs) for the computation of a continuous spin system. It is a continuous-mixture PxielCNN structure Van Den Oord et al. (2016), in which the masked layers are added into the network to establish the autoregressive networks.

As a practical example, we consider the 2D XY model on the lattice. The Hamiltonian is expressed in terms of spins living at the lattice sites with nearest-neighbor interactions,

H=J<i,j>sisj=J<i,j>cos(ϕiϕj),H=-J\sum_{<i,j>}s_{i}s_{j}=-J\sum_{<i,j>}\cos(\phi_{i}-\phi_{j}), (2)

where <i,j><i,j> indicates that the sum is taken over all nearest-neighbor pairs and the angle ϕi[0,2π)\phi_{i}\in[0,2\pi) denotes the spin orientation at site ii. The Mermin-Wagner theorem forbids long-range order (LRO) in 2D systems with continuous degrees of freedom, since the strong fluctuations in 2D break the order Wagner and Schollwoeck (2010). Nevertheless, the formation of topological defects (i.e., vortices and anti-vortices) in the XY model distinguishes phases with and without quasi-LRO, which characterizes the global properties of the many-body system.

To detect the KT phase transition in the XY model, we study the free energy F=(1/β)lnZF=-(1/\beta)\ln Z according to statistical mechanics, where β1/(kBT)\beta\equiv 1/(k_{B}T), with TT being the temperature. The free energy is obtained from the partition function Z𝐬exp(βE(𝐬))Z\equiv\sum_{\mathbf{s}}\exp(-\beta E(\mathbf{s})), which contains all information of the system. The summation runs over all possible configurations {𝐬}\{\mathbf{s}\} of the system. Monte Carlo algorithms can be routinely applied to generate the configurations and can achieve proper relative importance among different configurations. However, the free energy cannot be computed directly from the Monte Carlo algorithms. Variational approaches can be employed to obtain a variational free energy. In this work, the variational target function is set to be the joint probability of all configurations and equals the Boltzmann distribution p(𝐬)=eβE(s)/Zp(\mathbf{s})={e^{-\beta E(\textbf{s})}}/{Z}. We use a variational Ansatz{Ansatz} for the joint distribution, denoted by qθ(𝐬)q_{\theta}(\mathbf{s}). It is parametrized by a set of variational parameters {θ}\{\theta\}, which are tuned to approach the target distribution p(𝐬)p(\mathbf{s}). The Kullback-Leibler divergence MacKay and Kay (2003) between the variational and the target distributions, DKL(qθp)𝔼𝐬qθ(logp+logqθ)D_{\mathrm{KL}}\left(q_{\theta}\|p\right)\equiv\mathbb{E}_{\mathbf{s}\sim q_{\theta}}(-\log p+\log q_{\theta}), provides the measure of the closeness from qθq_{\theta} to pp. The corresponding variational free energy FqF_{q} can be derived from DKL(qθp)=β(FqF)D_{\mathrm{KL}}\left(q_{\theta}\|p\right)=\beta\left(F_{q}-F\right). We obtain

Fq=1β𝐬qθ(𝐬)[βE(𝐬)+lnqθ(𝐬)].F_{q}=\frac{1}{\beta}\sum_{\mathbf{s}}q_{\theta}(\mathbf{s})\left[\beta E(\mathbf{s})+\ln q_{\theta}(\mathbf{s})\right]. (3)

Since DKL(qθp)D_{\mathrm{KL}}\left(q_{\theta}\|p\right) is non-negative, the variational free energy FqF_{q} gives an upper bound of the true free energy FF. Thus the minimization of DKL(qθp)D_{\mathrm{KL}}\left(q_{\theta}\|p\right) and the variational free energy FqF_{q} are actually equivalent. Meanwhile, as pointed out in previous works Carleo and Troyer (2017); Wu et al. (2019); Sharir et al. (2020), it is straightforward to map the parameters {θ}\{\theta\} onto the weights of an Artificial Neural Network (ANN). Correspondingly, the variational free energy becomes the loss function. Using the log-derivative trick Williams (1992) we obtain

βθFq=𝔼𝐬qθ{[βE(𝐬)+lnqθ(𝐬)]θlnqθ(𝐬)},\beta\nabla_{\theta}F_{q}=\mathbb{E}_{\mathbf{s}\sim q_{\theta}}\Big{\{}\left[\beta E(\mathbf{s})+\ln q_{\theta}(\mathbf{s})\right]\nabla_{\theta}\ln q_{\theta}(\mathbf{s})\Big{\}}, (4)

where the gradient θlnqθ(𝐬)\nabla_{\theta}\ln q_{\theta}(\mathbf{s}) is weighted by the reward signal βE(𝐬)+lnqθ(𝐬)\beta E(\mathbf{s})+\ln q_{\theta}(\mathbf{s}).

Once the parameters {θ}\{\theta\} are mapped onto ANN, the variational problem becomes nothing but training the networks. Using autoregressive networks such as CANs for the parametrization, we can decompose the variational distribution into a product of the conditional probabilities,

qθ(𝐬)=i=1Nfθ(si|s1,,si1),q_{\theta}(\mathbf{s})=\prod_{i=1}^{N}f_{\theta}\left(s_{i}|s_{1},\ldots,s_{i-1}\right), (5)

which provides the variational Ansatz{Ansatz} by parametrizing each conditional probability as neural networks. Considering the symmetry, we employ the PixelCNNVan Den Oord et al. (2016) that can naturally preserve the locality and the translational symmetry on a square lattice. In addition, the autoregressive property is guaranteed by putting a mask on the convolution kernel Germain et al. (2015); mas . As shown in Fig. 1, the input layer takes in the configurations 𝐬\mathbf{s} on the lattice. After passing it through several masked convolution layers, the parameters of the beta distribution at each site are obtained in the output layer. Then the configuration probability qθ(𝐬)q_{\theta}({\bf s}) can be derived and the variational free energy can be further calculated via Eq.(3) after such a forward propagation for a batch of independent configurations. If we specify the channels in the convolution layers to represent the parameters of each beta component, we find that it greatly saves the training time and speeds up the sampling.

Training the neural networks here is the key to perform the variational approach. We use a classical back-propagation algorithm. The nuts-and-bolts computation with CANs is with the following procedures: (i) With the randomly initialized network, sample a batch of independent configurations to be the training set; (ii) Pass the training set forward to evaluate the log-probability lnqθ\ln q_{\theta} and the variational free energy FqF_{q}; (iii) Estimate the gradient θFq\nabla_{\theta}F_{q} and update the network weights via back-propagation; (iv) With the updated network, resample a batch of configurations to be the new training set, which actually follow the current joint probability qθq_{\theta}; (v) Repeat the above procedures until the loss function becomes eventually convergent; (vi) Sample an ensemble of independent configurations from qθq_{\theta} site by site at once; (vii) Calculate the thermodynamic observables. On the square lattice with NN sites, the convergence is reached if the change of the variational energy per site is much smaller than the superior limit, δFq4J/N\delta F_{q}\ll 4J/N Chung (1999).

Rediscovery of KT Transition.—Now we study the KT transition and calculate the thermodynamic quantities of the 2D XY model using CANs. In the calculations, the default width and depth of the network we adopted in CANs are set to be (32,3)(32,3), with N=L2N=L^{2} lattice sites being the input. The multi-channel feature is used to construct a mixture of the beta distributions, which makes the networks more expressive Salimans et al. (2017). The Adam optimizer is applied to minimize the loss function in Pytorch. The CANs can be implemented on GitHub Wang (2020) and the corresponding hyper-parameters are consistent with previous works Wu et al. (2019).

Refer to caption
Figure 2: The free energy per site of the 2D XY model on a square lattice from CANs.
Refer to caption
Figure 3: The energy per site as a function of β\beta with lattice size L=16L=16 from CANs with a 100-channel mixture of the Beta distribution. The shaded area denotes the statistical error from thermal fluctuations. As a comparison, we also show the result from MCMC. The insert shows the number of vortex pairs per site extracted from CANs.

The thermodynamic observables of the 2D XY model have been computed with the MCMC method Gupta et al. (1988); Hasenbusch (2005); Komura and Okabe (2012); Weber and Minnhagen (1988). It is interesting to compare the results from CANs and MCMC. The advantage of CANs is that the free energy per site can be directly calculated. It is presented Fig.  2 for three different lattice sizes, L=4,8,16L=4,8,16. The results for L=8L=8 and L=16L=16 indicate that the free energy converges rapidly with increasing lattice sizes, which ensures that the size effect can be avoided. In the following discussions we use the results for L=16L=16.

Fig. 3 shows the energy per site as a function of β\beta obtained from CANs and MCMC with the same number of configurations. Here the MCMC calculation was implemented with a classical algorithm Tobochnik and Chester (1979); Teitel and Jayaprakash (1983). In the low temperature regime (β>1\beta>1), the result from CANs agrees with that from MCMC despite the statistical error from thermal fluctuations. In the high temperature regime (β<1\beta<1), however, the two results do not match perfectly. We can understand this deviation from the comparison of the number of free vortices (or anti-vortices) shown in Table 1. Since a larger number of free vortices and anti-vortices indicates a larger entropy in the XY model, configurations with more vorticity vor lead to a lower free energy. This is the reason why the energy per site from CANs is larger than that from MCMC at β<0.7\beta<0.7, because the entropy from the free vortices and anti-vortices balances the free energy. The situation is reversed at 0.7<β<10.7<\beta<1 for the same reason. The rapid increase of the number of free vortices per site around β1\beta\sim 1 indicates that there exists a topological phase transition, i.e., the KT transition.

Table 1: The energy and the number of free vortices (or anti-vortices) per site extracted from CANs and MCMC. The results from CANs are obtained from an ensemble average with 1000 configurations.
β\beta 0.4 0.6 0.8 1.0 1.2
Energy MCMC -0.424 -0.682 -0.996 -1.336 -1.502
CANs -0.384 -0.588 -1.017 -1.346 -1.497
Vortices MCMC 0.114 0.080 0.042 0.010 0.002
CANs 0.121 0.096 0.038 0.007 0.001

The KT transition is a transition from bound vortex-antivortex pairs at low temperatures to unpaired vortices and anti-vortices at high temperature. In previous works Gupta et al. (1988); Komura and Okabe (2012); Bighin et al. (2019), the KT transition temperature was reported as βKT1.12\beta_{\rm KT}\approx 1.12. In our method, the CANs capture the global property of the 2D XY model since the trained neural networks help achieve a good evaluation of the free energy. Nevertheless, at high temperature, the disorder of the configurations due to the thermal fluctuations results in a slight mismatch between CANs and MCMC. The temperature effect attenuates the long-range correlation exponentially Kosterlitz (1974), which also slightly weakens the expressive ability of CANs for a finite-size system. Additionally, we emphasize that the elementary conditional distribution at each site, fθ(si|s1,,si1)f_{\theta}\left(s_{i}|s_{1},\ldots,s_{i-1}\right), should be carefully chosen since the spin in the XY model is periodically valued. We find that, as a common test, the choice of the normal distribution for fθ(si|s1,,si1)f_{\theta}\left(s_{i}|s_{1},\ldots,s_{i-1}\right) can hardly make the the loss function converge to a reasonable value.

KT Transition Point.—To recognize the KT transition point quantitatively, we may introduce the spin stiffness ρs\rho_{s}, which reflects the change of the free energy in response to an infinitesimally slow twist δϕ\delta\phi on the spins. In the continuum limit, it is ρs=[2F(δϕ)/(δϕ)2]|δϕ=0\rho_{s}=[{\partial^{2}F(\delta\phi)}/{\partial(\delta\phi)^{2}}]|_{\delta\phi=0}. In practice, we consider another quantity, the helicity modulus γ(L)\gamma(L) Weber and Minnhagen (1988); Gupta et al. (1988); Komura and Okabe (2012); Hasenbusch (2005), which is equivalent to ρs\rho_{s} in the limit δϕ0\delta\phi\rightarrow 0. It can be expressed as

γ(L)=E2L2JβL2(<i,j>sin(ϕiϕj)𝐞ij𝐱)2,\gamma(L)=-\frac{E}{2L^{2}}-\frac{J\beta}{L^{2}}\left\langle\Big{(}\sum_{<i,j>}\sin(\phi_{i}-\phi_{j}){\bf e}_{ij}\cdot{\bf x}\Big{)}^{2}\right\rangle, (6)

where LL is the size of the square lattice, 𝐞ij{\bf e}_{ij} is the vector pointing from site jj to site ii, and 𝐱{\bf x} is an arbitrary unit vector in the 2D plane. The Kosterlitz renormalization group Kosterlitz (1974) predicts that γ(L)\gamma(L\to\infty) jumps from the value 2kBTc/π2k_{B}T_{c}/\pi to zero at the critical temperature, and hence the helicity modulus gives a reliable prediction of the KT transition point.

Refer to caption
Figure 4: The helicity modulus as a function of β\beta for lattice size L=16L=16 obtained from CANs with a 100-channel mixture of the beta distribution. The cross point with the curve 2kBT/π2k_{B}T/\pi gives the transition point βKT1.10\beta_{\rm KT}\simeq 1.10.

In Fig. 4, we show the helicity modulus for a large lattice size L=16L=16. The markers are the numerical results evaluated from CANs with several multi-channel mixtures of the beta distribution and the dashed lines are the corresponding interpolation curves. With increasing number of channels, the crossing point with the curve 2kBT/π2k_{B}T/\pi predicts the KT transition temperature. For a sufficiently large number of channels (100), we observe that the crossing point is located at β1.10\beta\simeq 1.10. Since the helicity modulus depends on the correlation function that needs a higher order statistics than the energy (i.e., the mean square of the energy), here we only consider a large lattice size that can give a precise evaluation of γ\gamma. The result is consistent with that from the standard Monte Carlo simulations  Komura and Okabe (2012); Beach et al. (2018).

We find that the computational time cost is around 0.2 seconds per training step and is almost independent of the temperature. This indicates that the Critical Slowing Down (CSD) is hopefully avoided. Even though the burden due to the increasing auto-correlation time Goodman and Sokal (1989); Urban and Pawlowski (2018) in MCMC does not appear in CANs, the relation between the training time cost and the lattice size should be mentioned here. The training time per step in CANs with the default network setup for different lattice sizes at the KT transition temperature can be recorded. We find that as a function of the lattice size LL it can be well fitted as ttrain(L)=aLbt_{\text{train}}(L)=a\,L^{b}, with a=0.0026a=0.0026 and b=1.708b=1.708. Thus the cost of bypassing the CSD is such a training time that has an approximate polynomial dependence on LL. However, it is just one-off and the following configuration sampling procedure can take on the parallel advantage of GPU for a large ensemble generation. Therefore, a more powerful GPU can reduce the time cost efficiently. As a reference, all results presented in this work were obtained on a Nvidia RTX 2080 GPU.

Summary.—We have proposed continuous-mixture autoregressive networks (CANs) for an efficient variational calculation of many-body systems with continuous degrees of freedom. Specifically, a CAN can be designed as an Ansatz for the variational approach to the topological phase transition in the 2D XY model. The CANs are able to learn to construct microscopic states of the system, in which vortices and anti-vortices emerge automatically. We can evaluate the energy and the vorticity using configurations generated from the networks and compare them to the results from the MCMC calculations. The autoregressive structure of the neural networks is beneficial to study the long-range correlations even beyond the phase transition point. It sheds light on the investigation of more latent topological structures, such as a coupled XY model Bighin et al. (2019) and a twisted bilayer graphene Julku et al. (2020), where novel long-range correlations may emerge. Besides, a straightforward determination of the KT transition point in CANs is shown to be consistent with previous works using the standard Monte Carlo methods.

Although the increasing time cost with the lattice size is unavoidable, it becomes more economical in searching for the critical point in the limit of a large ensemble of configurations, because of the direct GPU usage. With the help of CANs, the CSD problem occurring in MCMC is expected to be remarkably alleviated, which may reduce the computational difficulties in more complicated many-body systems, e.g., lattice simulation for the possible critical end point in the QCD phase diagram. The ϕ4\phi^{4} model could be a practical step Urban and Pawlowski (2018), where the computational accuracy and practicability could be rigorously examined. Therefore, the deep learning approach, especially with well-designed neural networks, can be applied to specific physical problems Mehta and Schwab (2014); Iten et al. (2020). This inspires us to explore the techniques from a more physical viewpoint, which will help us open the black boxes of the deep learning and the nature.

Acknowledgements.
We thank Giuseppe Carleo and Junwei Liu for useful discussions. The work is supported by the BMBF under the ErUM-Data project (K. Z.), the AI grant of SAMSON AG, Frankfurt (K. Z.), the National Natural Science Foundation of China with Grant No. 11875002 (Y. J.) and No.11775123 (L. H. and L. W.), the Zhuobai Program of Beihang University (Y. J.), and the National Key R&D Program of China with Grant No. 2018YFA0306503 (L. H.). K. Z. also thanks the donation of NVIDIA GPUs from NVIDIA Corporation.

References