This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

α\alpha-divergence improves the entropy production estimation via machine learning

Euijoon Kwon Department of Physics and Astronomy & Center for Theoretical Physics, Seoul National University, Seoul 08826, Republic of Korea    Yongjoo Baek y.baek@snu.ac.kr Department of Physics and Astronomy & Center for Theoretical Physics, Seoul National University, Seoul 08826, Republic of Korea
Abstract

Recent years have seen a surge of interest in the algorithmic estimation of stochastic entropy production (EP) from trajectory data via machine learning. A crucial element of such algorithms is the identification of a loss function whose minimization guarantees the accurate EP estimation. In this study, we show that there exists a host of loss functions, namely those implementing a variational representation of the α\alpha-divergence, which can be used for the EP estimation. By fixing α\alpha to a value between 1-1 and 0, the α\alpha-NEEP (Neural Estimator for Entropy Production) exhibits a much more robust performance against strong nonequilibrium driving or slow dynamics, which adversely affects the existing method based on the Kullback-Leibler divergence (α=0\alpha=0). In particular, the choice of α=0.5\alpha=-0.5 tends to yield the optimal results. To corroborate our findings, we present an exactly solvable simplification of the EP estimation problem, whose loss function landscape and stochastic properties give deeper intuition into the robustness of the α\alpha-NEEP.

I Introduction

How irreversible does a process look? One may pose this question for two distinct reasons. First, whether a biological process requires energy dissipation is often a subject of much debate [1, 2]. To resolve this issue, it is useful to note that irreversibility suggests energy dissipation. Various hallmarks of irreversibility, such as the breaking of the fluctuation-dissipation theorem [3] and the presence of nonequilibrium probability currents in the phase space [4, 5], have been used to determine whether energy is dissipated. Second, whether a nonequilibrium system allows for an effective equilibrium description is an important issue. For instance, in active matter, despite the energy dissipation at the microscopic level, it has been argued that the large-scale phenomena allow for an effective equilibrium description [6, 7, 8, 9, 10]. If we can quantify the irreversibility of an empirical process at various levels of coarse-graining [11, 12], it will provide us with helpful clues as to whether we should look for an effective equilibrium theory for the process.

Based on the framework of stochastic thermodynamics, modern thermodynamics assigns entropy production (EP) to each stochastic trajectory based on its irreversibility [13]. Thus, empirically measuring the irreversibility of a process is closely tied to the problem of estimating EP from sampled trajectories [14, 15, 16, 17, 18, 19, 20, 21]. A straightforward approach to the problem is to evaluate the relevant transition probabilities by directly counting the number of trajectory segments, which is called the plug-in method [14, 15]. The method, readily applicable to discrete systems, can also be applied to continuous systems through the use of kernel functions [16]. However, while this method is simple and intuitive, it requires a huge ensemble of lengthy trajectories for accurate estimations (curse of dimensionality). More recent studies proposed methods based on universal lower bounds of the average EP, such as the thermodynamic uncertainty relations [16, 17, 18, 19] and the entropic bound [20]. While these methods do not suffer from the curse of dimensionality and are applicable even to non-stationary processes [19, 20], their accuracy is impaired when the underlying bounds are not tight. Moreover, these methods are applicable only to the estimation of the average EP, not the EP of each trajectory.

Meanwhile, with the advent of machine learning techniques in physics, a novel method for EP estimation using artificial neural networks has been developed [21]. This method, called the Neural Estimator for Entropy Production (NEEP), minimizes the loss function based on a variational representation of the Kullback-Leibler (KL) divergence. Without any presupposed discretization of the phase space and using the rich expressivity of neural networks, the NEEP suffers far less from the complications of the sampling issues and is applicable to a diverse range of stochastic processes [19].

Still, the NEEP has its limits. Its accuracy deteriorates when the nonequilibrium driving is strong or when the dynamics slows down so that the phase space is poorly sampled. In this study, we show that the NEEP can be significantly improved by changing the loss function. Toward this purpose, we propose the α\alpha-NEEP, which generalizes the NEEP. Instead of the KL divergence, the α\alpha-NEEP utilizes the α\alpha-divergence, which has been mainly used in the machine learning community [22, 23, 24, 25]. We demonstrate that the α\alpha-NEEP with nonzero values of α\alpha shows much more robust performance for a broader range of nonequilibrium driving and sampling quality, with α=0.5\alpha=-0.5 showing the optimal performance overall. This is corroborated by an analytically tractable simplification of the α\alpha-NEEP that shows the optimality of α=0.5\alpha=-0.5.

The rest of this paper is organized as follows. After reviewing the original NEEP and its limitations (Sec. II), we introduce the α\alpha-NEEP (Sec. III) and demonstrate its enhanced performance for three different examples of nonequilibrium systems (Sec. IV). Then we investigate the rationale behind the observed results using a simplified model describing how the α\alpha-NEEP works (Sec. V). Finally, we sum up the results and discuss their implications (Sec. VI).

Refer to caption
Figure 1: Schematic illustration of the neural-network implementation of the α\alpha-NEEP.

II Overview of the Original NEEP

We first give a brief overview of how the original NEEP [21] estimates EP at the trajectory level. Suppose our goal is to estimate EP of a Markov process in discretized time, 𝐱t\mathbf{x}_{t}, in a dd-dimensional space. For every ordered pair of states, denoted by 𝐱(𝐱t,𝐱t+1)\mathbf{x}\equiv(\mathbf{x}_{t},\mathbf{x}_{t+1}), there is EP associated with the transition between them, which is given by the ratio between the forward and the backward path probabilities

ΔS(𝐱)=logp(𝐱)p(𝐱~),\displaystyle\Delta S(\mathbf{x})=\log\frac{p(\mathbf{x})}{p(\tilde{\mathbf{x}})}, (1)

where 𝐱~(𝐱t+1,𝐱t)\tilde{\mathbf{x}}\equiv(\mathbf{x}_{t+1},\mathbf{x}_{t}). Note that, throughout this study, we use the unit system in which the Boltzmann constant can be set to unity (kB=1k_{\mathrm{B}}=1). Then it follows that the ensemble average of this EP is equivalent to the KL divergence, which satisfies the inequality

ΔS(𝐱)\displaystyle\langle\Delta S(\mathbf{x})\rangle =DKL[p(𝐱),p(𝐱~)]\displaystyle=D_{\rm{KL}}[p(\mathbf{x}),p(\tilde{\mathbf{x}})]
logr(𝐱)p(𝐱~)p(𝐱)r(𝐱)+1p(𝐱)\displaystyle\geq\left\langle\log r(\mathbf{x})-\frac{p(\tilde{\mathbf{x}})}{p(\mathbf{x})}r(\mathbf{x})+1\right\rangle_{p(\mathbf{x})} (2)

for any positive function r(𝐱)r(\mathbf{x}), given that p(𝐱)\langle\cdot\rangle_{p(\mathbf{x})} denotes the average with respect to the distribution p(𝐱)p(\mathbf{x}). This inequality can be proven as follows: since log\log is a concave function, the line tangent to any point never falls below the function. Thus, 1r0(rr0)+logr0logr\frac{1}{r_{0}}(r-r_{0})+\log r_{0}\geq\log r for any rr and r0r_{0}. By putting r0=r0(𝐱)=p(𝐱)/p(𝐱~)r_{0}=r_{0}(\mathbf{x})=p(\mathbf{x})/p(\tilde{\mathbf{x}}) and taking the average with respect to p(𝐱)p(\mathbf{x}), we get the inequality. In this derivation, we immediately note that the equality condition is satisfied if and only if r(𝐱)=r0(𝐱)=p(𝐱)/p(𝐱~)r(\mathbf{x})=r_{0}(\mathbf{x})=p(\mathbf{x})/p(\tilde{\mathbf{x}}). Hence, by varying r(𝐱)r(\mathbf{x}) to maximize the right-hand side of Eq. (II), we accurately estimate the average EP ΔS(𝐱)\langle\Delta S(\mathbf{x})\rangle. For this reason, Eq. (II) is called the variational representation of the KL divergence. Moreover, as a byproduct, we also obtain the function r0(𝐱)r_{0}(\mathbf{x}), which yields an accurate estimate for trajectory-level EP by ΔS(𝐱)=logr0(𝐱)\Delta S(\mathbf{x})=\log r_{0}(\mathbf{x}).

Kim et al. [21] used these properties to construct the loss function of the NEEP. More specifically, they introduce sθ(𝐱)s_{\theta}(\mathbf{x}), an estimator for trajectory-level EP parametrized by θ\theta, and put r(𝐱)=esθ(𝐱)r(\mathbf{x})=e^{s_{\theta}(\mathbf{x})}. Then, Eq. (II) can be rewritten as

ΔS(𝐱)sθ(𝐱)esθ(𝐱~)+1p(𝐱),\displaystyle\langle\Delta S(\mathbf{x})\rangle\geq\left\langle s_{\theta}(\mathbf{x})-e^{s_{\theta}(\tilde{\mathbf{x}})}+1\right\rangle_{p(\mathbf{x})}, (3)

where r(𝐱)p(𝐱~)/p(𝐱)p(𝐱)=r(𝐱)p(𝐱~)=r(𝐱~)p(𝐱)\langle r(\mathbf{x})\,p(\tilde{\mathbf{x}})/p(\mathbf{x})\rangle_{p(\mathbf{x})}=\langle r(\mathbf{x})\rangle_{p(\tilde{\mathbf{x}})}=\langle r(\tilde{\mathbf{x}})\rangle_{p(\mathbf{x})} has been used based on the one-to-one correspondence between 𝐱\mathbf{x} and 𝐱~\tilde{\mathbf{x}}. Furthermore, since EP is odd under time reversal, i.e., ΔS(𝐱)=ΔS(𝐱~)\Delta S(\mathbf{x})=-\Delta S(\tilde{\mathbf{x}}), it is natural to impose the same condition on sθs_{\theta}. This leads to the inequality

ΔS(𝐱)sθ(𝐱)esθ(𝐱)+1p(𝐱),\displaystyle\langle\Delta S(\mathbf{x})\rangle\geq\left\langle s_{\theta}(\mathbf{x})-e^{-s_{\theta}(\mathbf{x})}+1\right\rangle_{p(\mathbf{x})}, (4)

which motivates the loss function

(θ)=sθ(𝐱)+esθ(𝐱)1p(𝐱),\displaystyle\mathcal{L}(\theta)=\left\langle-s_{\theta}(\mathbf{x})+e^{-s_{\theta}(\mathbf{x})}-1\right\rangle_{p(\mathbf{x})}, (5)

so that the minimization of (θ)\mathcal{L}(\theta) ensures the accurate EP estimation sθ(𝐱)=ΔS(𝐱)s_{\theta}(\mathbf{x})=\Delta S(\mathbf{x}).

It is notable that (θ)\mathcal{L}(\theta) defined above is a convex functional of sθs_{\theta}. Thus, as long as the θ\theta-dependence of sθs_{\theta} is well behaved, any gradient-descent algorithm can reach the global minimum of (θ)\mathcal{L}(\theta) without getting trapped in a local minimum. In this regard, the rugged loss function landscape is not a major issue of the NEEP.

However, the performance of the NEEP strongly depends on how well p(𝐱)p(\mathbf{x}) is sampled. Since the second term of (θ)\mathcal{L}(\theta) depends exponentially on sθ(𝐱)s_{\theta}(\mathbf{x}), rare transitions with minute p(𝐱)p(\mathbf{x}) can make nonnegligible contributions to (θ)\mathcal{L}(\theta) when esθ(𝐱)e^{-s_{\theta}(\mathbf{x})} is extremely large. Since the frequency of rare events is subject to considerable sampling noise, the performance of the original NEEP deteriorates in the presence of a strong nonequilibrium driving which induces rare transitions with large negative EP. In the following section, we propose a loss function that remedies this weakness of the NEEP.

III Formulation of the α\alpha-NEEP

Here we formulate a generalization of the NEEP loss function with the goal of mitigating its strong sampling-noise dependence. We note that the loss function needs not be an estimator of average EP ΔS(𝐱)\langle\Delta S(\mathbf{x})\rangle, for our goal is to estimate ΔS(𝐱)\Delta S(\mathbf{x}) at the level of each trajectory. Thus, while the original NEEP uses the variational representation of the KL divergence corresponding to ΔS(𝐱)\langle\Delta S(\mathbf{x})\rangle, we propose a different approach based on the variational representation of the α\alpha-divergence, which quantifies the difference between a pair of probability distributions p(𝐱)p(\mathbf{x}) and q(𝐱)q(\mathbf{x}) as

Dα[p:q][p(𝐱)/q(𝐱)]α1α(1+α)p(𝐱).\displaystyle D_{\alpha}[p\!:\!q]\equiv\left\langle\frac{[p(\mathbf{x})/q(\mathbf{x})]^{\alpha}-1}{\alpha(1+\alpha)}\right\rangle_{p(\mathbf{x})}. (6)

Since this reduces to the KL divergence in the limit α0\alpha\to 0, our approach generalizes the NEEP by introducing an extra parameter α\alpha. To emphasize this aspect, we term our method the α\alpha-NEEP.

The goal of the α\alpha-NEEP is to find r(𝐱)r(\mathbf{x}) that minimizes the loss function

α[r]r(𝐱)α1α+q(𝐱)p(𝐱)r(𝐱)1+α11+αp(𝐱),\displaystyle\mathcal{L}_{\alpha}[r]\equiv\left\langle-\frac{r(\mathbf{x})^{\alpha}-1}{\alpha}+\frac{q(\mathbf{x})}{p(\mathbf{x})}\frac{r(\mathbf{x})^{1+\alpha}-1}{1+\alpha}\right\rangle_{p(\mathbf{x})}, (7)

where p(𝐱)p(\mathbf{x}) and q(𝐱)q(\mathbf{x}) are probability density functions, and α\alpha is a real number other than 0 and 1-1. See Appendix B for discussions of these two exceptional cases. It can be rigorously shown (see Appendix B) that α[r]\mathcal{L}_{\alpha}[r] satisfies the inequality

α[r]Dα[p:q]\displaystyle\mathcal{L}_{\alpha}[r]\geq-D_{\alpha}[p\!:\!q] (8)

where the equality is achieved if and only if r(𝐱)=p(𝐱)/q(𝐱)r(\mathbf{x})=p(\mathbf{x})/q(\mathbf{x}) for all 𝐱\mathbf{x}. In other words, by minimizing α[r]\mathcal{L}_{\alpha}[r] to find Dα[p:q]D_{\alpha}[p\!:\!q], we also obtain an estimate for the ratio p(𝐱)/q(𝐱)p(\mathbf{x})/q(\mathbf{x}). We note that the properties of α[r]\mathcal{L}_{\alpha}[r] used here are also valid for a much more general class of loss functions, as discussed in [22, 23] (also see Appendix B).

Based on Eq. (8), we can construct a loss function

α(θ)=eαsθ(𝐱)1α+e(1+α)sθ(𝐱)11+α.\displaystyle\mathcal{L}_{\alpha}(\theta)=\left\langle-\frac{e^{\alpha s_{\theta}(\mathbf{x})}-1}{\alpha}+\frac{e^{-(1+\alpha)s_{\theta}(\mathbf{x})}-1}{1+\alpha}\right\rangle. (9)

Note that this reduces to the loss function of the original NEEP shown in Eq. (5) in the limit α0\alpha\to 0. If sθs_{\theta} is sufficiently well behaved, the minimization of α(θ)\mathcal{L}_{\alpha}(\theta) yields the minimizer θ\theta^{*} which satisfies α(θ)=Dα[p(𝐱),p(𝐱~)]\mathcal{L}_{\alpha}(\theta^{*})=-D_{\alpha}[p(\mathbf{x}),p(\tilde{\mathbf{x}})] and ΔS(𝐱)=sθ(𝐱)\Delta S(\mathbf{x})=s_{\theta^{*}}(\mathbf{x}). The former is generally not equal to average EP ΔS(𝐱)\langle\Delta S(\mathbf{x})\rangle (unless α0\alpha\to 0), but the latter ensures the accurate estimation of trajectory-level EP ΔS(𝐱)\Delta S(\mathbf{x}).

Comparing Eqs. (5) and (9), one readily observes that the exponential dependence on sθ(𝐱)s_{\theta}(\mathbf{x}) can be made much weaker in α(θ)\mathcal{L}_{\alpha}(\theta) by choosing the value of α\alpha between 1-1 and 0. Since this mitigates the detrimental effects of the sampling error associated with rare trajectories with large negative sθ(𝐱)s_{\theta}(\mathbf{x}), one can naturally expect that the performance of the α\alpha-NEEP is much more robust against strong nonequilibrium driving. This is confirmed in the following sections.

Before proceeding, a few remarks are in order:

  1. 1.

    The loss function satisfies (1+α)(θ)=α(θ)\mathcal{L}_{-(1+\alpha)}(\theta)=\mathcal{L}_{\alpha}(\theta), so the α\alpha-NEEP is symmetric under the exchange α(1+α)\alpha\leftrightarrow-(1+\alpha). For this reason, in the rest of this paper, we focus on the regime 0.5α0-0.5\leq\alpha\leq 0 (the regime α>0\alpha>0 leads to very poor performance and is left out).

  2. 2.

    From the antisymmetry ΔS(𝐱)=ΔS(𝐱~)\Delta S(\mathbf{x})=-\Delta S(\tilde{\mathbf{x}}), we may set the estimator sθs_{\theta} to be related to the feedforward neural network (FNN) output hθh_{\theta} as

    sθ(𝐱)=hθ(𝐱)hθ(𝐱~),\displaystyle s_{\theta}(\mathbf{x})=h_{\theta}(\mathbf{x})-h_{\theta}(\tilde{\mathbf{x}})\;, (10)

    so that the neural network focuses on the estimators that satisfy the antisymmetry of EP for more efficient training. The method described so far is schematically illustrated in Fig. 1.

  3. 3.

    We emphasize that the minimized α\mathcal{L}_{\alpha} is not directly related to average EP. In all cases, we compute the average EP by averaging sθs_{\theta} over the sampled transitions.

IV Examples

To assess the performance of the α\alpha-NEEP for various values of α\alpha, we apply the method to toy models of nonequilibrium systems, namely the two-bead model, the Brownian gyrator, and the driven Brownian particle.

Refer to caption
Figure 2: (a) Illustration of the two-bead model. (b) Mean square error (MSE) of the EP estimate for various temperature differences. (c) Ratio between the estimated value σpred\sigma_{\mathrm{pred}} and the true value σ\sigma of average EP for the two-bead model. Temperature of the cold bath is fixed at Tc=1T_{\mathrm{c}}=1. Each data point and error bar are obtained from 4040 independent trainings.

(i) The two-bead model. This model has been used in a number of previous studies as a benchmark for testing EP estimators [4, 16, 18, 21]. The model consists of two one-dimensional (1D) overdamped beads which are connected to each other and to the walls on both sides by identical springs, see Fig. 2(a). The beads are in contact with heat baths at temperatures ThT_{\rm{h}} and TcT_{\rm{c}} with Th>TcT_{\rm{h}}>T_{\rm{c}}. Denoting by xhx_{\rm{h}} (xcx_{\rm{c}}) the bead in contact with the hot (cold) bath, the stochastic equations of motion are given by

γx˙h\displaystyle\gamma\dot{x}_{\rm{h}} =k(2xh+xc)+2γThξh(t),\displaystyle=k(-2x_{\rm{h}}+x_{\rm{c}})+\sqrt{2\gamma T_{\rm{h}}}\xi_{\rm{h}}(t)\;, (11a)
γx˙c\displaystyle\gamma\dot{x}_{\rm{c}} =k(2xc+xh)+2γTcξc(t).\displaystyle=k(-2x_{\rm{c}}+x_{\rm{h}})+\sqrt{2\gamma T_{\rm{c}}}\xi_{\rm{c}}(t)\;. (11b)

Here kk is the spring constant, γ\gamma the friction coefficient, and ξh,c\xi_{\rm{h},\,\rm{c}} the Gaussian thermal noise with zero means and ξh(t)ξh(t)=ξc(t)ξc(t)=δ(tt)\langle\xi_{\rm{h}}(t)\xi_{\rm{h}}(t^{\prime})\rangle=\langle\xi_{\rm{c}}(t)\xi_{\rm{c}}(t^{\prime})\rangle=\delta(t-t^{\prime}). For infinitesimal displacements (dxh,dxc)(dx_{\rm{h}},dx_{\rm{c}}), the associated EP is given by

ΔS=kγ[2xhxcThdxh+2xcxhTcdxc]+ΔSsys,\displaystyle\Delta S=\frac{k}{\gamma}\left[\frac{2x_{\rm{h}}-x_{\rm{c}}}{T_{\rm{h}}}\circ dx_{\rm{h}}+\frac{2x_{\rm{c}}-x_{\rm{h}}}{T_{\rm{c}}}\circ dx_{\rm{c}}\right]+\Delta S_{\text{sys}}\;, (12)

where \circ denotes the Stratonovich product and ΔSsys\Delta S_{\text{sys}} the change of the system’s Shannon entropy, namely

ΔSsys=lnps(xh+dxh,xc+dxc)ps(xh,xc)\displaystyle\Delta S_{\text{sys}}=-\ln\frac{p_{\mathrm{s}}(x_{\rm{h}}+dx_{\rm{h}},x_{\rm{c}}+dx_{\rm{c}})}{p_{\mathrm{s}}(x_{\rm{h}},x_{\rm{c}})} (13)

for the steady-state distribution ps(xh,xc)p_{\mathrm{s}}(x_{\rm{h}},x_{\rm{c}}). Since the system is fully linear, ps(xh,xc)p_{s}(x_{\rm{h}},x_{\rm{c}}) can be calculated analytically. Thus the EP of this model can be calculated exactly using Eq. (12) and compared with the α\alpha-NEEP result.

To see how the predicted EP differs from the true EP, we observe the behavior of the mean square error (MSE) (sθΔS)2\langle(s_{\theta}-\Delta S)^{2}\rangle. In Fig. 2(b), we observe that strengthening the nonequilibrium driving (by increasing ThT_{h} while keeping Tc=1T_{c}=1) tends to impair the EP estimation. This is because a stronger driving makes the reverse trajectories of typical trajectories rarer, lowering the sample quality. The adverse effects of the nonequilibrium driving are the strongest for the original NEEP (α=0\alpha=0), which are mitigated by choosing different values of α\alpha. Remarkably, choosing α=0.5\alpha=-0.5 leads to the most robust performance against the driving.

As an alternative measure of the estimator’s performance, we also observe the ratio between the predicted average EP σpred\sigma_{\text{pred}} and the exact average EP σ\sigma. The results are shown in Fig. 2(c), which exhibit two different regimes. As ThT_{h} increases, there is a regime where the estimator overestimates average EP, which is followed by an underestimation regime. A detailed explanation for this behavior will be given in Sec. V using a simplified model. At the moment, we note that σpred/σ\sigma_{\text{pred}}/\sigma tends to deviate away from 11 most strongly for the original NEEP (α=0\alpha=0), while choosing different values of α\alpha makes the ratio stay closer to 11. Again, the optimal value of α\alpha seems to be 0.5-0.5.

Refer to caption
Figure 3: (a) Illustration of the Brownian gyrator. Circles represent the equipotential lines and the dashed arrows indicate the directions of the nonconservative driving. (b) MSE of the EP estimate for the Brownian gyrator model as the magnitude of nonconservative force, ε=δ\varepsilon=-\delta, is varied. (c) Ratio between the estimated value σpred\sigma_{\mathrm{pred}} and the true value σ\sigma of average EP for the Brownian gyrator. Temperatures are fixed at Th=10T_{\mathrm{h}}=10 and Tc=1T_{\mathrm{c}}=1. Each data point and error bar are obtained from 4040 independent trainings.

(ii) The Brownian gyrator. This simple model of a single-particle heat engine allows us to check the effects of a nonequilibrium driving apart from the temperature difference Th>TcT_{\mathrm{h}}>T_{\mathrm{c}}. The dynamics of the model is governed by

γx˙h\displaystyle\gamma\dot{x}_{\rm{h}} =xhU(xh,xc)+εxc+2γThξh(t),\displaystyle=-\partial_{x_{\rm{h}}}U(x_{\rm{h}},x_{\rm{c}})+\varepsilon x_{\rm{c}}+\sqrt{2\gamma T_{\rm{h}}}\,\xi_{\rm{h}}(t)\;, (14a)
γx˙c\displaystyle\gamma\dot{x}_{\rm{c}} =xcU(xh,xc)+δxh+2γTcξc(t),\displaystyle=-\partial_{x_{\rm{c}}}U(x_{\rm{h}},x_{\rm{c}})+\delta x_{\rm{h}}+\sqrt{2\gamma T_{\rm{c}}}\,\xi_{\rm{c}}(t)\;, (14b)

where U(xh,xc)=12k(xh2+xc2)U(x_{\rm{h}},x_{\rm{c}})=\frac{1}{2}k(x_{\rm{h}}^{2}+x_{\rm{c}}^{2}) is the harmonic potential, and (εxc,δxh)(\varepsilon x_{\mathrm{c}},\delta x_{\mathrm{h}}) is a nonconservative force that drives the system out of equilibrium and enables work extraction. See Fig. 3(a) for an illustration of this system. For infinitesimal displacements (dxh,dxc)(dx_{\rm{h}},dx_{\rm{c}}), the associated EP is given by

ΔS=QhThQcTc+ΔSsys,\displaystyle\Delta S=-\frac{Q_{\rm{h}}}{T_{\rm{h}}}-\frac{Q_{\rm{c}}}{T_{\rm{c}}}+\Delta S_{\text{sys}}, (15)

where

Qh\displaystyle Q_{\rm{h}} =(xhUϵxc)dxh,\displaystyle=(\partial_{x_{\rm{h}}}U-\epsilon x_{c})\circ dx_{\rm{h}}, (16a)
Qc\displaystyle Q_{\rm{c}} =(xcUδxh)dxc,\displaystyle=(\partial_{x_{\rm{c}}}U-\delta x_{h})\circ dx_{\rm{c}}, (16b)

and ΔSsys\Delta S_{\text{sys}} the change of the system entropy. Again, the system is fully linear and the steady-state distribution can be calculated analytically, allowing exact calculations of EP at the trajectory level.

Setting Th/Tc=10T_{\mathrm{h}}/T_{\mathrm{c}}=10 and ε=δ\varepsilon=-\delta, we vary the magnitude of ε\varepsilon to assess the robustness of the α\alpha-NEEP in terms of the MSE and the ratio σpred/σ\sigma_{\text{pred}}/\sigma, as shown in Figs. 3(b) and (c), respectively. The results are qualitatively similar to the case of the two-bead model: as the nonconservative driving gets stronger, the performance of the original NEEP (α=0\alpha=0) deteriorates the most, while other values of α\alpha yield more robust results. Again, α=0.5\alpha=-0.5 seems to be the optimal choice.

(iii) The driven Brownian particle. While the two examples given above were both linear systems, we also consider a nonlinear system featuring a 1D overdamped Brownian particle in a periodic potential U(x)=AsinxU(x)=A\sin x driven by a constant force ff. The motion of the particle is described by the Langevin equation

γx˙=fU(x)+2γTξ(t),\displaystyle\gamma\dot{x}=f-U^{\prime}(x)+\sqrt{2\gamma T}\xi(t)\;, (17)

where ξ(t)\xi(t) is a Gaussian white noise with unit variance. See Fig. 4(a) for an illustration of the model. For sufficiently large AA, this model can approximate the behaviors of the Markov jump process on a discrete chain. For this model, the EP associated with the infinitesimal displacement dxdx is given by

ΔS=fdx+U(x+dx)U(x)T+ΔSsys,\displaystyle\Delta S=\frac{-f\,dx+U(x+dx)-U(x)}{T}+\Delta S_{\text{sys}}\;, (18)

where ΔSsys=lnps(x+dx)/ps(x)\Delta S_{\text{sys}}=-\ln p_{\mathrm{s}}(x+dx)/p_{\mathrm{s}}(x) again denotes the Shannon entropy change for the steady-state distribution ps(x)p_{\mathrm{s}}(x). Since the system is 1D, it is straightforward to obtain ps(x)p_{\mathrm{s}}(x) by numerical integration. Thus, the EP of this model can also be calculated exactly and compared to the α\alpha-NEEP result.

Refer to caption
Figure 4: (a) Illustration of the driven Brownian particle. (b) MSE of the EP estimate for the driven Brownian particle as the potential depth AA is varied. (c) Ratio between the estimated value σpred\sigma_{\mathrm{pred}} and the true value σ\sigma of the average EP for the driven Brownian particle. Strength of the nonequilibrium driving is fixed at f=32f=32 and the temperature at T=1T=1. Each data point and error bar are obtained from 4040 independent trainings.

Fixing f=32f=32, the performance of the α\alpha-NEEP for this model is shown in Figs. 4(b) and (c) in terms of the MSE and the ratio σpred/σ\sigma_{\mathrm{pred}}/\sigma, respectively. Due to the presence of a strong background driving (f=32f=32), there are already considerable differences among different methods at A=0A=0. But it is worth noting that increasing the amplitude AA of the periodic potential U(x)U(x) clearly increases the MSE and makes σpred/σ\sigma_{\mathrm{pred}}/\sigma deviate farther away from 11 for the original NEEP (α=0\alpha=0). This may be the consequence of rarer movements (jumps from one potential well to the next) across the system as the potential well gets deeper, which means rare trajectories are even more poorly sampled. The α\alpha-NEEPs with nonzero values of α\alpha are much more robust against the increase of AA, with α=0.5\alpha=-0.5 showing the best performance overall.

Refer to caption
Figure 5: Performance of the exactly solvable one-parameter model. (a) Shift Δθ\Delta\theta of the loss function minimum as a function of the truncation parameter kk. Circles are results obtained by numerical minimization, and solid lines are from the small 1/k1/k expansion. (Inset) Loss function landscapes, with circles indicating the minima. We fixed μ=3\mu=3, σ=1\sigma=1, and α=0\alpha=0. (b) Ratio of the estimated minimum θ\theta^{*} to the true minimum θ0\theta_{0} as the bias μ\mu is varied. The optimal points are calculated using the criterion that the loss function gradient satisfies |θα(θ)|<103|\partial_{\theta}\mathcal{L}_{\alpha}(\theta)|<10^{-3} for the first time as θ\theta increases from 0. We fixed k=4k=4 and σ=1\sigma=1. (Inset) Loss function landscape. Open diamonds indicate the true minima θ0\theta_{0}, and the filled diamonds represent the estimated minima θ\theta^{*}. The parameters α=0.5\alpha=-0.5 and k=4k=4 are fixed. (c) MSE of θ\theta. The vertical dashed line shows that the error is minimized at α=0.5\alpha=-0.5. (d) Distribution of the loss function gradient θα\partial_{\theta}\mathcal{L}_{\alpha} at the minimum θ0=2\theta_{0}=2 for μ=σ=1\mu=\sigma=1.

V Simple Gaussian Model

The results shown thus far clearly indicate that, by choosing a nonzero value of α\alpha, the α\alpha-NEEP can exhibit a much more robust performance against the adverse effects of the nonequilibrium driving. Moreover, α=0.5\alpha=-0.5 seems to exhibit the best performance in many cases. To gain more intuition into these results, we simplify the EP estimation problem to the density-ratio estimation problem for a 1D random variable. To be specific, we estimate the log ratio s(x)logp(x)/p(x)s(x)\equiv\log p(x)/p(-x) given samples drawn from the distribution p(x)p(x). It is intuitively clear that this problem is structurally equivalent to EP estimation.

For further simplification, we set

p(x)={𝒩exp[(xμ)22σ2]for |xμ|kσ,0otherwise.\displaystyle p(x)=\begin{cases}\mathcal{N}\exp\left[-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right]&\text{for $|x-\mu|\leq k\sigma$,}\\ 0&\text{otherwise.}\end{cases} (19)

Here 𝒩\mathcal{N} is a suitable normalization factor, μ\mu the positive mean of the distribution, σ\sigma the width of the distribution, and kk a positive number truncating the tails of the distribution. While k=k=\infty corresponds to the perfect sampling of a Gaussian distribution, a finite kk corresponds to the case where the tails of the distribution are poorly sampled.

For k=k=\infty, the correct answer to the problem is a linear function s(x)=θ0xs(x)=\theta_{0}x, where θ02μ/σ2\theta_{0}\equiv 2\mu/\sigma^{2}. Thus, for further simplicity, we focus on the one-parameter model sθ(x)=θxs_{\theta}(x)=\theta x, which estimates s(x)s(x) using only a single parameter θ\theta. For this problem, the suitable loss function is obtained as an analog of Eq. (9):

α(θ)=(1+α)eαθx1α+e(1+α)θxp(x).\displaystyle\mathcal{L}_{\alpha}(\theta)=\left\langle-\frac{(1+\alpha)e^{\alpha\theta x}-1}{\alpha}+e^{-(1+\alpha)\theta x}\right\rangle_{p(x)}\;. (20)

If kk is large but finite, the minimum of this loss function shifts to θ0+Δθ\theta_{0}+\Delta\theta, where Δθ\Delta\theta can be expanded to the leading orders in 1/k1/k:

Δθexp[k22+2μσ(|α+12|+12)k].\displaystyle\Delta\theta\sim\exp\left[-\frac{k^{2}}{2}+\frac{2\mu}{\sigma}\left(\left|\alpha+\frac{1}{2}\right|+\frac{1}{2}\right)k\right]\;. (21)

This clearly shows that α=0.5\alpha=-0.5 gives the least shift Δθ\Delta\theta, as also illustrated by various results shown in Fig. 5.

In Fig. 5(a), we show that the shift of the minimum Δθ\Delta\theta tends to increase as the tail sampling becomes poorer (i.e., kk decreases). The landscapes of the loss function α(θ)\mathcal{L}_{\alpha}(\theta), shown in the inset of Fig. 5(a), also confirm this observation. The increase of the error with the potential depth AA in Figs. 1(d) and 2(b) may primarily be due to the same effect.

In Fig. 5(b), we plot the ratio between the estimated minimum θ\theta^{*} and the true minimum θ0\theta_{0} as a function of the mean μ\mu, which is an analog of the nonequilibrium driving. We note that here θ\theta^{*} is the lowest value of θ\theta at which the slope of the loss function becomes less then 10310^{-3}. We observe that an overestimation regime (θ/θ0>1\theta^{*}/\theta_{0}>1) crosses over to an underestimation regime (θ/θ0<1\theta^{*}/\theta_{0}<1) as μ\mu grows. This is in striking agreement with the trends shown in Fig. 2(a). The reason why θ\theta^{*} underestimates θ0\theta_{0} for large μ\mu can be understood by the flattened loss function landscapes shown in the inset of Fig. 5(b). In this regime, the dynamics of θ\theta (starting from θ\theta = 0) slows down, ending up at a value (filled diamonds) even lower than θ0\theta_{0} (empty diamonds). This effect is due to the samples with x<0x<0 vanishing when μ\mu is too large. We expect that a similar mechanism might be at play behind the observed behavior of σpred/σ\sigma_{\mathrm{pred}}/\sigma shown in Fig. 2(a). If we had used a broader range of nonequilibrium driving, the same behaviors might have been observed for other models as well, although this remains to be checked.

The one-parameter model also allows us to examine the effects of the finite minibatch size MM. While the ideal loss function is given in Eq. (20), the loss function used in the actual training looks like

α(θ;M)=1Mi[(1+α)eαθXi1α+e(1+α)θXi],\displaystyle\mathcal{L}_{\alpha}(\theta;M)=\frac{1}{M}\sum_{i}{\left[-\frac{(1+\alpha)e^{\alpha\theta X_{i}}-1}{\alpha}+e^{-(1+\alpha)\theta X_{i}}\right]}\;, (22)

where X1,,XMX_{1},\,\ldots,\,X_{M} are i.i.d. Gaussian random variables of mean μ\mu and variance σ2\sigma^{2}. When MM is large and finite, using the central limit theorem (CLT), the gradient of this loss function can be approximated as [26, 27]

αθ|θ=θt=K¯(θtθ0)+ΛMNt+o(M1/2),\displaystyle\frac{\partial\mathcal{L}_{\alpha}}{\partial\theta}\bigg{|}_{\theta=\theta_{t}}=\bar{K}(\theta_{t}-\theta_{0})+\sqrt{\frac{\Lambda}{M}}N_{t}+o(M^{-1/2})\;, (23)

where θ0=argmin(α(θ))\theta_{0}=\text{argmin}(\langle\mathcal{L}_{\alpha}(\theta)\rangle), K¯=θ2α|θ=θ0\bar{K}=\partial^{2}_{\theta}\langle\mathcal{L}_{\alpha}\rangle\big{|}_{\theta=\theta_{0}}, and Λ=Var[θα|θ=θ0,M=1]\Lambda=\text{Var}\big{[}\partial_{\theta}\mathcal{L}_{\alpha}\big{|}_{\theta=\theta_{0},\,M=1}\big{]}. When the stochastic gradient descent θt+1=θtλ(α/θ)|θ=θt\theta_{t+1}=\theta_{t}-\lambda(\partial\mathcal{L}_{\alpha}/\partial\theta)\big{|}_{\theta=\theta_{t}} reaches the steady state, the MSE of θ\theta is given by

(θθ0)2s=λΛMK¯(2λK¯)+o(M1).\displaystyle\langle\left(\theta-\theta_{0}\right)^{2}\rangle_{s}=\frac{\lambda\Lambda}{M\bar{K}(2-\lambda\bar{K})}+o(M^{-1})\;. (24)

This leading-order behavior is shown in Fig. 5(c) for various values of μ\mu. For all cases, the MSE of θ\theta is minimized at α=0.5\alpha=-0.5, which is consistent with the smallest error bars observed at α=0.5\alpha=-0.5 in Figs. 1 and 2. Hence, α=0.5\alpha=-0.5 yields the most consistent EP estimator.

Direct measurements of the loss function gradient at the minimum also confirm the above result. As shown in Fig. 5(d), the gradient θα\partial_{\theta}\mathcal{L}_{\alpha} is far more broadly distributed for α=0\alpha=0 than for α=0.5\alpha=-0.5. Moreover, due to the subleading effects (beyond the CLT) of finite MM, the gradient for α=0\alpha=0 features a large skewness. These show that the training dynamics for the original NEEP (α=0\alpha=0) tends to be far more volatile and unstable than for the α\alpha-NEEP with α=0.5\alpha=-0.5.

VI Summary and outlook

We proposed the α\alpha-NEEP, a generalization of the NEEP for estimating steady-state EP at the trajectory level. By choosing a value of α\alpha between 1-1 and 0, the α\alpha-NEEP weakens the exponential dependence of the loss function on the EP estimator, effectively mitigating the adverse effects induced by poor sampling of transitions associated with large negative EP in the presence of strong nonequilibrium driving and/or deep potential wells. We also observed that α=0.5\alpha=-0.5 tends to exhibit the optimal performance, which can be understood via a simplification of the original EP estimation problem, whose loss function landscape and relaxation properties are analytically tractable. The α\alpha-NEEP thus provides a powerful method for estimating the EP for much broader range of the nonequilibrium driving force and the time scale of dynamics. Identification of even better loss functions and optimization of other hyperparameters (network size, number of iterations, etc.) are left as future works. It would also be interesting to apply the α\alpha-NEEP to estimations of the EP of the Brownian movies [28] and stochastic systems with odd-parity variables [29], which have been studied using the original NEEP method.

Acknowledgments. — This work was supported by the POSCO Science Fellowship of the POSCO TJ Park Foundation. E.K. and Y.B. also thank Junghyo Jo and Sangyun Lee for helpful comments.

Refer to caption
Figure 6: Coefficient of determination R2R^{2} for (a) the two-bead model with Tc=1T_{\mathrm{c}}=1 and for (b) the driven Brownian particle with f=32f=32. 10410^{4} trajectories are used for each minibatch, and error bars indicate the standard deviations obtained from 4040 independent trainings.
Refer to caption
Figure 7: Effects of the minibatch size on the performance of the α\alpha-NEEP for the two-bead model with Th=1000T_{\mathrm{h}}=1000 and Tc=1T_{\mathrm{c}}=1. Error bars indicate the standard deviations obtained from 4040 independent trainings.

Appendix A Training details

We always use the fully connected network (FCN) with three hidden layers, with each layer composed of 512512 nodes. Each training dataset consists of 10610^{6} trajectories. The neural network parameters are updated using the ReLU activation function and the Adam optimizer. The learning rate is fixed to 10510^{-5} and the weight decay is fixed to 5×1055\times 10^{-5}. We halt the training after 1000010000 iterations, except for the results shown in Figs. 8 and 9 (see Appendix C), where we continue the training for a longer time to check the overfitting effects. All trainings are done on PyTorch with NVIDIA GeForce RTX 3090.

In subfigure (b) of Figs. 24, each minibatch consists of 10410^{4} trajectories. On the other hand, in subfigure (c) of Figs. 24, each minibatch consists of 10510^{5} trajectories.

Appendix B Density ratio estimation via ff-divergence

Here we show that the the loss function α[r]\mathcal{L}_{\alpha}[r] given in Eq. (7), whose minimization allows us to estimate the ratio between two probability density functions, can be generalized even further using the concept of ff-divergence. Consider a convex, twice-differentiable real-valued function f(u)f(u). Then, the inequality

pf(u)+q[uf(u)f(u)]qf(p/q)\displaystyle-pf^{\prime}(u)+q\left[uf^{\prime}(u)-f(u)\right]\geq-qf(p/q)\; (25)

holds. We can verify this by differentiating the left-hand side (LHS) with respect to uu, which yields f′′(u)(p+qu)f^{\prime\prime}(u)(-p+qu). Thus, the LHS has a local minimum at u=p/qu=p/q, and this is the only local minimum since ff is convex. In addition, the second derivative of the LHS at u=p/qu=p/q equals qf′′(p/q)qf^{\prime\prime}(p/q), which is positive by the convexity. This proves the inequality (25).

Using this result, we can design a loss function whose minimum is equal to the negative ff-divergence between two probability distributions p(𝐱)p(\mathbf{x}) and q(𝐱)q(\mathbf{x}). To be specific, for any function r(𝐱)r(\mathbf{x}), we define

f[r]\displaystyle\mathcal{L}_{f}[r] =f(r(𝐱))+q(𝐱)p(𝐱)[r(𝐱)f(r(𝐱))f(r(𝐱))]p(𝐱)\displaystyle=\left\langle-f^{\prime}(r(\mathbf{x}))+\frac{q(\mathbf{x})}{p(\mathbf{x})}\left[r(\mathbf{x})f^{\prime}(r(\mathbf{x}))-f(r(\mathbf{x}))\right]\right\rangle_{p(\mathbf{x})}
=d2d𝐱{p(𝐱)f(r(𝐱))\displaystyle=\int d^{2d}\mathbf{x}\,\{-p(\mathbf{x})f^{\prime}\!\left(r(\mathbf{x})\right)
+q(𝐱)[r(𝐱)f(r(𝐱))f(r(𝐱))]}.\displaystyle\quad\quad\quad\quad\quad+q(\mathbf{x})\left[r(\mathbf{x})f^{\prime}\!\left(r(\mathbf{x})\right)-f\!\left(r(\mathbf{x})\right)\right]\}\;. (26)

Using Eq. (B), we conclude that

f[r]d2d𝐱q(𝐱)f(p(𝐱)q(𝐱))=Df[p:q],\displaystyle\mathcal{L}_{f}[r]\geq-\int d^{2d}\mathbf{x}\,q(\mathbf{x})\,f\!\left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right)=-D_{f}[p\!:\!q]\;, (27)

where Df[p:q]D_{f}[p{{\,:\,}}q] is the ff-divergence between the distributions p(𝐱)p(\mathbf{x}) and q(𝐱)q(\mathbf{x}), and the equality holds if and only if r(𝐱)=p(𝐱)/q(𝐱)r(\mathbf{x})=p(\mathbf{x})/q(\mathbf{x}) for all 𝐱\mathbf{x}. By minimizing f[r]\mathcal{L}_{f}[r], we can estimate p(𝐱)/q(𝐱)p(\mathbf{x})/q(\mathbf{x}) as well as Df[p:q]D_{f}[p\!:\!q].

The loss function α\mathcal{L}_{\alpha} and the associated α\alpha-divergence discussed in the main text are obtained by choosing the function ff to be

fα(u)={u1+α(1+α)u+αα(1+α)for α0,1,ulogufor α=0,logu+1ufor α=1..\displaystyle f_{\alpha}(u)=\begin{cases}\frac{u^{1+\alpha}-(1+\alpha)u+\alpha}{\alpha(1+\alpha)}&\text{for $\alpha\neq 0,\,-1$,}\\ u\log u&\text{for $\alpha=0$,}\\ \log u+1-u&\text{for $\alpha=-1$.}\end{cases}\;. (28)

Note that f0(u)=limα0fα(u)f_{0}(u)=\lim_{\alpha\to 0}f_{\alpha}(u) and f1(u)=limα1fα(u)f_{-1}(u)=\lim_{\alpha\to-1}f_{\alpha}(u). It is straightforward to obtain Eq. (9) and its extensions to the cases α=0\alpha=0 and α=1\alpha=-1 from this choice.

Refer to caption
Figure 8: Training dynamics of the α\alpha-NEEP for the two-bead model at Th=10T_{\mathrm{h}}=10. Each minibatch consists of 10510^{5} trajectories. The first and the third columns show the performance for the training set (blue/dark gray curve) and the test set (orange/light gray curve). The second and the fourth columns show the difference in performance between the two datasets. The first (last) two columns correspond to α=0.5\alpha=-0.5 (α=0\alpha=0).
Refer to caption
Figure 9: Training dynamics of the α\alpha-NEEP for the two-bead model at Th=3000T_{\mathrm{h}}=3000 and Tc=1T_{\mathrm{c}}=1. Each minibatch consists of 10510^{5} trajectories. The first and the third columns show the performance for the training set (blue/dark gray curve) and the test set (orange/light gray curve). The second and the fourth columns show the difference in performance between the two datasets. The first (last) two columns correspond to α=0.5\alpha=-0.5 (α=0\alpha=0).

Appendix C Extra numerical results

C.1 Coefficient of determination

In the literature, the extent of agreement between a prediction and the true value is often expressed by the coefficient of determination R2R^{2}. Here we check how the behaviors of R2R^{2} differ as the value of α\alpha changes for the cases of the two-bead model and the driven 1D Brownian particle.

For the two-bead model, as shown in Fig. 6(a), R2R^{2} exhibits a nonmonotonic behavior as a function of ThT_{\mathrm{h}}. The decrease of R2R^{2} with increasing ThT_{\mathrm{h}} reflects the detriment of the α\alpha-NEEP performance as the nonequilibrium driving gets stronger. Meanwhile, the decrease of R2R^{2} as ThT_{\mathrm{h}} decreases (getting closer to equilibrium Th=Tc=1T_{\mathrm{h}}=T_{\mathrm{c}}=1) is due to the overfitting phenomenon discussed in the next section, which disrupts the linear relationship between the predicted EP and the true EP.

For the driven Brownian particle, as shown in Fig. 6(b), R2R^{2} always increases with AA. This may seem contradictory to how the MSE tends to increase or stay constant with increasing AA in Fig. 4(b). Indeed, higher R2R^{2} only means that there is a good linear relationship between the EP estimate sθs_{\theta} and the true EP ΔS\Delta S, not that sθs_{\theta} and ΔS\Delta S are close to each other. When AA is increased, due to the slower dynamics, we may have sθ>ΔSs_{\theta}>\Delta S for transitions with positive EP and sθ<ΔSs_{\theta}<\Delta S for transitions with negative EP, which can make the linear relationship between sθs_{\theta} and ΔS\Delta S appear stronger. This example clearly shows that R2R^{2} is not an adequate measure of the performance of EP estimators.

C.2 Effects of the minibatch size

The minibatch refers to the group of samples used for computing the gradient of the loss function. Smaller (larger) minibatches increase (decrease) the noisy component of the gradient, which in turn affects the performance of the α\alpha-NEEP.

We explicitly check the effects of the minibatch size using the two-bead model with Th=1000T_{\mathrm{h}}=1000 and Tc=1T_{\mathrm{c}}=1, as shown in Fig. 7. We use the ratio σpred/σ\sigma_{\mathrm{pred}}/\sigma and the MSE as two different measures of the α\alpha-NEEP performance. For small minibatches, the highly skewed distribution of the stochastic gradient shown in Fig. 5(d) causes underestimation of the EP. For large minibatches, the noisy component of the loss-function gradient decreases, revealing the properties of the loss function landscape of the training dataset. As discussed using the Gaussian model in Sec. V, the loss function landscape at a moderately strong nonequilibrium driving leads to the overestimation of the EP. Thus, as the minibatch size is increased, σpred/σ\sigma_{\mathrm{pred}}/\sigma grows beyond 11.

The nonmonotonic behaviors of the MSE also hint at the existence of an optimal minibatch size at the tradeoff between the skewed noise in the gradient (which drives the neural network towards underestimation) and the loss function landscape tilted towards overestimation. For both measures, the superiority of α=0.5\alpha=-0.5 to α=0\alpha=0 is manifest.

C.3 Effects of overfitting

In many cases, when the training continues for too many iterations, artificial neural networks are known to exhibit overfitting behaviors. As shown in Figs. 8 and 9, we checked whether the α\alpha-NEEP is also subject to the same phenomena as the training continues up to 2000020000 iterations. Towards this end, we created two independent datasets of trajectories exhibited by the two-bead model, namely the training set and the test set. Only the former was used during the training of the α\alpha-NEEP, and we measured the MSE and the ratio σpred/σ\sigma_{\mathrm{pred}}/\sigma to assess the performance of the α\alpha-NEEP for each dataset.

In Fig. 8, we show the results for the weak nonequilibrium driving (Th=10T_{\mathrm{h}}=10 and Tc=1T_{\mathrm{c}}=1). The first and the third columns show the two different measures of performance for the training dataset and the test dataset. Meanwhile, the second and the fourth columns show the difference between the corresponding measures obtained for two datasets. The overfitting phenomena are manifest from the increase of the MSE towards the end of the training. Interestingly, overfitting leads to an overestimation of the average EP only for the training dataset. We also note that the value of α\alpha is largely irrelevant to the extent of overfitting. This phenomenon can be explained as follows. Near equilibrium, the neural network swiftly reaches the loss function minimum. However, as the training continues, the neural network starts to see the detailed fluctuations of the training dataset. This makes the functional form of the estimator sθs_{\theta} very rough, leading to the increase of the MSE for both datasets. But while the neural network now believes all trajectories in the training dataset to be highly irreversible and assigns high EP to them, the EP assigned to the trajectories in the test dataset stay unbiased. Thus, σpred/σ\sigma_{\mathrm{pred}}/\sigma grows larger only for the training dataset.

In Fig. 9, we show the results for the strong nonequilibrium driving (Th=3000T_{\mathrm{h}}=3000 and Tc=1T_{\mathrm{c}}=1). The subfigures are organized in exactly the same way as in Fig. 8. In this case, the overfitting effects do exist. But they are not as pronounced as in the case of the weaker nonequilibrium driving, and the differences between the training and the test datasets stay small. Note that the curves for α=0\alpha=0 exhibit strong fluctuations, which is in agreement with the large fluctuations of the gradient shown in Fig. 5(d).

References