This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DiffMD: A Geometric Diffusion Model for Molecular Dynamics Simulations

Fang Wu12, Stan Z. Li1 The corresponding author.
Abstract

Molecular dynamics (MD) has long been the de facto choice for simulating complex atomistic systems from first principles. Recently deep learning models become a popular way to accelerate MD. Notwithstanding, existing models depend on intermediate variables such as the potential energy or force fields to update atomic positions, which requires additional computations to perform back-propagation. To waive this requirement, we propose a novel model called DiffMD by directly estimating the gradient of the log density of molecular conformations. DiffMD relies on a score-based denoising diffusion generative model that perturbs the molecular structure with a conditional noise depending on atomic accelerations and treats conformations at previous timeframes as the prior distribution for sampling. Another challenge of modeling such a conformation generation process is that a molecule is kinetic instead of static, which no prior works have strictly studied. To solve this challenge, we propose an equivariant geometric Transformer as the score function in the diffusion process to calculate corresponding gradients. It incorporates the directions and velocities of atomic motions via 3D spherical Fourier-Bessel representations. With multiple architectural improvements, we outperform state-of-the-art baselines on MD17 and isomers of C7O2H10\textrm{C}_{7}\textrm{O}_{2}\textrm{H}_{10} datasets. This work contributes to accelerating material and drug discovery.

Introductions

Molecular dynamics (MD), an in silico tool that simulates complex atomic systems based on first principles, has exerted dramatic impacts in scientific research. Instead of yielding an average structure by experimental approaches including X-ray crystallography and cryo-EM, MD simulations can capture the sequential behavior of molecules in full atomic details at the very fine temporal resolution, and thus allow researchers to quantify how much various regions of the molecule move at equilibrium and what types of structural fluctuations they undergo. In the areas of molecular biology and drug discovery, the most basic and intuitive application of MD is to assess the mobility or flexibility of various regions of a molecule. MD substantially accelerates the studies to observe the biomolecular processes in action, particularly important functional processes such as ligand binding (Shan et al. 2011), ligand- or voltage-induced conformational change (Dror et al. 2011), protein folding (Lindorff-Larsen et al. 2011), or membrane transport (Suomivuori et al. 2017).

Nevertheless, the computational cost of MD generally scales cubically with respect to the number of electronic degrees of freedom. Besides, important biomolecular processes like conformational change often take place on timescales longer than those accessible by classical all-atom MD simulations. Although a wide variety of enhanced sampling techniques have been proposed to capture longer-timescale events (Schwantes, McGibbon, and Pande 2014), none of them is a panacea for timescale limitations and might additionally cause decreased accuracy. Thus, it is an urgent demand to fundamentally boost the efficiency of MD while keeping accuracy.

Recently, deep learning-based MD (DLMD) models provide a new paradigm to meet the pressing demand. The accuracy of those models stems from not only the distinctive ability of neural networks to approximate high-dimensional functions but the proper treatment of physical requirements like symmetry constraints and the concurrent learning scheme that generates a compact training dataset (Jia et al. 2020). Despite their success, current DLMD models primarily suffer from the following three issues. First, most DLMD models still rely on intermediate variables (e.g., the potential energy) and multiple stages to generate subsequent biomolecular conformations. This substantially raises the computational expenditure and hinders the inference efficiency, since the inverse Hessian scales as cubically with the number of atom coordinates (Cranmer et al. 2020). Second, existing DLMD models regard the DL module as a black-box to predict atomic attributes and never inosculate the neural architecture with the theory of thermodynamics. Last but not least, the majority of prevailing geometric methods (Gilmer et al. 2017; Schütt et al. 2018; Klicpera, Groß, and Günnemann 2020) are designed for immobile molecules and not suitable for dynamic systems where the directions and velocities of atomic motions count.

This paper proposes DiffMD that aims to address the above-mentioned issues. First, DiffMD is a one-stage procedure and forecasts the simulation trajectories without any dependency on the potential energy or forces. For the second issue, inspired by the consistency of diffusion processes in nonequilibrium thermodynamics and probabilistic generative models (Sohl-Dickstein et al. 2015; Song and Ermon 2019), DiffMD adopts a score-based denoising diffusion generative model (Song et al. 2020b) with the exploration of various stochastic differential equations (SDEs). It sequentially corrupts training data by slowly increasing noise and then learns to reverse this corruption. This generative process highly accords with the enhanced sampling mechanism in MD (Miao, Feher, and McCammon 2015), where a boost potential is added conditionally to smooth biomolecular potential energy surface and decrease energy barriers. Besides, to make geometric models aware of atom mobility, we propose an equivariant geometric Transformer (EGT) as the score function for our DiffMD. It refines the self-attention mechanism (Vaswani et al. 2017) with 3D spherical Fourier-Bessel representations to incorporate both the intersection and dihedral angles between each pair of atoms and their associated velocities.

We conduct comprehensive experiments on multiple standard MD simulation datasets including MD17 and C7O2H10\textrm{C}_{7}\textrm{O}_{2}\textrm{H}_{10} isomers. Numerical results demonstrate that DiffMD constantly outperforms state-of-the-art DLMD models by a large margin. The significantly superior performance illustrates the high capability of our DiffMD to accurately produce MD trajectories for microscopic systems.

Preliminaries

Background

We consider an MD trajectory of a molecule with TT timeframes. (t)=(𝒙(t),𝒉(t),𝒗(t))\mathcal{M}^{(t)}=\left(\boldsymbol{x}^{(t)},\boldsymbol{h}^{(t)},\boldsymbol{v}^{(t)}\right) denotes the conformation 𝒙(t)\boldsymbol{x}^{(t)} at time t[T]t\in[T] and is assumed to have NN atoms. There 𝒙(t)N×3\boldsymbol{x}^{(t)}\in\mathbb{R}^{N\times 3} and 𝒉(t)N×ψh\boldsymbol{h}^{(t)}\in\mathbb{R}^{N\times\psi_{h}} denote the 3D coordinates and ψh\psi_{h}-dimension roto-translational invariant features (e.g. atom types) associated with each atom, respectively. 𝒗(t)N×3\boldsymbol{v}^{(t)}\in\mathbb{R}^{N\times 3} corresponds to the atomic velocities. We denote a vector norm by x=𝒙2x=\|\boldsymbol{x}\|_{2}, its direction by 𝒙^=𝒙/x\hat{\boldsymbol{x}}=\boldsymbol{x}/x, and the relative position by 𝒙ij=𝒙i𝒙j\boldsymbol{x}_{ij}=\boldsymbol{x}_{i}-\boldsymbol{x}_{j}.

Molecular Dynamics

MD with classical potentials.

The fundamental idea behind MD simulations is to study the time-dependent behavior of a microscopic system. It generates the atomic trajectories for a specific interatomic potential with certain initial conditions and boundary conditions. This is obtained by solving the first-order differential equation of the Newton’s second law:

𝑭i(t)=mi𝒂i(t)=U(𝒙(t))𝒙i(t),\boldsymbol{F}_{i}^{(t)}=m_{i}\boldsymbol{a}_{i}^{(t)}=-\frac{\partial U\left(\boldsymbol{x}^{(t)}\right)}{\partial\boldsymbol{x}_{i}^{(t)}}, (1)

where 𝑭i(t)\boldsymbol{F}_{i}^{(t)} is the net force acting on the ii-th atom of the system at a given point in the tt-th timeframe, 𝒂i(t)\boldsymbol{a}_{i}^{(t)} is the corresponding acceleration, and mim_{i} is the mass. U(𝒙)U\left(\boldsymbol{x}\right) is the potential energy function. The classic force field (FF) defines the potential energy function in Appendix. Then numerical methods are utilized to advance the trajectory over small time increments Δt\Delta t with the assistance of some integrator (see more introductions to MD in Appendix).

Enhanced sampling in MD.

Enhanced sampling methods have been developed to accelerate MD and retrieve useful thermodynamic and kinetic data (Rocchia, Masetti, and Cavalli 2012). These methods exploit the fact that the free energy is a state function; thus, differences in free energy are independent of the path between states (De Vivo et al. 2016). Several techniques such as free-energy perturbation, umbrella sampling, tempering, and metadynamics are invented to reduce the energy barrier and smooth the potential energy surface (Luo et al. 2020; Liao 2020).

Score-based Generative Model

Score-based generative models (Song et al. 2020b) refer to the score matching with Langevin dynamics (Song and Ermon 2019) and the denoising diffusion probabilistic modeling (Sohl-Dickstein et al. 2015) .They have shown effectiveness in the generation of images (Ho, Jain, and Abbeel 2020) and molecular conformations (Shi et al. 2021).

Diffusion process.

Assume a diffusion process {𝒙(s)}s=0S\{\boldsymbol{x}(s)\}_{s=0}^{S} indexed by a continuous time variable s[0,S]s\in[0,S], such that 𝒙(0)p0\boldsymbol{x}(0)\sim p_{0}, for which we have a dataset of i.i.d. samples, and 𝒙(S)pS\boldsymbol{x}(S)\sim p_{S}, for which we have a tractable form to generate samples efficiently. Let ps(𝒙)p_{s}(\boldsymbol{x}) be the probability density of 𝒙(s)\boldsymbol{x}(s), and p(𝒙(s1)𝒙(s0))p(\boldsymbol{x}(s_{1})\mid\boldsymbol{x}(s_{0})) be the transition kernel from 𝒙(s0)\boldsymbol{x}(s_{0}) to 𝒙(s1)\boldsymbol{x}(s_{1}), where 0s0<s1T0\leq s_{0}<s_{1}\leq T. Then the diffusion process is modeled as the solution to an Ito^\hat{\text{o}} SDE (Song et al. 2020b):

d𝒙=f(𝒙,s)ds+g(s)d𝒘,\mathrm{d}\boldsymbol{x}={f}(\boldsymbol{x},s)\mathrm{d}s+g(s)\mathrm{d}\boldsymbol{w}, (2)

where 𝒘\boldsymbol{w} is a standard Wiener process, f(,s):dd{f}(\cdot,s):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is a vector-valued function called the drift coefficient of 𝒙(s)\boldsymbol{x}(s), and g():g(\cdot):\mathbb{R}\rightarrow\mathbb{R} is a scalar function known as the diffusion coefficient of 𝒙(s)\boldsymbol{x}(s).

Refer to caption
Figure 1: The overall procedure of our DiffMD. Starting from the conformation of the last time step, atomic locations are sequentially updated with the gradient information from the score network.

Reverse process.

By starting from samples of 𝒙(S)pS\boldsymbol{x}(S)\sim p_{S} and reversing the diffusion process, we can obtain samples 𝒙(0)p0\boldsymbol{x}(0)\sim p_{0}. The reverse-time SDE can be acquired based on the result from Anderson (1982) that the reverse of a diffusion process is also a diffusion process as:

d𝒙=[f(𝒙,s)g(s)2𝒙logps(𝒙)]ds+g(s)d𝒘¯,\mathrm{d}\boldsymbol{x}=\left[{f}(\boldsymbol{x},s)-g(s)^{2}\nabla_{\boldsymbol{x}}\log p_{s}(\boldsymbol{x})\right]\mathrm{d}s+g(s)\mathrm{d}\overline{\boldsymbol{w}}, (3)

where 𝒘¯\overline{\boldsymbol{w}} is a standard Wiener process when time flows backwards from SS to 0, and ds\mathrm{d}s is an infinitesimal negative timeframe. The score of a distribution can be estimated by training a score-based model on samples with score matching (Song and Ermon 2019). To estimate 𝒙logps(𝒙)\nabla_{\boldsymbol{x}}\log p_{s}(\boldsymbol{x}), one can train a time-dependent score-based model 𝒔ϑ(𝒙,s)\boldsymbol{s}_{\boldsymbol{\vartheta}}(\boldsymbol{x},s) via a continuous generalization to the denoising score matching objective (Song et al. 2020b):

ϑ=argminϑ𝔼s{λ(s)𝔼𝒙(0)𝔼𝒙(s)𝒙(0)[𝒔ϑ(𝒙(s),s)𝒙(s)logp0s(𝒙(s)𝒙(0))22]}.\begin{split}\boldsymbol{\vartheta}^{*}=\underset{\boldsymbol{\vartheta}}{\arg\min}&\mathbb{E}_{s}\Big{\{}\lambda(s)\mathbb{E}_{\boldsymbol{x}(0)}\mathbb{E}_{\boldsymbol{x}(s)\mid\boldsymbol{x}(0)}\Big{[}\big{\|}\boldsymbol{s}_{\boldsymbol{\vartheta}}(\boldsymbol{x}(s),s)\\ &-\nabla_{\boldsymbol{x}(s)}\log p_{0s}(\boldsymbol{x}(s)\mid\boldsymbol{x}(0))\big{\|}_{2}^{2}\Big{]}\Big{\}}.\end{split} (4)

Here λ:[0,S]+\lambda:[0,S]\rightarrow\mathbb{R}^{+} is a positive weighting function, ss is uniformly sampled over [0,T][0,T], 𝒙(0)p0(𝒙)\boldsymbol{x}(0)\sim p_{0}(\boldsymbol{x}) and 𝒙(s)p0s(𝒙(s)𝒙(0))\boldsymbol{x}(s)\sim p_{0s}(\boldsymbol{x}(s)\mid\boldsymbol{x}(0)). With sufficient data and model capacity, score matching ensures that the optimal solution to Eq. 4, denoted by 𝒔ϑ(𝒙,s)\boldsymbol{s}_{\boldsymbol{\vartheta}^{*}}(\boldsymbol{x},s), equals 𝒙logps(𝒙)\nabla_{\boldsymbol{x}}\log p_{s}(\boldsymbol{x}) for almost all 𝒙\boldsymbol{x} and ss. We can typically choose λ1/𝔼[𝒙(s)logp0s(𝒙(s)𝒙(0))22]\lambda\propto 1/\mathbb{E}\left[\left\|\nabla_{\boldsymbol{x}(s)}\log p_{0s}(\boldsymbol{x}(s)\mid\boldsymbol{x}(0))\right\|_{2}^{2}\right] (Song et al. 2020b).

DiffMD

Model Overview

Most prior DLMD studies such as Zhang et al. (2018) rely on the potential energy UU as the intermediate variable to acquire atomic forces and update positions, which demands an additional backpropagation calculation and significantly increases the computational costs. Some recent work starts to abandon the two-stage manner and choose the atom-level force 𝑭\boldsymbol{F} as the prediction target of deep networks (Park et al. 2021). However, they all rely on the integrator from external computational tools to renew the positions in accordance with pre-calculated energy or forces. None embraces a straightforward paradigm to immediately forecast the 3D coordinates in a microscopic system concurrently based on previously available timeframes, i.e., p(𝒙(t+1){(i)}i=0t)p\left(\boldsymbol{x}^{(t+1)}\mid\left\{\mathcal{M}^{(i)}\right\}_{i=0}^{t}\right). To bridge this gap, we seek to generate trajectories without any transitional integrator.

Several MD simulation frameworks assume the Markov property on biomolecular conformational dynamics (Chodera and Noé 2014; Malmstrom et al. 2014) for ease of representation, i.e., p(𝒙(t+1){(i)}i=0t)=p(𝒙(t+1)(t))p\left(\boldsymbol{x}^{(t+1)}\mid\left\{\mathcal{M}^{(i)}\right\}_{i=0}^{t}\right)=p\left(\boldsymbol{x}^{(t+1)}\mid\mathcal{M}^{(t)}\right). We also hold this assumption and aim to estimate the gradient field of the log density of atomic positions at each timeframe, i.e. 𝒙(t+1)logp(𝒙(t+1))\nabla_{\boldsymbol{x}^{(t+1)}}\log p\left(\boldsymbol{x}^{(t+1)}\right). In this setting, we design a score network based on the Transformer architecture to learn the scores of the position distribution, i.e., 𝒔ϑ((t+1))=𝒙(t+1)logp(𝒙(t+1))\boldsymbol{s}_{\boldsymbol{\vartheta}}\left({\mathcal{M}}^{(t+1)}\right)=\nabla_{\boldsymbol{x}^{(t+1)}}\log p\left(\boldsymbol{x}^{(t+1)}\right). During the inference period, we regard the conformation of the previous frame t\mathcal{M}^{t} as the prior distribution, from which 𝒙t+1\boldsymbol{x}^{t+1} is sampled. Note that 𝒔ϑ((t+1))N\boldsymbol{s}_{\boldsymbol{\vartheta}}\left({\mathcal{M}}^{(t+1)}\right)\in\mathbb{R}^{N}, we formulate it as a node regression problem. The whole procedure of DiffMD is depicted in Fig. 1.

Refer to caption
Figure 2: Solving a reverse-time SDE yields a score-based model to predict positions. The cycles indicate the atomic locations, and the darkness represents the noise strengths.

Score-based Generative Models for MD

The motivation for our extending the denoising diffusion models to MD simulations is their resemblance to the enhanced sampling mechanism. Inspired by non-equilibrium statistical physics, these models first systematically and slowly destroy structures in distribution through an iterative forward diffusion process and then reverse it, similar to the behavior of perturbing the free energy in the system and striving to minimize the overall energy.

Perturbing data conditionally with SDEs.

Our goal is to construct a diffusion process {𝒙(t+1)(s)}s=0S\left\{\boldsymbol{x}^{(t+1)}(s)\right\}_{s=0}^{S} indexed by a continuous time variable s[0,S]s\in[0,S], such that 𝒙(t+1)(0)p0\boldsymbol{x}^{(t+1)}(0)\sim p_{0} and 𝒙(t+1)(S)pS\boldsymbol{x}^{(t+1)}(S)\sim p_{S}. There, p0p_{0} and pSp_{S} are the data distribution and the prior distribution of atomic positions respectively, as Equation 2.

How to incorporate noise remains critical to the success of the generation, which ensures the resulting distribution does not collapse to a low dimensional manifold (Song and Ermon 2019). Conventionally, pSp_{S} is an unstructured prior distribution, such as a Gaussian distribution with fixed mean and variance (Song et al. 2020b), which is uninformative for p0p_{0}. This construction of pSp_{S} improves the sample variety for image generation (Brock, Donahue, and Simonyan 2018) but may not work well for MD. One reason is corrupting molecular conformations unconditionally would trigger severe turbulence to the microscopic system; besides, it ignores the fact that molecular conformations of neighboring frames (t)\mathcal{M}^{(t)} and (t+1)\mathcal{M}^{(t+1)} are close to each other and their divergence is dependent on the status of the former one. Therefore, it is necessary to formulate pSp_{S} with the prior knowledge of (t)\mathcal{M}^{(t)}.

To be explicit, the noise does not constantly grow along with ss, but depends on prior states. This strategy aligns with the Gaussian accelerated MD (GaMD) mechanism (details are in Appendix) and serve as a more practical way to inject turbulence into p0p_{0}. Driven by the foregoing analysis, we introduce a conditional noise in compliance with the accelerations at prior frames and choose the following SDE:

d𝒙(t+1)=σ𝒂(t)sd𝒘,s[0,1],d\boldsymbol{x}^{(t+1)}=\sigma^{s}_{\boldsymbol{a}^{(t)}}d\boldsymbol{w},s\in[0,1], (5)

where the noise term σ𝒂(t)s\sigma^{s}_{\boldsymbol{a}^{(t)}} is dynamically adjusted as:

σ𝒂(t)s=σsησ(𝒂(t1)22a¯)2,𝒂(t1)22<a¯,\sigma^{s}_{\boldsymbol{a}^{(t)}}=\sigma_{s}\eta_{\sigma}\left(\left\|{\boldsymbol{a}}^{(t-1)}\right\|_{2}^{2}-\bar{{a}}\right)^{2},\left\|{\boldsymbol{a}}^{(t-1)}\right\|_{2}^{2}<\bar{{a}}, (6)

here ησ\eta_{\sigma} is the harmonic acceleration constant and a¯\bar{{a}} represents the acceleration threshold. Once the system has a slow variation trend of motion (i.e., the systematical energy is low), it will be supplied with a large level of noise and verse vice. Thus, the conditional noise is inversely proportional to 𝒂(t1){\boldsymbol{a}}^{(t-1)} and inherits the merits of enhanced sampling.

Generating samples through reversing the SDE.

Following the reverse-time SDE (Anderson 1982), samples of the next timeframe 𝒙(t+1)\boldsymbol{x}^{(t+1)} can be attained by reversing the diffusion process as:

d𝒙(t+1)=g(s)d𝒘¯+[f(𝒙(t+1),s)g(s)2𝒙(t+1)logps(𝒙(t+1))]ds.\begin{split}\mathrm{d}&\boldsymbol{x}^{(t+1)}=g(s)\mathrm{d}\overline{\boldsymbol{w}}\\ &+\left[{f}\left(\boldsymbol{x}^{(t+1)},s\right)-g(s)^{2}\nabla_{\boldsymbol{x}^{(t+1)}}\log p_{s}\left(\boldsymbol{x}^{(t+1)}\right)\right]\mathrm{d}s.\end{split} (7)

Once the score of each marginal distribution, 𝒙(t+1)logps(𝒙(t+1))\nabla_{\boldsymbol{x}^{(t+1)}}\log p_{s}\left(\boldsymbol{x}^{(t+1)}\right), is known for all ss, we can simulate the reverse diffusion process to sample from p0p_{0}. The workflow is summarized in Fig. 2.

Estimating scores for the SDE.

Intuitively, the optimal parameters ϑ\boldsymbol{\vartheta}^{*} of the conditional score network 𝒔ϑ(~(t+1))\boldsymbol{s}_{\boldsymbol{\vartheta}}\left(\tilde{\mathcal{M}}^{(t+1)}\right) can be trained directly by minimizing the following formula:

𝔼s{λ(s)𝔼𝒙(t+1)(0)𝔼𝒙(t+1)(s)𝒙(t+1)(0)[𝒔ϑ(~(t+1)(s),s)𝒙(t+1)(s)logp0s(𝒙(t+1)(s)𝒙(t+1)(0))22]}.\begin{split}\mathbb{E}_{s}\bigg{\{}&\lambda(s)\mathbb{E}_{\boldsymbol{x}^{(t+1)}(0)}\mathbb{E}_{\boldsymbol{x}^{(t+1)}(s)\mid\boldsymbol{x}^{(t+1)}(0)}\bigg{[}\Big{\|}\boldsymbol{s}_{\boldsymbol{\vartheta}}\Big{(}\tilde{\mathcal{M}}^{(t+1)}(s),s\Big{)}\\ &-\nabla_{\boldsymbol{x}^{(t+1)}(s)}\log p_{0s}\Big{(}\boldsymbol{x}^{(t+1)}(s)\mid\boldsymbol{x}^{(t+1)}(0)\Big{)}\Big{\|}_{2}^{2}\bigg{]}\bigg{\}}.\end{split} (8)

Here 𝒙(t+1)(0)p0(𝒙(t+1))\boldsymbol{x}^{(t+1)}(0)\sim p_{0}\left(\boldsymbol{x}^{(t+1)}\right) and 𝒙(t+1)(s)p0s(𝒙(t+1)(s)𝒙(t+1)(0))\boldsymbol{x}^{(t+1)}(s)\sim p_{0s}\left(\boldsymbol{x}^{(t+1)}(s)\mid\boldsymbol{x}^{(t+1)}(0)\right). ~(t+1)\tilde{\mathcal{M}}^{(t+1)} stands for the disturbed conformation with the noised geometric position 𝒙~(t+1)\tilde{\boldsymbol{x}}^{(t+1)}. Notably, other score matching objectives, such as sliced score matching (Song et al. 2020a) and finite-difference score matching (Pang et al. 2020) can also be applied here rather than denoising score matching in Eq. 8.

In order to efficiently solve Eq. 8, it is required to know the transition kernel p0s(𝒙(t+1)(s)𝒙(t+1)(0))p_{0s}\left(\boldsymbol{x}^{(t+1)}(s)\mid\boldsymbol{x}^{(t+1)}(0)\right). When f(.,s){f}\left(.,s\right) is affine, this transition kernel is always a Gaussian distribution, where its mean and variance are in closed forms by means of standard techniques (Särkkä and Solin 2019):

p0s(𝒙(t+1)(s)𝒙(t+1)(0))=𝒩(𝒙(t+1)(s);𝒙(t+1)(0),12logσ𝒂(t)(σ𝒂(t)2s1)𝑰).\begin{split}p_{0s}&\left(\boldsymbol{x}^{(t+1)}(s)\mid\boldsymbol{x}^{(t+1)}(0)\right)=\\ &\mathcal{N}\left(\boldsymbol{x}^{(t+1)}(s);\boldsymbol{x}^{(t+1)}(0),\frac{1}{2\log\sigma_{\boldsymbol{a}^{(t)}}}\left(\sigma_{\boldsymbol{a}^{(t)}}^{2s}-1\right)\boldsymbol{I}\right).\end{split} (9)

Equivariant Geometric Score Network

Equivariance is a ubiquitous symmetry, which complies with the fact that physical laws hold regardless of the coordinate system. It has shown efficacy to integrate such inductive bias into model parameterization for modeling 3D geometry (Köhler, Klein, and Noé 2020; Klicpera, Becker, and Günnemann 2021). Hence, we consider building the score network 𝒔ϑ\boldsymbol{s}_{\boldsymbol{\vartheta}} equivariant to rotation and translation transformations.

Existing equivariant models for molecular representations are static rather than kinetic. In contrast, along MD trajectories each atom has a velocity and a corresponding orientation. To be specific, for some pair of atoms (a,b)\left(a,b\right), they formulate two intersecting planes (see Fig. 3) with respective velocities (𝒗a(t),𝒗b(t))\left(\boldsymbol{v}_{a}^{(t)},\boldsymbol{v}_{b}^{(t)}\right). We denote the angles between velocities and the connecting line of two atoms by φ𝒗aab(t)=𝒗^a(t)𝒙^𝒗bba(t)\varphi_{\boldsymbol{v}_{a}ab}^{(t)}=\angle\hat{\boldsymbol{v}}_{a}^{(t)}\hat{\boldsymbol{x}}_{\boldsymbol{v}_{b}ba}^{(t)} and φ𝒗bba(t)=𝒗^b(t)𝒙^ba(t)\varphi_{\boldsymbol{v}_{b}ba}^{(t)}=\angle\hat{\boldsymbol{v}}_{b}^{(t)}\hat{\boldsymbol{x}}_{ba}^{(t)}. We denote the dihedral angle between two half-phases as θ𝒗aab𝒗b(t)=𝒗^a(t)𝒗^b(t)𝒙^ab(t)\theta_{\boldsymbol{v}_{a}ab\boldsymbol{v}_{b}}^{(t)}=\angle\hat{\boldsymbol{v}}_{a}^{(t)}\hat{\boldsymbol{v}}_{b}^{(t)}\perp\hat{\boldsymbol{x}}_{ab}^{(t)}. These three angles contain pivotal geometric information for predicting pairwise interactions as well as their future positions. It is necessary to incorporate them into our geometric modeling. Unfortunately, the directions and velocities of atomic motion, uniquely owned by dynamic systems, are seldom concerned by prior models.

Refer to caption
Figure 3: Angles leveraged in our EGT, including two intersection angles, φ𝒗aab(t)\varphi_{\boldsymbol{v}_{a}ab}^{(t)} and φ𝒗bba(t)\varphi_{\boldsymbol{v}_{b}ba}^{(t)},and one dihedral angle, θ𝒗aab𝒗b(t)\theta_{\boldsymbol{v}_{a}ab\boldsymbol{v}_{b}}^{(t)}.

To this end, we draw inspiration from Equivariant Graph Neural Networks (EGNN) (Satorras, Hoogeboom, and Welling 2021), GemNet (Klicpera, Becker, and Günnemann 2021), and Molformer (Wu et al. 2021), and propose an equivariant geometric Transformer (EGT) as 𝒔ϑ\boldsymbol{s}_{\boldsymbol{\vartheta}}. Our EGT is roto-translation equivariant and leverages directional information. The ll-th equivariant geometric layer (EGL) takes the set of atomic coordinates 𝒙(t),l\boldsymbol{x}^{(t),l}, velocities 𝒗(t),l\boldsymbol{v}^{(t),l}, and features 𝒉(t),l\boldsymbol{h}^{(t),l} as input, and outputs a transformation on 𝒙(t),l+1\boldsymbol{x}^{(t),l+1}, 𝒗(t),l+1\boldsymbol{v}^{(t),l+1}, and 𝒉(t),l+1\boldsymbol{h}^{(t),l+1}. Concisely, 𝒙(t),l+1,𝒗(t),l+1,𝒉(t),l+1=EGL(𝒙(t),l,𝒗(t),l,𝒉(t),l)\boldsymbol{x}^{(t),l+1},\boldsymbol{v}^{(t),l+1},\boldsymbol{h}^{(t),l+1}=\text{EGL}\left(\boldsymbol{x}^{(t),l},\boldsymbol{v}^{(t),l},\boldsymbol{h}^{(t),l}\right).

We first calculate the spherical Fourier-Bessel bases to integrate all available geometric information:

e~SBF1,omnl(xab(t),l,φ𝒗aab(t),l,θ𝒗aab𝒗b(t),l)=2cint3jo+12(zon)jo(zoncintxab(t),l)Yom(φ𝒗aab(t),l,θ𝒗aab𝒗b(t),l),\begin{split}&\tilde{e}_{\mathrm{SBF}_{1},omn}^{l}\left(x_{ab}^{(t),l},\varphi_{\boldsymbol{v}_{a}ab}^{(t),l},\theta_{\boldsymbol{v}_{a}ab\boldsymbol{v}_{b}}^{(t),l}\right)=\\ &\sqrt{\frac{2}{c_{\mathrm{int}}^{3}j_{o+1}^{2}\left(z_{on}\right)}}j_{o}\left(\frac{z_{on}}{c_{\mathrm{int}}}x_{ab}^{(t),l}\right)Y_{om}\left(\varphi_{\boldsymbol{v}_{a}ab}^{(t),l},\theta_{\boldsymbol{v}_{a}ab\boldsymbol{v}_{b}}^{(t),l}\right),\end{split} (10)
e~SBF2,omnl(xab(t),l,φ𝒗bba(t),l,θ𝒗aab𝒗b(t),l)=2cint3jo+12(zon)jo(zoncintxab(t),l)Yom(φ𝒗bba(t),l,θ𝒗aab𝒗b(t),l),\begin{split}&\tilde{e}_{\mathrm{SBF}_{2},omn}^{l}\left(x_{ab}^{(t),l},\varphi_{\boldsymbol{v}_{b}ba}^{(t),l},\theta_{\boldsymbol{v}_{a}ab\boldsymbol{v}_{b}}^{(t),l}\right)=\\ &\sqrt{\frac{2}{c_{\mathrm{int}}^{3}j_{o+1}^{2}\left(z_{on}\right)}}j_{o}\left(\frac{z_{on}}{c_{\mathrm{int}}}x_{ab}^{(t),l}\right)Y_{om}\left(\varphi_{\boldsymbol{v}_{b}ba}^{(t),l},\theta_{\boldsymbol{v}_{a}ab\boldsymbol{v}_{b}}^{(t),l}\right),\end{split} (11)

where o[NCBF]o\in[N_{\textrm{CBF}}], n[NRBF]n\in[N_{\textrm{RBF}}], and m[NSBF]m\in[N_{\textrm{SBF}}] control the degree, root, and order of the radial basis functions, respectively. cintc_{\text{int}} is the interaction cutoff. joj_{o} is the spherical Bessel functions. zonz_{on} is the nn-th root of the oo-degree Bessel functions. YomY_{om} is the real spherical harmonics with degree oo and order mm. Remarkably, 3D spherical Fourier-Bessel representations including 𝒆~SBF1\tilde{\boldsymbol{e}}_{\mathrm{SBF}_{1}} and 𝒆~SBF2\tilde{\boldsymbol{e}}_{\mathrm{SBF}_{2}} enjoy the roto-translation invariant property due to their exploitation of the relative distance as well as the invariant angles. Then those directional vectors are fed into EGL as:

𝒒i=\displaystyle\boldsymbol{q}_{i}= [fq(𝒉i(t),l)𝒆~SBF1l]𝑾SBF1,\displaystyle\Big{[}f_{q}\left(\boldsymbol{h}_{i}^{(t),l}\right)\oplus\tilde{\boldsymbol{e}}_{\mathrm{SBF}_{1}}^{l}\Big{]}\boldsymbol{W}_{\mathrm{SBF}_{1}}, (12)
𝒌i=\displaystyle\boldsymbol{k}_{i}= [fk(𝒉i(t),l)𝒆~SBF2l]𝑾SBF2,\displaystyle\left[f_{k}\left(\boldsymbol{h}_{i}^{(t),l}\right)\oplus\tilde{\boldsymbol{e}}_{\mathrm{SBF}_{2}}^{l}\right]\boldsymbol{W}_{\mathrm{SBF}_{2}}, (13)
𝒎i=\displaystyle\boldsymbol{m}_{i}= fm(𝒉i(t),l),aij=𝒒i𝒌jT/ψatt,\displaystyle f_{m}\left(\boldsymbol{h}_{i}^{(t),l}\right),\ {a}_{ij}=\boldsymbol{q}_{i}\boldsymbol{k}_{j}^{T}/\sqrt{\psi_{\textrm{att}}}, (14)
𝒗i(t),l+1=\displaystyle\boldsymbol{v}_{i}^{(t),l+1}= fv(𝒉i(t),l)𝒗i(t),l+j=1Nϕ(aij)𝒙ij(t),l,\displaystyle f_{v}\left(\boldsymbol{h}_{i}^{(t),l}\right)\boldsymbol{v}_{i}^{(t),l}+\sum_{j=1}^{N}\phi\left(a_{ij}\right)\boldsymbol{x}_{ij}^{(t),l}, (15)
𝒙i(t),l+1=\displaystyle\boldsymbol{x}_{i}^{(t),l+1}= 𝒙i(t),l+1L𝒗i(t),l+1,\displaystyle\boldsymbol{x}_{i}^{(t),l}+\frac{1}{L}\boldsymbol{v}_{i}^{(t),l+1}, (16)
𝒉i(t),l+1=\displaystyle\boldsymbol{h}_{i}^{(t),l+1}= fh(j=1Nϕ(aij)𝒎j).\displaystyle f_{h}\left(\sum_{j=1}^{N}\phi\left(a_{ij}\right)\boldsymbol{m}_{j}\right). (17)

Here \oplus denotes concatenation and LL is the number of total layers in EGT. fqf_{q}, fkf_{k} and fmf_{m} are linear transformations. fvf_{v} and fhf_{h} are velocity and feature operations, which are commonly approximately by multi-layer perceptrons (MLPs). 𝒒i\boldsymbol{q}_{i}, 𝒌i\boldsymbol{k}_{i} and 𝒎i\boldsymbol{m}_{i} are respectively the query, key, and value vectors with the same dimension ψatt\psi_{\textrm{att}}. The weight matrix 𝑾SBF1\boldsymbol{W}_{\mathrm{SBF}_{1}} and 𝑾SBF2\boldsymbol{W}_{\mathrm{SBF}_{2}} are learnable, transferring dimensions of the concatenated vectors back to ψatt\psi_{\textrm{att}}. aija_{ij} is the attention that the token ii pays to the token jj. ϕ\phi denotes the Softmax function. Finally, 𝒙(t),L\boldsymbol{x}^{(t),L} at the last layer immediately draw the gradient field of locations, i.e., 𝒙(t)logp(𝒙(t))\nabla_{\boldsymbol{x}^{(t)}}\log p\left(\boldsymbol{x}^{(t)}\right).

Note that EGL breaks down the coordinate update into two stages. First we compute the velocity 𝒗i(t),l+1\boldsymbol{v}_{i}^{(t),l+1}, and then leverage it to update the position 𝒙i(t),l\boldsymbol{x}_{i}^{(t),l}. The initial velocity 𝒗i(t),l\boldsymbol{v}_{i}^{(t),l} is scaled by fv:ψhf_{v}:\mathbb{R}^{\psi_{h}}\rightarrow\mathbb{R} that maps the feature embedding 𝒉i(t),l\boldsymbol{h}_{i}^{(t),l} to a scalar value. After that, the velocity of each atom 𝒗i(t),l\boldsymbol{v}_{i}^{(t),l} is updated as a vector field in a radial direction. In other words, 𝒗i(t),l\boldsymbol{v}_{i}^{(t),l} is renewed by the weighted sum of all relative differences {𝒙ij(t),l}j=1N\left\{\boldsymbol{x}_{ij}^{(t),l}\right\}_{j=1}^{N}. The weights of this sum are provided as the attention score {aij}j=1N\left\{a_{ij}\right\}_{j=1}^{N}. Meanwhile, those attention scores are used to gain the new feature 𝒉i(t),l+1\boldsymbol{h}_{i}^{(t),l+1}.

Analysis on E(nn) equivariance.

We analyze the equivariance properties of our model for E(3) symmetries. That is, our model should be translation equivariant on 𝒙\boldsymbol{x} for any translation vector and it should also be rotation and reflection equivariant on 𝒙\boldsymbol{x} for any orthogonal matrix Qn×nQ\in\mathbb{R}^{n\times n} and any translation matrix on×3o\in\mathbb{R}^{n\times 3}. More formally, our model satisfies (see proof in Appendix):

Q𝒙(t),l+1+o,Q𝒗(t),l+1,𝒉(t),l+1=EGL(Q𝒙(t),l+o,Q𝒗(t),l,𝒉(t),l).\begin{split}Q\boldsymbol{x}^{(t),l+1}+o,&Q\boldsymbol{v}^{(t),l+1},\boldsymbol{h}^{(t),l+1}=\\ &\text{EGL}\left(Q\boldsymbol{x}^{(t),l}+o,Q\boldsymbol{v}^{(t),l},\boldsymbol{h}^{(t),l}\right).\end{split} (18)

Trajectory Sampling

After training a time-dependent score-based model 𝒔ϑ\boldsymbol{s}_{\boldsymbol{\vartheta}}, we can exploit it to construct the reverse-time SDE and then simulate it with numerical methods to generate molecular conformations from p0p_{0}. As analyzed before, 𝒙(t)\boldsymbol{x}^{(t)} and 𝒙(t+1)\boldsymbol{x}^{(t+1)} are heavily correlated and their divergence is minor. Based on this relationship, instead of using some Gaussian distributions (Song et al. 2020b), we leverage 𝒙(t)\boldsymbol{x}^{(t)} as a replacement to approximate the unknown prior distribution pS(𝒙(t+1))p_{S}\left(\boldsymbol{x}^{(t+1)}\right). That is, we regard 𝒙(t)\boldsymbol{x}^{(t)} as the perturbed version of 𝒙(t+1)\boldsymbol{x}^{(t+1)} and seize it as the starting point of our trajectory sampling process.

Numerical solvers provide approximate computation for SDEs. Many general-purpose numerical methods, such as Euler-Maruyama and stochastic Runge-Kutta methods, apply to the reverse-time SDE for sample generation. In addition to SDE solvers, we can also employ score-based Markov Chain Monte Carlo (MCMC) approaches such as Langevin MCMC or Hamiltonian Monte Carlo to sample from directly, and correct the solution of a numerical SDE solver. Readers are recommended to refer to Song et al. (2020b) for more details. We provide the pseudo-code of the whole sampling process in Algorithm 1.

Algorithm 1 Sampling Algorithm with Predictor-Corrector.
NPN_{P}: Number of discretization steps for the reverse-time SDE.
NCN_{C}: Number of corrector steps.
Initialize 𝒙(t+1)(NP)𝒙(t)\boldsymbol{x}^{(t+1)}(N_{P})\leftarrow\boldsymbol{x}^{(t)}
for i=NP1i=N_{P}-1 to 0 do
     𝒙(t+1)(i)Predictor(𝒙(t+1)(i+1))\boldsymbol{x}^{(t+1)}(i)\leftarrow\text{Predictor}\left(\boldsymbol{x}^{(t+1)}(i+1)\right)
     for j=1j=1 to NC1N_{C}-1 do
         𝒙(t+1)(i)Corrector(𝒙(t+1)(i))\boldsymbol{x}^{(t+1)}(i)\leftarrow\text{Corrector}\left(\boldsymbol{x}^{(t+1)}(i)\right)
     end for
end for
return 𝒙(t+1)\boldsymbol{x}^{(t+1)}

Experiments

To verify the effectiveness of our DiffMD, we construct the following two tasks and empirically evaluate it:

Short-term-to-long-term (S2L) Trajectory Generation.

In this task setting, models are first trained on some short-term trajectories and are required to produce long-term trajectories of the same molecule given the starting conformation 𝒙(t0)\boldsymbol{x}^{(t_{0})} as p(𝒙(tn),,𝒙(t1)|𝒙(t0))p\left(\boldsymbol{x}^{(t_{n})},...,\boldsymbol{x}^{(t_{1})}|\boldsymbol{x}^{(t_{0})}\right). This extrapolation over time aims to examine the model’s capacity of generalization from the temporal view.

One-to-others (O2O) Trajectory Generation.

In the O2O task, models are trained on the entire trajectories of some molecules and examined on other molecules from scratch. This evaluates model’s eligibility to generalize to conformations of different molecules, namely, the discrepancy with respect to different molecular types.

Experiment Setup

Evaluation metric.

We adopt the accumulative root-mean-square-error (ARMSE) of all snapshots at a given nn-step time period {ti}i=1n\{t_{i}\}_{i=1}^{n} as the evaluation metric. ARMSE evaluates the generated conformations as: ARMSE=(1ni=t1tn𝒙~(i)𝒙(i)2)12\text{ARMSE}=\left(\frac{1}{n}\sum_{i=t_{1}}^{t_{n}}\left\|\tilde{\boldsymbol{x}}^{(i)}-\boldsymbol{x}^{(i)}\right\|^{2}\right)^{\frac{1}{2}}, which is roto-translational invariant.

Baselines.

We compare DiffMD with several state-of-the-art methods for the MD trajectory prediction. Specifically, Tensor Field Network (TFN) (Thomas et al. 2018) adopts filters built from spherical harmonics to achieve equivariance. Radial Field (RF) is a GNN drawn from Equivariant Flows (Köhler, Klein, and Noé 2019). SE(3)-Transformer (Fuchs et al. 2020) is a equivariant variant of the self-attention module for 3D point-clouds. EGNN (Satorras, Hoogeboom, and Welling 2021) learns GNNs equivariant to rotations, translations, reflections and permutations. GMN (Huang et al. 2022) resorts to generalized coordinates to impose geometrical constraints on graphs. SCFNN (Gao and Remsing 2022) is a self-consistent field NN for learning the long-range response of molecular systems. The full experimental details are elaborated in Appendix.

Short-term-to-long-term Trajectory Generation

Data.

MD17 (Chmiela et al. 2017) 111Both MD17 and C7O2H10\textrm{C}_{7}\textrm{O}_{2}\textrm{H}_{10} datasets are available at http://quantum-machine.org/datasets/ contains trajectories of eight thermalized molecules, and all are calculated at a temperature of 500K and a resolution of 0.5 femtosecond (ft). We use the first 20K frame pairs as the training set and split the next 20K frame pairs equally into validation and test sets. Unfortunately, MD17 does not include velocities of particles, for which we use 𝒗(t)=𝒙(t)𝒙(t1)\boldsymbol{v}^{(t)}=\boldsymbol{x}^{(t)}-\boldsymbol{x}^{(t-1)} as a substitution, similarly to GMN.

Results.

Table 1 documents the performance of baselines and our DiffMD in S2L, where the best performance is marked bold and the second best is underlined for clear comparison. Note that floating overflow is encountered by RF (denoted as NA). For all eight organic molecules, DiffMD achieves the lowest ARMSEs. Moreover, different organic molecules perform in different manners during MD. Particularly, benzene moves most actively than other molecules, which leads to the highest prediction errors.

Table 1: Extrapolation performance on MD17. Note the extrapolation errors for TFN are not available (NA) due to the floating number overflow.
Methods Aspirin Benzene Ethanol Malonaldehyde Naphthalene Salicylic Toluene Uracil
TFN NA NA NA NA NA NA NA NA
RF 3.707 19.724 5.963 18.532 13.791 2.071 4.052 2.382
SE(3)-Tr. 0.813 2.415 0.678 1.183 1.834 1.230 1.312 0.691
EGNN 0.868 2.518 0.719 0.889 0.484 0.632 1.034 0.464
GMN 0.814 2.528 0.751 0.880 0.832 0.895 1.018 0.494
SCFNN 1.151 2.832 1.084 1.096 0.923 0.918 1.229 0.857
DiffMD 0.648 2.365 0.637 0.784 0.298 0.471 0.820 0.393
Relative Impro. 20.2% 2.1% 5.9% 10.9% 38.4% 25.4% 19.5% 15.4%

One-to-others Trajectory Generation

Data.

C7O2H10\textrm{C}_{7}\textrm{O}_{2}\textrm{H}_{10} (Brockherde et al. 2017)1{}^{~{}\ref{dataset_site}} is a dataset that consists of the trajectories of 113 randomly selected C7O2H10\textrm{C}_{7}\textrm{O}_{2}\textrm{H}_{10} isomers, which are calculated at a temperature of 100K and resolution of 1 fs using density functional theory with the PBE exchange-correlation potential. We select the top-5 isomers that have the largest ARMSEs out of 113 samples, using 𝒙(t0)\boldsymbol{x}^{(t_{0})} as the prediction for all the subsequent timeframes, as the validation targets and take the rest as the training set. Same as the MD17 case, we compute the distance vector between neighboring frames as the velocities.

Table 2: Performance on the five isomers in C7O2H10\textrm{C}_{7}\textrm{O}_{2}\textrm{H}_{10}.
Methods ISO_1004 ISO_2134 ISO_2126 ISO_3001 ISO_1007
TFN 7.390 10.990 10.412 4.697 10.677
RF 4.772 4.364 21.576 9.077 11.049
SE(3)-Tr. 5.253 6.186 4.334 5.304 7.514
EGNN 1.142 0.578 0.928 1.017 1.035
GMN 1.205 0.363 0.998 1.053 1.154
SCFNN 1.781 1.693 1.785 2.842 2.264
DiffMD 1.127 0.278 0.919 0.837 0.878
Relative Impro. 1.2% 23.4% 9.7% 9.8% 15.1%

Results.

Table 2 reports ARMSE of baselines and our DiffMD on the five isomers from C7O2H10\textrm{C}_{7}\textrm{O}_{2}\textrm{H}_{10}. DiffMD exceeds all baselines with a large margin for all target molecules. We plot snapshots at different timeframes in Appendix.

A closer inspection of the generated trajectories shows that several baselines have worse generation quality because their conformations are not geometrically and biologically constrained. On the contrary, generated conformations by models like EGNN and GMN are geometrically legal, but their variations are minute. Interestingly, we discover that conformations generated by SCFNN remain unchanged after a few timeframes, which indicates the network is stuck in a fixed point.

Related Work

Molecular Dynamics with Deep Learning

Recently, various DL models have become easy-to-use tools for fascinating MD with ab initio accuracy. Behler-Parrinello network (Behler and Parrinello 2007) is one of the first models to learn potential surfaces from MD data. After that, Deep-Potential net (Han et al. 2017) is further developed by extending to more advanced functions involving two neighbors. While DTNN (Schütt et al. 2017) and SchNet (Schütt et al. 2018) achieve highly competitive prediction performance across the chemical compound space and the configuration space in order to simulate MD (Noé et al. 2020). However, they still follow the routine of multi-stage simulations and rely on forces or energy as the prediction target. Huang et al. (2022) proposes an end-to-end GMN to characterize constrained systems of interacting objects, where molecules are defined as a set of rigidly connected particles with sticks and hinges. Also, their experiments fail to be realistic and the constraint strongly violates the nature of MD, since no distance between any pair of atoms are fixed.

Conformation Generation.

Researchers are increasingly interested in conformation generation. Some works start from 2D molecular graphs to gain their 3D structures via bi-level programming (Xu et al. 2021b) and continuous flows (Xu et al. 2021a). Some others concentrate on the inverse design to create the conformations of drug-like molecules or crystals with desired properties (Noh et al. 2019). Recently, Gao and Remsing (2022) propose an SCFNN that perturbs positions of the Wannier function centers induced by external electric fields. Latterly, diffusion models become a favored choice in conformation generation (Shi et al. 2021)Xu et al. (2022) introduce a GeoDiff by progressively injecting and eliminating small noises. However, its perturbations evolve over discrete times. A better approach would be to express dynamics as a set of differential equations since time is actually continuous. Furthermore, these studies leverage diffusion models in recovering conformations from molecular graphs instead of generating sequential conformations. We fill in the gap by applying them to yield MD trajectories.

Conclusion

We propose DiffMD, a novel principle to sequentially generate molecular conformations in MD simulations. DiffMD marries denoising diffusion models with an equivariant geometric Transformer, which enables self-attention to leverage directional information in addition to the interatomic distances. Extensive experiments over multiple tasks demonstrate that DiffMD is superior to existing state-of-the-art models. This research may shed light on the acceleration of new drugs and material discovery.

References

  • Allen and Tildesley (2012) Allen, M. P.; and Tildesley, D. J. 2012. Computer simulation in chemical physics, volume 397. Springer Science & Business Media.
  • Anderson (1982) Anderson, B. D. 1982. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3): 313–326.
  • Behler and Parrinello (2007) Behler, J.; and Parrinello, M. 2007. Generalized neural-network representation of high-dimensional potential-energy surfaces. Physical review letters, 98(14): 146401.
  • Brock, Donahue, and Simonyan (2018) Brock, A.; Donahue, J.; and Simonyan, K. 2018. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
  • Brockherde et al. (2017) Brockherde, F.; Vogt, L.; Li, L.; Tuckerman, M. E.; Burke, K.; and Müller, K.-R. 2017. Bypassing the Kohn-Sham equations with machine learning. Nature communications, 8(1): 1–10.
  • Case et al. (2005) Case, D. A.; Cheatham III, T. E.; Darden, T.; Gohlke, H.; Luo, R.; Merz Jr, K. M.; Onufriev, A.; Simmerling, C.; Wang, B.; and Woods, R. J. 2005. The Amber biomolecular simulation programs. Journal of computational chemistry, 26(16): 1668–1688.
  • Chmiela et al. (2017) Chmiela, S.; Tkatchenko, A.; Sauceda, H. E.; Poltavsky, I.; Schütt, K. T.; and Müller, K.-R. 2017. Machine learning of accurate energy-conserving molecular force fields. Science advances, 3(5): e1603015.
  • Chodera and Noé (2014) Chodera, J. D.; and Noé, F. 2014. Markov state models of biomolecular conformational dynamics. Current opinion in structural biology, 25: 135–144.
  • Cranmer et al. (2020) Cranmer, M.; Greydanus, S.; Hoyer, S.; Battaglia, P.; Spergel, D.; and Ho, S. 2020. Lagrangian neural networks. arXiv preprint arXiv:2003.04630.
  • De Vivo et al. (2016) De Vivo, M.; Masetti, M.; Bottegoni, G.; and Cavalli, A. 2016. Role of molecular dynamics and related methods in drug discovery. Journal of medicinal chemistry, 59(9): 4035–4061.
  • Dror et al. (2011) Dror, R. O.; Arlow, D. H.; Maragakis, P.; Mildorf, T. J.; Pan, A. C.; Xu, H.; Borhani, D. W.; and Shaw, D. E. 2011. Activation mechanism of the β\beta2-adrenergic receptor. Proceedings of the National Academy of Sciences, 108(46): 18684–18689.
  • Frenkel and Smit (2001) Frenkel, D.; and Smit, B. 2001. Understanding molecular simulation: from algorithms to applications, volume 1. Elsevier.
  • Fuchs et al. (2020) Fuchs, F.; Worrall, D.; Fischer, V.; and Welling, M. 2020. Se (3)-transformers: 3d roto-translation equivariant attention networks. Advances in Neural Information Processing Systems, 33: 1970–1981.
  • Gao and Remsing (2022) Gao, A.; and Remsing, R. C. 2022. Self-consistent determination of long-range electrostatics in neural network potentials. Nature Communications, 13(1): 1–11.
  • Gilmer et al. (2017) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In International conference on machine learning, 1263–1272. PMLR.
  • Han et al. (2017) Han, J.; Zhang, L.; Car, R.; et al. 2017. Deep potential: A general representation of a many-body potential energy surface. arXiv preprint arXiv:1707.01478.
  • Ho, Jain, and Abbeel (2020) Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840–6851.
  • Huang et al. (2022) Huang, W.; Han, J.; Rong, Y.; Xu, T.; Sun, F.; and Huang, J. 2022. Equivariant Graph Mechanics Networks with Constraints. In International Conference on Learning Representations.
  • Jia et al. (2020) Jia, W.; Wang, H.; Chen, M.; Lu, D.; Lin, L.; Car, R.; Weinan, E.; and Zhang, L. 2020. Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In SC20: International conference for high performance computing, networking, storage and analysis, 1–14. IEEE.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Klicpera, Becker, and Günnemann (2021) Klicpera, J.; Becker, F.; and Günnemann, S. 2021. Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems, 34.
  • Klicpera, Groß, and Günnemann (2020) Klicpera, J.; Groß, J.; and Günnemann, S. 2020. Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123.
  • Köhler, Klein, and Noé (2019) Köhler, J.; Klein, L.; and Noé, F. 2019. Equivariant flows: sampling configurations for multi-body systems with symmetric energies. arXiv preprint arXiv:1910.00753.
  • Köhler, Klein, and Noé (2020) Köhler, J.; Klein, L.; and Noé, F. 2020. Equivariant flows: exact likelihood generative learning for symmetric densities. In International Conference on Machine Learning, 5361–5370. PMLR.
  • Lambert (1991) Lambert, J. D. 1991. Numerical methods for ordinary differential systems: the initial value problem. John Wiley & Sons, Inc.
  • Liao (2020) Liao, Q. 2020. Enhanced sampling and free energy calculations for protein simulations. In Progress in molecular biology and translational science, volume 170, 177–213. Elsevier.
  • Lindorff-Larsen et al. (2011) Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; and Shaw, D. E. 2011. How fast-folding proteins fold. Science, 334(6055): 517–520.
  • Luo et al. (2020) Luo, R.; Zhang, Q.; Yang, Y.; and Wang, J. 2020. Replica-exchange Nos\\backslash’e-Hoover dynamics for Bayesian learning on large datasets. Advances in Neural Information Processing Systems, 33: 17874–17885.
  • Malmstrom et al. (2014) Malmstrom, R. D.; Lee, C. T.; Van Wart, A. T.; and Amaro, R. E. 2014. Application of molecular-dynamics based markov state models to functional proteins. Journal of chemical theory and computation, 10(7): 2648–2657.
  • Martys and Mountain (1999) Martys, N. S.; and Mountain, R. D. 1999. Velocity Verlet algorithm for dissipative-particle-dynamics-based models of suspensions. Physical Review E, 59(3): 3733.
  • Miao, Bhattarai, and Wang (2020) Miao, Y.; Bhattarai, A.; and Wang, J. 2020. Ligand Gaussian accelerated molecular dynamics (LiGaMD): Characterization of ligand binding thermodynamics and kinetics. Journal of chemical theory and computation, 16(9): 5526–5547.
  • Miao, Feher, and McCammon (2015) Miao, Y.; Feher, V. A.; and McCammon, J. A. 2015. Gaussian accelerated molecular dynamics: unconstrained enhanced sampling and free energy calculation. Journal of chemical theory and computation, 11(8): 3584–3595.
  • Miao et al. (2015) Miao, Y.; Feixas, F.; Eun, C.; and McCammon, J. A. 2015. Accelerated molecular dynamics simulations of protein folding. Journal of computational chemistry, 36(20): 1536–1549.
  • Miao and McCammon (2016) Miao, Y.; and McCammon, J. A. 2016. Graded activation and free energy landscapes of a muscarinic G-protein–coupled receptor. Proceedings of the National Academy of Sciences, 113(43): 12162–12167.
  • Miao and McCammon (2017) Miao, Y.; and McCammon, J. A. 2017. Gaussian accelerated molecular dynamics: Theory, implementation, and applications. In Annual reports in computational chemistry, volume 13, 231–278. Elsevier.
  • Noé et al. (2020) Noé, F.; Tkatchenko, A.; Müller, K.-R.; and Clementi, C. 2020. Machine learning for molecular simulation. Annual review of physical chemistry, 71: 361–390.
  • Noh et al. (2019) Noh, J.; Kim, J.; Stein, H. S.; Sanchez-Lengeling, B.; Gregoire, J. M.; Aspuru-Guzik, A.; and Jung, Y. 2019. Inverse design of solid-state materials via a continuous representation. Matter, 1(5): 1370–1384.
  • Pang et al. (2020) Pang, T.; Xu, K.; Li, C.; Song, Y.; Ermon, S.; and Zhu, J. 2020. Efficient learning of generative models via finite-difference score matching. Advances in Neural Information Processing Systems, 33: 19175–19188.
  • Pang et al. (2017) Pang, Y. T.; Miao, Y.; Wang, Y.; and McCammon, J. A. 2017. Gaussian accelerated molecular dynamics in NAMD. Journal of chemical theory and computation, 13(1): 9–19.
  • Park et al. (2021) Park, C. W.; Kornbluth, M.; Vandermause, J.; Wolverton, C.; Kozinsky, B.; and Mailoa, J. P. 2021. Accurate and scalable graph neural network force field and molecular dynamics with direct force architecture. npj Computational Materials, 7(1): 1–9.
  • Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  • Rocchia, Masetti, and Cavalli (2012) Rocchia, W.; Masetti, M.; and Cavalli, A. 2012. Enhanced sampling methods in drug design. Physico-Chemical and Computational Approaches to Drug Discovery. The Royal Society of Chemistry, 273–301.
  • Särkkä and Solin (2019) Särkkä, S.; and Solin, A. 2019. Applied stochastic differential equations, volume 10. Cambridge University Press.
  • Satorras, Hoogeboom, and Welling (2021) Satorras, V. G.; Hoogeboom, E.; and Welling, M. 2021. E (n) equivariant graph neural networks. In International Conference on Machine Learning, 9323–9332. PMLR.
  • Schlick (2010) Schlick, T. 2010. Molecular modeling and simulation: an interdisciplinary guide, volume 2. Springer.
  • Schütt et al. (2017) Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K. R.; and Tkatchenko, A. 2017. Quantum-chemical insights from deep tensor neural networks. Nature communications, 8(1): 1–8.
  • Schütt et al. (2018) Schütt, K. T.; Sauceda, H. E.; Kindermans, P.-J.; Tkatchenko, A.; and Müller, K.-R. 2018. Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics, 148(24): 241722.
  • Schwantes, McGibbon, and Pande (2014) Schwantes, C. R.; McGibbon, R. T.; and Pande, V. S. 2014. Perspective: Markov models for long-timescale biomolecular dynamics. The Journal of chemical physics, 141(9): 09B201_1.
  • Shan et al. (2011) Shan, Y.; Kim, E. T.; Eastwood, M. P.; Dror, R. O.; Seeliger, M. A.; and Shaw, D. E. 2011. How does a drug molecule find its target binding site? Journal of the American Chemical Society, 133(24): 9181–9183.
  • Shi et al. (2021) Shi, C.; Luo, S.; Xu, M.; and Tang, J. 2021. Learning gradient fields for molecular conformation generation. In International Conference on Machine Learning, 9558–9568. PMLR.
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2256–2265. PMLR.
  • Song and Ermon (2019) Song, Y.; and Ermon, S. 2019. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32.
  • Song et al. (2020a) Song, Y.; Garg, S.; Shi, J.; and Ermon, S. 2020a. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, 574–584. PMLR.
  • Song et al. (2020b) Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Ermon, S.; and Poole, B. 2020b. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
  • Suomivuori et al. (2017) Suomivuori, C.-M.; Gamiz-Hernandez, A. P.; Sundholm, D.; and Kaila, V. R. 2017. Energetics and dynamics of a light-driven sodium-pumping rhodopsin. Proceedings of the National Academy of Sciences, 114(27): 7043–7048.
  • Thomas et al. (2018) Thomas, N.; Smidt, T.; Kearnes, S.; Yang, L.; Li, L.; Kohlhoff, K.; and Riley, P. 2018. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219.
  • Tuckerman and Martyna (2000) Tuckerman, M. E.; and Martyna, G. J. 2000. Understanding modern molecular dynamics: Techniques and applications.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Wang et al. (2021) Wang, J.; Arantes, P. R.; Bhattarai, A.; Hsu, R. V.; Pawnikar, S.; Huang, Y.-m. M.; Palermo, G.; and Miao, Y. 2021. Gaussian accelerated molecular dynamics: Principles and applications. Wiley Interdisciplinary Reviews: Computational Molecular Science, 11(5): e1521.
  • Wu et al. (2021) Wu, F.; Zhang, Q.; Radev, D.; Cui, J.; Zhang, W.; Xing, H.; Zhang, N.; and Chen, H. 2021. 3D-Transformer: Molecular Representation with Transformer in 3D Space. arXiv preprint arXiv:2110.01191.
  • Xu et al. (2021a) Xu, M.; Luo, S.; Bengio, Y.; Peng, J.; and Tang, J. 2021a. Learning neural generative dynamics for molecular conformation generation. arXiv preprint arXiv:2102.10240.
  • Xu et al. (2021b) Xu, M.; Wang, W.; Luo, S.; Shi, C.; Bengio, Y.; Gomez-Bombarelli, R.; and Tang, J. 2021b. An end-to-end framework for molecular conformation generation via bilevel programming. In International Conference on Machine Learning, 11537–11547. PMLR.
  • Xu et al. (2022) Xu, M.; Yu, L.; Song, Y.; Shi, C.; Ermon, S.; and Tang, J. 2022. GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation. arXiv preprint arXiv:2203.02923.
  • Zhang et al. (2018) Zhang, L.; Han, J.; Wang, H.; Car, R.; and Weinan, E. 2018. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Physical review letters, 120(14): 143001.

Appendix A Molecular Dynamics Simulations

Force Field for MD.

The following equation defines the classic force field for MD simulations as:

U=icl,i2(lil0,i)2+i𝒜cα,i2(αiα0,i)2bonded+i𝒯{kMUik2[1+cos(cikθikθ0,ik)]}bonded+0<i,jN,jjεij[(x0,ijxij)122(x0,ijxij)6]non-bonded+0<i,jN,jjqiqj4πε0εrxijnon-bonded,\begin{split}U=&\underbrace{\sum_{i\in\mathcal{B}}\frac{c_{l,i}}{2}\left(l_{i}-l_{0,i}\right)^{2}+\sum_{i\in\mathcal{A}}\frac{c_{\alpha,i}}{2}\left(\alpha_{i}-\alpha_{0,i}\right)^{2}}_{\text{bonded}}\\ &\underbrace{+\sum_{i\in\mathcal{T}}\left\{\sum_{k}^{M}\frac{U_{ik}}{2}\left[1+\cos\left(c_{ik}\cdot\theta_{ik}-\theta_{0,ik}\right)\right]\right\}}_{\text{bonded}}\\ &+\underbrace{\sum_{0<i,j\leq N,j\neq j}\varepsilon_{ij}\left[\left(\frac{x_{0,ij}}{x_{ij}}\right)^{12}-2\left(\frac{x_{0,ij}}{x_{ij}}\right)^{6}\right]}_{\text{non-bonded}}\\ &\underbrace{+\sum_{0<i,j\leq N,j\neq j}\frac{q_{i}q_{j}}{4\pi\varepsilon_{0}\varepsilon_{\mathrm{r}}x_{ij}}}_{\text{non-bonded}},\end{split} (19)

where \mathcal{B}, 𝒜\mathcal{A}, and 𝒯\mathcal{T} denote the set of all bonds, angles, and torsions. ll, α\alpha, and θ\theta correspond to the bond lengths, angles, and dihedral angles, respectively. l0l_{0} and α0\alpha_{0} are reference values. clc_{l} and cαc_{\alpha} are force constants. cikc_{ik} is a parameter describing the multiplicity for the kk-th term of the series. θ0,ik\theta_{0,ik} is the corresponding phase angle. UikU_{ik} is the energy barrier. The torsion terms are defined by a cosine series of MM terms for each dihedral angle θi\theta_{i}. xijx_{ij} is the distance between considered atoms. εij\varepsilon_{ij} defines the depth of the energy well. x0,ijx_{0,ij} is the minimum distance that equals the sum of the van der Waals radii of the two interaction atoms. qiq_{i} and qjq_{j} are the partial changes of a pair of atoms. ε0\varepsilon_{0} stands for the permittivity of free space, and εr\varepsilon_{\mathrm{r}} is the relative permittivity, which takes a value of one in vacuum.

The first three terms in Eq. 19 represent intramolecular interactions of the atoms. They depict variations in potential energy as a function of bond stretching, bending, and torsions between atoms directly involved in bonding relationships. They are represented as the summations over bond lengths (ll), angles (θ\theta), and dihedral angles (α\alpha), respectively. The last two terms stand for van der Waals interactions, modeled using the Lennard-Jones potential, and eletrostatic interactions, modeled using Coulomb’s law. These atomic forces are denoted as non-bonded items because they are caused by interactions between atoms that are not bonded.

Integrator.

Then numerical methods are utilized to advance the trajectory over small time increments Δt\Delta t with the assistance of some integrator. An integrator advances the trajectory over small time increments Δt\Delta t as:

𝒙(t0)𝒙(t0+Δt)𝒙(t0+2Δt)𝒙(t0+TΔt),\boldsymbol{x}^{\left(t_{0}\right)}\rightarrow\boldsymbol{x}^{(t_{0}+\Delta t)}\rightarrow\boldsymbol{x}^{(t_{0}+2\Delta t)}\rightarrow\cdots\rightarrow\boldsymbol{x}^{(t_{0}+T\Delta t)}, (20)

where TT is usually taken around 10410^{4} picoseconds (ps) to 10710^{7} ps. Various strategies have been invented to iteratively update atomic positions including the central difference (Verlet, leap-frog, velocity Verlet, Beeman algorithm) (Frenkel and Smit 2001; Allen and Tildesley 2012), the predictor-corrector (Lambert 1991), and the symplectic integrators (Tuckerman and Martyna 2000; Schlick 2010).

MD trajectory iterations.

With sampled forces, we can now advance to update the velocities of each atom and compute their coordinates of the next timeframe. As an example of an integrator, the velocity-Verlet (Martys and Mountain 1999) in Eq. 21, is a simple and widely used algorithm in MD software such as AMBER (Case et al. 2005). The coordinates and velocities are updated simultaneously as:

𝒙i(t+1)\displaystyle\boldsymbol{x}_{i}^{(t+1)} =𝒙i(t)+𝒗i(t)Δt+12(𝒙i(t)mi)(Δt)2,\displaystyle=\boldsymbol{x}_{i}^{(t)}+\boldsymbol{v}_{i}^{(t)}\cdot\Delta t+\frac{1}{2}\left(\frac{\boldsymbol{x}_{i}^{(t)}}{m_{i}}\right)(\Delta t)^{2}, (21)
𝒗i(t+1)\displaystyle\boldsymbol{v}_{i}^{{}^{(t+1)}} =𝒗i(t)+12[𝒙i(t)mi+𝒙i(t+1)mi]Δt.\displaystyle=\boldsymbol{v}_{i}^{(t)}+\frac{1}{2}\left[\frac{\boldsymbol{x}_{i}^{\left(t\right)}}{m_{i}}+\frac{\boldsymbol{x}_{i}^{(t+1)}}{m_{i}}\right]\Delta t. (22)

Gaussian accelerated MD

GaMD (Miao and McCammon 2017; Pang et al. 2017) is a robust computational method for simultaneous unconstrained enhanced sampling and free energy calculations of biomolecules, which greatly accelerates MD by orders of magnitude (Wang et al. 2021). It works by adding a harmonic boost potential to reduce energy barriers. When the system potential U(𝒙)U(\boldsymbol{x}) is lower than a threshold energy U¯\bar{U}, a boost potential ΔU\Delta U is injected as:

ΔU=12ηU(U¯U(𝒙))2,U(𝒙)<U¯,\Delta U=\frac{1}{2}\eta_{U}\left(\bar{U}-U(\boldsymbol{x})\right)^{2},U(\boldsymbol{x})<\bar{U}, (23)

where ηU\eta_{U} is the harmonic force constant. The modified system potential is given by U=U+ΔUU^{*}=U+\Delta U. Otherwise, when the system potential is above the threshold energy, the boost potential is set to zero.

Remarkably, the boost potential ΔU\Delta U exhibits a Gaussian distribution, allowing for accurate reweighting of the simulations using cumulant expansion to the second order and recovery of the original free energy landscapes even for large biomolecules (Miao and McCammon 2016; Pang et al. 2017).

Moreover, three enhanced sampling principles are imposed to the boost potential in GaMD. Among them, for the sake of ensuring accurate reweighting (Miao et al. 2015), the standard deviation of ΔU\Delta U ought to be adequately small, i.e., narrow distribution:

σΔU=(ΔUUU=Uavg)2σU2=η(U¯Uavg)σUσ0,\begin{split}\sigma_{\Delta U}=&\sqrt{\left(\frac{\partial\Delta U}{\partial U}\mid U=U_{\mathrm{avg}}\right)^{2}\sigma_{U}^{2}}\\ =&\eta\left(\bar{U}-U_{\mathrm{avg}}\right)\sigma_{U}\leq\sigma_{0},\end{split} (24)

where UavgU_{\mathrm{avg}} and σU\sigma_{U} are the average and standard deviation of the system potential energies, σΔU\sigma_{\Delta U} is the standard deviation of ΔU\Delta U with σ0\sigma_{0} as a user-specific upper limit (e.g. 10kBk_{B}T) for accurate reweighting (Miao, Bhattarai, and Wang 2020). This also indicates that our choice of ησ\eta_{\sigma} ought to be small enough with some upper bound.

Appendix B Equivariance Proof for EGT

In this part, we provide a strict proof that EGT achieves E(nn) equivariance on 𝒙(t)\boldsymbol{x}^{(t)}. More formally, for any orthogonal matrix Qn×nQ\in\mathbb{R}^{n\times n} and any translation matrix on×3o\in\mathbb{R}^{n\times 3}, the model should satisfy:

Q𝒙(t),l+1+o,Q𝒗(t),l+1,𝒉(t),l+1=EGL(Q𝒙(t),l+o,Q𝒗(t),l,𝒉(t),l).\begin{split}Q\boldsymbol{x}^{(t),l+1}+o,&Q\boldsymbol{v}^{(t),l+1},\boldsymbol{h}^{(t),l+1}=\\ &\text{EGL}\left(Q\boldsymbol{x}^{(t),l}+o,Q\boldsymbol{v}^{(t),l},\boldsymbol{h}^{(t),l}\right).\end{split} (25)

As assumed in the preliminary, 𝒉(t),0\boldsymbol{h}^{(t),0} is invariant to E(nn) transformation. In other words, we do not encode any information about the absolute position or orientation of 𝒙(t),0\boldsymbol{x}^{(t),0} into 𝒉(t),0\boldsymbol{h}^{(t),0}. Moreover, the spherical representations 𝒆~SBF1l\tilde{\boldsymbol{e}}_{\mathrm{SBF}_{1}}^{l} and 𝒆~SBF2l\tilde{\boldsymbol{e}}_{\mathrm{SBF}_{2}}^{l} will be invariant too. This is because the distance between two particles and the three angles are invariant to translation as

𝒙a(t),l+o(𝒙b(t),l+o)2=𝒙a(t),l𝒙b(t),l2,\left\|\boldsymbol{x}_{a}^{(t),l}+o-\left(\boldsymbol{x}_{b}^{(t),l}+o\right)\right\|^{2}=\left\|\boldsymbol{x}_{a}^{(t),l}-\boldsymbol{x}_{b}^{(t),l}\right\|^{2}, (26)

and they are invariant to rotations as

Q𝒙a(t),lQ𝒙b(t),l2=(𝒙a(t),l𝒙b(t),l)QQ(𝒙a(t),l𝒙b(t),l)=(𝒙a(t),l𝒙b(t),l)𝑰(𝒙a(t),l𝒙b(t),l)=𝒙a(t),l𝒙b(t),l2.\begin{split}&\left\|Q\boldsymbol{x}_{a}^{(t),l}-Q\boldsymbol{x}_{b}^{(t),l}\right\|^{2}\\ &=\left(\boldsymbol{x}_{a}^{(t),l}-\boldsymbol{x}_{b}^{(t),l}\right)^{\top}Q^{\top}Q\left(\boldsymbol{x}_{a}^{(t),l}-\boldsymbol{x}_{b}^{(t),l}\right)\\ &=\left(\boldsymbol{x}_{a}^{(t),l}-\boldsymbol{x}_{b}^{(t),l}\right)^{\top}\boldsymbol{I}\left(\boldsymbol{x}_{a}^{(t),l}-\boldsymbol{x}_{b}^{(t),l}\right)\\ &=\left\|\boldsymbol{x}_{a}^{(t),l}-\boldsymbol{x}_{b}^{(t),l}\right\|^{2}.\end{split} (27)

As for angles, the intersection angles take the form of

φ𝒗aab(t),l=\displaystyle\varphi_{\boldsymbol{v}_{a}ab}^{(t),l}= arccos(𝒗a(t),lT𝒙ab(t),lvaxab),\displaystyle\arccos\left(\frac{{\boldsymbol{v}^{(t),l}_{a}}^{T}\boldsymbol{x}_{ab}^{(t),l}}{v_{a}x_{ab}}\right), (28)
φ𝒗bba(t),l=\displaystyle\varphi_{\boldsymbol{v}_{b}ba}^{(t),l}= arccos(𝒗b(t),lT𝒙ba(t),lvbxba).\displaystyle\arccos\left(\frac{{\boldsymbol{v}_{b}^{(t),l}}^{T}\boldsymbol{x}_{ba}^{(t),l}}{v_{b}x_{ba}}\right). (29)

Similarly,

Q𝒗a(t),lTQ𝒙ab(t),l=\displaystyle{Q\boldsymbol{v}^{(t),l}_{a}}^{T}Q\boldsymbol{x}_{ab}^{(t),l}= 𝒗a(t),lT𝒙ab(t),l,\displaystyle{\boldsymbol{v}^{(t),l}_{a}}^{T}\boldsymbol{x}_{ab}^{(t),l}, (30)
Q𝒗b(t),lTQ𝒙ba(t),l=\displaystyle{Q\boldsymbol{v}_{b}^{(t),l}}^{T}Q\boldsymbol{x}_{ba}^{(t),l}= 𝒗b(t),lT𝒙ba(t),l,\displaystyle{\boldsymbol{v}_{b}^{(t),l}}^{T}\boldsymbol{x}_{ba}^{(t),l}, (31)

while vav_{a}, vbv_{b}, and xabx_{ab} are invariant as well. Thus, intersection angles are invariant. Suppose 𝒏a(t),l\boldsymbol{n}_{a}^{(t),l} and 𝒏b(t),l\boldsymbol{n}_{b}^{(t),l} are two normal vectors to the two planes, and the dihedral angle between these planes is defined as arccos(na(t),lT𝒏b(t),lnanb)\arccos\left(\frac{{\textbf{n}^{(t),l}_{a}}^{T}\boldsymbol{n}_{b}^{(t),l}}{n_{a}n_{b}}\right). The invariance of the dihedral angle can be proven in the same way as the intersection angles. Finally, it leads to the result that all query, key, value vectors, attention scores, and feature embeddings become invariant.

Now we aim to show:

Q𝒗i(t),l+1=fv(𝒉i(t),l)Q𝒗i(t),l+j=1Nϕ(aij)(Q𝒙ij(t),l+o(Q𝒙ij(t),l+o)).\begin{split}Q\boldsymbol{v}_{i}^{(t),l+1}&=f_{v}\left(\boldsymbol{h}_{i}^{(t),l}\right)Q\boldsymbol{v}_{i}^{(t),l}\\ &+\sum_{j=1}^{N}\phi\left(a_{ij}\right)\left(Q\boldsymbol{x}_{ij}^{(t),l}+o-\left(Q\boldsymbol{x}_{ij}^{(t),l}+o\right)\right).\end{split} (32)

The right hand of this equation can be written as:

fv\displaystyle f_{v} (𝒉i(t),l)Q𝒗i(t),l\displaystyle\left(\boldsymbol{h}_{i}^{(t),l}\right)Q\boldsymbol{v}_{i}^{(t),l} (33)
+j=1Nϕ(aij)(Q𝒙ij(t),l+o(Q𝒙ij(t),l+o))\displaystyle+\sum_{j=1}^{N}\phi\left(a_{ij}\right)\left(Q\boldsymbol{x}_{ij}^{(t),l}+o-\left(Q\boldsymbol{x}_{ij}^{(t),l}+o\right)\right)
=Qfv(𝒉i(t),l)𝒗i(t),l+Qj=1Nϕ(aij)(𝒙ij(t),l𝒙ij(t),l)\displaystyle=Qf_{v}\left(\boldsymbol{h}_{i}^{(t),l}\right)\boldsymbol{v}_{i}^{(t),l}+Q\sum_{j=1}^{N}\phi\left(a_{ij}\right)\left(\boldsymbol{x}_{ij}^{(t),l}-\boldsymbol{x}_{ij}^{(t),l}\right)
=Q(fv(𝒉i(t),l)𝒗i(t),l+j=1Nϕ(aij)(𝒙ij(t),l𝒙ij(t),l))\displaystyle=Q\left(f_{v}\left(\boldsymbol{h}_{i}^{(t),l}\right)\boldsymbol{v}_{i}^{(t),l}+\sum_{j=1}^{N}\phi\left(a_{ij}\right)\left(\boldsymbol{x}_{ij}^{(t),l}-\boldsymbol{x}_{ij}^{(t),l}\right)\right)
=Q𝒗i(t),l+1.\displaystyle=Q\boldsymbol{v}_{i}^{(t),l+1}. (34)

Obviously, it is easy to show that Q𝒙i(t),l+1+o=Q𝒙i(t),l+o+Q𝒗i(t),l+1Q\boldsymbol{x}_{i}^{(t),l+1}+o=Q\boldsymbol{x}_{i}^{(t),l}+o+Q\boldsymbol{v}_{i}^{(t),l+1}, and we omit this derivation.

Appendix C Experiment

Implementation Details

In this subsection, we introduce details of hyper-parameters used in our experiments.

Training details.

All models are implemented in Pytorch (Paszke et al. 2019), and are optimized with an Adam (Kingma and Ba 2014) optimizer with the initial learning rate of 5e-4 and weight decay of 1e-10 on a single A100 GPU. A ReduceLROnPlateau scheduler is utilized to adjust the learning rate according to the training ARMSE with a lower bound of 1e-7 and a patience of 5 epochs. The random seed is 1. The hidden dimension is 64. All baseline models are evaluated with 4 layers. For RF, TFN, EGNN, GMN, SCFNN, and DiffMD, the batch size is 200. The batch size is reduced to 100 for SE(3)-Transformer due to the memory explosion error. We train models for 200 epochs and test their performance each 5 epochs. Similar to (Shi et al. 2021; Huang et al. 2022), for graph-based baselines, we augment the original molecular graphs with 2-hop neighbors, and concatenate the hop index with atom number of the connected atoms as well as the edge type indicator as the edge feature. For GMN, we use the configurations in the paper that bonds without commonly-connected atoms are selected as sticks, and the rest of atoms as isolated particles. Notably, unlike Huang et al. (2022), we keep all atoms including the hydrogen atoms for a more accurate description of the microscopic system.

DiffMD architecture.

The score function, EGT, has 6 EGLs, and each layer has 8 attention heads. We adopt ReLU as the activation function and a dropout rate of 0.1. The input embedding size is 128 and the hidden size for the feed-forward network (FFN) is 2048. The numbers of the degree, root, and order of the radial basis are all 2. The interaction cutoff is 1.6 Å. We use the 2-norm of velocity concatenated with the atom number as the node feature, which is invariant to geometric transformations. As for the generative process, we exploit the ODE sampler instead of the predictor-corrector sampler because the former is much faster than the latter. The tolerances of the absolute error and the relative error are both 1e-5. We tune several key hyper-parameters via grid search based on the validation dataset (see Table 3).

Table 3: The hyper-parameters searched for our DiffMD.
Name Description Range
σs\sigma_{s} The standard derivation when the acceleration is above the threshold. [1e-3, 1e-2, 1e-1, 1]
a^\hat{a} The acceleration threshold. [1e-2, 1e-1, 1, 2, 5, 10]
ησ\eta_{\sigma} The harmonic acceleration constant. [1e-2, 1e-1, 1, 10]
ϵ\epsilon The smallest time step for numerical stability in ODE sampler. [1e-1, 0.2, 0.4, 0.8, 0.9, 0.99]

Additional Results

Generated MD Trajectory Samples

We present visualizations of samples in the generated trajectories from different approaches in Fig 4. For each method, we sample every 800 fts. The first row is the ground truth trajectory. It can be seen that even if EGNN, GMN, DiffMD achieve low ARMSEs, their variance between different frames is tiny. Particularly, SCFNN is caught by a fixed point and its prediction keeps unchanged. That is, its output is the same as its input. On the other hand, other models including TFN, RF, and SE(3)-Transformer adjust the conformations dramatically, and the generated conformations are mostly illegal. In other words, they produce molecular structures that break the underlying law of biological geometry.

Refer to caption
Figure 4: Examples of the generated MD trajectory for ISO_1004 generated in C7O2H10\textrm{C}_{7}\textrm{O}_{2}\textrm{H}_{10}. This first line is the ground truth trajectory of this compound.