This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Direct Molecular Conformation Generationthanks: This work was done when Jinhua Zhu and Yusong Wang were interns at Microsoft Research AI4Science. Correspondence to Yingce Xia and Chang Liu.

Jinhua Zhu1 teslazhu@mail.ustc.edu.cn
Yingce Xia2 yingce.xia@microsoft.com
Chang Liu2 changliu@microsoft.com
Lijun Wu2 lijuwu@microsoft.com
Shufang Xie3 shufangxie@ruc.edu.cn
Yusong Wang4 wangyusong2000@stu.xjtu.edu.cn
Tong Wang2 watong@microsoft.com
Tao Qin2 taoqin@microsoft.com
Wengang Zhou1 zhwg@ustc.edu.cn
Houqiang Li1 lihq@ustc.edu.cn
Haiguang Liu2 liuhaiguang@microsoft.com
Tie-Yan Liu2 tyliu@microsoft.com
1 University of Science and Technology of China  2 Microsoft Research AI4Science
3 Renmin University of China  4 Xi’an Jiaotong University
Abstract

Molecular conformation generation aims to generate three-dimensional coordinates of all the atoms in a molecule and is an important task in bioinformatics and pharmacology. Previous methods usually first predict the interatomic distances, the gradients of interatomic distances or the local structures (e.g., torsion angles) of a molecule, and then reconstruct its 3D conformation. How to directly generate the conformation without the above intermediate values is not fully explored. In this work, we propose a method that directly predicts the coordinates of atoms: (1) the loss function is invariant to roto-translation of coordinates and permutation of symmetric atoms; (2) the newly proposed model adaptively aggregates the bond and atom information and iteratively refines the coordinates of the generated conformation. Our method achieves the best results on GEOM-QM9 and GEOM-Drugs datasets. Further analysis shows that our generated conformations have closer properties (e.g., HOMO-LUMO gap) with the groundtruth conformations. In addition, our method improves molecular docking by providing better initial conformations. All the results demonstrate the effectiveness of our method and the great potential of the direct approach. The code is released at https://github.com/DirectMolecularConfGen/DMCG.

1 Introduction

Molecular conformation generation aims to generate 3D atomic coordinates of a molecule, which then can be used in molecular property prediction (Axelrod & Gomez-Bombarelli, 2021), docking (Roy et al., 2015), structure-based virtual screening (Kontoyianni, 2017), etc. While molecular conformation is experimentally obtainable, such as via X-ray crystallography, it is prohibitively costly for industry-scale tasks (Mansimov et al., 2019). Ab initio methods, e.g., based on density functional theory (DFT) (Parr, 1980; Baseden & Tye, 2014), can accurately predict molecular structures, but take several hours per small molecule (Hu et al., 2021). To handle large molecules, people turn to leverage classical force fields, like UFF (Rappe et al., 1992) or MMFF (Halgren, 1996) and its extension (Cleves & Jain, 2017), to optimize conformations, which is efficient but at the cost of low accuracy (Kanal et al., 2018).

Recently, machine learning methods have attracted much attention for conformation generation due to their accuracy and efficiency. Most of previous methods first predict some intermediate values, like interatomic distances  (Simm & Hernández-Lobato, 2020; Shi et al., 2020; Xu et al., 2021a; b), the gradients w.r.t. interatomic distances (Shi et al., 2021; Luo et al., 2021b), or the torsion angles (Ganea et al., 2021), and then reconstruct the conformation based on them. While those methods improve molecular conformation generation, the intermediate values they used should satisfy additional hard constraints, which are unfortunately violated in many cases. For example, GraphDG (Simm & Hernández-Lobato, 2020) predicts the interatomic distances and then reconstructs the conformation based on them. The real distances (e.g., considering three atoms) should satisfy the triangle inequality, but the distances predicted by GraphDG violate the inequality out of 8.65%8.65\% cases according to our study. For another example, ConfGF (Shi et al., 2021) predicts the gradient of interatomic distances, and the rank of a squared distance matrix is at most five. Such constraint makes gradients ill-defined, because other distances cannot all be held constant while taking an infinitesimal change to a specific distance dijd_{ij} (see Appendix C for more details). Directly generating the coordinates without those intermediate values is a more straightforward strategy but is not fully explored. AlphaFold 2 (Jumper et al., 2021) is such a kind of direct approach and has achieved remarkable performances on protein structure prediction. The success of AlphaFold 2 inspires us to explore the method of directly generating coordinates for molecular conformation.

A challenge of this approach is to maintain roto-translation invariance and permutation invariance. Specifically, (1) rotating and translating the coordinates of all atoms as a group do not change the conformation of a molecule, which should be taken into consideration for the direct approach; (2) Permutation invariance should be considered for symmetry-related atoms. For example, as shown in Figure 1, due to the symmetry of the pyrimidine part along the C-S bond (atom 11 and 12), atoms 13,1413,14 and atoms 17,1617,16 are equivalent. Therefore, swapping the coordinates of 1313 with 1717 and 1414 with 1616 yields the same conformation. According to our statistics on a subset of 40K molecules from GEOM-Drugs (Axelrod & Gomez-Bombarelli, 2021), on average, each molecule has 5.95.9 atom mappings which could result in the same conformation (more specifically, the average number of |𝒮||{\mathcal{S}}| in Eqn.(1) is 5.95.9). The number is non-negligible for loss function design.

To maintain roto-translation and permutation invariance, in our method, we design a loss function as the minimal distance between two sets of coordinates after any roto-translation and permutation of symmatric atoms. Based on the new loss function, we design a model that iteratively refines atom coordinates. The model stacks multiple blocks, and each block outputs a conformation which is then refined by the following block. A block consists of several modules that encode the previous conformation as well as the representations of bonds, atoms and global information of molecules. At the end of each block, we add a normalization layer that centers the coordinates at the origin. Since a molecule may have multiple conformations, inspired by variational auto-encoder (VAE), we introduce a random variable zz and a regularization term on zz, which allows diverse generation.

Refer to caption
Figure 1: An example of symmetric substructure of a molecule.

We conduct experiments on four benchmarks: GEOM-QM9 and GEOM-Drugs with the small-scale setting (Shi et al., 2021) and large-scale setting (Axelrod & Gomez-Bombarelli, 2021). The small-scale GEOM-QM9 and GEOM-Drugs have 200200K molecule-conformation pairs for training, and the large-scale GEOM-QM9 and GEOM-Drugs have 1.371.37M and 2.02.0M training pairs. Our method achieves state-of-the-art results on all of them, demonstrating the effectiveness of our method. Specifically, on small-scale GEOM-QM9, our method improves the recall-based mean coverage score and mean matching score by 4.7%4.7\% and 0.3%0.3\%. On small-scale GEOM-Drugs, the improvements are 7.4%7.4\% and 16.3%16.3\%. On the large-scale settings, the improvements are more significant: 7.3%7.3\% and 47.1%47.1\% for GEOM-QM9, and 25.3%25.3\% and 36.0%36.0\% for GEOM-Drugs. To further verify the generation quality, we use Psi4 (Smith et al., 2020) to calculate the properties of generated conformations and groundtruth conformations (e.g., HOMO-LUMO gap). Our conformations have closer properties to the groundtruth compared with other methods. We also find that our generated conformations can help improve molecular docking by providing better initial conformations.

To summary, (1) we design a dedicated loss function, that can maintains both permutation invariance on symmetric atoms and roto-translation invariance on conformations; (2) we design a new model that iteratively refines the conformation. (3) our method, named Direct Molecular Conformation Generation (DMCG), outperforms strong baselines and achieves state-of-the-art results on all benchmarks we tested.

Problem Definition: Let G=(V,E)G=(V,E) denote a molecular graph, where VV and EE are collections of atoms and bonds, respectively. Specifically, V={v1,v2,,v|V|}V=\{v_{1},v_{2},\cdots,v_{|V|}\} with the ii-th atom viv_{i}. Let eije_{ij} denote the bond between atom viv_{i} and vjv_{j}. For ease of reference, we simply use iVi\in V and (i,j)E(i,j)\in E to denote the ii-th atom in VV and the bond eije_{ij} in EE. Let N(i)N(i) denote the neighbors of atom ii, i.e., N(i)={j(i,j)E}N(i)=\{j\mid(i,j)\in E\}. We use RR to represent the conformation of GG, where R|V|×3R\in\mathbb{R}^{|V|\times 3}. The ii-th row of RR (denoted as RiR_{i}) is the coordinate of atom viv_{i}. Given a graph G=(V,E)G=(V,E), our task is to learn a mapping, that can output the coordinates RR of all atoms in VV, i.e., R|V|×3R\in\mathbb{R}^{|V|\times 3}.

2 Framework

In this section, we first introduce the loss function. After that, we present the overall training and inference workflow of our method. Finally, we introduce our proposed model.

2.1 Loss function

Roto-translational and permutation invariance of conformations: Let R|V|×3R\in\mathbb{R}^{|V|\times 3} and R^|V|×3\hat{R}\in\mathbb{R}^{|V|\times 3} denote the groundtruth conformation and the generated conformation. The roto-translation and permutation invariant loss is defined as follows:

RTP(R,R^)=minρ;σ𝒮Rρ(σ(R^))F2.\ell_{\rm RTP}(R,\hat{R})=\min_{\rho;\;\sigma\in\mathcal{S}}\|R-\rho(\sigma(\hat{R}))\|^{2}_{F}. (1)

In Eqn.(1), (i) ρ\rho denotes a roto-translational operation, which means to rotate and translate a conformation rigidly; (ii) 𝒮\mathcal{S} denotes the collection of the permutation operations on symmetric atoms. For example, in Figure 1, 𝒮\mathcal{S} contains two elements σ1\sigma_{1} and σ2\sigma_{2}, where σ1\sigma_{1} is an identical mapping, i.e. σ1(i)=i\sigma_{1}(i)=i for any i{1,2,,18}i\in\{1,2,\cdots,18\}, and σ2\sigma_{2} is the mapping on symmetric atoms of the pyrimidine: σ2(13)=17,σ2(17)=13,σ2(14)=16,σ2(16)=14\sigma_{2}(13)=17,\sigma_{2}(17)=13,\sigma_{2}(14)=16,\sigma_{2}(16)=14 and σ2(i)=i\sigma_{2}(i)=i for the remaining atom ii’s. (iii) AF2\|A\|_{F}^{2} is defined as i,j|Ai,j|2\sum_{i,j}|A_{i,j}|^{2}. In all, Eqn.(1) defines a loss between RR and R^\hat{R} as the minimal achievable distance under any roto-translation operation and any permutation operation of symmetric atoms, hence is invariant to these operations. Eqn.(1) can be solved via quaternions (Karney, 2007; Hamilton, 1840) and graph isomorphism (Meli & Biggin, 2020).

To solve Eqn. (1), the optimization can be decomposed into two sub problems: (S1) RT=minρρ(R^)RF2\ell_{\rm RT}=\min_{\rho}\|\rho(\hat{R})-R\|^{2}_{F}; (S2) P=minσ𝒮σ(R^)RF2\ell_{\rm P}=\min_{\sigma\in{\mathcal{S}}}\|\sigma(\hat{R})-R\|^{2}_{F}.

Karney (2007) propose to use quaternions (Hamilton, 1840) to solve (S1). A quaternion qq is an extension of complex numbers, q=q0𝒖+q1𝒊+q2𝒋+q3𝒌q=q_{0}\bm{\mathsfit{u}}+q_{1}\bm{\mathsfit{i}}+q_{2}\bm{\mathsfit{j}}+q_{3}\bm{\mathsfit{k}}, where q0,q1,q2,q3q_{0},q_{1},q_{2},q_{3} are real scalars and 𝒖\bm{\mathsfit{u}}, 𝒊\bm{\mathsfit{i}}, 𝒋\bm{\mathsfit{j}}, 𝒌\bm{\mathsfit{k}} are orientation vectors. With quaternions, any rotation operation is specified by a 3×33\times 3 matrix, where each element in the matrix is the summation/multiplication of q0q_{0} to q3q_{3}. The solution to (S1) is the minimal eigenvalue of a 4×44\times 4 matrix obtained by algebraic operations on RR and R^\hat{R}. To stabilize training, we stop gradient back-propagation through ρ\rho (see Appendix B.3 for the ablation study).

To solve (S2), we need to find all elements in 𝒮{\mathcal{S}}, and then enumerate them to get the minimal value. 𝒮{\mathcal{S}} can be mathematically described as follows: (1) iV\forall i\in V, atom ii and atom σ(i)\sigma(i) have the same label, which is defined as the union of the atom type itself and also the types of all the bonds connected to it111For example, in Figure 1, the label of atom 1111 is “S-2Single”, and the label of atom 1717 is “N-2Aromatic”.. (2) There exists a bond between atoms ii and jj if and only if there exists a bond between atoms σ(i)\sigma(i) and σ(j)\sigma(j) in the same molecular graph. Therefore, we convert finding 𝒮{\mathcal{S}} into a graph isomorphism problem on molecular graphs. Inspired by Meli & Biggin (2020), we use the graph_tool toolkit222https://graph-tool.skewed.de to find all permutations in 𝒮{\mathcal{S}}. By combining the above two strategies, we are able to solve Eqn.(1). We provide several examples in the online supplementary material to show how our method works.

Hopcroft & Wong (1974) proposed an algorithm whose complexity of testing planar graphs for isomorphism is O(|E|)O(|E|) , where |E||E| is the number of edges in a graph. A planar graph can be regarded as a type of graph that no edges cross each other (see Wiki for a quick introduction). For the widely used GEOM-QM9 and GEOM-Drugs datasets (Shi et al., 2021; Xu et al., 2022) of conformation generation, all the molecules are planar graphs. We also randomly sample 30M compounds from PubChem, and only 4.5k of them are not planar graphs (0.015%). This shows that although our method needs to test graph isomorphism, the time complexity could still be controlled. In addition, the |𝒮||{\mathcal{S}}|’s of 99.8%99.8\% molecules in GEOM-Drugs are smaller than 100100 and efficient to enumerate them all in GPU. A limitation is that, when stepping from small molecules to proteins with a long chain, |𝒮||{\mathcal{S}}| will significantly increase, resulting in large computation cost of obtaining RTP\ell_{\rm RTP}. We will improve it in the future.

One-to-many mapping of conformations: A molecule might correspond to multiple conformations. Thus, we introduce a random variable zz to our model for diverse conformation generation. Given a molecular graph GG, different zz could result in different conformations (denoted as R^(z,G)\hat{R}(z,G)). Inspired by the variational auto-encoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014; Sohn et al., 2015), we introduce a (conditional) inference model q(z|R,G)q(z|R,G) to describe the posterior distribution of zz, reform the reconstruction loss in a probabilistic style 𝔼q(z|R,G)[RTP(R,R^(z,G))]\mathbb{E}_{q(z|R,G)}\left[\ell_{\rm RTP}(R,\hat{R}(z,G))\right], and append a regularization term in the form of the Kullback-Leibler (KL) divergence w.r.t. a prior distribution p(z)p(z), i.e. DKL(q(z|R,G)p(z))D_{\mathrm{KL}}(q(z|R,G)\|p(z)). In this way, the aggregated (i.e. averaged/marginalized) posterior q(z|R,G)pdata(R)dR\int q(z|R,G)p_{\text{data}}(R)\,\mathrm{d}R is driven towards the prior p(z)p(z), which in turn allows generating a new conformation from pdata(R)p_{\text{data}}(R) by passing through the decoder with a p(z)p(z) sample. It is easy to draw a random variable zz from p(z)p(z) and encourages diversity.

By properly choosing q(z|R,G)q(z|R,G), the loss is tractable to optimize. We specify q(z|R,G):=𝒩(z|μR,G,ΣR,G)q(z|R,G):=\mathcal{N}(z|\mu_{R,G},\Sigma_{R,G}) as a multivariate Gaussian with a diagonal covariance matrix, where the μR,G\mu_{R,G} and ΣR,G\Sigma_{R,G} are outputs from an encoder. It enables tractable loss optimization via reparameterization (Kingma & Welling, 2014): zq(z|R,G)z\sim q(z|R,G) is equivalent to z(i)=μR,G(i)+ΣR,G(i,i)ϵ,iz^{(i)}=\mu_{R,G}^{(i)}+\sqrt{\Sigma_{R,G}^{(i,i)}}\epsilon,\forall i, where ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1), z(i)z^{(i)} and μR,G(i)\mu^{(i)}_{R,G} are the ii-th element of zz and μR,G\mu_{R,G}, and ΣR,G(i,i)\Sigma^{(i,i)}_{R,G} denotes the ii-th diagonal element. The KL divergence loss is specialized as DKL(𝒩(μR,G,ΣR,G)𝒩(0,𝑰))D_{\mathrm{KL}}(\mathcal{N}(\mu_{R,G},\Sigma_{R,G})\|\mathcal{N}(0,{\bm{I}})), which has closed form solution.

Overall, the overall training objective function is defined as follows:

min𝔼ϵ𝒩(0,𝑰)RTP(R,R^(z,G))+βDKL(𝒩(μR,G,ΣR,G)𝒩(0,𝑰)),\displaystyle\min\;\;\mathbb{E}_{\epsilon\sim\mathcal{N}(0,{\bm{I}})}\ell_{\rm RTP}(R,\hat{R}(z,G))+\beta D_{\mathrm{KL}}(\mathcal{N}(\mu_{R,G},\Sigma_{R,G})\|\mathcal{N}(0,{\bm{I}})), (2)

where β>0\beta>0 is a hyperparameter. The minimization in Eqn.(2) is taken over all the network parameters (including the conformation generator and auxiliary model qq).

Refer to caption
(a) Training workflow.
Refer to caption
(b) Inference workflow.
Figure 2: The workflow of our method. Green and orange lines represent how to obtain R^(z,G)\hat{R}(z,G) and q(z|R,G)q(z|R,G) respectively. Solid lines and dashed lines represent the model components and outputs respectively.

2.2 Training and inference flow

Now we show the training and inference workflow. The training process involves three modules, φ2D\varphi_{\rm 2D}, φ3D\varphi_{\rm 3D} and φdec\varphi_{\rm dec}. The workflow is illustrated in Figure 2(a). Specifically,

(1) The encoder φ2D\varphi_{\textrm{2D}} takes the molecular graph GG as its input, and outputs several representations: HV(0)|V|×dH_{V}^{(0)}\in\mathbb{R}^{|V|\times d} for all atoms, HE(0)|E|×dH_{E}^{(0)}\in\mathbb{R}^{|E|\times d} for all bonds, a global graph feature U(0)dU^{(0)}\in\mathbb{R}^{d}, and initial conformation R^(0)|V|×3\hat{R}^{(0)}\in\mathbb{R}^{|V|\times 3}. Note dd is the dimension of the representations. Formally, (HV(0),HE(0),U(0),R^(0))=φ2D(G)(H_{V}^{(0)},H_{E}^{(0)},U^{(0)},\hat{R}^{(0)})=\varphi_{\textrm{2D}}(G).

(2) The encoder φ3D\varphi_{\textrm{3D}} extracts features of the conformation RR for constructing the conditional inference module q(z|R,G)q(z|R,G). According to the above specification, φ3D\varphi_{\rm 3D} only needs to output the mean and covariance of the Gaussian, or formally, (μR,G,ΣR,G)=φ3D(R,G)(\mu_{R,G},\Sigma_{R,G})=\varphi_{\textrm{3D}}(R,G).

(3) We randomly sample a variable zz from the Gaussian distribution 𝒩(μR,G,ΣR,G)\mathcal{N}(\mu_{R,G},\Sigma_{R,G}), and then feed HV(0)H_{V}^{(0)}, HE(0)H_{E}^{(0)}, U(0)U^{(0)}, R^(0)\hat{R}^{(0)}, zz into the decoder φdec\varphi_{\textrm{dec}} to obtain the conformation R^(z,G)\hat{R}(z,G). That is, R^(z,G)=φdec(φ2d(G),z)=φdec(HV(0),HE(0),U(0),R^(0),z)\hat{R}(z,G)=\varphi_{\textrm{dec}}(\varphi_{\textrm{2d}}(G),z)=\varphi_{\textrm{dec}}(H_{V}^{(0)},H_{E}^{(0)},U^{(0)},\hat{R}^{(0)},z). Note that sampling z𝒩(μR,G,ΣR,G)z\sim{\mathcal{N}}(\mu_{R,G},\Sigma_{R,G}) is equivalent to sampling ϵ𝒩(0,𝑰)\epsilon\sim{\mathcal{N}}(0,{\bm{I}}) and then setting z(i)=μR,G(i)+ΣR,G(i,i)ϵz^{(i)}=\mu_{R,G}^{(i)}+\sqrt{\Sigma_{R,G}^{(i,i)}}\epsilon.

(4) After obtaining R^(z,G)\hat{R}(z,G) and 𝒩(μR,G,ΣR,G){\mathcal{N}}(\mu_{R,G},\Sigma_{R,G}), we optimize Eqn.(2) for training. Recall that R^(z,G)\hat{R}(z,G) is related to φ2D\varphi_{\textrm{2D}}, φ3D\varphi_{\textrm{3D}}, φdec\varphi_{\textrm{dec}}, and μR,G\mu_{R,G}, ΣR,G\Sigma_{R,G} are related to φ3D\varphi_{\textrm{3D}}.

The inference workflow is shown in Figure 2(b), where the well-trained φ2D\varphi_{\rm 2D} and φdec\varphi_{\rm dec} are leveraged: (1) Given a molecular graph GG, we use φ2D\varphi_{\textrm{2D}} to encode GG and obtain R^(0)\hat{R}^{(0)}, HV(0)H_{V}^{(0)}, HE(0)H_{E}^{(0)}, U(0)U^{(0)}; (2) we sample a random variable zz from Gaussian 𝒩(0,𝑰)\mathcal{N}(0,{\bm{I}}); (3) we feed R^(0)\hat{R}^{(0)}, HV(0)H_{V}^{(0)}, HE(0)H_{E}^{(0)}, U(0)U^{(0)}, zz into φdec\varphi_{\textrm{dec}} and obtain the eventual conformation R^(z,G)\hat{R}(z,G). Note that φ3D\varphi_{\textrm{3D}} is not used in inference phase.

2.3 Model architecture

The encoders φ2D\varphi_{\textrm{2D}}, φ3D\varphi_{\textrm{3D}} and the decoder φdec\varphi_{\textrm{dec}} share the same architecture. They all stack LL identical blocks. We take the decoder φdec\varphi_{\textrm{dec}} as an example to introduce its ll-th block, and leave the details of φ2D\varphi_{\textrm{2D}} and φ3D\varphi_{\textrm{3D}} to Appendix A.1.

Figure 3 shows the architecture of the ll-th block of φdec\varphi_{\textrm{dec}}. Roughly speaking, this block takes the outputs from its preceding block (including the conformation R^(l1)\hat{R}^{(l-1)}, atom representations HV(l1)H_{V}^{(l-1)}, edge representations HE(l1)H_{E}^{(l-1)} and the global representation U(l1)U^{(l-1)} of the whole molecule) and outputs refined conformation and representations of atoms, bonds, the whole graph. The process is repeated until the eventual output R^(L)\hat{R}^{(L)} is obtained. For the input of the first block (i.e., l=1l=1), the HV(0)H^{(0)}_{V}, HE(0)H^{(0)}_{E}, U(0)U^{(0)} and R^(0)\hat{R}^{(0)} are the outputs of φ2D\varphi_{\textrm{2D}}.

We use a variant of the GN block (Battaglia et al., 2018; Addanki et al., 2021) as the backbone of our model due to its superior performance in molecular modeling. In each block, we first update bond representations, then atom representations, and finally the global molecule representation and the conformation.

Refer to caption
Figure 3: Network architecture of the ll-th block.

For ease of reference, let hi(l)h^{(l)}_{i} denote the representation of atom ii output by the ll-th block, and hij(l)h^{(l)}_{ij} the representation of the bond between atom ii and jj. Also, let MLP denote a feed-forward network.

Mathematically, the ll-th block takes following operations:

(1) Update bond representations: We first incorporate the coordinate information into the representations by

h¯i(l)=hi(l1)+MLP(R^i(l1))+z,iV,\displaystyle\bar{h}^{(l)}_{i}=h^{(l-1)}_{i}+\texttt{MLP}(\hat{R}^{(l-1)}_{i})+z,\,\forall i\in V, (3)
h¯ij(l)=hij(l1)+MLP(R^i(l1)R^j(l1)),(i,j)E,\displaystyle\bar{h}^{(l)}_{ij}=h^{(l-1)}_{ij}+\texttt{MLP}(\|\hat{R}^{(l-1)}_{i}-\hat{R}^{(l-1)}_{j}\|),\,\forall(i,j)\in E,

where z𝒩(μR,G,ΣR,G)z\sim{\mathcal{N}}(\mu_{R,G},\Sigma_{R,G}). After that, the bond representations are updated as follows: (i,j)E\forall(i,j)\in E,

hij(l)=hij(l1)+MLP(h¯i(l1),h¯j(l1),h¯ij(l1),U(l1)).h^{(l)}_{ij}=h^{(l-1)}_{ij}+\texttt{MLP}(\bar{h}^{(l-1)}_{i},\bar{h}^{(l-1)}_{j},\bar{h}^{(l-1)}_{ij},U^{(l-1)}). (4)

(2) Update atom representations: for any atom iVi\in V,

h~i(l)=jN(i)αjWvconcat(h¯ij(l),h¯j(l1))whereαjexp(𝒂ζ(Wqh¯i(l1)+Wkconcat(h¯j(l1),h¯ijl)));\displaystyle\tilde{h}_{i}^{(l)}=\sum_{j\in N(i)}\alpha_{j}W_{v}\texttt{concat}(\bar{h}^{(l)}_{ij},\bar{h}_{j}^{(l-1)})\;{\rm where}\;\alpha_{j}\propto\exp(\bm{a}^{\top}\zeta(W_{q}\bar{h}^{(l-1)}_{i}+W_{k}\texttt{concat}(\bar{h}^{(l-1)}_{j},\bar{h}^{l}_{ij}))); (5)
hi(l)=hi(l1)+MLP(h¯i(l1),h~i(l),U(l1)).\displaystyle h^{(l)}_{i}=h_{i}^{(l-1)}+\texttt{MLP}\Big{(}\bar{h}^{(l-1)}_{i},\tilde{h}_{i}^{(l)},U^{(l-1)}\Big{)}.

In Eqn.(5), 𝒂\bm{a}, WqW_{q}, WvW_{v} and WkW_{k} are the parameters to be learned, concat(,)(\cdot,\cdot) is the concatenation of two vectors and ζ\zeta is the leaky ReLU activation. For atom viv_{i}, we first use GATv2 (Brody et al., 2021) to aggregate the representations from its connected bonds to obtain h~i\tilde{h}_{i}, and then update viv_{i} based on h~i(l)\tilde{h}_{i}^{(l)}, h¯i(l1)\bar{h}^{(l-1)}_{i} and U(l1)U^{(l-1)}.

(4) Update global molecule representation:

U(l)=U(l1)+MLP(1|V|i=1|V|hi(l),1|E|i,jhij(l),U(l1)).U^{(l)}=U^{(l-1)}+\texttt{MLP}\Big{(}\frac{1}{|V|}\sum\nolimits_{i=1}^{|V|}h^{(l)}_{i},\frac{1}{|E|}\sum_{i,j}h^{(l)}_{ij},U^{(l-1)}\Big{)}. (6)

(5) Update the conformation: iV\forall i\in V,

R¯i(l)=MLP(hi(l)),m(l)=1|V|j=1|V|R¯j(l),R^i(l)=R¯i(l)m(l)+R^i(l1).\displaystyle\bar{R}^{(l)}_{i}=\texttt{MLP}(h^{(l)}_{i}),\quad m^{(l)}=\frac{1}{|V|}\sum\nolimits_{j=1}^{|V|}\bar{R}^{(l)}_{j},\quad\hat{R}^{(l)}_{i}=\bar{R}^{(l)}_{i}-m^{(l)}+\hat{R}^{(l-1)}_{i}. (7)

An important step in Eqn.(7) is that, after making initial prediction R¯i(l)\bar{R}^{(l)}_{i}, we calculate its center and normalize their coordinates by moving the center to the origin. This normalization ensures that the coordinates generated by each block are in reasonable numeric ranges.

We use R^(L)\hat{R}^{(L)} output by the last block in φdec\varphi_{\textrm{dec}} as the final prediction of the conformation.

3 Discussions with related work

CVGAE (Mansimov et al., 2019) is an early attempt to directly generating conformation. Unfortunately, its performance is not as good as distance-based methods developed afterwards, like (Shi et al., 2020) and (Simm & Hernández-Lobato, 2020). Our method, pursuing the same spirit, makes several finer designs: (1) We design a dedicated training objective that takes the invariance of both roto-translation and permutation on symmetric atoms into consideration. (2) We iteratively refine the output of each block, which is effective for conformation generation (see Figure 7 for ablation study). In comparison, CVGAE only outputs the conformation in the last layer. (3) Our model integrates several advanced and more effective modules, including GATv2 (Brody et al., 2021) and GN block (Battaglia et al., 2018), while CVGAE mainly leverages GRU (Bahdanau et al., 2015) and its variants on graphs, which are outperformed by the modules used in our model. GeoDiff (Xu et al., 2022) is a concurrent work, which uses a diffusion-based method for conformation generation and also directly predicts the coordinates without using intermediate distances. Compared with our method, GeoDiff does not consider the permutation invariance of symmetric atoms and is not as efficient as our method due to its sequential sampling.

ConfGF (Shi et al., 2021) and DGSM (Luo et al., 2021b) are two recent works that can also directly output the coordinates. They both model the gradient of log-density w.r.t interatomic distances, and then generate coordinates by running Langevin dynamics using the gradients. The gradient model is learned via score-matching. ConfGF considers the distances of 1-hop, 2-hop and 3-hop neighbors, and DGSM also considers distances of two randomly sampled nodes to model non-bonded distances. In comparison, we completely get rid of modeling distances. More importantly, the permutation invariance of symmetric atoms are not considered in those works. Ganea et al. (2021) propose another method for conformation generation: they first build the local structure (LS) by predicting the coordinates of non-terminal atoms, and then refine the LS by the predicted distances and dihedral angles. In comparison, our method does not require refinement based on the predicted distances and angles. Furthermore, although Ganea et al. (2021) use a permutation invariant loss, they only consider the terminal atoms. According to our statistics on a subset of 40K40K molecules from GEOM-Drugs, besides terminal atoms, on average, a molecule has 4.9 non-terminal symmetric atoms, accounting for 10.8% of all atoms. We consider all symmetric atoms.

Our method models the roto-translation and permutation invariance through the loss function, while previous works model the molecules using equivariant networks (Hoogeboom et al., 2022; Xu et al., 2022). More specifically, these works use the diffusion model for conformation generation. Rotational invariance of the conformation distribution is implemented using an invariant latent prior and an equivariant model structure (reverse diffusion process) to map from the latent space to the conformation space. This effectively makes an invariant loss in the latent space. Hoogeboom et al. (2022) also generate the composition of a molecule, by leveraging continuous representation of ordinal/categorical variables. In comparison, our method removes the constraints on equivariant neural networks by introducing equivariance/invariance into loss function, which is different from previous works that rely on specific network designs to ensure equivariance/invariance. A recent work (Du et al., 2022) points that only using radial direction to represent the geometric information (like the models used in Hoogeboom et al. (2022) and Xu et al. (2022)) abandons high-order tensor information, thus bringing direction degeneration problem and is insufficient to express complex geometric qualities. Therefore, in our approach, we can adopt both equivariant network models and more general (non-equivariant) networks, enabling the possibility of using more powerful non-equivariant neural models.

There are some other works on conformation generation, but they target at different problems. G-SchNet (Gebauer et al., 2019; Hoogeboom et al., 2022) takes some properties as input (not 2D graph) and output a conformation with desired properties. Luo et al. (2021a) focus on generating a conformation that can bind with specific binding pocket. We can combine our method with them in the future.

4 Experiments

4.1 Settings

Datasets: Following prior works (Xu et al., 2021a; Shi et al., 2021), we use the GEOM-QM9 and GEOM-Drugs datasets (Axelrod & Gomez-Bombarelli, 2021) for conformation generation. We verify our method on both small-scale setting and large-scale setting. For the small-scale setting, we use the same datasets provided by Shi et al. (2021) for fair comparison with prior works. The training, validation and test sets of the two datasets consist of 200200K, 2.52.5K and 2240822408 (for GEOM-QM9)/1432414324 (for GEOM-Drugs) molecule-conformation pairs respectively. After that, we work on the large-scale setting by sampling larger datasets from the original GEOM to validate the scalability of our method. We use all data in GEOM-QM9 and 2.2M2.2M molecule-conformation pairs for GEOM-Drugs. The numbers of training, validation and test sets for the larger GEOM-QM9 setting are 1.371.37M, 165165K and 174174K, and those for larger GEOM-Drugs are 22M, 100100K and 100100K.

Model configuration: All of φ2D\varphi_{\textrm{2D}}, φ3D\varphi_{\textrm{3D}} and φdec\varphi_{\textrm{dec}} have 6 blocks. The dimension dd of the features is 256256. Inspired by the feed-forward layer in Transformer (Vaswani et al., 2017), MLP also consists of two sub-layers, where the first one maps the input features from dimension 256256 to hidden states, followed by Batch Normalization and ReLU activation. Then the hidden states is mapped to 256256 again using linear mapping. Considering that our method outputs a conformation R^(l)\hat{R}^{(l)} at each block ll, we also require that each R^(l)\hat{R}^{(l)} should try to be similar to the groundtruth RR. Therefore, the RTP\ell_{\rm RTP} is Eqn.(2) is implemented as

RTP(R^(L),R)+λl=0L1RTP(R^(l),R),\ell_{\rm RTP}(\hat{R}^{(L)},R)+\lambda\sum_{l=0}^{L-1}\ell_{\rm RTP}(\hat{R}^{(l)},R), (8)

where LL is the number of blocks in the decoder, R^(0)\hat{R}^{(0)} is the output from φ2D\varphi_{\rm 2D}, and λ\lambda is determined according to validation performance. More details are summarized in Appendix A.2.

Evaluation: Assuming in the test set, the molecule xx has NxN_{x} conformations. Following Shi et al. (2020; 2021), for each molecule xx in the test set, we generate 2Nx2N_{x} conformations. Let 𝕊g\mathbb{S}_{g} and 𝕊r\mathbb{S}_{r} denote all generated and groundtruth conformations respectively. We use coverage score (COV) and matching score (MAT) to evaluate the generation quality. To measure the difference between RR and R^\hat{R}, we use the GetBestRMS in the RDKit package and denote the root-mean-square deviation as RMSD(R,R^)\operatorname{RMSD}(R,\hat{R}). The recall-based coverage and matching scores are defined as follows:

COV(𝕊g,𝕊r)=1|𝕊r||{R𝕊rRMSD(R,R^)<δ,R^𝕊g}|;\displaystyle\operatorname{COV}(\mathbb{S}_{g},\mathbb{S}_{r})=\frac{1}{\left|\mathbb{S}_{r}\right|}\left|\left\{R\in\mathbb{S}_{r}\mid\operatorname{RMSD}(R,\hat{R})<\delta,\exists\hat{R}\in\mathbb{S}_{g}\right\}\right|; (9)
MAT(𝕊g,𝕊r)=1|𝕊r|R𝕊rminR^𝕊gRMSD(R,R^).\displaystyle\operatorname{MAT}\left(\mathbb{S}_{g},\mathbb{S}_{r}\right)=\frac{1}{\left|\mathbb{S}_{r}\right|}\sum_{R\in\mathbb{S}_{r}}\min_{\hat{R}\in\mathbb{S}_{g}}\operatorname{RMSD}(R,\hat{R}).

A good method should have a high COV score and a low MAT score. Following (Shi et al., 2021; Xu et al., 2022), the δ\delta’s are set as 0.50.5 and 1.251.25 for GEOM-QM9 and GEOM-Drugs, respectively. The COV-δ\delta curves are left in Figure 9 of the appendix. There are also precision-based COV and MAT scores by switching the 𝕊r\mathbb{S}_{r} and 𝕊g\mathbb{S}_{g} in Eqn.(9). We leave the precision-based results in Appendix B.1.

Table 1: Recall-based coverage and matching scores. Bold fonts indicate the best results. The standard derivations of our method (five independent runs) on small-scale datasets are reported.
Small-scale QM9 Small-scale Drugs
Methods COV(%)\uparrow MAT (Å)\downarrow COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median Mean Median Mean Median
RDKit 83.26 90.78 0.3447 0.2935 60.91 65.70 1.2026 1.1252
CVGAE 0.09 0.00 1.6713 1.6088 0.00 0.00 3.0702 2.9937
GraphDG 73.33 84.21 0.4245 0.3973 8.27 0.00 1.9722 1.9845
CGCF 78.05 82.48 0.4219 0.3900 53.96 57.06 1.2487 1.2247
ConfVAE 80.42 85.31 0.4066 0.3891 53.14 53.98 1.2392 1.2447
GeoMol 71.26 72.00 0.3731 0.3731 67.16 71.71 1.0875 1.0586
ConfGF 88.49{88.49} 94.13 0.2673{0.2673} 0.2685 62.15 70.93 1.1629 1.1596
DGSM 91.49 95.92 0.2139 0.2137 78.73 94.39 1.0154 0.9980
GeoDiff 90.54 94.61 0.2090 0.1988 89.13 97.88 0.8629 0.8529
DMCG 96.23\mathbf{96.23} 99.26\mathbf{99.26} 0.2083\mathbf{0.2083} 0.2014\mathbf{0.2014} 96.52\mathbf{96.52} 100.00\mathbf{100.00} 0.7220\mathbf{0.7220} 0.7161\mathbf{0.7161}
 Std ±0.38\pm 0.38 ±0.37\pm 0.37 ±0.0052\pm 0.0052 ±0.0040\pm 0.0040 ±0.14\pm 0.14 ±0.00\pm 0.00 ±0.0027\pm 0.0027 ±0.0061\pm 0.0061
Large-scale QM9 Large-scale Drugs
Methods COV(%)\uparrow MAT (Å)\downarrow COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median Mean Median Mean Median
RDKit 81.61 85.71 0.2643 0.2472 69.42 77.45 1.0880 1.0333
CVGAE 0.00 0.00 1.4687 1.3758 0.00 0.00 2.6501 2.5969
GraphDG 13.48 5.71 0.9511 0.9180 1.95 0.00 2.6133 2.6132
CGCF 81.48 86.95 0.3598 0.3684 57.47 62.09 1.2205 1.2003
ConfVAE 80.18 85.87 0.3684 0.3776 57.63 63.75 1.2125 1.1986
ConfGF 89.21 95.12 0.2809 0.2837 70.92 85.71 1.0940 1.0917
GeoMol 91.05 95.55 0.2970 0.2993 69.74 83.56 1.1110 1.0864
DMCG 98.34\bf 98.34 100.00\bf 100.00 0.1486\bf 0.1486 0.1340\bf 0.1340 96.22\bf 96.22 100.00\bf 100.00 0.6967\bf 0.6967 0.6552\bf 0.6552

Baselines: (1) RDKit, which is a widely used toolkit and generates the conformation based on the force fields; (2) CVGAE (Mansimov et al., 2019), which is an early attempt to generate raw coordinates; (3) GraphDG (Simm & Hernández-Lobato, 2020), a representative distance-based method with VAE; (4) CGCF (Xu et al., 2021a), which is another distance-based method leveraging continuous normalizing flow; (5) ConfVAE (Xu et al., 2021b), an end-to-end framework for molecular conformation generation, which still uses the pairwise distances among atoms as intermediate variables; (6) ConfGF (Shi et al., 2021) and DGSM (Luo et al., 2021b), which uses score matching to generate the gradients w.r.t distances and then recover the conformation; (7) GeoDiff (Xu et al., 2022), which uses diffusion model to generate conformations; (8) GeoMol (Ganea et al., 2021), which predicts local atomic 3D structures and torsion angles. Considering Ganea et al. (2021) use a different data split from previous work, we reproduce their method following the more commonly used data split (Xu et al., 2021a; Shi et al., 2021).

4.2 Results

The recall-based results are shown in Table 1. For small-scale datasets, we independently train our models with five different random seeds, and report the mean and standard derivations. We have the following observations:

(1) On the four settings in Table 1, our method achieves state-of-the-art results on all of them. The median COV(%) being 100% means that for more than half of the groundtruth conformations, there exist generated conformations that are close to them within a predefined threshold. These results show the effectiveness and scalability of our method.

(2) For the molecules in GEOM-QM9 and GEOM-Drugs, our method achieves more improvement on molecules with more heavy atoms. Take the small-scale results in Table 1 as an example. On average, GEOM-QM9 and GEOM-Drugs have 8.88.8 and 24.924.9 heavy atoms respectively. In terms of MAT mean values, on GEOM-QM9, our method improves ConfGF and GeoDiff by 22.7%22.7\% and 1.2%1.2\%, while on GEOM-Drugs, the improvements are 37.9%37.9\% and 16.3%16.3\%. The results demonstrate the effectiveness of our method on large molecules.

More analysis is in Appendix B.5.

(3) Our method is much more sample-efficient than methods based on Langevin dynamics like ConfGF, since we can generate IID samples free of the auto-correlation in a Markov chain. ConfGF requires 50005000 sequential forward steps, while we only need to sample once from 𝒩(0,𝑰){\mathcal{N}}(0,{\bm{I}}) and forward through the model. For a fair comparison, following the official implementation of ConfGF, we split the test sets of small-scale GEOM-QM9 and GEOM-Drugs into 200200 batches. ConfGF requires 8511.608511.60 and 11830.4211830.42 seconds to decode QM9 and Drugs test sets, while our method only requires 32.6832.68 and 54.8954.89 seconds respectively. Our method speeds up the decoding more than 200200 times. Our method is also much more efficient than the recent GeoMol algorithm , which takes 99.3499.34s and 668.95668.95s to decode the above two datasets.

(4) As shown in Table 1, the standard derivations of our method are significantly smaller than the gain compared to the previously best results. This shows the effectiveness and robustness of our method. In addition, considering that our method takes a random conformation as input, to test the confidence interval, we run decoding with 1010 different initial conformations. The mean COV and MAT scores on small-scale GEOM-QM9 are 96.24±0.1296.24\pm 0.12 and 0.2079±0.00100.2079\pm 0.0010, and those two numbers on small-scale GEOM-Drugs are 96.38±0.1996.38\pm 0.19 and 0.7239±0.00250.7239\pm 0.0025. Our method is not sensitive to the choice of initial conformations.

The number of rotatable bonds is an important metric of how flexible a molecule is. We report coverage score w.r.t. the number of rotatable bonds in Figure 4 based on small-scale GEOM-Drugs. More rotatable bonds indicate harder generation. Our method outperforms previous baselines.

Refer to caption
Figure 4: Coverage scores w.r.t. number of rotatable bonds.
Refer to caption
Figure 5: Visualization of different conformations.

In Figure 5, we visualize the conformation of different methods. We randomly select three molecules from the small-scale GEOM-drug dataset, generate several conformations, and visualize the best-aligned ones with the groundtruth. We can see that our method can generate high-quality conformations than previous methods, which are the most similar to the groundtruth.

Computation cost analysis: We use PyTorch profiler333https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html to analyze the training time of the following components: (1) model forward time, which denotes the time of calculating the hidden representations from the input layer to output layer; (2) transformation time, which denotes the time of calculating the optimal roto-translation operation ρ\rho^{*}; (3) permutation time, which denotes the time of enumerating all possible permutations in 𝒮\mathcal{S} and find the optimal one σ𝒮\sigma^{*}\in{\mathcal{S}}; note that we can use torch.no_grad to reduce time and memory; (4) loss forward time, which is the total of calculating the loss after obtaining ρ\rho^{*} and σ\sigma^{*}; (5) loss backward time, which denotes the time of gradient backpropagation.

The time is summarized in Table 2. We can see that model forward and loss forward/backward takes about 71.4%71.4\% of the total computation time. The transformation and permutation takes 20.4%20.4\% and 8.2%8.2\% of the total time. Note that there are 77 transformation operations in the experiments (see Eqn.(8)). For the full training pipeline where data loading, model forwarding, loss forwarding, gradient backpropagation, metric calculation and CPU/GPU communications are all considered, DMCG takes 20%20\% more time than that without roto-translation and permutation.

Table 2: Computation time statistics of each part in 100100 iterations.
Model forward Transformation Permutation Loss forward Loss backward
5.5155.515 (52.8%) 2.1362.136 (20.4%) 0.8580.858 (8.2%) 0.0520.052 (0.5%) 1.8861.886 (18.1%)

We use graph isomorphism algorithms to find all 𝒮{\mathcal{S}}. Although the general graph isomorphism problem is NP-hard, the size of drug-like molecules is largely limited, otherwise the molecule’s druggability is limited (one can refer to Lipinski’s rule of five). Therefore, our method does not need scalability to a large scale. In our experiments, it takes 4.94.9 seconds to process 10k10k molecules in GEOM-QM9, and 6.66.6 seconds to process 10k10k molecules in GEOM-Drugs. This is negligible compared to the training time, and we only need to process the data for one time in data preparing stage.

4.3 Molecular docking

Molecular docking (Roy et al., 2015) is a widely used technique in drug discovery, which aims to find the optimal binding conformation of a drug (i.e., the small molecule) in the pocket of a given target protein and the corresponding binding affinity. In most cases, the molecular docking algorithms treat proteins as rigid bodies and take one conformation of the small molecules as the initial structure inputs. The algorithms then search for the optimal conformation in the conformation space of the small molecules guided by the scoring function. However, due to the complexity of the conformation space, it is difficult for the algorithm to converge to a global minimum. Therefore, the choice of the initial structure often leads to different binding conformations and needs to be taken seriously.

Previously, RDKit was often used to generate initial conformations of small molecules, which usually got reasonable but not optimal results after docking. To verify the effectiveness of our method, we compared the docked poses which take initial conformations generated by our method (DMCG), ConfGF, GeoMol, GeoDiff and RDKit as the initial conformations for docking respectively.

We use Smina (Koes et al., 2013) for molecular docking and make evaluation on PDBbind refined set (Liu et al., 2017) which is a comprehensive collection of experimentally measured binding affinity for all biomolecular complexes deposited in the Protein Data Bank444https://www.rcsb.org/. We randomly select 100100 protein-ligand pairs for evaluation. Appendix A.3 shows detailed optimization hyper-parameters.

Refer to caption
(a) Docking score (kcal/mol).
Refer to caption
(b) RMSD score (Å).
Figure 6: Histograms of docking scores and RMSD scores.

Two metrics were used to evaluate the results of docking. One is the docking score (roughly, the estimation of binding affinity), which measures how well a molecule fits the binding site. A smaller value indicates better binding affinity. The other is the root-mean-square deviation (RMSD, the smaller, the better) compared to the crystal complex structure. As shown in Figure 6(a), the distributions for the three methods have the similar shape but our method is much more left-shifted than the others. This shows that for the same small molecule, our method tends to help docking to find conformations with higher binding affinities. Furthermore, docking tends to find lower RMSD binding conformations using the conformation generated by our method as the initial conformation, suggesting that our method can help docking to find binding conformations that are closer to the native crystal structures (Figure 6(b)). We also summarize the mean values of the docking scores and RMSD of different algorithms in the legends of Figure 6. All these results show that our method provides more proper initial conformations for molecular docking and thus facilitates the real application in computer-aided drug discovery.

4.4 Property prediction

In addition to conformation generation task, we also conduct experiments on property prediction task, which is to predict molecular property based on an ensemble of generated conformation (Axelrod & Gomez-Bombarelli, 2021). We first randomly choose 3030 molecules from GEOM-QM9 test sets, and then sample 5050 conformations for each molecule using RDKit, ConfGF and our method. We use the quantum chemical calculation package Psi4 (Smith et al., 2020) to calculate the energy, HOMO and LUMO for each generated conformation and groundtruth conformation. Next, we calculate the ensemble properties of average energy E¯\overline{E}, lowest energy EminE_{\rm min}, average HOMO-LUMO gap Δϵ¯\overline{\Delta\epsilon}, minimum gap Δϵmin\Delta\epsilon_{\rm min} and maximum gap Δϵmax\Delta\epsilon_{\rm max} based on the conformational properties of each molecule555From a physics perspective, using the Boltzmann-weighted average of the energies of the molecules is a better choice, but the distribution is missing from the dataset. Following (Simm & Hernández-Lobato, 2020; Shi et al., 2021; Luo et al., 2021b), we use the average number here instead of the weighted version.. We use mean absolute error to measure the property differences between the generated conformations and groundtruth conformations.

Table 3: Mean absolute error of predicted ensemble properties. (Unit: eV).
Methods E¯\overline{E} EminE_{\rm min} Δϵ¯\overline{\Delta\epsilon} Δϵmin\Delta\epsilon_{\rm min} Δϵmax\Delta\epsilon_{\rm max}
RDKit 0.8875 0.6530 0.3484 0.5570\bf 0.5570 0.2399
GraphDG 45.1088 9.2868 3.8970 6.6997 1.7724
ConfGF 2.8349 0.2012 0.6903 4.9221 0.1820
GeoMol 4.5700 0.5096 0.5616 3.5083 0.2650
DMCG 0.4324\bf 0.4324 0.1364\bf 0.1364 0.2057\bf 0.2057 1.3229 0.1509\bf 0.1509

The results are shown in Table 3. Our method significantly outperforms GraphDG, ConfGF and the recent GeoMol, which shows the effectiveness of our method. We can observe that RDKit achieves the best results on Δϵmin\Delta\epsilon_{\rm min}, and we will combine our method with RDKit in the future.

4.5 Ablation study

We conduct ablation study on the small-scale GEOM-Drugs dataset. The results are shown in Table 4.

(1) We remove the permutation invariant loss and use the roto-translation invariant loss only, i.e., the RTP\ell_{\rm RTP} in Eqn.(2) is replaced with RT\ell_{\rm RT} defined in Section.(2.1). The results are denoted as “No P\ell_{\rm P}” in Table 4.

(2) We replace attentive node aggregation by a simple MLP network. That is, Eqn.(5) is replaced by

hi(l)=hi(l1)+MLP(hi(l1),U(l1),1|N(i)|jN(i)hj(l1)).h^{(l)}_{i}=h^{(l-1)}_{i}+\texttt{MLP}(h^{(l-1)}_{i},U^{(l-1)},\frac{1}{|N(i)|}\sum_{j\in N(i)}h^{(l-1)}_{j}).

The results are denoted as “No attention” in Table 4.

(3) We remove the normalization step in Eqn.(7), i.e., the m(l)m^{(l)} is not used. Denote the results as “No normalization”.

We can see that: (1) The permutation invariant loss is extremely important, without which the mean COV drops 18.91 while MAT increases 0.3434. We also visualize several cases in Appendix B.2 to compare the results with or without P\ell_{\rm P}. (2) Without attentively aggregating the atom features, the mean COV drops 1.701.70 points and MAT score increases 0.03450.0345 points. (3) Without the conformation normalization, the performance is also hurt. These results demonstrate the importance of the components in our method.

Table 4: Ablation study on small-scale GEOM-Drugs.
Methods COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median
DMCG 96.5296.52 100.00100.00 0.72200.7220 0.71610.7161
No P\ell_{\rm P} 77.7877.78 86.0986.09 1.06571.0657 1.05631.0563
No attention 94.99 100.00 0.7611 0.7581
No normalization 92.77 98.68 0.8002 0.7977

Finally, we compute the COV and MAT scores of R^(l)\hat{R}^{(l)} against the groundtruth, which is the output conformation of the ll-th block in the decoder. R^(0)\hat{R}^{(0)} is the output of φ2D\varphi_{\textrm{2D}}. The results are shown in Figure 7. We can see that iteratively refining the conformations can improve the performances, which shows the effectiveness of our design. This phenomenon is consistency with the discovery in machine translation (Xia et al., 2017), image synthesis (Chen & Koltun, 2017) and protein structure prediction (Jumper et al., 2021).

Refer to caption
(a) MAT score.
Refer to caption
(b) COV score.
Figure 7: The MAT and COV scores of R^(l)\hat{R}^{(l)} output by different blocks.

We leave the discussion about additional constraints on loss functions, the comparison of model sizes and more discussions in Appendix B.

5 Conclusions and future work

In this work, we propose a new method, that directly generates the coordinates of conformations. For this purpose, we design a dedicated loss function, which is invariant to roto-translation and permutation on symmetric atoms. We also design a new model with many advanced modules (i.e., GATv2, GN block) that can iteratively refine the conformations. Experimental results on both small-scale and large-scale GEOM-QM9 and GEOM-Drugs demonstrate the effectiveness of our method.

For future work, first, we will incorporate chemical rules into deep learning models to improve generation quality. Second, current methods are mainly non-autoregressive, where all coordinates are generated simultaneously. We will study the autoregressive setting so as to further improve the accuracy. Third, Villar et al. (2021) point that equivariance/invariance can be universally approximated through polynomial functions. This is a good direction to explore in molecular conformation generation. Fourth, when the number of permutation invariant mappings in a molecule is extremely large, enumerating all of them is not the best choice due to the exponentially increased computation cost. We will improve our method along this direction. Fifth, we will deeply collaborate with chemists and biologists on more case studies.

References

  • Addanki et al. (2021) Ravichandra Addanki, Peter W. Battaglia, David Budden, Andreea Deac, Jonathan Godwin, Thomas Keck, Wai Lok Sibon Li, Alvaro Sanchez-Gonzalez, Jacklynn Stott, Shantanu Thakoor, and Petar Velickovic. Large-scale graph representation learning with very deep gnns and self-supervision. CoRR, abs/2107.09422, 2021. URL https://arxiv.org/abs/2107.09422.
  • Axelrod & Gomez-Bombarelli (2021) Simon Axelrod and Rafael Gomez-Bombarelli. Geom: Energy-annotated molecular conformations for property prediction and molecular generation, 2021.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
  • Baseden & Tye (2014) Kyle A. Baseden and Jesse W. Tye. Introduction to density functional theory: Calculations by hand on the helium atom. Journal of Chemical Education, 91(12):2116–2123, 2014. doi: 10.1021/ed5004788.
  • Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • Brody et al. (2021) Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? arXiv preprint arXiv:2105.14491, 2021.
  • Chen & Koltun (2017) Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. 2017 IEEE International Conference on Computer Vision (ICCV), pp.  1520–1529, 2017.
  • Cleves & Jain (2017) Ann E. Cleves and Ajay N. Jain. Forcegen 3d structure and conformer generation: from small lead-like molecules to macrocyclic drugs. Journal of Computer-Aided Molecular Design, 31(5):419–439, May 2017. ISSN 1573-4951. doi: 10.1007/s10822-017-0015-8. URL https://doi.org/10.1007/s10822-017-0015-8.
  • Dokmanic et al. (2015) Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32(6):12–30, 2015.
  • Du et al. (2022) Weitao Du, He Zhang, Yuanqi Du, Qi Meng, Wei Chen, Nanning Zheng, Bin Shao, and Tie-Yan Liu. SE(3) equivariant graph neural networks with complete local frames. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5583–5608. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/du22e.html.
  • Ganea et al. (2021) Octavian-Eugen Ganea, Lagnajit Pattanaik, Connor W. Coley, Regina Barzilay, Klavs Jensen, William Green, and Tommi S. Jaakkola. Geomol: Torsional geometric generation of molecular 3d conformer ensembles. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=af_hng9tuNj.
  • Gebauer et al. (2019) Niklas Gebauer, Michael Gastegger, and Kristof Schütt. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/a4d8e2a7e0d0c102339f97716d2fdfb6-Paper.pdf.
  • Halgren (1996) Thomas A Halgren. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. Journal of computational chemistry, 17(5-6):490–519, 1996.
  • Hamilton (1840) W. R. Hamilton. On a new species of imaginary quantities, connected with the theory of quaternions. Proceedings of the Royal Irish Academy (1836-1869), 2:424–434, 1840. ISSN 03027597. URL http://www.jstor.org/stable/20520177.
  • Hoffmann & Noé (2019) Moritz Hoffmann and Frank Noé. Generating valid euclidean distance matrices. arXiv preprint arXiv:1910.03131, 2019.
  • Hoogeboom et al. (2022) Emiel Hoogeboom, Víctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3D. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  8867–8887. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/hoogeboom22a.html.
  • Hopcroft & Wong (1974) J. E. Hopcroft and J. K. Wong. Linear time algorithm for isomorphism of planar graphs (preliminary report). STOC ’74, pp.  172–184, New York, NY, USA, 1974. Association for Computing Machinery. ISBN 9781450374231. doi: 10.1145/800119.803896. URL https://doi.org/10.1145/800119.803896.
  • Hu et al. (2021) Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021.
  • Jumper et al. (2021) John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, Aug 2021. ISSN 1476-4687. doi: 10.1038/s41586-021-03819-2. URL https://doi.org/10.1038/s41586-021-03819-2.
  • Kanal et al. (2018) Ilana Y. Kanal, John A. Keith, and Geoffrey R. Hutchison. A sobering assessment of small-molecule force field methods for low energy conformer predictions. International Journal of Quantum Chemistry, 118(5):e25512, 2018. doi: https://doi.org/10.1002/qua.25512. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/qua.25512.
  • Karney (2007) Charles FF Karney. Quaternions in molecular modeling. Journal of Molecular Graphics and Modelling, 25(5):595–604, 2007.
  • Kingma & Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR 2014), Banff, Canada, 2014. ICLR Committee.
  • Koes et al. (2013) D. R. Koes, M. P. Baumgartner, and C. J. Camacho. Lessons learned in empirical scoring with smina from the csar 2011 benchmarking exercise. J Chem Inf Model, 53(8):1893–904, 2013. ISSN 1549-960X (Electronic) 1549-9596 (Linking). doi: 10.1021/ci300604z.
  • Kontoyianni (2017) Maria Kontoyianni. Docking and Virtual Screening in Drug Discovery, pp. 255–266. Springer New York, New York, NY, 2017. ISBN 978-1-4939-7201-2. doi: 10.1007/978-1-4939-7201-2_18. URL https://doi.org/10.1007/978-1-4939-7201-2_18.
  • Liu et al. (2017) Zhihai Liu, Minyi Su, Li Han, Jie Liu, Qifan Yang, Yan Li, and Renxiao Wang. Forging the basis for developing protein–ligand interaction scoring functions. Accounts of chemical research, 50(2):302–309, 2017.
  • Loshchilov & Hutter (2016) Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983, 2016. URL http://arxiv.org/abs/1608.03983.
  • Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Luo et al. (2021a) Shitong Luo, Jiaqi Guan, Jianzhu Ma, and Jian Peng. A 3d generative model for structure-based drug design. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021a.
  • Luo et al. (2021b) Shitong Luo, Chence Shi, Minkai Xu, and Jian Tang. Predicting molecular conformation via dynamic graph score matching. Advances in Neural Information Processing Systems, 34, 2021b.
  • Mansimov et al. (2019) Elman Mansimov, Omar Mahmood, Seokho Kang, and Kyunghyun Cho. Molecular geometry prediction using a deep generative graph neural network. Scientific Reports, 9(1):20381, Dec 2019. ISSN 2045-2322. doi: 10.1038/s41598-019-56773-5. URL https://doi.org/10.1038/s41598-019-56773-5.
  • Meli & Biggin (2020) Rocco Meli and Philip C. Biggin. spyrmsd: symmetry-corrected rmsd calculations in python. Journal of Cheminformatics, 12(1):49, Aug 2020. ISSN 1758-2946. doi: 10.1186/s13321-020-00455-2. URL https://doi.org/10.1186/s13321-020-00455-2.
  • Parr (1980) Robert G. Parr. Density functional theory of atoms and molecules. In Kenichi Fukui and Bernard Pullman (eds.), Horizons of Quantum Chemistry, pp.  5–15, Dordrecht, 1980. Springer Netherlands. ISBN 978-94-009-9027-2.
  • Rappe et al. (1992) A. K. Rappe, C. J. Casewit, K. S. Colwell, W. A. Goddard, and W. M. Skiff. Uff, a full periodic table force field for molecular mechanics and molecular dynamics simulations. Journal of the American Chemical Society, 114(25):10024–10035, 1992. doi: 10.1021/ja00051a040.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.
  • Roy et al. (2015) Kunal Roy, Supratik Kar, and Rudra Narayan Das. Chapter 10 - other related techniques. In Kunal Roy, Supratik Kar, and Rudra Narayan Das (eds.), Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment, pp.  357–425. Academic Press, Boston, 2015. ISBN 978-0-12-801505-6. doi: https://doi.org/10.1016/B978-0-12-801505-6.00010-7. URL https://www.sciencedirect.com/science/article/pii/B9780128015056000107.
  • Shi et al. (2020) Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. Graphaf: a flow-based autoregressive model for molecular graph generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1esMkHYPr.
  • Shi et al. (2021) Chence Shi, Shitong Luo, Minkai Xu, and Jian Tang. Learning gradient fields for molecular conformation generation. In International Conference on Machine Learning, 2021.
  • Simm & Hernández-Lobato (2020) Gregor N. C. Simm and José Miguel Hernández-Lobato. A generative model for molecular distance geometry. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  8949–8958. PMLR, 13–18 Jul 2020. URL http://proceedings.mlr.press/v119/simm20a.html.
  • Smith et al. (2020) Daniel GA Smith, Lori A Burns, Andrew C Simmonett, Robert M Parrish, Matthew C Schieber, Raimondas Galvelis, Peter Kraus, Holger Kruse, Roberto Di Remigio, Asem Alenaizan, et al. Psi4 1.4: Open-source software for high-throughput quantum chemistry. The Journal of chemical physics, 152(18):184108, 2020.
  • Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, volume 28, pp.  3483–3491, 2015.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Villar et al. (2021) Soledad Villar, David W Hogg, Kate Storey-Fisher, Weichi Yao, and Ben Blum-Smith. Scalars are universal: Equivariant machine learning, structured like classical physics. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=NqYtJMX9g2t.
  • Winter et al. (2021) Robin Winter, Frank Noé, and Djork-Arné Clevert. Auto-encoding molecular conformations. arXiv preprint arXiv:2101.01618, 2021.
  • Xia et al. (2017) Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Deliberation networks: Sequence generation beyond one-pass decoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/c6036a69be21cb660499b75718a3ef24-Paper.pdf.
  • Xu et al. (2021a) Minkai Xu, Shitong Luo, Yoshua Bengio, Jian Peng, and Jian Tang. Learning neural generative dynamics for molecular conformation generation. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=pAbm1qfheGk.
  • Xu et al. (2021b) Minkai Xu, Wujie Wang, Shitong Luo, Chence Shi, Yoshua Bengio, Rafael Gomez-Bombarelli, and Jian Tang. An end-to-end framework for molecular conformation generation via bilevel programming. arXiv preprint arXiv:2105.07246, 2021b.
  • Xu et al. (2022) Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=PzcvxEMzvQC.

Appendix A Procedure descriptions

A.1 Details of other model components

The model architectures of φ2D\varphi_{\textrm{2D}} and φ3D\varphi_{\textrm{3D}} are similar to φdec\varphi_{\textrm{dec}}, with the following differences.

Comparing φ2D\varphi_{\textrm{2D}} with φdec\varphi_{\textrm{dec}}, the differences are the initial conformation R^(0)\hat{R}^{(0)} and initial features (i.e., the HV(0)H_{V}^{(0)}, HE(0)H_{E}^{(0)} and U(0)U^{(0)}). φ2D\varphi_{\textrm{2D}} takes a random conformation sampled from uniform distribution in [1,1][-1,1] as input. The initial atom and edge features are the embeddings of the atoms and edges respectively. φ2D\varphi_{\textrm{2D}} will also output a prediction of the conformation. Note that the random variable zz sampled from Gaussian 𝒩(μR,G,ΣR,G){\mathcal{N}}(\mu_{R,G},\Sigma_{R,G}) is not used in φ2D\varphi_{\textrm{2D}}.

Comparing φ3D\varphi_{\textrm{3D}} with φdec\varphi_{\textrm{dec}}, the differences are the initial conformation R^(0)\hat{R}^{(0)}, initial features (i.e., the HV(0)H_{V}^{(0)}, HE(0)H_{E}^{(0)} and U(0)U^{(0)}) too. φ3D\varphi_{\textrm{3D}} takes the groundtruth conformation as input. The initial atom and edge features are the embeddings of the atoms and edges respectively. Another difference is that the fourth step of φdec\varphi_{\textrm{dec}}, i.e., Eqn.(7), is not used.

A.2 More details about training

We use AdamW optimizer (Loshchilov & Hutter, 2019) with initial learning rate η0=2×104\eta_{0}=2\times 10^{-4} and weight decay 0.010.01. In the first 40004000 iterations, the learning rate is linearly increased from 10610^{-6} to 2×1042\times 10^{-4}. After that, we use cosine learning rate scheduler (Loshchilov & Hutter, 2016), where the learning rate at the tt-th iteration is η0(1+cos(πtT))/2\eta_{0}(1+\operatorname{cos}(\pi\frac{t}{T}))/2, where TT is the half of the period (i.e., the iteration numbers of 1010 epochs in our setting). Similarly, we also use the cosine scheduler to dynamically set the β\beta at range [0.0001,βmax][0.0001,\beta_{\rm max}]. The batch size is fixed as 128128. All models are trained for 100100 epochs. For the two small-scale settings, the experiments are conducted on a single V100 GPU. For the two large-scale settings, we use two V100 GPUs for experiments. The λ\lambda in Eqn.(8) for large-scale QM9 is 0.10.1, and for the remaining settings, λ\lambda is set as 0.20.2. The detailed hyper-parameters are described in Table 5. We grid search the best hyper-parameter with these hyper-parameters, and the last hyper-parameters are selected according to validation performance, i.e., the hyper-parameter setting corresponding to the best coverage score (COV) and matching score (MAT) on the hold-out validation set.

Table 5: Hyper-parameters for our experiments.
Small-Scale Large-Scale
Layer number 6 6
Dropout {0.1, 0.2} {0.1, 0.2}
Learning rate {1e-4, 2e-4, 5e-4} {1e-4, 2e-4, 5e-4}
Batch size 128 128
Epoch 100 100
β\beta Min 0.0001 0.001
β\beta Max {0.001, 0.002, 0.004, 0.008, 0.01} {0.005, 0.01, 0.02, 0.04,0.05}
Latent size 256 256
Hidden dimension 1024 1024
GPU number 1×\times NVIDIA V100 2×\times NVIDIA V100

A.3 More details about molecular docking

For RDKit, we generated one initial conformation as input and set num_modes to 50 when performing docking666When using different random seeds, the conformations output by RDKit is not diverse enough. Therefore, we only choose one here.. For our method, ConfGF, GeoDiff and GeoMol, since the generated conformations are independent and diverse, we randomly selected five of them, performed five independent molecular docking calculations and set num_modes to 10 to ensure all three methods generate equal number of conformations. Eventually, each method got about 50 binding conformations. The conformation corresponding to the lowest binding affinity was selected as the final docked pose.

Appendix B More experimental results

B.1 Precision-based results

The precision-based coverage and matching scores are defined as follows:

COVP(𝕊g,𝕊r)=1|𝕊g||{R^𝕊gRMSD(R,R^)<δ,R𝕊r}|;\displaystyle\operatorname{COV-P}(\mathbb{S}_{g},\mathbb{S}_{r})=\frac{1}{\left|\mathbb{S}_{g}\right|}\left|\left\{\hat{R}\in\mathbb{S}_{g}\mid\operatorname{RMSD}(R,\hat{R})<\delta,\exists R\in\mathbb{S}_{r}\right\}\right|; (10)
MATP(𝕊g,𝕊r)=1|𝕊g|R^𝕊gminR𝕊rRMSD(R,R^).\displaystyle\operatorname{MAT-P}\left(\mathbb{S}_{g},\mathbb{S}_{r}\right)=\frac{1}{\left|\mathbb{S}_{g}\right|}\sum_{\hat{R}\in\mathbb{S}_{g}}\min_{R\in\mathbb{S}_{r}}\operatorname{RMSD}(R,\hat{R}).

The results are in Table 6. The results of GraphDG, CGCF, ConfVAE, ConfGF and GeoDiff are from (Xu et al., 2022). Our method is still the best one.

Table 6: Precision-based coverage and matching scores. Bold fonts indicate the best results.
Small-scale QM9 Small-scale Drugs
COV-P(%)\uparrow MAT-P(Å)\downarrow COV-P(%)\uparrow MAT-P(Å)\downarrow
Methods Mean Median Mean Median Mean Median Mean Median
GraphDG 43.90 35.33 0.5809 0.5823 2.08 0.00 2.4340 2.4100
CGCF 36.49 33.57 0.6615 0.6427 21.68 13.72 1.8571 1.8066
ConfVAE 38.02 34.67 0.6215 0.6091 22.96 14.05 1.8287 1.8159
ConfGF 49.02 46.69 0.5111 0.4979 23.15 15.73 1.7304 1.7106
GeoDiff 52.79 50.29 0.4448 0.4267 61.47 64.55 1.1712 1.1232
GeoMol 84.98 89.90 0.3292 0.3269 75.54 94.13 1.0028 0.9082
DMCG 87.26\bf 87.26 91.00\bf 91.00 0.2872\bf 0.2872 0.2926\bf 0.2926 81.05\bf 81.05 95.51\bf 95.51 0.9210\bf 0.9210 0.8785\bf 0.8785
Dataset Large-scale QM9 Large-scale Drugs
Methods Mean Median Mean Median Mean Median Mean Median
COV-P(%)\uparrow MAT-P(Å)\downarrow COV-P(%)\uparrow MAT-P(Å)\downarrow
ConfGF 46.23 44.87 0.5171 0.5133 28.23 20.71 1.6317 1.6155
GeoMol 78.28 81.03 0.3790 0.3861 41.46 36.79 1.5120 1.5107
DMCG 90.86\bf 90.86 95.36\bf 95.36 0.2305\bf 0.2305 0.2258\bf 0.2258 74.57\bf 74.57 81.80\bf 81.80 0.9940\bf 0.9940 0.9454\bf 0.9454

B.2 Combination with distance-based and angle-based loss functions

One may be curious about whether using distance-based loss and angle-based can further improve the performance, since the latter two are equivariant to the transformation of coordinates. For ease of reference, let RiR_{i} denote the groundtruth coordinate of atom viv_{i} and R^i\hat{R}_{i} denote the predicted coordinate of atom viv_{i}. Recall in Section 1, we use EE to denote the collection of all bonds. We define E2E_{2} as {(i,j,k)|(i,j)E,(i,k)E,kj}\{(i,j,k)|(i,j)\in E,(i,k)\in E,k\neq j\}.

Inspired by (Winter et al., 2021) and (Ganea et al., 2021), we use the following two functions:

angle=1|E2|(i,j,k)E2,cosine(RjRi,RkRi)cosine(R^jR^i,R^kR^i)F2,\displaystyle\ell_{{\rm angle}}=\frac{1}{|E_{2}|}\sum_{(i,j,k)\in E_{2},}\|\operatorname{cosine}(R_{j}-R_{i},R_{k}-R_{i})-\operatorname{cosine}(\hat{R}_{j}-\hat{R}_{i},\hat{R}_{k}-\hat{R}_{i})\|_{F}^{2}, (11)
bond=1|E|(i,j)E(distance(Rj,Ri)distance(R^jR^i))2,\displaystyle\ell_{{\rm bond}}=\frac{1}{|E|}\sum_{(i,j)\in E}\left(\operatorname{distance}(R_{j},R_{i})-\operatorname{distance}(\hat{R}_{j}-\hat{R}_{i})\right)^{2}, (12)

where cosine(a,b)=abab\operatorname{cosine}(a,b)=\frac{a^{\top}b}{\|a\|\|b\|} and distance(a,b)=ab\operatorname{distance}(a,b)=\|a-b\|, aa and bb are two vectors. That is, we apply additional constraints to bond length and bond angles. Please note that with the above two auxiliary loss functions, our method still generates coordinates directly and does not need to generate intermediate distances and angles.

We verify the following three loss functions:

1=𝔼ϵ𝒩(0,𝑰)RT(R,R^(μR,G+ΣR,Gϵ,G))+βDKL(𝒩(μR,G,ΣR,G)𝒩(0,𝑰)),\displaystyle\mathcal{L}_{1}=\mathbb{E}_{\epsilon\sim\mathcal{N}(0,{\bm{I}})}\ell_{\rm RT}(R,\hat{R}(\mu_{R,G}+\Sigma_{R,G}\epsilon,G))+\beta D_{\mathrm{KL}}(\mathcal{N}(\mu_{R,G},\Sigma_{R,G})\|\mathcal{N}(0,{\bm{I}})), (13)
2=𝔼ϵ𝒩(0,𝑰)RT(R,R^(μR,G+ΣR,Gϵ,G))+βDKL(𝒩(μR,G,ΣR,G)𝒩(0,𝑰))+λ1(angle+bond),\displaystyle\mathcal{L}_{2}=\mathbb{E}_{\epsilon\sim\mathcal{N}(0,{\bm{I}})}\ell_{\rm RT}(R,\hat{R}(\mu_{R,G}+\Sigma_{R,G}\epsilon,G))+\beta D_{\mathrm{KL}}(\mathcal{N}(\mu_{R,G},\Sigma_{R,G})\|\mathcal{N}(0,{\bm{I}}))+\lambda_{1}(\ell_{\rm angle}+\ell_{\rm bond}), (14)
3=𝔼ϵ𝒩(0,𝑰)RTP(R,R^(μR,G+ΣR,Gϵ,G))+βDKL(𝒩(μR,G,ΣR,G)𝒩(0,𝑰))+λ1(angle+bond),\displaystyle\mathcal{L}_{3}=\mathbb{E}_{\epsilon\sim\mathcal{N}(0,{\bm{I}})}\ell_{\rm RTP}(R,\hat{R}(\mu_{R,G}+\Sigma_{R,G}\epsilon,G))+\beta D_{\mathrm{KL}}(\mathcal{N}(\mu_{R,G},\Sigma_{R,G})\|\mathcal{N}(0,{\bm{I}}))+\lambda_{1}(\ell_{\rm angle}+\ell_{\rm bond}), (15)

where λ1=0.1\lambda_{1}=0.1. Note in Eqn.(13) and Eqn.(14), we use the roto-translation loss only without considering permutation invariant loss on symmetric atoms. We conduct experiments on GEOM-Drugs (small-scale setting). The results are reported in Table 7.

Table 7: Results of combining with constraints on bond lengths and bond angles.
Methods COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median
DMCG 96.52\mathbf{96.52} 100.00\mathbf{100.00} 0.7220\mathbf{0.7220} 0.7161\mathbf{0.7161}
1\mathcal{L}_{1} 77.78{77.78} 86.09{86.09} 1.0657{1.0657} 1.0563{1.0563}
2\mathcal{L}_{2} 92.45{92.45} 98.70{98.70} 0.8983{0.8983} 0.9016{0.9016}
3\mathcal{L}_{3} 96.01 100.00 0.7235 0.7199

We have the following observations:

(1) Comparing 1\mathcal{L}_{1} with our method, we can see that using permutation invariant loss on symmetric atoms are important, without which the results significantly drop. (2) Comparing 2\mathcal{L}_{2} with our method, we can see that when we do not use the permutation invariant loss, using more constraints on bond lengths and bond angles can help improve the performances. (3) When using both permutation invariant loss and roto-translation invariant loss, using bond\ell_{\rm bond} and angle\ell_{\rm angle} will not bring more significant improvement. These results demonstrate that for molecular conformation generation, it is important to consider the permutation of symmetric atoms.

To illustrate the impact of the permutation invariant loss, we show two examples in Figure 8. For these two examples, there exists a rotatable ring at the end of a molecule, where the ring is symmetric to the bond connecting itself to the rest of the molecule. Without the permutation invariant loss (see the row No P\ell_{\rm P}), our method fails to generate the coordinates of such rings, but simply puts them in a line. This is because the model is trapped into local optimal. By using the permutation invariant loss, we can successfully recover the conformations of those rings (see the row “DMCG”). This shows the importance of using the permutation invariant loss P\ell_{\rm P} as we proposed.

Refer to caption
Figure 8: The illustration of the impact of the permutation invariant loss. “No P\ell_{\rm P}" means without the permutation invariant loss.

B.3 Gradient back-propagation through roto-translational operation?

As introduced Section 2.1, the optimal roto-translation operation ρ\rho^{*} can be obtained by calculating the eigenvalues and eigen vectors of a matrix. This is implemented by using the torch.linalg.eig. However, the official document lists a warning of this function: “Gradients computed using the eigenvectors tensor will only be finite when AA has distinct eigenvalues. Furthermore, if the distance between any two eigenvalues is close to zero, the gradient will be numerically unstable, as it depends on the eigenvalues λi\lambda_{i} through the computation of 1minijλiλj\frac{1}{\min_{i\neq j}\lambda_{i}-\lambda_{j}}” (the words are from the official document). Therefore, we disable the gradients through ρ\rho^{*} for stability.

We compare the performance between enabling and disabling the gradients. The results are in Table 8. Overall speaking, enabling the gradients slightly hurts the performance (especially the MAT for GEOM-Drugs) and increases computation time. Therefore, we recommend disabling the gradients through ρ\rho^{*}.

Table 8: Results with/without gradient back-propagation through ρ\rho^{*}.
Small-scale QM9 Small-scale Drugs
COV-P(%)\uparrow MAT-P(Å)\downarrow COV-P(%)\uparrow MAT-P(Å)\downarrow
Methods Mean Median Mean Median Mean Median Mean Median
with gradient 96.14 99.43 0.2090 0.2062 96.08 100.00 0.7345 0.7247
without gradient 96.2396.23 99.2699.26 0.20830.2083 0.20140.2014 96.5296.52 100.00100.00 0.72200.7220 0.71610.7161

B.4 Study of model parameters

In this section, we compare the performances of our method and ConfGF. The ConfGF model has 0.81M parameters. We reduce the network parameter of our method to 0.98M0.98M. The results are shown in Table 9.

Table 9: Comparison of our method and ConfGF with different model sizes
Dataset GEOM-QM9 GEOM-Drugs
Methods COV(%)\uparrow MAT (Å)\downarrow COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median Mean Median Mean Median
ConfGF (0.81M) 88.49 94.13 0.2673 0.2685 62.15 70.93 1.1629. 1.1596
DMCG (0.98M) 94.28 98.20 0.2399 0.2361 89.40 97.06 0.8653 0.8670
DMCG (normal) 96.2396.23 99.2699.26 0.20830.2083 0.20140.2014 96.5296.52 100.00100.00 0.72200.7220 0.71610.7161

By reducing the network parameters of our method, the performance also drops, but still significantly better than ConfGF.

We also study the results of our method w.r.t. the representation dimension dd (please kindly refer to the Section 2.2) and the dimension of the MLP layer (denoted as dMLPd_{\rm MLP}). Results are reported in Table 10. We can see that our model benefits from more parameters.

Table 10: Results on small-scale GEOM-Drugs with different hidden dimensions.
COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median
(128,512)(128,512) 95.5495.54 100.00100.00 0.76410.7641 0.75840.7584
(128,1024)(128,1024) 95.8495.84 100.00100.00 0.75140.7514 0.74320.7432
(256,512)(256,512) 95.8895.88 100.00100.00 0.73780.7378 0.73870.7387
(256,1024)(256,1024) 96.5296.52 100.00100.00 0.72200.7220 0.71610.7161

B.5 More discussions on the conformation with more heavy atoms

In Table 1, we observe that our method works better than distance-based methods (include modeling the distances directly, or the gradients of distances) on molecules with more heavy atoms. Our conjecture is that for these distance-based works, they usually extend the molecular graph with 1,2,3-order neighbors, which is sufficient to determine the 3D structure in principle. For GEOM-QM9 dataset, considering the number of atoms is less than 10, this extended graph is nearly a complete graph and can provide enough signals to reconstruct the 3D structure. Therefore, these distance-based performances are good on GEOM-QM9 dataset. For GEOM-Drugs dataset, the numbers of atoms are much more than those in GEOM-QM9. Although in theory, the distances in a third-order extended graph can reconstruct the 3D structure, practically the signals are still not enough. Our method does not rely on the interatomic distances, and can achieve good results on large molecules.

To verify our conjecture, on GEOM-Drugs, we categorize the molecules based on their numbers of heavy atoms. We choose one of the five independently run DMCG models for analysis. The number of heavy atoms in the ii-th group lie in [10i+1,10(i+1)][10i+1,10(i+1)]. We compare our method against ConfGF (the code of DGSM is not available) and GraphDG. The results are in Table 11. We have similar observation, that our method brings more improvements than previous method on larger molecules.

Table 11: COV and MAT mean scores w.r.t numbers of heavy atoms on small-scale GEOM-Drugs. The ii indicates that the heavy atom number lies in range [10i+1,10(i+1)][10i+1,10(i+1)], i{1,2,3}i\in\{1,2,3\}.
Metric COV(%)\uparrow MAT(Å)\downarrow
i=1i=1 i=2i=2 i=3i=3 average i=1i=1 i=2i=2 i=3i=3 average
ConfGF 99.95 66.28 15.34 62.54 0.7764 1.1510 1.5345 1.1637
GraphDG 15.11 1.78 0.0 3.12 2.0578 2.5863 2.9849 2.5847
DMCG 100.00\bf 100.00 97.62\bf 97.62 90.04\bf 90.04 96.69\bf 96.69 0.5305\bf 0.5305 0.7190\bf 0.7190 0.8794\bf 0.8794 0.7223\bf 0.7223

B.6 More results about property prediction

Table 12: Median absolute error of predicted ensemble properties. (Unit: eV).
Methods E¯\overline{E} EminE_{\rm min} Δϵ¯\overline{\Delta\epsilon} Δϵmin\Delta\epsilon_{\rm min} Δϵmax\Delta\epsilon_{\rm max}
RDKit 0.8721 0.6119 0.3057 0.4414\bf 0.4414 0.1830
GraphDG 13.1707 1.9221 3.4136 7.6845 1.1663
ConfGF 1.5167 0.1972 0.6588 4.8920 0.1686
DMCG 0.4132\bf 0.4132 0.1100\bf 0.1100 0.1276\bf 0.1276 0.8486 0.1288\bf 0.1288

The median absolute error of the property prediction is shown in Table 12. We can see that our method still outperforms all deep learning based methods, which demonstrate the effectiveness of our method.

Refer to caption
(a) GEOM-QM9
Refer to caption
(b) GEOM-Drugs
Figure 9: The Coverage score w.r.t different threshold δ\delta.

B.7 Results with different training sizes

To investigate whether our method relies on a large dataset, we subsample training data of the large-scale GEOM-QM9 and GEOM-Drugs to 10%, 25%, 50%, 75%. The validation and test sets remain unchanged. The results are in Table 13.

Table 13: Results with different training sizes. The baseline mehtods are trained on the full dataset. For our DMCG, it is trained on 10%10\%, 25%25\%, 50%50\%, 75%75\% and the full dataset.
Dataset GEOM-QM9 GEOM-Drugs
Methods COV(%)\uparrow MAT (Å)\downarrow COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median Mean Median Mean Median
ConfGF 89.21 95.12 0.2809 0.2837 70.92 85.71 1.0940 1.0917
GeoMol 91.05 95.55 0.2970 0.2993 69.74 83.56 1.1110 1.0864
DMCG (10%) 91.48 97.45 0.2748 0.2692 94.47 100.00 0.7934 0.7624
DMCG (25%) 97.65 100.00 0.1858 0.1676 95.17 100.00 0.7475 0.7092
DMCG (50%) 98.23 100.00 0.1606 0.1457 96.38 100.00 0.7057 0.6771
DMCG (75%) 98.31 100.00 0.1544 0.1384 96.14 100.00 0.6947 0.6562
DMCG (100%) 98.34 100.00 0.1486 0.1340 96.22 100.00 0.6967 0.6552

We can see that:

  1. 1.

    Generally, DMCG benefits from more training data.

  2. 2.

    With 10%10\% training data, our method is better than previous baselines ConfGF and GeoMol.

B.8 Adding more blocks

In this section, we study whether adding more blocks are helpful. We increase the number of blocks from 66 to 1212. The results are in Table 14. For GEOM-QM9, we do not observer performance improvement by increasing the number of blocks. For GEOM-Drugs, increasing the number of blocks further improves the performance. Our conjecture is that, the molecules in GEOM-Drugs are more complex than those in GEOM-QM9, which benefit more from larger models.

Table 14: Results with different number of blocks.
Dataset GEOM-QM9 GEOM-Drugs
# blocks COV(%)\uparrow MAT (Å)\downarrow COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median Mean Median Mean Median
6 96.2396.23 99.2699.26 0.20830.2083 0.20140.2014 96.5296.52 100.00100.00 0.72200.7220 0.71610.7161
8 95.55 98.91 0.2217 0.2190 96.77 100.00 0.7122 0.7093
10 95.71 99.52 0.2215 0.2212 97.29 100.00 0.7089 0.7079
12 94.65 99.11 0.2280 0.2284 97.11 100.00 0.7092 0.6996

B.9 Iterative refinement v.s. recursive refinement

Currently, the parameters of the blocks in the decoder (i.e., φdec\varphi_{\rm dec}) are not shared. Another option is to implement a recursive model, where the parameters of different decoder blocks are shared. The results are in Table 15. We can see that using the recursive model hurts the performances.

Table 15: Results with recursive decoder.
Dataset GEOM-QM9 GEOM-Drugs
# Method COV(%)\uparrow MAT (Å)\downarrow COV(%)\uparrow MAT (Å)\downarrow
Mean Median Mean Median Mean Median Mean Median
DMCG 96.2396.23 99.2699.26 0.20830.2083 0.20140.2014 96.5296.52 100.00100.00 0.72200.7220 0.71610.7161
DMCG with recursive decoder 94.03 98.00 0.2726 0.2771 95.20 100.00 0.7883 0.7862

Appendix C Constraints on distances

Let GG be a molecular graph with NN atoms (N3N\geq 3). Let dijd_{ij} denote the distance between atom ii and atom jj. Define DD as the distance matrix, which is an N×NN\times N matrix, and dijd_{ij} locates in the ii-th row and jj-th column of DD.

The triangle inequalities means that for any three different i,j,ki,j,k, di,j+dj,kdj,kd_{i,j}+d_{j,k}\geq d_{j,k}.

A valid distance matrix DD is induced from the 3N63N-6 degree-of-freedom (DOF) of NN 3D-coordinates excluding global translation and rotation, while the popular practice of independently generating distances to 2- or 3-hop neighbors (Xu et al., 2021a) often introduces more DOF.

Moreover, a distance matrix should have a rank at most 5 after element-wise squared (Dokmanic et al., 2015). In other words, the rank of matrix D~={dij2}i,j\tilde{D}=\{d^{2}_{ij}\}_{i,j} is at most 55. Such a constraint is hard to guarantee even if the DOF is matched (Simm & Hernández-Lobato, 2020) (e.g., λI\lambda I has one DOF but is almost surely full-rank). It also makes gradients ill-defined (Shi et al., 2021) (other distances cannot all be held constant while taking an infinitesimal change to dijd_{ij}). Careful treatments (Hoffmann & Noé, 2019) often increase the order of computation complexity.