This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MolGrow: A Graph Normalizing Flow for Hierarchical Molecular Generation

Maksim Kuznetsov,1 Daniil Polykovskiy, 1
Abstract

We propose a hierarchical normalizing flow model for generating molecular graphs. The model produces new molecular structures from a single-node graph by recursively splitting every node into two. All operations are invertible and can be used as plug-and-play modules. The hierarchical nature of the latent codes allows for precise changes in the resulting graph: perturbations in the top layer cause global structural changes, while perturbations in the consequent layers change the resulting molecule marginally. The proposed model outperforms existing generative graph models on the distribution learning task. We also show successful experiments on global and constrained optimization of chemical properties using latent codes of the model.

Introduction

Drug discovery is a challenging multidisciplinary task that combines domain knowledge in chemistry, biology, and computational science. Recent works demonstrated successful applications of machine learning to the drug development process, including synthesis planning (Segler, Preuss, and Waller 2018), protein folding (Senior et al. 2020), and hit discovery (Merk et al. 2018; Zhavoronkov et al. 2019; Kadurin et al. 2016). Advances in generative models enabled applications of machine learning to drug discovery, such as distribution learning and molecular property optimization. Distribution learning models train on a large dataset to produce novel compounds (Polykovskiy et al. 2020); property optimization models search the chemical space for molecules with desirable properties (Brown et al. 2019). Often researchers combine these tasks: they first train a distribution learning model and then use its latent codes to optimize molecular properties (Gómez-Bombarelli et al. 2018). For such models, proper latent codes are crucial for molecular space navigation.

We propose a new graph generative model—MolGrow. Starting with a single node, it iteratively splits every node into two. Our model is invertible and maps molecular structures onto a fixed-size hierarchical manifold. Top levels of the manifold define global structure, while the bottom levels influence local features.

Our contributions are three-fold:

  • We propose a hierarchical normalizing flow model for generating molecular graphs. The model gradually increases graph size during sampling, starting with a single node;

  • We propose a fragment-oriented atom ordering that improves our model over commonly used breadth-first search ordering;

  • We apply our model to distribution learning and property optimization tasks. We report distribution learning metrics (Fréchet ChemNet distance and fragment distribution) for graph generative models besides providing standard uniqueness and validity measures.

Refer to caption

Figure 1: MolGrow architecture. Left: Full architecture combines multiple levels to generate latent codes zL,,z0z^{L},\dots,z^{0} from a graph (V,E)(V,E) and vice versa. Middle: Each level separates noise, merges node pairs, applies multiple blocks and linear transformations; Right: Each block applies three channel-wise transformations and two RealNVP layers.

Background: Normalizing Flows

Normalizing flows are generative models that transform a prior distribution p(z)p(z) into a target distribution p(x)p(x) by composing invertible functions fkf_{k}:

z=fKf2f1(x),\displaystyle z=f_{K}\circ...\circ f_{2}\circ f_{1}(x), (1)
x=f11fK11fK1(z).\displaystyle x=f^{-1}_{1}\circ...\circ f^{-1}_{K-1}\circ f^{-1}_{K}(z). (2)

We call Equation 1 a forward path, and Equation 2 an inverse path. The prior distribution p(z)p(z) is often a standard multivariate normal distribution 𝒩(0,I)\mathcal{N}(0,I). Such models are trained by maximizing training set log-likelihood using the change of variables formula:

logp(x)=logp(z)+i=1Klog|det(dhidhi1)|,\log p(x)=\log p(z)+\sum\limits_{i=1}^{K}\log\left|\det\left(\frac{dh_{i}}{dh_{i-1}}\right)\right|, (3)

where hi=fi(hi1)h_{i}=f_{i}(h_{i-1}), h0=xh_{0}=x. To efficiently train the model and sample from it, inverse transformations and Jacobian determinants should be tractable and computationally efficient. In this work, we consider three types of layers: invertible linear layer, actnorm, and real-valued non-volume preserving transformation (RealNVP) (Dinh, Sohl-Dickstein, and Bengio 2017). We define these layers below for arbitrary dd-dimensional vectors, and extend these layers for graph-structured data in the next section.

We consider an invertible linear layer parameterization by Hoogeboom, Van Den Berg, and Welling (2019) that uses QR decomposition of a weight matrix: h=QRzh=QR\cdot z, where QQ is an orthogonal matrix (QT=Q1Q^{T}=Q^{-1}), and RR is an upper triangular matrix with ones on the main diagonal. We use Householder reflections to parameterize Q:

Q=i=1d(I2viviTvi2),Q=\prod_{i=1}^{d^{\prime}}\left(I-2\frac{v_{i}v_{i}^{T}}{\|v_{i}\|^{2}}\right), (4)

where viv_{i} are learnable column-vectors. The Jacobian determinant of a linear layer is 11. There is an alternative way to formulate a linear layer with LU decomposition (Kingma and Dhariwal 2018). However, in our experiments, QR decomposition showed more numerically stable results.

Actnorm layer (Kingma and Dhariwal 2018) is a linear layer with a diagonal weight matrix: h=sz+mh=s\odot z+m, where \odot is an element-wise multiplication. Vectors ss and mm are initialized so that the output activations from this layer have zero mean and unit variance at the beginning of training. We use the first training batch for initialization. The Jacobian determinant of this layer is i=1dsi\prod_{i=1}^{d}s_{i}.

RealNVP layer (Dinh, Sohl-Dickstein, and Bengio 2017) is a nonlinear invertible transformation. Consider a vector zz of length d=2td=2t with first half of the components denoted as zaz_{a}, and the second half as zbz_{b}. Then, RealNVP and its inverse transformations are:

(hahb)=(zbesθ(zb)za+tθ(zb))\displaystyle\begin{pmatrix}h_{a}\\ h_{b}\end{pmatrix}=\begin{pmatrix}z_{b}\\ e^{s_{\theta}(z_{b})}\odot z_{a}+t_{\theta}(z_{b})\end{pmatrix} (5)
(zazb)=((hbtθ(ha))/esθ(ha)ha)\displaystyle\begin{pmatrix}z_{a}\\ z_{b}\end{pmatrix}=\begin{pmatrix}\left(h_{b}-t_{\theta}(h_{a})\right)/e^{s_{\theta}(h_{a})}\\ h_{a}\end{pmatrix} (6)

Functions sθs_{\theta} and tθt_{\theta} do not have to be invertible, and usually take form of a neural network. The Jacobian determinant of the RealNVP layer is i=1desθ,i(zb)\prod_{i=1}^{d}e^{s_{\theta,i}(z_{b})}. We sequentially apply two RealNVP layers to transform both components of zz. We also use permutation layer that deterministically shuffles input dimensions before RealNVP—this is equivalent to randomly splitting data into aa and bb parts.

MolGrow (Molecular Graph Flow)

In this section, we present our generative model—MolGrow. MolGrow is a hierarchical normalizing flow model (Figure 1): it produces new molecular graphs from a single-node graph by recursively dividing every node into two. The final graph has N=2LN=2^{L} nodes, where LL is a number of node-splitting layers in the model. To generate graphs with fewer nodes, we add special padding atoms. We choose NN to be large enough to fit any graph from the training dataset.

We represent a graph with node attribute matrix VN×dvV\in\mathbb{R}^{N\times d_{v}} and edge attribute tensor EN×N×deE\in\mathbb{R}^{N\times N\times d_{e}}, where dvd_{v} and ded_{e} are feature dimensions. For the input data, ViV_{i} defines atom type and charge, Ei,jE_{i,j} defines edge type. Since molecular graphs are non-oriented, we preserve the symmetry constraint on all intermediate layers: Ei,j,k=Ej,i,kE_{i,j,k}=E_{j,i,k}.

We illustrate MolGrow’s architecture in Figure 1. MolGrow consists of LL invertible levels, each level has its own latent code with a Gaussian prior. On a forward path, each level extracts the latent code and halves the graph size by merging node pairs. On the inverse path, each level does the opposite: it splits each node into two and adds additional noise. The final output on the forward path is a single-node graph graph z0=(VK,EK)z^{0}=(V^{K},E^{K}) and latent codes from each level: zL,,z1z^{L},\dots,z^{1}. We call z0z^{0} a top level latent code.

Dequantization

To avoid fitting discrete graphs into a continuous density model, we dequantize the data using a uniform noise (Kingma and Dhariwal 2018):

Vi,j0=Vi,j+ui,jv,\displaystyle V^{0}_{i,j}=V_{i,j}+u^{v}_{i,j}, (7)
Ei,j,k0=Ej,i,k0={Ei,j,k+ui,j,ke,i<j0,i=j.\displaystyle E^{0}_{i,j,k}=E^{0}_{j,i,k}=\begin{cases}E_{i,j,k}+u^{e}_{i,j,k},&i<j\\ 0,&i=j\end{cases}. (8)

Elements of uvu^{v} and ueu^{e} are independent samples from a uniform distribution 𝒰[0,c]\mathcal{U}[0,c]. Such dequantization is invertible for c[0,1)c\in[0,1)—original data can be reconstructed by rounding down the elements of Vi,j0V^{0}_{i,j} and Ei,j,k0E^{0}_{i,j,k}. We dequantize the data for each training batch independently and train the model on (V0,E0)(V^{0},E^{0}). Dequantizated graph (V0,E0)(V^{0},E^{0}) is a complete graph.

Node merging and splitting

We use node merging and splitting operations to control the graph size. These operations are inverse of each other, and both operate by rearranging node and edge features. Consider a graph (Vk,Ek)(V^{k},E^{k}) with NkN_{k} nodes. Node merging operation joins nodes 2i2i and 2i+12i+1 into a single node by concatenating their features and features of the edge between them. We concatenate edge features connecting the merged nodes:

Vik+12dv+de=cat(V2ikdv,V2i+1kdv,E2i,2i+1kde),\displaystyle\underbrace{V^{k+1}_{i}}_{2d_{v}+d_{e}}=\mathrm{cat}\big{(}\underbrace{V^{k}_{2i}}_{d_{v}},\underbrace{V^{k}_{2i+1}}_{d_{v}},\underbrace{E^{k}_{2i,2i+1}}_{d_{e}}\big{)}, (9)
Ei,jk+14de=cat(E2i,2jkde,E2i,2j+1kde,E2i+1,2jkde,E2i+1,2j+1kde).\displaystyle\underbrace{E^{k+1}_{i,j}}_{4d_{e}}=\mathrm{cat}\big{(}\underbrace{E^{k}_{2i,2j}}_{d_{e}},\underbrace{E^{k}_{2i,2j+1}}_{d_{e}},\underbrace{E^{k}_{2i+1,2j}}_{d_{e}},\underbrace{E^{k}_{2i+1,2j+1}}_{d_{e}}\big{)}. (10)

Node splitting is the inverse of node merging layer: it slices features into original components. See an example in Figure 2.

Refer to caption

Figure 2: Node merging and splitting example for a 44-node graph. We concatenate features of nodes V1V_{1} and V2V_{2} and edge E1,2E_{1,2} to get new node features. We also concatenate edge features E1,3E_{1,3}, E1,4E_{1,4}, E2,3E_{2,3}, and E2,4E_{2,4}. Splitting operation slices merged graph’s node and edge features.

Noise separation and injection

MolGrow produces a latent vector for each level. We derive the latent codes by separating half of the node and edge features before node merging and impose Gaussian prior on these latent codes. During generation, we sample the latent code from the prior and concatenate it with node and edge features. As we show in the experiments, latent codes on different levels affect the generated structure differently. Latent codes from smaller intermediate graphs (top level) influence global structure, while bottom level features define local structure.

Block architecture

The basic building block in MolGrow (denoted block in Figure 1) consists of five layers. The first three layers (permutation, linear, and actnorm) serve as 1×11\times 1 convolutions. Each layer contains two transformations: one transforms every node and the other transforms every edge. The number of linear layer’s Housholder reflections in matrix QQ is smaller than the dimension of QQ. Hence, a combination of linear and permutation layers is not equivalent to a single linear layer.

The final two layers of the block are RealNVP layers. RealNVP layer splits its input graph (Vk,Ek)(V^{k},E^{k}) with NkN_{k} nodes into (Vk,a,Ek,a)(V^{k,a},E^{k,a}) and (Vk,b,Ek,b)(V^{k,b},E^{k,b}) along features dimension. We transform (Vk,b,Ek,b)(V^{k,b},E^{k,b}) by projecting node and edge features onto a low-dimensional manifold and applying attention on complete graph edges (CAGE) architecture (Algorithm 1). We compute the final output of RealNVP layer by applying fully-connected neural networks sθvs^{v}_{\theta}, tθvt^{v}_{\theta}, sθes^{e}_{\theta}, and tθet^{e}_{\theta} to each node and edge independently:

(V¯k,b,E¯k,b)=CAGE(Vk,bWv,Ek,bWe)\displaystyle(\overline{V}^{k,b},\overline{E}^{k,b})=\mathrm{CAGE}(V^{k,b}W_{v},E^{k,b}W_{e}) (11)
Vik+1,b=exp(sθv(V¯ik,b))Vik,a+tθv(V¯ik,b)\displaystyle V_{i}^{k+1,b}=\exp\left(s^{v}_{\theta}\left(\overline{V}_{i}^{k,b}\right)\right)\odot V_{i}^{k,a}+t^{v}_{\theta}\left(\overline{V}_{i}^{k,b}\right) (12)
Vik+1,a=Vik,b\displaystyle V_{i}^{k+1,a}=V_{i}^{k,b} (13)
Ei,jk+1,b=exp(sθe(E¯i,jk,b))Ei,jk,a+tθe(E¯i,jk,b)\displaystyle E_{i,j}^{k+1,b}=\exp\left(s^{e}_{\theta}\left(\overline{E}_{i,j}^{k,b}\right)\right)\odot E_{i,j}^{k,a}+t^{e}_{\theta}\left(\overline{E}_{i,j}^{k,b}\right) (14)
Ei,jk+1,a=Ei,jk,b\displaystyle E_{i,j}^{k+1,a}=E_{i,j}^{k,b} (15)

Similar to other attentive graph convolutions (Veličković et al. 2017; Guo, Zhang, and Lu 2019), CAGE architecture uses a multi-head attention (Vaswani et al. 2017). It also uses gated recurrent unit update function to stabilize training (Parisotto et al. 2020). Positional encoding in CAGE consists of two parts. First dvlog2Nkd_{v}-\lceil\log_{2}N_{k}\rceil dimensions are standard sinusoidal positional encoding (Vaswani et al. 2017):

posi,2j=sin(i/100002j/dv),\displaystyle\mathrm{pos}_{i,2j}=\sin\left(i/10000^{2j/d_{v}}\right), (16)
posi,2j+1=cos(i/100002j/dv).\displaystyle\mathrm{pos}_{i,2j+1}=\cos\left(i/10000^{2j/d_{v}}\right). (17)

The last log2Nk\lceil\log_{2}N_{k}\rceil components of posi\mathrm{pos}_{i} contain a binary code of ii. We add multiple blocks before the first and after the last level in the full architecture.

Algorithm 1 Attention on complete graph edges (CAGE)
1:  Input: Complete graph (V,E)(V,E) with node feature matrix Vn×dvV\in\mathbb{R}^{n\times d_{v}} and edge feature tensor En×n×deE\in\mathbb{R}^{n\times n\times d_{e}}.
2:  Output: Transformed complete graph (V¯,E¯)(\overline{V},\overline{E}) of the same dimension as (V,E)(V,E).
3:  Compute positional encodings matrix posn×dv\mathrm{pos}\in\mathbb{R}^{n\times d_{v}}
4:  for i=1i=1 to nn do
5:     Allocate μn×dv\mu\in\mathbb{R}^{n\times d_{v}}
6:     for j=1j=1 to nn do
7:        μj=fθ(cat(Ei,j,Vi,Vj))+posj\mu_{j}=f_{\theta}\left(\mathrm{cat}\left(E_{i,j},V_{i},V_{j}\right)\right)+\mathrm{pos}_{j} — compute message jij\to i of dimension dvd_{v} using a fully-connected neural network fθf_{\theta}
8:     end for
9:     q=Vi+posiq=V_{i}+\mathrm{pos}_{i} — compute attention query
10:     r=Multi-Head Attention(query=q,keys=μ,values=μ)r=\text{Multi-Head Attention}(\textrm{query}{=}q,\textrm{keys}{=}\mu,\textrm{values}{=}\mu) — aggregate messages
11:     V¯i=GRUV(x=r,h=Vi)\overline{V}_{i}=\textrm{GRU}_{V}(x{=}r,\,h{=}V_{i}) — update node feature matrix using a GRU cell
12:     νi,j=cat(Ei,j,V¯i,V¯j),j\nu_{i,j}=\mathrm{cat}\left(E_{i,j},\overline{V}_{i},\overline{V}_{j}\right),~{}\forall j — compute edge update vectors
13:     E¯i,j=12GRUE(x=νi,j,h=Ei,j)+12GRUE(x=νj,i,h=Ej,i)j\overline{E}_{i,j}=\frac{1}{2}\textrm{GRU}_{E}(x{=}\nu_{i,j},\,h{=}E_{i,j})+\frac{1}{2}\textrm{GRU}_{E}(x{=}\nu_{j,i},h{=}E_{j,i})~{}\forall j — update edge features
14:  end for

Layout and padding

Similar to the previous works (Shi et al. 2020; You et al. 2018b), we achieved better results when learning on a fixed atom ordering, instead of learning a distribution over all permutations. Previous works used breadth-first search (BFS) atom ordering, since it avoids long-range dependencies. However, BFS does not incorporate the knowledge of common fragments and can mix their atoms (Figure 3a). We propose a new atom ordering to incorporate prior knowledge about frequent fragments. Our ordering better organizes the latent space and simplifies generation.

Refer to caption

Figure 3: Different atom orderings. Numbers are atom’s indices in a particular ordering. Note that BFS ordering (a) generates two fragments in parallel (see nodes 1515-1818), while our method completes a fragment before transitioning to the next one. For fragment-oriented ordering (b), we circled extracted fragments.

We break the molecule into fragments by removing BRICS (Degen et al. 2008) bonds and bonds connecting rings, linkers, and decorations in the Bemis-Murcko scaffold (Bemis and Murcko 1996). We then enumerate the fragments and atoms in each fragment using BFS ordering (Figure 3b). We recursively choose padding positions, minimizing the number of edges after node merging layers (Algorithm 2).

Algorithm 2 Balanced padding (function “pad”)
1:  Input: List of fragments f1:K=[f1fK]f_{1:K}=[f_{1}\dots f_{K}], where fif_{i} contains atom indices in the ii-th fragment; NN—target graph size, power of 22.
2:  Output: Atom order with * indicating padding positions (N*_{N} indicates NN sequential paddings).
3:  if K=1K=1 and |f1|N/2|f_{1}|\leq N/2 then
4:     Randomly add padding: with 50% probability return cat(pad(f1,N/2),N/2)\mathrm{cat}(\mathrm{pad}(f_{1},N/2),*_{N/2}), otherwise return cat(N/2,pad(f1,N/2))\mathrm{cat}(*_{N/2},\mathrm{pad}(f_{1},N/2))
5:  end if
6:  Find possible splitting positions BB: bBb\in B if left and right parts fit into subtrees: |f1:b|N/2|f_{1:b}|\leq N/2 and |fb+1:K|N/2|f_{b+1:K}|\leq N/2
7:  if |B|>0|B|>0 then
8:     Sample any index bBb\in B that minimizes the number of bonds between f1:bf_{1:b} and fb+1:Kf_{b+1:K}.
9:     Recursively add padding to left and right parts:
10:     return cat(pad(f1:b,N/2),pad(fb+1:K,N/2))\mathrm{cat}\Big{(}\mathrm{pad}(f_{1:b},N/2),\mathrm{pad}(f_{b+1:K},N/2)\Big{)}
11:  else
12:     Add padding to the right
13:     return cat(f1:K,N|f1:K|)\mathrm{cat}(f_{1:K},*_{N-|f_{1:K}|})
14:  end if
Table 1: Distribution learning metrics on MOSES dataset.
Method FCD/Test ()(\downarrow) Frag/Test ()(\uparrow) Unique@10k ()(\uparrow) Novelty ()(\uparrow)
Graph-based models
MolecularRNN (Popova et al. 2019) 23.13 0.56 98.6% 99.9%
GraphVAE (Simonovsky and Komodakis 2018) 49.39 0.0 5% 100%
GraphNVP (Madhawa et al. 2019) 29.95 0.62 99.7 % 99.9%
GraphAF (BFS) (Shi et al. 2020) 21.84 0.651 97% 99.9%
MoFlow (Zang and Wang 2020) 28.05 0.685 100% 99.99%
Proposed model
MolGrow (fragment-oriented) 6.284±0.986\mathbf{6.284\pm 0.986} 0.929±0.0250.929\pm 0.025 99.28±0.6299.28\pm 0.62% 99.26±0.1299.26\pm 0.12%
MolGrow (BFS) 9.962±0.7959.962\pm 0.795 0.932±0.010.932\pm 0.01 𝟏𝟎𝟎±0.0\mathbf{100\pm 0.0}% 99.37±0.0899.37\pm 0.08%
MolGrow (BFS on fragments) 16.15±1.02616.15\pm 1.026 0.868±0.0180.868\pm 0.018 𝟏𝟎𝟎±0.0\mathbf{100\pm 0.0}% 𝟏𝟎𝟎±0.0\mathbf{100\pm 0.0}%
MolGrow (random permutation) 40.17±4.70940.17\pm 4.709 0.051±0.0340.051\pm 0.034 58.96±38.1158.96\pm 38.11% 𝟏𝟎𝟎±0.0\mathbf{100\pm 0.0}%
MolGrow (GAT instead of CAGE) 6.523±0.3026.523\pm 0.302 0.941±0.013\mathbf{0.941\pm 0.013} 99.36±0.399.36\pm 0.3% 99.32±0.0599.32\pm 0.05%
MolGrow (No positional embedding) 6.771±0.5556.771\pm 0.555 0.937±0.0060.937\pm 0.006 99.49±0.1999.49\pm 0.19% 99.41±0.0699.41\pm 0.06%
SMILES and fragment-based models
CharRNN (from MOSES benchmark) 0.073 0.9998 99.73% 84.19%
VAE (from MOSES benchmark) 0.099 0.9994 99.84% 69.49%
JTN-VAE (from MOSES benchmark) 0.422 0.9962 100% 91.53%

Refer to caption

Figure 4: Samples from molecular graph generative models trained on MOSES.

Related work

Many well-known generative models work out of the box for molecular generation task. By representing molecules as strings, one can apply any sequence generation model: language models (Segler et al. 2018), variational autoencoders (Gómez-Bombarelli et al. 2018), and generative adversarial networks (Sanchez-Lengeling et al. 2017). Molecular graphs satisfy a formal set of rules: all atoms must have a proper valency, and a graph must have only one component. These constraints can be learned implicitly from the data or explicitly by specifying grammar rules (Kusner, Paige, and Hernández-Lobato 2017; O’Boyle and Dalke 2018; Krenn et al. 2019).

Multiple generative models for molecular graphs were proposed. Graph recurrent neural network (GraphRNN) (You et al. 2018b) and molecular recurrent neural network (MolecularRNN) (Popova et al. 2019) use node and edge generators: node generator sequentially produces nodes; edge generator sequentially predicts edge types for all the previous nodes from the hidden states of a node generator. Molecular generative adversarial network (MolGAN) (De Cao and Kipf 2018) trains a critic on generated graphs and passes the gradient to the generator using deep deterministic policy gradient (Lillicrap et al. 2015). Graph variational autoencoder (GraphVAE) (Simonovsky and Komodakis 2018) encodes and decodes molecules using edge-conditioned graph convolutions (Simonovsky and Komodakis 2017). Graph autoregressive flow (GraphAF) (Shi et al. 2020) iteratively produces nodes and edges; discrete one-hot vectors are dequantized, and tokens are decoded using argmax. The most similar work to ours is a graph non-volume preserving transformation (GraphNVP) (Madhawa et al. 2019) model. GraphNVP generation is not autoregressive: the model produces a dequantized adjacency matrix using normalizing flow, turns it into a discrete set of edges by computing argmax, and obtains atom types using a normalizing flow. MoFlow (Zang and Wang 2020) exploits a two-stage graph generation process similar to GraphNVP: first, it generates an adjacency matrix with Glow architecture and then recovers node attributes with additional coupling layers. Unlike GraphNVP and MoFlow, we generate the graph hierarchically and inject noise on multiple levels. We also produce nodes and edges simultaneously. Tran et al. (2019) model discrete data directly with straight-through gradient estimators. Application of such model to graph structured data is yet to be explored.

All graph generative models mentioned above are not permutation invariant, and most of these models employ a fixed atom order. GraphVAE aligns generated and target graphs using Hungarian algorithm, but has high computational complexity. Vinyals, Bengio, and Kudlur (2016) report that order matters in set to set transformation problems with specific orderings giving better results. Most models learn on a breadth-first search (BFS) atom ordering (You et al. 2018b). In BFS, new nodes are connected only to the nodes produced on the current or previous BFS layers, avoiding long range dependencies. Alternatively, graph generative models could use a canonical depth-first search (DFS) order (Weininger, Weininger, and Weininger 1989). Note that unlike graph-based generators, string-based generators seem to improve from augmenting atom orders (Bjerrum 2017).

Several works studied fragment-based molecular generation. Jin, Barzilay, and Jaakkola (2018) replace molecular fragments with nodes to form a junction tree. They then produce molecules by sampling a junction tree and then expanding it into a full molecule. The authors expanded this approach to incorporate larger fragments as tokens (motifs) (Jin, Barzilay, and Jaakkola 2020).

Experiments

We consider three problems: distribution learning, global molecular property optimization and constrained optimization. For all the experiments, we provide model and optimization hyperparameters in supplementary material A; the source code for reproducing all the experiments is provided in supplementary materials. We consider hydrogen-depleted graphs, since hydrogens can be deduced from atom valence.

Distribution learning

In distribution learning task, we assess how well models capture the data distribution. We compare generated and a test sets using Fréchet ChemNet distance (FCD/Test) (Preuer et al. 2018). FCD/Test is a Wasserstein-1 distance between Gaussian approximations of ChemNet’s penultimate layer activations. We also computed cosine similarity between fragment frequency vectors in the generated and test sets. We report the results on MOSES (Polykovskiy et al. 2020) dataset in Table 1. MolGrow outperforms previous node-level graph generators by a large margin. Note that SMILES-based generators (CharRNN and VAE) and fragment-level generator (JTN-VAE) outperform all node-level graph models. We hypothesize that such representations impose strong prior on the generated structures. We provide samples from graph-based models in Figure 4. Note that baseline models tend to produce macrocycles which were not present in the training set; molecules produced with GraphNVP contain too few rings. Ablation study demonstrates the advantage of fragment-oriented ordering and CAGE over standard graph attention network GAT (Veličković et al. 2017). We provide the results on QM9 (Ramakrishnan et al. 2014) and ZINC250k (Kusner, Paige, and Hernández-Lobato 2017) datasets in supplementary material B. In Figure 6 of Supplementary materials, we show how resampling different latent codes affect the generated structure.

Global optimization

The goal of a global optimization task is to produce new molecules that maximize a given chemical property. Similar to the previous works, we selected two commonly used properties: penalized octanol-water partition coefficient (penalized logP) (Kusner, Paige, and Hernández-Lobato 2017) and quantitative estimation of drug-likeness (QED) (Gómez-Bombarelli et al. 2018). We considered genetic and predictor-guided optimization strategies.

For genetic optimization, we start by sampling 256256 random molecules from ZINC250k dataset and computing their latent codes. Then we hierarchically optimize latent codes for 30003000 iterations. At each iteration we generate a new population using crossing-over and mutation and keep 256256 molecules with the highest reward. In crossing-over, we randomly permute all molecules in the population to form 256256 pairs. For each pair, we uniformly sample latent codes from spherical linear interpolation (Slerp) trajectory (White 2016) and reconstruct the resulting molecule. We mutate one level’s latent code at each iteration. Starting with a top level, we resample 10%10\% of the components from a Gaussian distribution. For genetic optimization, we compare different mutation and crossing-over strategies, including top level optimization with fixed bottom layers and vice versa.

In predictor-guided optimization, we followed the approach proposed by Jin, Barzilay, and Jaakkola (2018): we fine-tuned the pre-trained model jointly with a penalized logP predictor from the high-level latent codes for one epoch (MAE=0.410.41 for penalized logP, MAE=0.070.07 for QED). We randomly sampled 25602560 molecules from a prior distribution and took 200200 constrained gradient ascent steps along the predictor’s gradient to modify the high-level latent codes; we resample low-level latent codes from the prior. We decrease the learning rate after each iteration and keep the best reconstructed molecule that falls into the constrained region. Intuitively, the gradient ascent over high-level latent codes guides the search towards better global structure, while low-level latent codes produce a diverse set of molecules with the same global structure and similar predicted values.

We report the scores of the best molecules found during optimization in Table 2 and provide optimization trajectories in supplementary information C.

Table 2: Molecular property optimization: penalized octanol-water partition coefficient (penalized logP) and quantitative estimation of drug-likeness (QED). Results for baseline models from (Shi et al. 2020; Madhawa et al. 2019).
Method Penalized logP QED
1st 2nd 3rd 1st 2nd 3rd
ZINC250k 4.52 4.30 4.23 0.948 0.948 0.948
Graph-based models
GCPN (You et al. 2018a) 7.98 7.85 7.80 0.948 0.947 0.946
MolecularRNN (Popova et al. 2019) 8.63 6.08 4.73 0.844 0.796 0.736
GraphNVP (Madhawa et al. 2019) - - - 0.833 0.723 0.706
GraphAF (Shi et al. 2020) 12.23 11.29 11.05 0.948 0.948 0.948
MoFlow (Zang and Wang 2020) - - - 0.948 0.948 0.948
Proposed model
MolGrow (GE) 14.01±0.364\mathbf{14.01\pm 0.364} 13.95±0.424\mathbf{13.95\pm 0.424} 13.92±0.422\mathbf{13.92\pm 0.422} 0.9484±0.0\mathbf{0.9484\pm 0.0} 0.9484±0.0\mathbf{0.9484\pm 0.0} 0.9484±0.0\mathbf{0.9484\pm 0.0}
MolGrow (GE, Top only) 11.66±0.3111.66\pm 0.31 11.65±0.31911.65\pm 0.319 11.63±0.30611.63\pm 0.306 0.9484±0.0\mathbf{0.9484\pm 0.0} 0.9484±0.0\mathbf{0.9484\pm 0.0} 0.9484±0.0\mathbf{0.9484\pm 0.0}
MolGrow (GE, Bottom only) 10.29±3.3210.29\pm 3.32 10.29±3.3310.29\pm 3.33 10.28±3.3210.28\pm 3.32 0.9484±0.0\mathbf{0.9484\pm 0.0} 0.9484±0.0\mathbf{0.9484\pm 0.0} 0.9484±0.0\mathbf{0.9484\pm 0.0}
MolGrow (predictor-guided optimization) 5.2±0.3475.2\pm 0.347 4.94±0.2624.94\pm 0.262 4.84±0.224.84\pm 0.22 0.9484±0.0\mathbf{0.9484\pm 0.0} 0.9483±0.00.9483\pm 0.0 0.9483±0.00.9483\pm 0.0
MolGrow (REINFORCE) 4.81±0.2854.81\pm 0.285 4.47±0.1454.47\pm 0.145 4.39±0.1264.39\pm 0.126 0.9468±0.0010.9468\pm 0.001 0.9459±0.0010.9459\pm 0.001 0.9455±0.0010.9455\pm 0.001
SMILES and fragment-based models
DD-VAE (Polykovskiy and Vetrov 2020) 5.86 5.77 5.64 - - -
Grammar VAE (Kusner, Paige, and Hernández-Lobato 2017) 2.94 2.88 2.80 - - -
SD-VAE (Dai et al. 2018) 4.04 3.50 2.96 - - -
JT-VAE (Jin, Barzilay, and Jaakkola 2018) 5.30 4.93 4.49 0.948 0.947 0.947

Constrained optimization

In this section, we apply MolGrow to constrained molecular optimization. In this task, we optimize a chemical property in proximity of the initial molecule. Following the previous works (Jin, Barzilay, and Jaakkola 2018; You et al. 2018a), we selected 800800 molecules with the lowest penalized octanol-water partition coefficient (logP) and constrain minimum Tanimoto similarity δ\delta between Morgan fingerprints (Rogers and Hahn 2010) of the initial and final molecules. For constrained optimization, we followed the predictor-guided approach described above and optimize each of 800800 starting molecules for 200200 steps. In Table 3, we report average penalized logP improvement and similarity to the initial molecule. We also report a fraction of molecules for which we successfully discovered a new molecule with higher penalized logP. Note that unlike GCPN and GraphAF baselines, we do not fine-tune the model for each starting molecule, reducing time and memory costs for optimization.

Table 3: Constrained optimization of penalized octanol-water partition coefficient (logP). Mean±\pmstd over 800800 initial molecules with the worst penalized logP in ZINC250k.
δ\delta GCPN GraphAF
Improvement Similarity Success Improvement Similarity Success
0.0 4.20±1.284.20\pm 1.28 0.32±0.120.32\pm 0.12 100% 13.13±6.8913.13\pm 6.89 0.29±0.150.29\pm 0.15 100%
0.2 4.12±1.194.12\pm 1.19 0.32±0.110.32\pm 0.11 100% 11.90±6.8611.90\pm 6.86 0.33±0.120.33\pm 0.12 100%
0.4 2.49±1.302.49\pm 1.30 0.47±0.080.47\pm 0.08 100% 8.21±6.518.21\pm 6.51 0.49±0.090.49\pm 0.09 99.88%
0.6 0.79±0.630.79\pm 0.63 0.68±0.080.68\pm 0.08 100% 4.98±6.49{\bf 4.98\pm 6.49} 0.66±0.050.66\pm 0.05 96.88%
δ\delta MoFlow MolGrow
Improvement Similarity Success Improvement Similarity Success
0.0 8.61±5.448.61\pm 5.44 0.30±0.200.30\pm 0.20 98.88% 14.84±5.786{\bf 14.84\pm 5.786} 0.048±0.0380.048\pm 0.038 100%
0.2 7.06±5.047.06\pm 5.04 0.43±0.200.43\pm 0.20 96.75% 11.99±6.45{\bf 11.99\pm 6.45} 0.23±0.0450.23\pm 0.045 99.88%
0.4 4.71±4.554.71\pm 4.55 0.61±0.180.61\pm 0.18 85.75% 8.337±6.85{\bf 8.337\pm 6.85} 0.44±0.0480.44\pm 0.048 99.88%
0.6 2.10±2.862.10\pm 2.86 0.79±0.140.79\pm 0.14 58.25% 4.063±5.6094.063\pm 5.609 0.65±0.0680.65\pm 0.068 97.78%

Conclusion

In this paper, we presented a new hierarchical molecular graph generative model and outperformed existing node-level models on distribution learning and molecular property optimization tasks. On distribution learning, string- and fragment-based generators still perform better than node-level models, since they explicitly handle valency and connectivity constraints. Similar to the previous models, we obtained better performance when learning on a fixed atom ordering. Our fragment-oriented ordering further improves the results over BFS. In this work, we compared generated and test sets using standard distribution learning metrics and found out that the distributions produced by previous node-level graph generators differ significantly from the test set, although these models were trained for distribution learning.

References

  • Bemis and Murcko (1996) Bemis, G. W.; and Murcko, M. A. 1996. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39(15): 2887–2893.
  • Bjerrum (2017) Bjerrum, E. J. 2017. Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076 .
  • Brown et al. (2019) Brown, N.; Fiscato, M.; Segler, M. H.; and Vaucher, A. C. 2019. GuacaMol: benchmarking models for de novo molecular design. Journal of chemical information and modeling 59(3): 1096–1108.
  • Dai et al. (2018) Dai, H.; Tian, Y.; Dai, B.; Skiena, S.; and Song, L. 2018. Syntax-Directed Variational Autoencoder for Structured Data. arXiv preprint arXiv:1802.08786 .
  • De Cao and Kipf (2018) De Cao, N.; and Kipf, T. 2018. MolGAN: An implicit generative model for small molecular graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models .
  • Degen et al. (2008) Degen, J.; Wegscheid-Gerlach, C.; Zaliani, A.; and Rarey, M. 2008. On the art of compiling and using ’drug-like’ chemical fragment spaces. ChemMedChem 3(10): 1503–1507.
  • Dinh, Sohl-Dickstein, and Bengio (2017) Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2017. Density estimation using real nvp. International Conference on Learning Representations .
  • Gómez-Bombarelli et al. (2018) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and Aspuru-Guzik, A. 2018. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science 4(2): 268–276.
  • Guo, Zhang, and Lu (2019) Guo, Z.; Zhang, Y.; and Lu, W. 2019. Attention Guided Graph Convolutional Networks for Relation Extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 241–251. Florence, Italy: Association for Computational Linguistics. doi:10.18653/v1/P19-1024. URL https://www.aclweb.org/anthology/P19-1024.
  • Hoogeboom, Van Den Berg, and Welling (2019) Hoogeboom, E.; Van Den Berg, R.; and Welling, M. 2019. Emerging Convolutions for Generative Normalizing Flows. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, 2771–2780. Long Beach, California, USA: PMLR. URL http://proceedings.mlr.press/v97/hoogeboom19a.html.
  • Jin, Barzilay, and Jaakkola (2018) Jin, W.; Barzilay, R.; and Jaakkola, T. 2018. Junction Tree Variational Autoencoder for Molecular Graph Generation. In Dy, J.; and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 2323–2332. Stockholmsmässan, Stockholm Sweden: PMLR.
  • Jin, Barzilay, and Jaakkola (2020) Jin, W.; Barzilay, R.; and Jaakkola, T. 2020. Hierarchical Generation of Molecular Graphs using Structural Motifs. International Conference on Machine Learning .
  • Kadurin et al. (2016) Kadurin, A.; Aliper, A.; Kazennov, A.; Mamoshina, P.; Vanhaelen, Q.; Khrabrov, K.; and Zhavoronkov, A. 2016. The Cornucopia of Meaningful Leads: Applying Deep Adversarial Autoencoders for New Molecule Development in Oncology. Oncotarget 8(7): 10883.
  • Kingma and Dhariwal (2018) Kingma, D. P.; and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, 10215–10224.
  • Krenn et al. (2019) Krenn, M.; Häse, F.; Nigam, A.; Friederich, P.; and Aspuru-Guzik, A. 2019. SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry. arXiv preprint arXiv:1905.13741 .
  • Kusner, Paige, and Hernández-Lobato (2017) Kusner, M. J.; Paige, B.; and Hernández-Lobato, J. M. 2017. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1945–1954. JMLR. org.
  • Lillicrap et al. (2015) Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 .
  • Madhawa et al. (2019) Madhawa, K.; Ishiguro, K.; Nakago, K.; and Abe, M. 2019. GraphNVP: An invertible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600 .
  • Merk et al. (2018) Merk, D.; Friedrich, L.; Grisoni, F.; and Schneider, G. 2018. De novo design of bioactive small molecules by artificial intelligence. Molecular informatics 37(1-2): 1700153.
  • O’Boyle and Dalke (2018) O’Boyle, N.; and Dalke, A. 2018. DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. ChemRxiv .
  • Parisotto et al. (2020) Parisotto, E.; Song, H. F.; Rae, J. W.; Pascanu, R.; Gulcehre, C.; Jayakumar, S. M.; Jaderberg, M.; Kaufman, R. L.; Clark, A.; Noury, S.; et al. 2020. Stabilizing transformers for reinforcement learning. International Conference on Machine Learning .
  • Polykovskiy and Vetrov (2020) Polykovskiy, D.; and Vetrov, D. 2020. Deterministic Decoding for Discrete Data in Variational Autoencoders. In Chiappa, S.; and Calandra, R., eds., Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, 3046–3056. PMLR.
  • Polykovskiy et al. (2020) Polykovskiy, D.; Zhebrak, A.; Sanchez Lengeling, B.; Golovanov, S.; Tatanov, O.; Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.; et al. 2020. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Frontiers in Pharmacology 11: 1931.
  • Popova et al. (2019) Popova, M.; Shvets, M.; Oliva, J.; and Isayev, O. 2019. MolecularRNN: Generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372 .
  • Preuer et al. (2018) Preuer, K.; Renz, P.; Unterthiner, T.; Hochreiter, S.; and Klambauer, G. 2018. Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery. J. Chem. Inf. Model. 58(9): 1736–1741.
  • Ramakrishnan et al. (2014) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; and Von Lilienfeld, O. A. 2014. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1: 140022.
  • Rogers and Hahn (2010) Rogers, D.; and Hahn, M. 2010. Extended-connectivity fingerprints. Journal of chemical information and modeling 50(5): 742–754.
  • Sanchez-Lengeling et al. (2017) Sanchez-Lengeling, B.; Outeiral, C.; Guimaraes, G. L.; and Aspuru-Guzik, A. 2017. Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC) .
  • Segler, Preuss, and Waller (2018) Segler, M. H.; Preuss, M.; and Waller, M. P. 2018. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555(7698): 604–610.
  • Segler et al. (2018) Segler, M. H. S.; Kogej, T.; Tyrchan, C.; and Waller, M. P. 2018. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Cent Sci 4(1): 120–131.
  • Senior et al. (2020) Senior, A. W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Žídek, A.; Nelson, A. W.; Bridgland, A.; et al. 2020. Improved protein structure prediction using potentials from deep learning. Nature 1–5.
  • Shi et al. (2020) Shi, C.; Xu, M.; Zhu, Z.; Zhang, W.; Zhang, M.; and Tang, J. 2020. GraphAF: a flow-based autoregressive model for molecular graph generation. International Conference on Learning Representations .
  • Simonovsky and Komodakis (2017) Simonovsky, M.; and Komodakis, N. 2017. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3693–3702.
  • Simonovsky and Komodakis (2018) Simonovsky, M.; and Komodakis, N. 2018. GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders. In Kůrková, V.; Manolopoulos, Y.; Hammer, B.; Iliadis, L.; and Maglogiannis, I., eds., Artificial Neural Networks and Machine Learning – ICANN 2018, 412–422. Cham: Springer International Publishing. ISBN 978-3-030-01418-6.
  • Tran et al. (2019) Tran, D.; Vafa, K.; Agrawal, K.; Dinh, L.; and Poole, B. 2019. Discrete flows: Invertible generative models of discrete data. In Advances in Neural Information Processing Systems, 14719–14728.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
  • Veličković et al. (2017) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. International Conference on Learning Representations .
  • Vinyals, Bengio, and Kudlur (2016) Vinyals, O.; Bengio, S.; and Kudlur, M. 2016. Order matters: Sequence to sequence for sets. International Conference on Learning Representations .
  • Weininger, Weininger, and Weininger (1989) Weininger, D.; Weininger, A.; and Weininger, J. L. 1989. SMILES. 2. Algorithm for generation of unique SMILES notation. Journal of chemical information and computer sciences 29(2): 97–101.
  • White (2016) White, T. 2016. Sampling generative networks. arXiv preprint arXiv:1609.04468 .
  • You et al. (2018a) You, J.; Liu, B.; Ying, Z.; Pande, V.; and Leskovec, J. 2018a. Graph convolutional policy network for goal-directed molecular graph generation. In Advances in neural information processing systems, 6410–6421.
  • You et al. (2018b) You, J.; Ying, R.; Ren, X.; Hamilton, W.; and Leskovec, J. 2018b. GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models. In Dy, J.; and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 5708–5717. Stockholmsmässan, Stockholm Sweden: PMLR. URL http://proceedings.mlr.press/v80/you18a.html.
  • Zang and Wang (2020) Zang, C.; and Wang, F. 2020. MoFlow: An Invertible Flow Model for Generating Molecular Graphs. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining doi:10.1145/3394486.3403104. URL http://dx.doi.org/10.1145/3394486.3403104.
  • Zhavoronkov et al. (2019) Zhavoronkov, A.; Ivanenkov, Y. A.; Aliper, A.; Veselov, M. S.; Aladinskiy, V. A.; Aladinskaya, A. V.; Terentiev, V. A.; Polykovskiy, D. A.; Kuznetsov, M. D.; Asadulaev, A.; et al. 2019. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nature biotechnology 1–4.

Appendix A Appendix A. Hyperparameters

MolGrow model has 66 steps for ZINC250K dataset, 55 steps for MOSES dataset, and 44 steps for QM9 dataset. Each step contains 22 blocks if its input graph contains at most 1616 nodes, and 44 blocks otherwise. We apply additional 44 blocks to the initial one-node latent codes.

CAGE projects node and edge feature vectors onto a 6464 dimensional manifold; the number of attention heads equals 1616. Neural networks sθvs^{v}_{\theta}, tθvt^{v}_{\theta}, sθes^{e}_{\theta}, and tθet^{e}_{\theta} are 22-layer fully-connected neural networks with ReLU activations. The hidden size of these networks is 44 times bigger than the output size. The final layer of sθes^{e}_{\theta} and sθvs^{v}_{\theta} is a sigmoid function.

In QRQR decomposition for linear layers we parameterize matrix QQ decomposition with 128128 Householder reflections. We sample viv_{i} from 𝒩(0,I)\mathcal{N}(0,I) and normalize them. We re-normalize viv_{i} after each model optimization step. We initialize matrix RR with elements from 𝒩(0,0.05I)\mathcal{N}(0,0.05*I) above diagonal; zeros below the diagonal, and ones on diagonal.

We train the model with Adam optimizer with learning rate 0.0010.001 which decreases by a factor of 0.80.8 after each training epoch for MOSES, and by a factor of 0.920.92 after each training epoch for QM9 and ZINC250K datasets. The batch size is 256256 for MOSES and QM9 datasets, and 8080 for ZINC250K datasets. We train the model for 1010 epochs on MOSES and 3030 epochs on ZINC250k and QM9.

In our experiments we used QM9, ZINC250k, and MOSES datasets. For QM9 we use N=16N=16 nodes with dv=6d_{v}=6 atom types, including padding atom type. For MOSES we used N=32N=32 nodes with dv=9d_{v}=9 atom types, and N=64N=64 with dv=14d_{v}=14 (1111 for atom types and 33 for charge type) and for ZINC250K. The number of bond types is de=4d_{e}=4 (none, single, double, tripple).

We sample the latent codes from a multivariate normal distribution 𝒩(0,I)\mathcal{N}(0,I) and multiply them by a temperature parameter 0.70.7, similar to the previous works.

We performed a hyperparameter search on the number of blocks in each level (considered 2,4,8,162,4,8,16 blocks) and internal shape of CAGE (considered 8,16,32,648,16,32,64 elements in hidden layer). The best hyperparameters are described above. To train MolGrow on one dataset we used Tesla K80 GPU. The training procedure took approximately 2 days for each dataset.

Appendix B Appendix B. Distribution learning

In this section, we provide additional experiments on QM9 (Ramakrishnan et al. 2014) and ZINC250k (Kusner, Paige, and Hernández-Lobato 2017) datasets (Table 4 and Table 5). To compute validity for MolGrow, we removed edges that exceed maximum atom valency. If a final graph contains more than one fragment, we keep only the biggest one. Validity without cleanup corresponds to a fraction of valid molecules when no additional manipulations are performed on the generated graph. We check validity using RDKit and additionally consider molecules with multiple components invalid. We provide property distribution plots for models trained on MOSES in Figure 5.

Refer to caption

Figure 5: Properties distribution for models trained on MOSES. Quantitative estimate of drug-likeness (QED), synthetic accessibility score (SA), molecular weight (weight), and octanol-water partition coefficient (logP).

Appendix C Appendix C. Molecular property optimization

In this section, we illustrate genetic optimization of molecular properties. In Figures 11 and 13, we show best molecules in intermediate populations without consecutive duplicates. In Figures 12 and 14, we show the best molecules in the final population.

In Figure 6, Figure 7, and Figure 8 we provide random samples from MolGrow trained on QM9, ZINC and MOSES datasets correspondingly.

Table 4: Results for distribution learning on ZINC250k dataset. Baseline results from (Jin, Barzilay, and Jaakkola 2018; Popova et al. 2019; Madhawa et al. 2019; Shi et al. 2020; Zang and Wang 2020)
Method Validity Validity without cleanup Unique@10k Novelty Reconstruction
JT-VAE 100% - 100% 100% 76.7%
MolecularRNN 100% 65% 99.89% 100% -
GraphNVP 42.6% - 94.8% 100% 100%
GraphAF 100% 68% 99.1% 100% 100%
MoFlow 100±0.0100\pm 0.0% 81.76±0.2181.76\pm 0.21% 99.99±0.0199.99\pm 0.01% 100±0.0100\pm 0.0% 100±0.0100\pm 0.0%
MolGrow 100±0.0100\pm 0.0% 57.8±7.7557.8\pm 7.75% 99.06±0.4699.06\pm 0.46% 99.96±0.0199.96\pm 0.01% 100±0.0100\pm 0.0%
Table 5: Results for distribution learning on QM9 dataset. Baseline results from (Shi et al. 2020).
Method Validity Validity without cleanup Unique@10k Novelty Reconstruction
GraphVAE - 55% 76% 61% 55%
GraphNVP 83% - 99.2% 58.2% 100%
GraphAF 100% 67% 94.5% 88.8% 100%
MoFlow 100±0.0100\pm 0.0% 96.17±0.1896.17\pm 0.18% 99.20±0.1299.20\pm 0.12% 98.03±0.1498.03\pm 0.14% 100±0.0100\pm 0.0%
MolGrow 100±0.0100\pm 0.0% 86.80±0.1286.80\pm 0.12% 99.15±0.0599.15\pm 0.05% 81.43±0.5381.43\pm 0.53% 100±0.0100\pm 0.0%
Table 6: Performance metrics for baseline and proposed models on MOSES benchmark
Model Valid (\uparrow) Unique@1k (\uparrow) Unique@10k (\uparrow) FCD (\downarrow) SNN (\uparrow) Frag (\uparrow) Scaf (\uparrow) IntDiv (\uparrow) IntDiv2 (\uparrow) Filters (\uparrow) Novelty (\uparrow)
Test TestSF Test TestSF Test TestSF Test TestSF
Graph-based models
MolecularRNN 1.0 0.999 0.9861 23.1302 24.1749 0.279 0.2681 0.5657 0.5579 0.0457 0.0075 0.9062 0.8985 0.6143 0.9996
GraphVAE 1.0 0.207 0.0495 49.3917 50.7191 0.2082 0.189 0.2304 0.2148 0.0 0.0 0.8238 0.7769 0.963 1.0
GraphNVP 1.0 0.999 0.9974 29.9517 31.2123 0.3057 0.2836 0.6297 0.6144 0.0266 0.0139 0.8574 0.8448 0.6008 0.9999
GraphAF 1.0 0.989 0.9707 21.85 23.2515 0.3056 0.2891 0.6513 0.6346 0.1255 0.0137 0.9036 0.8927 0.4145 0.9999
MoFlow 1.0 1.0 1.0 28.05 29.30 0.2554 0.2426 0.6858 0.6683 0.0224 0.0068 0.8923 0.8849 0.4126 0.9999
Proposed model
MolGrow(fragment-oriented) 1.0 1.0 0.998 5.7888 6.3861 0.4335 0.4208 0.9518 0.9464 0.6697 0.0769 0.8663 0.8599 0.8274 0.9938
MolGrow(BFS) 1.0 1.0 0.9999 9.2024 9.8472 0.3997 0.3891 0.9476 0.9414 0.6628 0.0631 0.8724 0.8665 0.6673 0.9928
MolGrow(BFS on fragments) 1.0 1.0 1.0 15.0707 15.9881 0.3379 0.3269 0.8902 0.8794 0.1875 0.0308 0.8883 0.8819 0.6169 0.9993
MolGrow(random perm) 1.0 0.967 0.8861 36.6702 37.9284 0.2292 0.2136 0.0729 0.0696 0.0002 0.0 0.8735 0.8616 0.5777 1.0
MolGrow(GAT instead of CAGE) 1.0 1.0 0.9953 6.4564 6.9704 0.4383 0.4263 0.9499 0.9449 0.6599 0.0827 0.8704 0.863 0.8367 0.9936
MolGrow(No positional embeddings) 1.0 1.0 0.9926 7.5529 8.1794 0.4242 0.4138 0.9320 0.9279 0.6029 0.0828 0.8744 0.8668 0.8177 0.9949
SMILES and fragment-based models
CharRNN 0.9884 1.0 0.9992 0.0615 0.4967 0.6109 0.5709 0.9999 0.9984 0.9269 0.107 0.8566 0.8507 0.996 0.8232
VAE 0.9766 1.0 0.9982 0.1094 0.5548 0.626 0.5776 0.9993 0.9987 0.9411 0.0485 0.8556 0.8496 0.9971 0.6871
JTN-VAE 1.0 1.0 0.9992 0.4224 0.9962 0.5561 0.5273 0.9962 0.9948 0.8925 0.1005 0.8512 0.8453 0.9778 0.9153

Refer to caption

Figure 6: Samples from MolGrow model trained on QM9 dataset.

Refer to caption

Figure 7: Samples from MolGrow model trained on ZINC250K dataset.

Refer to caption

Figure 8: Samples from MolGrow model trained on MOSES dataset.

Refer to caption

Figure 9: Spherical interpolation between two molecules in the latent space of MolGrow model trained on MOSES dataset.

Refer to caption

Figure 10: Resampling of latent codes on different hierarchical levels of MolGrow model trained on MOSES dataset.

Refer to caption

Figure 11: Best molecules in intermediate populations for penalized octanol-water partition coefficient (logP).

Refer to caption

Figure 12: Best molecules in the final population during genetic optimization of penalized octanol-water partition coefficient (logP).

Refer to caption

Figure 13: Best molecules in intermediate populations for drug-likeness (QED).

Refer to caption

Figure 14: Best molecules in the final population during genetic optimization of quantitative estimation of drug-likeness (QED).

Refer to caption

Figure 15: Results of constrained optimization for penalized octanol-water partition coefficient (logP): trajectories for molecules with best improvement of property.