This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

String-based Molecule Generation via Multi-decoder VAE

Kisoo Kwon1    Kuhwan Jung1    Junghyun Park1    Hwidong Na1&Jinwoo Shin2 1Samsung Advanced Institute of Technology, Kyungki, Korea
2Korea Advanced Institute of Science and Technology, Daejeon, Korea
{kisoo.kwon, kuhwan.jung, jhpsy.park, hwidong.na}@samsung.com, jinwoos@kaist.ac.kr
Abstract

In this paper, we investigate the problem of string-based molecular generation via variational autoencoders (VAEs) that have served a popular generative approach for various tasks in artificial intelligence. We propose a simple, yet effective idea to improve the performance of VAE for the task. Our main idea is to maintain multiple decoders while sharing a single encoder, i.e., it is a type of ensemble techniques. Here, we first found that training each decoder independently may not be effective as the bias of the ensemble decoder increases severely under its auto-regressive inference. To maintain both small bias and variance of the ensemble model, our proposed technique is two-fold: (a) a different latent variable is sampled for each decoder (from estimated mean and variance offered by the shared encoder) to encourage diverse characteristics of decoders and (b) a collaborative loss is used during training to control the aggregated quality of decoders using different latent variables. In our experiments, the proposed VAE model particularly performs well for generating a sample from out-of-domain distribution.

1 Introduction

For the material discovery via molecular generation, several approaches based on machine learning are actively adopting Sanchez-Lengeling and Aspuru-Guzik (2018); Shi et al. (2020); Jin et al. (2018); Gómez-Bombarelli et al. (2018); Dai et al. (2018). Developing a novel molecule with desired properties is challenging because it costs huge amount of time and budget for laboratory experiment. We aim to provide an effective deep learning model to generate novel molecules for reducing the cost. Instead of human experts that design novel molecules using the domain knowledge, such a machine learning model can do a similar job by generating candidate molecules based on existing data annotated with physical properties. In order to enforce generated candidates to have the desired properties, target properties are given in addition to the molecule itself during training.

For generating string representation of novel molecules, we investigate variational autoencoders (VAEs) Kingma and Welling (2014) with auto-regressive decoders, which is of popular choice as a deep generative model for such a generation task. VAE minimizes Kullback-Leibler divergence (KLD) of variational approximation from the true posterior by maximizing evidence lower bound (ELBO). Although extensive research efforts have been made on improving vanilla VAE models, they typically focus on traditional AI domains such as images Pu et al. (2016), languages Dai et al. (2018) and speech Hsu et al. (2017); it is not clear whether such complex techniques are useful for our task to find novel molecules presumably following the out-of-distribution (OOD) domain.

Instead, we aim for providing an orthogonal simple, yet effective idea. To this end, we use ‘ensemble’ techniques that are arguably the most trustworthy techniques for boosting the performance of machine learning models. They have played a critical role in the machine learning community to obtain better predictive performance than what could be obtained from any of the constituent learning models alone, e.g., Bayesian model/parameter averaging Domingos (2000), boosting Freund et al. (1999) and bagging Breiman (1996). However, under the VAE framework, researchers draw less attention to ensemble techniques as it is not straightforward how to aggregate predictions of multiple encoders and decoders of VAE for improving them.

In this paper, we design an ensemble version of VAE by utilizing multiple decoders in VAE, while sharing the encoder. Specifically, our proposed idea ensembles logits from decoders in an auto-regressive generation process. Here, we found that training each decoder independently is not effective as the bias of the ensemble model increases under its auto-regressive inference.111 For example, in our experiment, multiple decoders trained independently showed a higher reconstruction loss than vanilla VAE with the similar model size. To maintain both small bias and variance of the ensemble model, each decoder is trained not only individually, but also in a way to collaborate with the others. Furthermore, to encourage diverse characteristics of decoders, we also propose to sample a different latent variable for each decoder (from estimated mean and variance offered by the shared encoder) given a molecule during training (see Figure 1 for the illustration of our proposed scheme).

We conducted experiments on a publicly available dataset on the molecular generation task for demonstrating the superiority of the proposed multi-decoder VAE model (MD-VAE). MD-VAE achieved lower training loss and higher reconstruction accuracy than baselines including VAE with k-annealing and Controllable VAE Bowman et al. (2016); Shao et al. (2020). As aforementioned, the efficiency of generation for OOD-domain is more important than that for in-domain. We emphasis that MD-VAE achieved 31.4% higher relative generative efficiency than Controllable VAE (ControlVAE) for this perspective (ControlVAE: 36.3%36.3\%\rightarrow MD-VAE: 47.7%47.7\%). This is remarkable as VAEs are often reported to be poor for generating OOD-domain samples Montero et al. (2020). Although we apply MD-VAE to the molecular generation task, we believe that it is of interest for a broader range of domains, e.g image generation with desired styles or text generation with desired sentiments.

2 Related Work

2.1 Ensemble method

Ensemble is arguably the most trustworthy technique or concept to improve the performance of a given machine learning model Opitz and Maclin (1999); Wang et al. (2020). The ensemble method gives room to appropriately control the trade-off between bias and variance of the model. The effect of ensemble is largely associated with the expertise of individual models. That is, a diversity of models is a one of essential factor for a success of ensemble Kuncheva and Christopher J. (2003). Due to the reason, many ensemble methods seek to promote diversity among the models they combine Brown et al. (2005). As a standard choice, multiple predictive models trained independently are averaged in order to obtain a ensemble effect. Recently, several attempts have been made to improve the performance by directly learning the (collaborative) loss of ensemble models Wang et al. (2021). While the preceding methods on this line consider the standard classification tasks, our proposed method targets a more complex task of auto-regressive generation that each token is generated by ensembles of each decoder’s logit.

2.2 Molecular design

Traditional molecular design using domain knowledge requires simulation techniques due to time and cost issues. High throughput computational screening, a brute-force simulation of a large number of molecules, has been conducted Bleicher et al. (2003), but recently, a research direction using deep learning is being actively explored, namely inverse design. It is a study on a model that generates molecules that match the given physical properties. Studies applying generative models such as VAEs are in progress in this field, and they can be roughly divided into two directions according to the molecular expression. The first one expresses molecules as sequences (or strings) such as SMILES Weininger et al. (1989) while the other one does molecules as graphs Jin et al. (2018).

Refer to caption
(a) Structure of MD-VAE
Refer to caption Refer to caption
(b) Reconstruction Loss (c) Reconstruction Success Rate (%)
Figure 1: (a) Structure of MD-VAE. (b) indicates a reconstruction loss according to the number of decoders (from 2 to 7). The xx-axis denotes the model size (mb). The encoder and each decoder size of MD-VAE (MDdif,col{}_{\text{dif,col}}) is the same. As the number of decoders increases, the model size of MD-VAE also be increased. MD-VAE outperformed ControlVAE in all model size. (c) The reconstruction success rate was evaluated under a teacher-forcing. The performance gap between MD-VAE and ControlVAE is remarkably large in the unseen case.

3 Conditional and Controllable VAE

The conditional VAE (cVAE) Kang and Cho (2018); Kingma et al. (2014) is designed to generate data given certain conditions such as classes or labels. In the cVAE, SMILES of molecule xx is assumed to be generated from pθ(x|y,z)p_{\theta}(x|y,z) conditioned on target molecular properties yy and latent variable zz. The prior distribution of zz is assumed to be Gaussian distribution, i.e. p(z)=𝒩(z|0,I)p(z)=\mathcal{N}(z|0,I). We use variational inference to approximate the posterior distribution of zz given xx and yy by

qϕ(z|x,y)=𝒩(z|μϕ(x,y),diag(σϕ(x,y))).q_{\phi}(z|x,y)=\mathcal{N}(z|\mu_{\phi}(x,y),\text{diag}(\sigma_{\phi}(x,y))). (1)

From the perspective of the auto-encoder, qϕ(z|x,y)q_{\phi}(z|x,y) and pθ(x|y,z)p_{\theta}(x|y,z) are called as an encoder and a decoder, respectively. To deal with string data, a structure such as Gated Recurrent Unit (GRU) Cho et al. (2014) or Long Short-Term Memory (LSTM) Sak et al. (2014) is usually used. In this paper, since transformer Vaswani et al. (2017) shows a better performance in cVAE than GRU and LSTM, transformer encoders are used for μϕ(x,y)\mu_{\phi}(x,y) and σϕ(x,y)\sigma_{\phi}(x,y) and a transformer decoder is used for pθ(x|y,z)p_{\theta}(x|y,z).

The objective of the cVAE is to maximize ELBO, which is a lower bound of the marginal log-likelihood:

logpθ(x|y)\displaystyle\log p_{\theta}(x|y) (2)
𝔼qϕ(z|x,y)[logpθ(x|y,z)]KLD(qϕ(|x,y)||p()).\displaystyle\geq\mathbb{E}_{q_{\phi}(z|x,y)}[\log p_{\theta}(x|y,z)]-\text{KLD}(q_{\phi}(\cdot|x,y)||p(\cdot)).

We define recon=𝔼qϕ(z|x,y)[logpθ(x|y,z)]\mathcal{L}_{\rm recon}=-\mathbb{E}_{q_{\phi}(z|x,y)}[\log p_{\theta}(x|y,z)] because it can be regarded as a reconstruction loss. reg=KLD(qϕ(|x,y)||p())\mathcal{L}_{\rm reg}=\text{KLD}(q_{\phi}(\cdot|x,y)||p(\cdot)) is acting like a regularizing term. In summary, parameters θ\theta and ϕ\phi are jointly optimized to minimize total=recon+reg\mathcal{L}_{\rm total}=\mathcal{L}_{\rm recon}+\mathcal{L}_{\rm reg}. Given a target molecular properties yy, a new molecule xx having this property is generated in the following way:

zp(z),xpθ(x|y,z).z\sim p(z),~{}x\sim p_{\theta}(x|y,z). (3)

Various ideas have been proposed to avoid the posterior vanishing problem in VAE models. VAE with k-annealing Bowman et al. (2016) and its variants have been shown that it is possible to prevent the posterior vanishing problem using an appropriate scale β\beta of the regularizing term, i.e. total=recon+βreg\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{recon}}+\beta\mathcal{L}_{\text{reg}}. Since it is difficult to accurately compare the performance among generative models by exploring the degree of convergence of reg\mathcal{L}_{\rm reg} with a hyper-parameter, it is practically useful compared to learning a model with β\beta as a hyper-parameter. ControlVAE Shao et al. (2020) further enhanced the idea by automatically tuning β\beta to converge reg\mathcal{L}_{\rm reg} to a specific value based on automatic control theory. This idea does not necessarily lead to convergence to the best reg\mathcal{L}_{\rm reg}, but it is possible to model the converging a desired reg\mathcal{L}_{\rm reg}. β\beta is updated to have a target reg\mathcal{L}_{\rm reg} at every training steps tt:

total=recon+β(t)reg.\mathcal{L}_{\rm total}=\mathcal{L}_{\rm recon}+\beta(t)\mathcal{L}_{\rm reg}. (4)

4 Multi-decoder VAE

In this paper, we propose a neural architecture for VAE, which we name MD-VAE. Our main idea is to use multiple decoders while sharing a single encoder, as illustrated in Figure 1. In particular, we consider an auto-regressive model (e.g., recurrent neural network or transformer) as each decoder architecture. Then, to aggregate outputs of multiple decoders, the decoder’s logit values are averaged to predict the next token in an auto-regressive manner.222We tested two different schemes in the ensemble computation: pre-softmax (logit-level) and post-softmax (probability-level) interpolation. In our preliminary experiments, we validated that the difference between two schemes in various metrics we used are negligible, but pre-softmax converges faster than post-softmax interpolation. Hence we conducted the rest of experiments using the pre-softmax scheme, i.e., we average the logits instead of the softmax probabilities. Here, each decoder has its own separate parameters, and produces a different logit value. One may expect that such ensemble of different logits provides more robust prediction than that from an individual decoder. However, to the best of our knowledge, such ensemble version of VAE has not been explored in literature.

This is because a naive ensemble of independently trained decoders (while sharing a single encoder) may increase the model bias (although it may decrease the model variance), as it can boost up significantly under VAEs using auto-regressive decoders. To mitigate such an negative effect, we propose to optimize the following additional collaborative loss to train our model:

col=𝔼z[log1Kkpθk(x|y,z)]\mathcal{L}_{\rm col}=-\mathbb{E}_{z}\left[\log\frac{1}{K}\sum\nolimits_{k}p_{\theta_{k}}(x|y,z)\right] (5)

Herein KK indicates the number of decoders of MD-VAE. We remark that such a collaborative loss may not be effective for ensemble of non-auto-regressive models, e.g., the standard classification or regression model, as it increases the model variance. However, in our case using auto-regressive models, it is effective as reducing the model bias is more crucial than doing the model variance. Nevertheless, even for our model, to reduce the model variance, we suggest to sample a different latent variable from the shared encoder for each decoder during training. It encourage for decoders to produce diverse outputs, and hence reduces the variance of the ensemble model. Specifically, each latent variable is sampled from the approximation of the posterior distribution of zz given xx and yy, and there are KK sampled latent variables z1,z2,,zKz_{1},z_{2},\ldots,z_{K} where

zk𝒩(zk|μϕ(x,y),diag(σϕ(x,y))).z_{k}\sim\mathcal{N}(z_{k}|\mu_{\phi}(x,y),\text{diag}(\sigma_{\phi}(x,y))). (6)

In summary, we integrate two ideas to train the proposed MD-VAE model. The collaborative loss promotes small bias of the ensemble prediction over multiple decoders, while sampling different latent variables for decoders does small variance of the ensemble model. Namely, we consider the following additional loss to train decoders in MD-VAE:

dif,col=𝔼z1,,zK[log1Kkpθk(x|y,zk)].\mathcal{L}_{\text{dif,col}}=-\mathbb{E}_{z_{1},\ldots,z_{K}}\left[\log\frac{1}{K}\sum\nolimits_{k}p_{\theta_{k}}(x|y,z_{k})\right]. (7)

The only difference in Eq. (5) and Eq. (7) is to use the latent variable zz or zkz_{k}.

The total reconstruction loss is a linear interpolation between collaborative and individual loss functions:

reconMD=αdif,col(1α)Kk=1Klogpθk(x|y,zk).\mathcal{L}_{\text{recon}}^{\text{MD}}=\alpha\mathcal{L}_{\text{dif,col}}-\frac{(1-\alpha)}{K}\sum_{k=1}^{K}\log p_{\theta_{k}}(x|y,z_{k}). (8)

Eq. (8) is minimized together with the regularizing term reg\mathcal{L}_{\text{reg}} in Eq. (4):

totalMD=reconMD+β(t)reg.\mathcal{L}_{\text{total}}^{\text{MD}}=\mathcal{L}_{\text{recon}}^{\text{MD}}+\beta(t)\mathcal{L}_{\rm reg}. (9)

The two ideas mentioned above (using multiple latent variables and using collaborative loss) are also applicable to vanilla single-decoder VAE (SD-VAE). In this case, SD-VAE can not utilize an ensemble scheme at an inference phase.

5 Experiments

Table 1: The model size, reconstruction loss, regularizing loss and KLD of decoders (ZINC250K dataset). Each model’s regularizing loss is controlled by 15 without Base. (Each model trained 3-time, and each figure was the average of 3 results.)
Model Size recon\mathcal{L}_{\rm recon} reg\mathcal{L}_{\rm reg} KLD
Base 142MB 17.276 0.000 -
ControlVAE 142MB 6.851 15.168 -
SDdif,col\textit{SD}_{\text{dif,col}} 142MB 6.657 15.189 -
MD 138MB 7.001 15.207 0.2179
MDcol\textit{MD}_{\text{col}} 138MB 5.508 14.937 0.5582
MDdif\textit{MD}_{\text{dif}} 138MB 6.555 15.145 0.2723
MDdif,col\textit{MD}_{\text{dif,col}} 138MB 4.482 15.068 0.4384
Refer to caption
Figure 2: An example of SMILES in ZINC: COc1ccc(N2CC(C(=O)Oc3cc(C)ccc3C)CC2=O)cc1.

5.1 Experiment Setup

ZINC Sterling and Irwin (2015) is a database comprising information on various drug-like molecules. ZINC contains 3D structural information of molecules and molecular physical properties such as molecular weight (molWt), partition coefficient (LogP), and quantitative estimation of drug-likeness (QED). Figure 2 shows SMILES of a molecule in ZINC, and we used two subsets of ZINC: ZINC250K Kusner et al. (2017) and ZINC310K Yan et al. (2019); Kang and Cho (2018). The vocabulary for SMILES contains 39 different symbols {1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, =, #, (, ), [, ], H, B, C, N, O, F, Si, P, S, Cl, Br, Sn, I, c, n, o, p, s, \{\backslash}, /, @, @@}. The minimum, median, and maximum lengths of a SMILES string in ZINC250K are 9, 44, and 120, respectively (In case of ZINC310K, 8, 42, and 86). The average values of molwt, LogP, and QED are about 330, 2.457, and 0.7318 in ZINK250K, respectively. The three quantities are 313, 1.9029, 0.7527 in ZINK310K, respectively.

Table 2: The reconstruction success rate for seen (ZINC250K) and unseen (ZINC310K) dataset. The number in boldface means that it is better than the ControlVAE. (Each figure was the average of 3 results.
Model Unseen Seen Average
Base 0.783 0.906 0.845
ControlVAE 0.880 0.979 0.930
SDdif,col\textit{SD}_{\text{dif,col}} 0.880 0.989 0.935
MD 0.898 0.974 0.936
MDcol\textit{MD}_{\text{col}} 0.891 0.978 0.934
MDdif\textit{MD}_{\text{dif}} 0.902 0.975 0.938
MDdif,col\textit{MD}_{\text{dif,col}} 0.909 0.983 0.946
Table 3: The molecular generative efficiency (Probability that all three conditions, validity, uniqueness and novelty, are satisfied): For generative efficiency, the condition values of conditional VAE were divided into two cases. For in-domain case, we chose three condition values based on the property distribution at each property. For y-extrapolation (OOD) case, two condition values (μ±4σ\mu\pm 4\sigma) were used. At each condition value, 2,000-times generation was tried. The number in boldface means that it is better than ControlVAE.
In-domain Out-of-distribution domain
molWt logP QED Average molWt logP QED Average
Base 0.224 0.308 0.329 0.287 0.100 0.069 0.110 0.0929
ControlVAE 0.910 0.916 0.913 0.913 0.377 0.311 0.403 0.363
SDdif,col\textit{SD}_{\text{dif,col}} 0.903 0.907 0.911 0.907 0.307 0.278 0.398 0.328
MD 0.946 0.944 0.946 0.945 0.489 0.415 0.472 0.458
MDcol\textit{MD}_{\text{col}} 0.915 0.922 0.909 0.915 0.431 0.398 0.414 0.414
MDdif\textit{MD}_{\text{dif}} 0.928 0.936 0.938 0.934 0.457 0.427 0.468 0.451
MDdif,col\textit{MD}_{\text{dif,col}} 0.935 0.933 0.932 0.933 0.457 0.539 0.436 0.477

Each encoder and decoder of the VAE model consists of three layers with self-attention like transformer, and the dimension of the latent variable is set to 100, and the dimension of the properties (conditions) is 3. We used the Adam optimizer Ruder (2016) with β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, and ϵ=106\epsilon=10^{-6}, and the initial learning rate was 0.001. Each model was trained during 100 epochs333In our experimental results, the single decoder based VAE models which were trained until 1,000 epochs showed a worse reconstruction success rate for unseen dataset (ZINC310K) than the models which were trained until 100 epochs. Therefore, the average reconstruction success rate of seen and unseen dataset of 100 epochs also better than those of 1,000 epochs. It also showed a better molecular generative efficiency when each model trained until 100 epochs., and batch size was 128.

We compared five methods and its detailed information is as follows:

\bullet Base: vanilla VAE with k-annealing Bowman et al. (2016)

\bullet ControlVAE: controllabe VAE Shao et al. (2020). Vanilla VAE suffered a posterior collapsing in molecular domain even if it adopted k-annealing. Therefore, small β\beta for reg\mathcal{L}_{\text{reg}} helps to reduce the posterior collapsing, also it enhances a molecular generative performance Yan et al. (2019). According to the our experimental results, when reg\mathcal{L}_{\text{reg}} was controlled as about 15, ControlVAE showed a proper performance.

\bullet SDdif,col{}_{\text{dif,col}}: ControlVAE with 3-time zz sampling and collaborative loss. However, it can not utilize a diverse speciality of multi-decoder.

\bullet MD: ControlVAE with multi-decoder with only the individual loss and the same sampled zz.

\bullet MDcol{}_{\text{col}}: ControlVAE with multi-decoder and collaborative loss.

\bullet MDdif{}_{\text{dif}}: ControlVAE with multi-decoder and multi-zz sampling. Each decoder is trained by different zz from the same input data.

\bullet MDdif,col{}_{\text{dif,col}}: ControlVAE with multi-decoder, collaborative loss and multi-zz sampling.

5.2 Evaluation: Training Phase

ZINC250K was used for the training data, and the reconstruction loss of each model was measured for the evaluation metric. In the cases of single decoder based VAE (Base, ControlVAE, SDdif,col{}_{\text{dif,col}}), the reconstruction loss is the sum of the cross-entropy for each token between the decoder’s output and the true label. For MDs (MD, MDcol{}_{\text{col}}, MDdif{}_{\text{dif}} and MDdif,col{}_{\text{dif,col}}), an ensemble of decoders’ outputs is used as an output for the reconstruction loss.

We fixed the number of decoders in MDs as 3, because it can utilize advantages of the proposed method sufficiently. Of course, for a reasonable comparison, the model size of MDs was adjusted to be the same as the size of SDs. (Each size of decoder of MDs is the same as the size of encoder.)

The size and reconstruction loss of each model is shown in Table 1. Although we adopted k-annealing to Base, its training results seemed to be not proper. ControlVAE showed a far better performance than Base, so we decided to compare our proposed methods with ControlVAE. All MD-VAE models showed smaller reconstruction loss than ControlVAE except MD. Through these comparisons, it can be seen that the proposal methods are effective to reduce the reconstruction error during training. We also measured KLD between the 3 decoders of MDs, and it was confirmed that KLD of MDdif,col{}_{\text{dif,col}} (0.4384) was greater than that of MD (0.2178). This means that using different latent variables and collaborative loss for decoders in MD-VAE would result in decoders having more different features, which import to enhance the generalization capacity.

5.3 Evaluation: Reconstruction Success Rate

It is important how accurately restore the molecule xx from μϕ(x,y)\mu_{\phi}(x,y) through the decoder. This can be seen that zz-space is appropriately expressing molecule space of not only the training dataset but also an unseen dataset. In order to generate a new molecule, the reconstruction success rate on unseen data is crucial point. To verify this, we checked that molecules were properly restored at token-level. This evaluation was performed in a teacher-forcing situation without a reparameterization trick. We calculated reconstruction success rates for the training (ZINC250K) and unseen (ZINC310K) dataset in Table 2. For this experiment, duplicate samples were removed in ZINC310K. Every MDs outperformed Base and ControlVAE in the seen and unseen cases. In particular, MDs showed a bigger improvement when OOD conditions were used. This tendency grows with longer training epoch (When each model was trained until 1,000 epochs, ControlVAE and MDdif,col{}_{\text{dif,col}} showed 0.846 and 0.894 in terms of reconstruction success rate for the unseen case, respectively.).

Table 4: The conditional satisfaction in terms of MAE between a condition and simulation value of the top 1 molecules generated by each model (the lower the better): For a conditional molecular generation, the quality of generated molecules is one of the essential factor. For each condition and property, the best generated molecules are compared. For each case, molecular generation was tried 2,000 times by each model. (QED=1.2861\text{QED}=1.2861 was an improper condition because it was a physically absent region. The maximum QED from RDKIT is 0.948.)
Top1 MAE In-domain Out-of-distribution domain
Property molWt LogP QED molWt LogP QED
Condition 434, 330, 230 4.816, 2.457, 0.098 0.9598, 0.7318, 0.5038 580 84 8.194 -3.281 1.2861 0.1778
Base 0.1520 0.0008 0.0041 0.0810 0.0780 0.0013 - - 0.0006
ControlVAE 0.1177 0.0005 0.0041 0.0800 0.0740 0.0005 1.3598 0.5771 0.0008
SDdif,col\textit{SD}_{\text{dif,col}} 0.0850 0.0002 0.0044 0.0370 0.0780 0.0001 - 0.6720 0.0002
MD 0.0940 0.0003 0.0040 0.1740 0.0980 0.0003 0.0204 0.4405 0.0015
MDcol\textit{MD}_{\text{col}} 0.0497 0.0013 0.0042 0.0760 0.0740 0.0011 0.0069 0.5592 0.0002
MDdif\textit{MD}_{\text{dif}} 0.0797 0.0007 0.0041 0.0470 0.0074 0.0051 0.0003 0.3638 0.0002
MDdif,col\textit{MD}_{\text{dif,col}} 0.0513 0.0004 0.0041 0.0620 0.1140 0.0005 0.0013 0.4242 0.0006

5.4 Evaluation: Molecular Generative Efficiency

The training loss and reconstruction success rate can be important indicators for molecular generation verification. However, they are insufficient to judge of a performance of molecular generative models. In this subsection, our proposed methods are evaluated in terms of molecular generative efficiency. The generative efficiency is the rate of the generated molecule satisfying validity, uniqueness, and novelty. The validity means that the generated molecule has a sound structure determined using the RDKIT package RDKit, online . A molecule is said to be novel if it is not in the train database. The uniqueness refers to how many molecules are generated without duplication. For example, if the generative efficiency is 0.9, 90 molecules are sound, distinct and novel when the molecular generation is tried 100 times.

Instead of approximated latent space from a given molecule, latent variables are sampled and new molecules are generated as in Eq. (3). In order to generate molecules, it is necessary to specify the condition that generated molecules should have. Each molecule of ZINC dataset has three properties: molWt, LogP and QED. One of the properties were determined manually as below, and the other properties were sampled from the conditional probability distribution.

We specified two types of conditions: in-domain and y-extrapolation (OOD). In the in-domain case, the mean μ\mu and lower and upper limits of 90% confidence interval (μ±1.645σ\mu\pm 1.645\sigma) of train dataset were used (molWt {330,434,230}\in\{330,434,230\}, LogP {2.457,4.816,0.098}\in\{2.457,4.816,0.098\} , QED {0.7318,0.9598,0.5038}\in\{0.7318,0.9598,0.5038\}). Meanwhile, in the OOD case, outlier values (μ±4σ\mu\pm 4\sigma) were used (molWt {580,84}\in\{580,84\} , LogP {8.194,3.281}\in\{8.194,-3.281\} , QED {0.1775,1.2861}\in\{0.1775,1.2861\} ). For each condition, 2,000 molecules were generated by each method. (For each model, 30,000-time generation was tried.)

Table 3 illustrates generative efficiency rates for each property. In the in-domain case, MDs except MDcol{}_{\text{col}} for QED showed higher efficiency than ControlVAE, and MD outperformed 3.5% relative to ControlVAE . In the OOD case, it showed higher relative improvement than when the in-domain case (In-domain and OOD cases’ relative improvement of MDdif,col{}_{\text{dif,col}}: 2.2%, 31.4%).

Refer to caption Refer to caption
molwt: 580580.047580\rightarrow 580.047 molwt: 8484.07484\rightarrow 84.074
Refer to caption Refer to caption
LogP: 8.1948.19458.194\rightarrow 8.1945 LogP: 3.2813.2813-3.281\rightarrow-3.2813
Refer to caption Refer to caption
QED: 0.95980.9480940.9598\rightarrow 0.948094 QED: 0.17750.1773060.1775\rightarrow 0.177306
Figure 3: Top1 generated molecules of the proposed method.The script below each molecular structure is the order of properties, conditions, and predicted values by RDKIT.

5.5 Evaluation: Conditional Satisfaction

Other important indicator is a conditional satisfaction for novel molecules. In the actual molecular generative task, the discovery of new molecules with properties close to the target properties is essential. For this reason, we measured a mean absolute error (MAE) of the top 1 molecule having the smallest MAE to the input conditions, as shown in Table 4. Three properties of generated molecules were all calculated using RDKIT. In the table, the dash (-) means that the model did not generate even a single valid molecule. When the condition is QED=1.2861\text{QED}=1.2861, most models showed high MAE because QED=1.2861 is a value that could not exist physically in QED property (The maximum QED value from RDKIT is 0.948.). In the case of In-domain, most proposed models achieved a lower MAE value than ControlVAE. In the case of OOD, MDdiff,col{}_{\text{diff,col}} outperformed ControlVAE except the condition MolWt=84\text{MolWt}=84. In particular, MDs showed a sensational performance when the LogP’s condition was -3.281. In a nutshell, the proposed methods averagely showed a lower MAE than ControlVAE and Base. Top 1 molecules of the proposed methods at the OOD case are presented in Figure 3.

5.6 Evaluation: Performance Comparison as the Number of Decoder

Until now, we showed our proposed method’s performance when the number of decoder was 3. In this subsection, we evaluate MDdiff,col{}_{\text{diff,col}}’s performance according to the number of decoders. Since Base showed a poor performance, we compared ControlVAE with MDdiff,col{}_{\text{diff,col}}. Of course, ControlVAE was trained as several model sizes. Table 1 (b) showed a reconstruction loss as the number of decoders and model size (mb). Each model trained until 100 epoch, and the number of decoders of MDdiff,col{}_{\text{diff,col}} is from 2 to 7. The size of encoder of MDdiff,col{}_{\text{diff,col}} is the same. In all cases, MDdiff,col{}_{\text{diff,col}} showed a lower reconstruction loss than ControlVAE.

Table 1 (c) showed a reconstruction success rate as model sizes. MDdiff,col\textit{MD}_{\text{diff,col}} outperformed ControlVAE at the not only unseen case but also seen case. In particular, although ControlVAE could not maintain its OOD performance as the model size, MDdiff,col\textit{MD}_{\text{diff,col}} showed a stable performance.

6 Conclusion

In this paper, we propose a simple method for molecular generation task. Our main idea is to maintain multiple decoders while sharing a single encoder. To facilitate synergy between decoders, we introduce a collaborative loss function. We also propose to utilize a different latent variable for each decoder in order to diversify decoders while they collaborate with others to achieve a common goal. In our experiments, especially for out-of-distribution conditions, MD-VAE is far better than baselines with respect to efficiency and conditional satisfaction of generated molecules.

References

  • Bleicher et al. [2003] Konrad H Bleicher, Hans-Joachim Böhm, Klaus Müller, and Alexander I Alanine. Hit and lead generation: beyond high-throughput screening. Nature reviews Drug discovery, 2(5):369–378, 2003.
  • Bowman et al. [2016] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21, 2016.
  • Breiman [1996] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
  • Brown et al. [2005] Gavin Brown, Jeremy Wyatt, Rachel Harris, and Xin Yao. Diversity creation methods: a survey and categorisation. Information fusion, 6(1):5–20, 2005.
  • Cho et al. [2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP 2014, 2014.
  • Dai et al. [2018] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed variational autoencoder for structured data. In International Conference on Learning Representations, 2018.
  • Domingos [2000] Pedro Domingos. Bayesian averaging of classifiers and the overfitting problem. International Conference on Machine Learning (ICML), 2000.
  • Freund et al. [1999] Yoav Freund, Robert Schapire, and Naoki Abe. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14:771–780, 1999.
  • Gómez-Bombarelli et al. [2018] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
  • Hsu et al. [2017] Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017.
  • Jin et al. [2018] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, 2018.
  • Kang and Cho [2018] Seokho Kang and Kyunghyun Cho. Conditional molecular design with deep generative models. Journal of chemical information and modeling, 59(1):43–52, 2018.
  • Kingma and Welling [2014] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  • Kingma et al. [2014] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
  • Kuncheva and Christopher J. [2003] Ludmila I. Kuncheva and Whitaker Christopher J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003.
  • Kusner et al. [2017] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1945–1954. JMLR. org, 2017.
  • Montero et al. [2020] Milton Llera Montero, Casimir JH Ludwig, Rui Ponte Costa, Gaurav Malhotra, and Jeffrey Bowers. The role of disentanglement in generalisation. In International Conference on Learning Representations, 2020.
  • Opitz and Maclin [1999] David Opitz and Richard Maclin. Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11:169–198, 1999.
  • Pu et al. [2016] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Carin Lawrence. Variational autoencoder for deep learning of images, labels and captions. Advances in neural information processing systems, 29:2352–2360, 2016.
  • [20] RDKit: Open-source cheminformatics. http://www.rdkit.org. [Online; accessed 11-April-2013].
  • Ruder [2016] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • Sak et al. [2014] Hasim Sak, Senior Andrew W., and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. arXiv preprint arXiv:1402.1128, 2014.
  • Sanchez-Lengeling and Aspuru-Guzik [2018] Benjamin Sanchez-Lengeling and Alán Aspuru-Guzik. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361(6400):360–365, 2018.
  • Shao et al. [2020] Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Liu Liu, Shengzhong, Dongxin, Jun Wang, and Tarek Abdelzaher. Controlvae: Controllable variational autoencoder. International Conference on Machine Learning (ICML), 2020.
  • Shi et al. [2020] Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. Graphaf: a flow-based autoregressive model for molecular graph generation. Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  • Sterling and Irwin [2015] Teague Sterling and John J Irwin. Zinc 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11):2324–2337, 2015.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 2017-Decem, pages 5999–6009, jun 2017.
  • Wang et al. [2020] Feng Wang, Yixuan Li, Fanshu Liao, and Hongyang Yan. An ensemble learning based prediction strategy for dynamic multi-objective optimization. Applied Soft Computing, 96(106592), 2020.
  • Wang et al. [2021] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella X Yu. Long-tailed recognition by routing diverse distribution-aware experts. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  • Weininger et al. [1989] David Weininger, Arthur Weininger, and Joseph L Weininger. Smiles. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences, 29(2):97–101, 1989.
  • Yan et al. [2019] Chaochao Yan, Sheng Wang, Jinyu Yang, Tingyang Xu, and Junzhou Huang. Re-balancing variational autoencoder loss for molecule sequence generation. arXiv preprint arXiv:1910.00698, 2019.