This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\contourlength

1.4pt

LICO: Large Language Models for In-Context Molecular Optimization

Tung Nguyen, Aditya Grover
University of California, Los Angeles
{tungnd,adityag}@cs.ucla.edu
Abstract

Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark comprising over 2020 objective functions.

1 Introduction

Black-box optimization (BBO) is the problem of optimizing an unknown, often complex objective function without direct access to its structure or derivatives. This problem is ubiquitous in many science and engineering fields, including material discovery [18], protein engineering [8, 40, 2], molecular design [16], mechanical design [5, 30], and neural architecture search [52]. Typically, BBO involves an iterative process where each step constructs a surrogate model to approximate the objective function. This model then guides the selection of promising candidates for subsequent evaluation. The main challenge of this approach lies in learning an effective surrogate function that can accurately estimate the objective from limited historical data.

In stark contrast, we have seen impressive generalization abilities of Large Language Models (LLMs) [10, 1, 11, 43, 44, 45, 22, 23] for language-driven reasoning over many kinds of domains. By pretraining on Internet-scale data, LLMs have demonstrated exceptional pattern-matching abilities and generalization from limited observations in both natural language [10, 25, 48] and other domains [32, 35, 17]. This positions LLMs as a promising solution for enhancing surrogate modeling for BBO. Some recent works have indeed shown great potential for using LLMs for solving optimization problems [50, 12, 51, 3]. The main idea behind these methods is to frame the optimization problem in natural language, and prompt the language model using previously collected observations to make predictions for new data points [3] or to propose better candidates [50, 12, 51, 33, 38, 34, 28, 7, 31]. However, this approach has several limitations. First, performing optimization in the text space requires the problem and solution to be expressed in natural language, thus limiting this approach to selected domains. Second, the scarcity of domain-relevant data in the text corpora used to train language models poses generalization challenges when using these models for general scientific domains such as molecular optimization. Therefore, existing works have only demonstrated the success of LLMs in neural architecture search [3, 12, 51], prompt optimization [50], and code generation [33, 28], corresponding to domains that are well-represented in the training dataset for common language models [10, 44, 22]. Third, relying on verbose textual descriptions for both the problem and its solution imposes practical constraints by inflating the context length and thereby reducing the number of historical observations the model can effectively utilize.

In this work, we propose Large Language Models for In-Context Optimization (LICO), a general-purpose model that leverages LLMs for black-box optimization, with a particular application to the molecular domain. To generalize a language model to a new scientific domain unseen during pretraining, we equip the model with two embedding layers for embedding the previously collected molecules and their scores, and a prediction head to predict the score of unseen candidates. Intuitively, the embedding layers map the molecules and their scores to the same feature space already learned by the language model, allowing the model to perform in-context learning in this space instead of the raw text space. Unlike previous methods, this approach is applicable to domains that may not be easily described in natural language such as molecular optimization. Moreover, avoiding verbose textual descriptions enables the model to condition on more historical observations, thus scaling better to harder problems that cannot be solved within a few steps.

We train the new layers together with the (frozen) LLM to perform in-context predictions on a family of functions. Specifically, for each function sampled from this family, we condition the model on a set of inputs and their corresponding evaluations, and task the model to predict the function value of the remaining data points. This task mimics surrogate modeling in BBO, where the surrogate model has to iteratively update its estimation of the underlying objective by conditioning on historical data. An ideal function family to train the model should be close to the target objective functions we want to optimize, but also be diverse enough to encourage generalization. Therefore, we propose to combine intrinsic functions and synthetically generated functions for training LICO. Intrinsic functions are inherent properties of the input that are easy to evaluate. In molecular optimization, for example, intrinsic functions include molecular weight, the number of rings, or heavy atom count, which are obtained via simple computation on the molecule. These intrinsic functions are closely related to the actual objective functions we want to optimize such as bioactivities against a target disease. To facilitate generalization outside of the intrinsic functions, we additionally train LICO on synthetic functions defined over the same target domain that are generated by Gaussian Processes. Our empirical evidence shows the importance of learning from both intrinsic and synthetic functions to the performance of the model on downstream tasks. Figure 1 illustrates our approach.

After training, LICO is capable of optimizing a wide range of molecular properties purely via in-context prompting. While the methodology of LICO applies to general scientific domains, in this paper we focus on molecular optimization. This problem plays a pivotal role in advancing drug and material discovery. The complexity of molecular structures and the vastness of the chemical space present unique challenges to black-box optimization algorithms. Moreover, since molecule-relevant data is likely under-represented in the pretraining corpora of existing language models, molecular optimization is a good problem to test the performance and applicability of LICO. We evaluate LICO against the state-of-the-art methods on Practical Molecular Optimization (PMO) [14], a challenging molecular optimization benchmark with over 2020 objective functions. The experiments show that LICO achieves the best performance and is the highest-ranked method in the benchmark.

Text embeddingxx embeddingyy embeddingPrediction layer~\tilde{\mathcal{F}}prompt<x>x1x_{1}<y>y1y_{1}TrainingSemi-synthetic<x>xnx_{n}<y>yny_{n}Pretrained LLMy^1\hat{y}_{1}y^n\hat{y}_{n}

Figure 1: Our proposed approach. We equip a pretrained LLM with an embedding layer for xx, an embedding layer for yy, and a prediction layer. We train the model on semi-synthetic data to predict yy given xx and previous (x,y)(x,y) pairs. We prepend each xx with a special token <x> and each yy with a special token <y> to guide in-context reasoning.

2 Problem Statement

Let f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} be a real-valued function that operates on a dd-dimensional space 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d}. In black-box optimization (BBO), the goal is to find the input xx^{\star} that maximizes ff:

xargmaxx𝒳f(x),x^{\star}\in\operatorname*{arg\,max}_{x\in\mathcal{X}}f(x), (1)

where we do not have direct access to the structure or gradient information of ff. In molecular optimization, 𝒳\mathcal{X} is the space of all possible molecules, and ff is a certain property of the molecule we want to optimize over, such as bioactivities against a disease. While ff is unknown, we often have access to an unlabeled dataset 𝒟u\mathcal{D}_{\text{u}} that consists of molecules xsx^{\prime}s without the corresponding function values ysy^{\prime}s. ZINC [42] is such a dataset, which contains thousands to millions of unlabeled molecules.

To solve the optimization in (1), we can query ff with a limited budget, since evaluation often involves expensive physical experiments. To overcome this challenge, a common BBO approach learns a surrogate model fθf_{\theta} that approximates the objective ff from past observations 𝒟obs={(xi,yi)}i=1n\mathcal{D}_{\text{obs}}=\{(x_{i},y_{i})\}_{i=1}^{n}, which starts empty and incrementally expands with new data points (x,f(x))(x,f(x)) we query at each iteration. Formally, a surrogate model represents a predictive distribution pθ(yx,𝒟obs)p_{\theta}(y\mid x,\mathcal{D}_{\text{obs}}) of the function value yy conditioned on the input xx and the evolving observed dataset 𝒟obs\mathcal{D}_{\text{obs}}. The prediction of this surrogate guides the selection of candidates to balance exploration and exploitation for subsequent function evaluation. The newly selected points are then added to 𝒟obs\mathcal{D}_{\text{obs}}, and the process continues.

The success of this approach highly depends on the efficiency of the surrogate model fθf_{\theta} in estimating ff from limited data in 𝒟obs\mathcal{D}_{\text{obs}} at each iteration. This resembles few-shot prediction, a setting that Large Language Models (LLMs) have proven to excel in. By pretraining on vast Internet-scale data, LLMs can learn generalizable patterns from limited data, and are capable of adapting to multiple functions at test time simply via in-context prompting [10, 35, 26, 27]. A recent line of works [50, 51, 12, 3] has exploited this ability of LLMs for optimization, but they relied on natural language as the interface, thus lacking generality to scientific domains. In this work, we propose a more general and efficient approach to leveraging LLMs for black-box optimization.

3 Related Work

LLMs for Optimization Recent works have explored the use of LLMs for optimization. The general idea behind these works is to prompt the model with the textual description of the optimization problem and historical evaluations for few-shot reasoning. Yang et al. [50], Liu et al. [31], Zhang et al. [51], Ma et al. [33] propose to prompt the language model to directly suggest better candidates to evaluate given the past inputs and their corresponding scores. Meyerson et al. [34], Lehman et al. [28], Bradley et al. [7] integrate LLMs with evolutionary algorithms, and prompt the model to perform crossover and mutation operations based on the population at each optimization step. Anonymous [3] study the use of LLMs to enhance several components in Bayesian optimization, including warmstarting, surrogate modeling, and candidate generation.

The common approach in existing works has several inherent limitations. First, for general scientific domains, the input xx may not be easily described by natural language. Second, even when there is a textual description of the input, for instance, molecules can be represented by SMILES strings [49], it is likely that molecule-relevant data is significantly under-represented in the text corpora used for training existing language models. This hinders the generalization of LLMs to new domains outside of their training distribution. These challenges are why existing works only consider hyperparameter optimization, prompt optimization, and code generation, domains that are easy to describe in text, and are well-represented in the training data of LLMs. Furthermore, from an engineering perspective, naively prompting a language model with verbose textual descriptions of the input xx results in an excessively long context, thus reducing the number of examples the model can condition on. For example, an LLM with a maximum context length of 40004000 can only utilize up to 100100 past observations, assuming the average length of each data point is 4040. This practically limits the scalability of this approach to harder problems that require more steps to solve.

LLMs for Non-language Tasks In addition to optimization, several works have studied the extension of pretrained LLMs to non-language domains with two main directions. The first direction considers problems that can be described in natural language, and prompts a pretrained LLM to solve the problem directly in the text space [35, 13, 17]. The second direction tackles more general problems, and does it by learning separate encoders for the new domain and aligning it with the embedding space of the pretrained LLM [32, 41, 47, 29]. Our work is closely related to the latter direction. However, as discussed in the following sections, while many of these works completely leave the word space, we find it beneficial to include language instruction while training the new modules.

4 Method

We introduce LICO, a methodology for extending arbitrary base LLMs for surrogate modeling in black-box optimization. While the method applies to broad scientific domains, we choose molecular optimization to demonstrate LICO in this paper. We aim to develop a model capable of efficiently adapting to various objective functions after training. To achieve this, we propose a simple extension to existing LLMs and an unsupervised objective using semi-synthetic data to facilitate generalization.

4.1 Model Architecture

In black-box optimization, a surrogate model fθf_{\theta} estimates the distribution of the function value yy given the input xx and past observations 𝒟obs={(xi,yi)}i=1n\mathcal{D}_{\text{obs}}=\{(x_{i},y_{i})\}_{i=1}^{n} the model has collected until the optimization iteration tt:

pθ(yx,x1,y1,x2,y2,,xn,yn),p_{\theta}(y\mid x,x_{1},y_{1},x_{2},y_{2},\dots,x_{n},y_{n}), (2)

where xix_{i} and yi=f(xi)y_{i}=f(x_{i}) are drawn from an objective function ff. Our goal is to explore LLMs to model pθp_{\theta}. As discussed earlier, we make no assumptions on the domain 𝒳\mathcal{X} to be expressed with natural language. To extend a pretrained language model to an arbitrary new domain, we equip the model with 33 new layers – an embedding layer for the inputs xsx^{\prime}s, an embedding layer for the function values ysy^{\prime}s, and a prediction layer for predicting the unknown function value yy. Learning separate embedding layers offers several benefits. First, the new embedding layers encode xx and yy to a shared hidden space obtained by the language model via pretraining, which enables the model to escape the raw text space and perform in-context reasoning in the hidden space instead. Moreover, by embedding each input xx to a single hidden vector instead of spanning it over several tokens, we effectively reduce the sequence length and thus allow the model to scale to more conditioning examples.

However, it is challenging for the model to perform this prediction task without any context information about the task. This is because, from the model point of view, embeddings of xx and yy do not mean anything more than some high-dimensional vectors. In other words, the model does not know what task it should perform and what each token in the embedding sequence represents. To address this issue, we prepend each sequence with a task prompt and prepend each input xx with a special token <x> and each function value yy with a special token <y>. The task prompt instructs the model to perform the task, while the special tokens <x> and <y> inform the model of the position of each input xx and the corresponding function value yy. In other words, we use a language the model has mastered (natural language) to guide the learning of a new “foreign language” (e.g., molecule). In practice, the task prompt is “Each x is a molecule and each y is the property of the corresponding molecule. Predict y given x.”, whereas <x> and <y> are two single characters “x” and “y”. Finally, we apply the prediction layer on top of each token <y> to predict the function value given the tokens preceding it. Each prediction consists of a mean and a standard deviation value which will be used for the selection of candidates during optimization. Figure 1 illustrates the architecture of LICO.

It is worth noting that the combination of natural language and domain-specific embeddings is the main distinction between LICO and previous works such as FPT [32] which applies pretrained LLMs to sequence classification tasks in non-language modalities. FPT also learns new embedding layers for the new domain, but relies entirely on the pretrained self-attention layers to model these embeddings without any language instructions. This distinction stems from the different nature of the tasks we aim to tackle. In sequence classification, the model produces a single prediction for the entire sequence, thus having a good representation of the sequence via self-attention is sufficient. For in-context learning, however, the model needs to associate each input xx with its corresponding value yy to infer the underlying function ff and make predictions for unknown yy. A language instruction that specifies where xx is and where yy is helps the model identify this association and improve its in-context reasoning. Our ablation study in 5.2.1 confirms this utility of retaining language tokens.

4.2 Semi-synthetic Training

Our goal is to train LICO on the unlabeled data 𝒟u\mathcal{D}_{\text{u}} with an unsupervised objective to facilitate efficient generalization to an arbitrary objective function ff in the same domain 𝒳\mathcal{X} after training. Our key insight is, that if we train the model to perform the estimation in (2) for a wide range of functions, it should be able to adapt to any objective function after training. While the true function values are unknown before optimization, we can use the unlabeled data xsx^{\prime}s to generate training data from other functions. Assume we have access to a family of functions ~\tilde{\mathcal{F}} that operate on the same input domain 𝒳\mathcal{X}. For each function f~\tilde{f} drawn from ~\tilde{\mathcal{F}}, we sample a set of function evaluations {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} and train the model to autoregressively predict yy given the input xx and preceding (x,y)(x,y) pairs:

(θ)=𝔼[i=1nlogpθ(yixi,x<i,y<i)],\displaystyle\mathcal{L}(\theta)=\mathbb{E}\left[\sum_{i=1}^{n}\log p_{\theta}(y_{i}\mid x_{i},x_{<i},y_{<i})\right], (3)

in which the expectation is with respect to f~~\tilde{f}\sim\tilde{\mathcal{F}}, x1:n𝒟ux_{1:n}\sim\mathcal{D}_{\text{u}}, and y1:n=f~(x1:n)y_{1:n}=\tilde{f}(x_{1:n}). After training, the estimation in (2) can be done purely via in-context prompting, where we condition the model on past observations to make predictions for new data points.

Ideally, the function family ~\tilde{\mathcal{F}} should be close to the downstream objective ff, but also be diverse enough to encourage broad generalization across functions. To achieve this, we propose to train LICO on a mix of intrinsic and synthetic functions, which we term semi-synthetic training. Intrinsic functions are functions that map each input molecule xx to an inherent property of xx. For example, molecular weight, the number of rings, or heavy atom count are intrinsic properties of the molecule that are known from domain knowledge or can be easily computed using standard tools. These intrinsic properties are closely related to many downstream objective functions. For example, the biological activity of a drug molecule, such as its ability to inhibit a particular enzyme, is often closely related to the molecule’s shape or conformation. Therefore, training LICO from these functions encourages the model to learn useful representations of the input xx and obtain good prior knowledge about the optimization domain.

However, it is important to note that we are ultimately interested in optimizing other functions outside of the intrinsic function set. Training the model only on a limited set of intrinsic functions may result in overfitting and poor generalization to unseen functions. To diversify the training data, we additionally train the model on synthetically generated functions. A synthetic function family should be easy to sample from and be capable of producing diverse functions. Many such families exist, including Gaussian Processes (GPs), randomly constructed Gaussian Mixture Models, or randomly initialized neural networks. We choose to generate synthetic functions from Gaussian Processes with a Tanimoto kernel due to its simplicity and efficiency. Tanimoto kernel, also known as the Jaccard coefficient, measures the similarity between two vectors of binary values, a representation that is widely used for many scientific domains such as chemistry, drug discovery, or bioinformatics. Specifically, each synthetic function f~\tilde{f} is sampled as follows,

f~𝒢𝒫(0,𝒦),𝒦(x,x)=xxx2+x2xx,\tilde{f}\sim\mathcal{G}\mathcal{P}(0,\mathcal{K}),\hskip 5.69046pt\mathcal{K}(x,x^{\prime})=\frac{x\cdot x^{\prime}}{||x||^{2}+||x^{\prime}||^{2}-x\cdot x^{\prime}}, (4)

where 𝒦(x,x)\mathcal{K}(x,x^{\prime}) is the Tanimoto kernel that measures the similarity between xx and xx^{\prime}.

The final family of functions ~\tilde{\mathcal{F}} used to train LICO is a mixture of intrinsic and synthetic functions with a certain ratio. This design choice is critical to the model’s performance. Intuitively, training on both types of functions ensures proximity to the downstream objectives and good coverage of the function space for efficient generalization. The use of intrinsic functions is also the main difference between our work and ExPT [37], a recent method that studies pure synthetic pretraining for optimization. We hypothesize that while synthetic data alone is sufficient for ExPT on a few simple tasks, for a more complex domain such as molecular optimization, synthetic training provides too little relevant signal for the model to generalize to downstream objectives. We empirically show the importance of both intrinsic and synthetic functions in the ablation study in section 5.2.2.

4.3 LICO for Black-box Optimization

After training, a single LICO model can be used for optimizing various objective functions within the domain 𝒳\mathcal{X}. Optimization involves an iterative process. At each iteration tt, we generate a set of candidates {xi}i=1C\{x_{i}\}_{i=1}^{C} for which the model predicts the mean μi\mu_{i} and standard deviation σi\sigma_{i} conditioned on prior observations 𝒟obs\mathcal{D}_{\text{obs}}, a dataset of (x,y)(x,y) pairs collected until tt. An acquisition function α\alpha then calculates a utility score based on μi\mu_{i} and σi\sigma_{i} for each candidate, balancing between exploration (favoring high σ\sigma) and exploitation (favoring high μ\mu). The top kk candidates determined by their utility scores are evaluated using the objective function ff. These kk candidates and their corresponding evaluations are incorporated into the dataset 𝒟obs\mathcal{D}_{\text{obs}}, and the cycle repeats. This process terminates once we exhaust the evaluation budget of BB. Algorithm 1 summarizes the optimization process.

5 Experiments

We evaluate LICO on molecular optimization, where the goal is to design new molecules with desired properties such as high chemical stability, low toxicity, or selective inhibition against a target disease. This problem plays a pivotal role in advancing drug and material discovery.

5.1 PMO Benchmark

Benchmark We evaluate LICO on Practical Molecular Optimization (PMO) [14], a standard benchmark for molecular optimization with a focus on sample efficiency. We experiment on 2121 optimization objectives provided by PMO, including QED [6], DRD2 [39], and 1919 objective functions from Guacamol [9]. QED assesses a molecule’s drug-likeness by identifying certain "red flags". DRD2 is a machine learning model trained on experimental data to predict bioactivities for specific target diseases. Guacamol objectives emulate drug discovery goals through a multi-property objective (MPO) approach, considering factors like target molecule similarity, molecular weights, and CLogP. All objective values range from 0 to 11, with 11 indicating the best outcome.

Baselines We compare LICO against 33 leading methods in PMO, namely REINVENT [39], Graph GA [21], and GP BO [46]. REINVENT is a reinforcement learning method that refines a pretrained RNN for generating SMILES strings. Graph GA, inspired by evolutionary processes, utilizes crossover operations derived from graph matching and mutation at atom and fragment levels to explore the molecule space. GP BO is a Bayesian optimization method that constructs a surrogate function using Gaussian Processes and employs an acquisition function that combines the surrogate’s predictions with uncertainty estimates to guide candidate selection. In addition to PMO baselines, we also compare LICO with TNP [36], a state-of-the-art transformers model for few-shot learning. GP BO and TNP are similar to LICO, where the difference is we use an LLM for surrogate modeling instead of a GP or TNP.

LICO training We use ZINC 250K as the unlabeled dataset 𝒟u\mathcal{D}_{\text{u}}. ZINC 250K contains around 250000250000 molecules sampled from the full ZINC database [42] with moderate size and high pharmaceutical relevance and popularity. We adopt 22-radius 20482048 bit molecular fingerprints as the input feature of the molecule. To generate training data, we use 4747 intrinsic properties of the molecule as the intrinsic functions, which we present in detail in Appendix A.1. We train LICO for 2000020000 iterations with a batch size of 44, where each data point is a sequence of (x,y)(x,y) pairs sampled from an intrinsic or synthetic function. The ratio of synthetic data is 0.10.1. We use Llama-2-7b [45] as the base LLM, and use LoRA [19] for parameter-efficient finetuning.

Optimization details We limit the optimization budget of all methods to 10001000 function calls. We report the area under the curve (AUC) of the top-1010 average objective value against the number of function calls as the performance metric. AUC metric favors methods that obtain high values with a smaller number of function calls, thus evaluating both optimization capability and sample efficiency. We min-max scale the AUC values to [0,1][0,1]. We aggregate the performance for each method across 55 seeds for better reproducibility as suggested by PMO.

Table 1: The performance of LICO and the baselines on 2121 optimization tasks in PMO. Higher score is better. We report the mean and stddev of scores averaged over 55 random seeds. We use blue and violet to denote the best and second-best method for each task.
Task GP BO Graph GA LICO REINVENT TNP
albuterol_similarity 0.636±0.106{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.636\pm 0.106}} 0.583±0.0650.583\pm 0.065 0.656±0.125{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.656\pm 0.125}} 0.496±0.0200.496\pm 0.020 0.550±0.0340.550\pm 0.034
amlodipine_mpo 0.519±0.014{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.519\pm 0.014}} 0.501±0.0160.501\pm 0.016 0.541±0.026{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.541\pm 0.026}} 0.472±0.0080.472\pm 0.008 0.491±0.0140.491\pm 0.014
celecoxib_rediscovery 0.411±0.0460.411\pm 0.046 0.424±0.0490.424\pm 0.049 0.447±0.073{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.447\pm 0.073}} 0.370±0.0290.370\pm 0.029 0.429±0.048{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.429\pm 0.048}}
deco_hop 0.593±0.013{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.593\pm 0.013}} 0.581±0.0060.581\pm 0.006 0.596±0.010{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.596\pm 0.010}} 0.572±0.0060.572\pm 0.006 0.586±0.0030.586\pm 0.003
drd2 0.857±0.080{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.857\pm 0.080}} 0.833±0.0650.833\pm 0.065 0.859±0.066{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.859\pm 0.066}} 0.775±0.0860.775\pm 0.086 0.758±0.0660.758\pm 0.066
fexofenadine_mpo 0.707±0.021{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.707\pm 0.021}} 0.666±0.0090.666\pm 0.009 0.700±0.023{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.700\pm 0.023}} 0.650±0.0070.650\pm 0.007 0.680±0.0150.680\pm 0.015
isomers_c7h8n2o2 0.545±0.1580.545\pm 0.158 0.735±0.112{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.735\pm 0.112}} 0.779±0.099{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.779\pm 0.099}} 0.725±0.0640.725\pm 0.064 0.694±0.1230.694\pm 0.123
isomers_c9h10n2o2pf2cl 0.599±0.0590.599\pm 0.059 0.630±0.0860.630\pm 0.086 0.672±0.075{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.672\pm 0.075}} 0.630±0.0320.630\pm 0.032 0.635±0.071{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.635\pm 0.071}}
median1 0.213±0.020{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.213\pm 0.020}} 0.208±0.0150.208\pm 0.015 0.217±0.019{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.217\pm 0.019}} 0.205±0.0120.205\pm 0.012 0.210±0.0120.210\pm 0.012
median2 0.203±0.009{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.203\pm 0.009}} 0.181±0.0090.181\pm 0.009 0.193±0.009{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.193\pm 0.009}} 0.188±0.0100.188\pm 0.010 0.186±0.0130.186\pm 0.013
mestranol_similarity 0.427±0.025{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.427\pm 0.025}} 0.362±0.0170.362\pm 0.017 0.423±0.016{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.423\pm 0.016}} 0.379±0.0260.379\pm 0.026 0.368±0.0050.368\pm 0.005
osimertinib_mpo 0.766±0.006{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.766\pm 0.006}} 0.751±0.0050.751\pm 0.005 0.759±0.008{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.759\pm 0.008}} 0.737±0.0070.737\pm 0.007 0.752±0.0060.752\pm 0.006
perindopril_mpo 0.458±0.019{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.458\pm 0.019}} 0.435±0.0160.435\pm 0.016 0.473±0.009{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.473\pm 0.009}} 0.404±0.0090.404\pm 0.009 0.433±0.0100.433\pm 0.010
qed 0.912±0.0100.912\pm 0.010 0.914±0.0070.914\pm 0.007 0.925±0.005{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.925\pm 0.005}} 0.921±0.002{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.921\pm 0.002}} 0.917±0.0020.917\pm 0.002
ranolazine_mpo 0.701±0.023{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.701\pm 0.023}} 0.620±0.0140.620\pm 0.014 0.687±0.029{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.687\pm 0.029}} 0.574±0.0440.574\pm 0.044 0.613±0.0330.613\pm 0.033
scaffold_hop 0.478±0.009{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.478\pm 0.009}} 0.461±0.0080.461\pm 0.008 0.480±0.008{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.480\pm 0.008}} 0.447±0.0100.447\pm 0.010 0.469±0.0150.469\pm 0.015
sitagliptin_mpo 0.232±0.0830.232\pm 0.083 0.229±0.0530.229\pm 0.053 0.315±0.097{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.315\pm 0.097}} 0.261±0.026{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.261\pm 0.026}} 0.221±0.0340.221\pm 0.034
thiothixene_rediscovery 0.351±0.039{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.351\pm 0.039}} 0.322±0.0230.322\pm 0.023 0.343±0.035{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.343\pm 0.035}} 0.311±0.0210.311\pm 0.021 0.307±0.0340.307\pm 0.034
troglitazone_rediscovery 0.313±0.018{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.313\pm 0.018}} 0.267±0.0150.267\pm 0.015 0.292±0.028{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.292\pm 0.028}} 0.246±0.0090.246\pm 0.009 0.266±0.0090.266\pm 0.009
valsartan_smarts 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000
zaleplon_mpo 0.392±0.0340.392\pm 0.034 0.374±0.0240.374\pm 0.024 0.404±0.022{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.404\pm 0.022}} 0.406±0.017{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.406\pm 0.017}} 0.377±0.0180.377\pm 0.018
Sum of scores (\uparrow) 10.313{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{10.313}} 10.07610.076 10.760{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{10.760}} 9.7729.772 9.9449.944
Mean rank (\downarrow) 2.33{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{2.33}} 3.623.62 1.48{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{1.48}} 3.863.86 3.653.65

Results Table 1 summarizes the performance of the 55 considered methods across 2121 optimization tasks. Overall, LICO is the leading method in this benchmark, achieving the highest aggregated score and the lowest mean rank. Specifically, LICO achieves the best performance in 12/2112/21 tasks and the second-best performance in the remaining 8/218/21 tasks. It is important to note that LICO achieves this impressive result without being explicitly trained on data from downstream objectives. This shows the effectiveness of semi-synthetic training in enabling generalization to a broad range of functions via in-context prompting. TNP performs poorly in this benchmark, despite sharing a similar architecture with LICO. This performance gap highlights the importance of the pattern-matching capabilities LLMs acquire through extensive pretraining, which are crucial for adapting to new domains.

The second best-performing method in this benchmark is GP BO, a method very similar to LICO, where the only difference between the two is the surrogate model. This indicates the superiority of LICO compared to GP, a popular surrogate model for black-box optimization. To verify this, we compare the predictive performance of LICO and GP on several objective functions. We do this by first labeling the ZINC unlabeled dataset with the objective functions and randomly choosing a subset of the labeled data points for evaluation. For each task, we vary the number of examples given to each method from 3232 to 512512, and evaluate their performance on 128128 held-out data points. We use negative log-likelihood, mean squared error, and root mean squared calibration error as the evaluation metrics. Figure 2 compares the predictive performance of LICO and GP in 33 objective functions, median1, ranolazine_mpo, and troglitazone_rediscovery. The figure shows that the optimization performance of the method closely aligns with the predictive performance of the surrogate model. In median1 and ranolazine_mpo where LICO outperforms GP in terms of optimization score, the model also achieves lower negative log-likelihood, mean squared error, and calibration error. Similarly, LICO has worse predictive performance in troglitazone_rediscovery where it underperforms GP. This verifies our hypothesis and proves the effectiveness of LICO for surrogate modeling.

Refer to caption
Figure 2: The predictive performance of LICO and GP on 33 objective functions in PMO with different metrics and varying numbers of observations.
Table 2: Performance of LICO on 55 tasks with different language instructions.
Task albuterol_similarity amlodipine_mpo celecoxib_rediscovery deco_hop drd2 Sum (\uparrow)
LICO w/o Language 0.615±0.1040.615\pm 0.104 0.491±0.0180.491\pm 0.018 0.396±0.0510.396\pm 0.051 0.585±0.0100.585\pm 0.010 0.840±0.0630.840\pm 0.063 2.9272.927
LICO w/o Task prompt 0.641±0.1070.641\pm 0.107 0.523±0.0180.523\pm 0.018 0.457±0.041\mathbf{0.457\pm 0.041} 0.595±0.0060.595\pm 0.006 0.844±0.1050.844\pm 0.105 3.0603.060
LICO 0.656±0.125\mathbf{0.656\pm 0.125} 0.541±0.026\mathbf{0.541\pm 0.026} 0.447±0.0730.447\pm 0.073 0.596±0.010\mathbf{0.596\pm 0.010} 0.859±0.066\mathbf{0.859\pm 0.066} 3.099\mathbf{3.099}
Table 3: Performance of LICO on 55 tasks with different ratios of synthetic data.
Task albuterol_similarity amlodipine_mpo celecoxib_rediscovery deco_hop drd2 Sum (\uparrow)
LICO Intrinsic 0.598±0.1150.598\pm 0.115 0.524±0.0290.524\pm 0.029 0.412±0.0420.412\pm 0.042 0.585±0.0050.585\pm 0.005 0.891±0.0320.891\pm 0.032 3.0103.010
LICO 0.1 Synthetic 0.656±0.1250.656\pm 0.125 0.541±0.026\mathbf{0.541\pm 0.026} 0.447±0.073\mathbf{0.447\pm 0.073} 0.596±0.010\mathbf{0.596\pm 0.010} 0.859±0.0660.859\pm 0.066 3.099\mathbf{3.099}
LICO 0.5 Synthetic 0.663±0.140\mathbf{0.663\pm 0.140} 0.504±0.0160.504\pm 0.016 0.402±0.0160.402\pm 0.016 0.588±0.0060.588\pm 0.006 0.907±0.020\mathbf{0.907\pm 0.020} 3.0633.063
LICO Synthetic 0.547±0.0800.547\pm 0.080 0.498±0.0260.498\pm 0.026 0.404±0.1030.404\pm 0.103 0.585±0.0030.585\pm 0.003 0.902±0.0120.902\pm 0.012 2.9362.936

We also note that there is a discrepancy between Table 1 and the results reported in PMO. This is due to several reasons. First, we use a smaller optimization budget of 10001000 queries compared to 1000010000 in PMO. We believe 10001000 is a more reasonable budget while still allowing optimization methods to achieve meaningful performances. Moreover, we found the GP BO implementation in PMO to be suboptimal, specifically in how it generated the candidate pool. The original implementation applied crossover and mutation to a mix of the best individuals and randomly selected individuals from the last iteration to generate the candidate pool for the current iteration. However, we found that only using the best individuals from the last iteration results in a much better performance. By implementing this change, GP BO becomes a much stronger baseline than Graph GA and REINVENT.

5.2 Ablation Analysis

We perform various ablation studies to understand the importance of different components and design choices in LICO. For the ablation experiments, we consider the first 55 tasks in Table 1 only. We report the aggregated performance of different models using AUC Top-1010 across 55 random seeds.

5.2.1 LICO without language instruction

First, we examine the importance of language instructions to the performance of LICO. We compare 33 variants of LICO: 1) LICO without any language instruction, 2) LICO with special tokens <x> and <y> but without a task prompt, and 3) LICO with both special tokens and the task prompt. Table 2 compares the performance of the 33 variants. LICO performs the best in 4/54/5 tasks, followed by LICO without the task prompt. LICO without any language instruction performs the worst, often by a large margin. This result confirms the importance of guiding a pretrained LLM with language instruction when applying the model to in-context reasoning in a completely new domain.

5.2.2 LICO with different synthetic ratios

We investigate the importance of training LICO on both intrinsic and synthetic data. To do this, we gradually increase the ratio of synthetic functions in the training data from 0 (intrinsic-only) to 11 (synthetic-only), and compare the performance of LICO across different ratios. Table 3 shows that LICO with semi-synthetic training performs the best, outperforming both intrinsic-only and synthetic-only data. Training with synthetic data only performs the worst, which is expected when synthetic functions generated by a GP do not include any domain knowledge that is encoded by the intrinsic functions. In other words, synthetic data alone provides too little relevant signal for the model to generalize to unseen downstream objectives. Training with intrinsic functions only, on the other hand, results in quite good performances on most tasks. However, in tasks like albuterol_similarity, semi-synthetic training outperforms this baseline by a large margin. We hypothesize that the underlying objective in albuterol_similarity is far from the intrinsic functions used to train LICO, leading to poor generalization. Finally, training with small (0.10.1) to moderate (0.50.5) ratios of synthetic data achieves similarly good performance.

Table 4: Performance of pretrained vs randomly initialized LLMs.
Task Pretrained LLM Scratch LLM
albuterol_similarity 0.656±0.125\mathbf{0.656\pm 0.125} 0.575±0.0640.575\pm 0.064
amlodipine_mpo 0.541±0.026\mathbf{0.541\pm 0.026} 0.503±0.0290.503\pm 0.029
celecoxib_rediscovery 0.447±0.073\mathbf{0.447\pm 0.073} 0.410±0.0340.410\pm 0.034
deco_hop 0.596±0.010\mathbf{0.596\pm 0.010} 0.583±0.0050.583\pm 0.005
drd2 0.859±0.066\mathbf{0.859\pm 0.066} 0.827±0.0850.827\pm 0.085
Sum 3.099\mathbf{3.099} 2.8982.898
Refer to caption
Figure 3: LICO with different LLM sizes.

5.2.3 Randomly initialized vs Pretrained LLMs

To understand the importance of using a pretrained LLM, we compare LICO with an autoregressive transformer model of the same size (7B). The transformer architecture is the same as in [15], and we train this model to perform in-context learning on the semi-synthetic data from scratch. Table 4 shows the comparison. The scratch model performs much worse than LICO with a pretrained LLM on all tasks despite sharing the same number of parameters. This highlights the importance of the pattern-matching capabilities that LLMs like Llama-2 acquire through extensive language pretraining.

5.2.4 LICO with different LLM sizes

Previous works have shown the favorable scaling laws of Large Language Models where larger models consistently perform better on downstream tasks [24]. In this section, we investigate the scaling properties of LLMs but in the context of black-box optimization. Specifically, we compare 44 different base LLMs with different sizes – Qwen-1.5 1.8B and 4B [4], Phi-2 2.7B [20], and Llama-2 7B [45]. We use the same language instructions for all models. We evaluate each model on the first 88 tasks in Table 1 and average the results across 55 random seeds. We report the sum of performance across 88 tasks.

The comparison in Figure 3 shows that the optimization performance scales consistently with the model size, with Llama-2 7B being the best method. This experiment indicates that larger LLMs not only perform better in language tasks but also obtain stronger pattern-matching capabilities that can be transferred to a completely different domain. Given this scaling, we can further improve the current performance of LICO by scaling up the base LLM size.

6 Conclusion and Future Work

We develop LICO, a new method that leverages pretrained Large Language Models for black-box optimization. LICO extends existing LLMs to non-language domains with separate embedding and prediction layers. To enable efficient generalization to various optimization tasks, we train LICO on a diverse set of semi-synthetic functions for few-shot predictions. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark with over 20 objective functions. Ablation analyses highlight the importance of incorporating language instruction to guide in-context learning and semi-synthetic training for better generalization.

One limitation of our method is the assumption of an accessible set of intrinsic functions. While this is true for molecular optimization, it may not apply to other scientific domains. In such cases, a better synthetic data generation process incorporating domain knowledge is needed to aid generalization.

Future directions include evaluating LICO in other domains to test its applicability and generality, exploring other prompts that better exploit the capabilities of pretrained LLMs, and using LLMs for other aspects of optimization, such as candidate suggestion or exploration.

Acknowledgements

This work is supported by grants from Amazon, Cisco, Microsoft, and Samsung.

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Angermueller et al. [2020] Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. 2020.
  • Anonymous [2024] Anonymous. Large language models to enhance bayesian optimization, 2024. URL https://openreview.net/forum?id=OOxotBmGol.
  • Bai et al. [2023] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  • Berkenkamp et al. [2016] Felix Berkenkamp, Angela P Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with gaussian processes. In 2016 IEEE international conference on robotics and automation (ICRA), pages 491–496. IEEE, 2016.
  • Bickerton et al. [2012] G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90–98, 2012.
  • Bradley et al. [2024] Herbie Bradley, Honglu Fan, Theodoros Galanos, Ryan Zhou, Daniel Scott, and Joel Lehman. The openelm library: Leveraging progress in language models for novel evolutionary algorithms. Genetic Programming Theory and Practice XX. Springer, 2024.
  • Brookes et al. [2019] David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
  • Brown et al. [2019] Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher. Guacamol: benchmarking models for de novo molecular design. Journal of chemical information and modeling, 59(3):1096–1108, 2019.
  • Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
  • Bubeck et al. [2023] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  • Chen et al. [2023] Angelica Chen, David M Dohan, and David R So. Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838, 2023.
  • Dinh et al. [2022] Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, and Kangwook Lee. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
  • Gao et al. [2022] Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor Coley. Sample efficiency matters: a benchmark for practical molecular optimization. Advances in Neural Information Processing Systems, 35:21342–21357, 2022.
  • Garg et al. [2022] Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  • Gaulton et al. [2012] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2012.
  • Gruver et al. [2023] Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820, 2023.
  • Hamidieh [2018] Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 154:346–354, 2018.
  • Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Javaheripi et al. [2023] Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
  • Jensen [2019] Jan H Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical science, 10(12):3567–3572, 2019.
  • Jiang et al. [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Jiang et al. [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  • Kaplan et al. [2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  • Krishnamoorthy et al. [2023a] Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Generative pretraining for black-box optimization. In ICML, 2023a.
  • Krishnamoorthy et al. [2023b] Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Diffusion models for black-box optimization. In ICML, 2023b.
  • Lehman et al. [2023] Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331–366. Springer, 2023.
  • Li et al. [2022] Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  • Liao et al. [2019] Thomas Liao, Grant Wang, Brian Yang, Rene Lee, Kristofer Pister, Sergey Levine, and Roberto Calandra. Data-efficient learning of morphology and controller for a microrobot. In 2019 International Conference on Robotics and Automation (ICRA), pages 2488–2494. IEEE, 2019.
  • Liu et al. [2023] Fei Liu, Xi Lin, Zhenkun Wang, Shunyu Yao, Xialiang Tong, Mingxuan Yuan, and Qingfu Zhang. Large language model for multi-objective evolutionary optimization. arXiv preprint arXiv:2310.12541, 2023.
  • Lu et al. [2022] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Frozen pretrained transformers as universal computation engines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7628–7636, 2022.
  • Ma et al. [2023] Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  • Meyerson et al. [2023] Elliot Meyerson, Mark J Nelson, Herbie Bradley, Arash Moradi, Amy K Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. arXiv preprint arXiv:2302.12170, 2023.
  • Mirchandani et al. [2023] Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
  • Nguyen and Grover [2022] Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. In ICML, 2022.
  • Nguyen et al. [2023] Tung Nguyen, Sudhanshu Agrawal, and Aditya Grover. Expt: Synthetic pretraining for few-shot experimental design. In NeurIPS, 2023.
  • Nie et al. [2023] Allen Nie, Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. Importance of directional feedback for llm-based optimizers. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
  • Olivecrona et al. [2017] Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9(1):1–14, 2017.
  • Sarkisyan et al. [2016] Karen S Sarkisyan, Dmitry A Bolotin, Margarita V Meer, Dinara R Usmanova, Alexander S Mishin, George V Sharonov, Dmitry N Ivankov, Nina G Bozhanova, Mikhail S Baranov, Onuralp Soylemez, et al. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016.
  • Shen et al. [2023] Junhong Shen, Liam Li, Lucio M Dery, Corey Staten, Mikhail Khodak, Graham Neubig, and Ameet Talwalkar. Cross-modal fine-tuning: Align then refine. arXiv preprint arXiv:2302.05738, 2023.
  • Sterling and Irwin [2015] Teague Sterling and John J Irwin. Zinc 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11):2324–2337, 2015.
  • Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Tripp et al. [2021] Austin Tripp, Gregor NC Simm, and José Miguel Hernández-Lobato. A fresh look at de novo molecular design benchmarks. In NeurIPS 2021 AI for Science Workshop, 2021.
  • Tsimpoukelli et al. [2021] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  • Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Weininger [1988] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
  • Yang et al. [2023] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  • Zhang et al. [2023] Michael R Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. arXiv e-prints, pages arXiv–2312, 2023.
  • Zoph and Le [2016] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.

Appendix A LICO implementation details

A.1 Molecular intrinsic functions

We utilize 4747 intrinsic properties of molecules for pretraining LICO. Table 5 shows the intrinsic properties and their explanation.

Table 5: Inherent Properties of Molecules and their Explanations
Property Explanation
Molecular Weight Total mass of all atoms in the molecule.
Number of Rotatable Bonds Bonds that allow free rotation around themselves.
Number of Rings Count of ring structures in the molecule.
Number of H Donors Atoms in the molecule that can donate a hydrogen atom.
Number of H Acceptors Atoms in the molecule capable of accepting a hydrogen atom.
Num Aromatic Rings Count of rings with a pattern of alternating single and double bonds.
Num Aliphatic Rings Count of non-aromatic rings in the molecule.
Num Saturated Rings Rings with single bonds only.
Num Heteroatoms Atoms other than carbon or hydrogen.
Fraction Csp3 Fraction of carbon atoms bonded with a single pair of electrons.
Heavy Atom Count Count of all atoms except hydrogen.
Num Valence Electrons Total number of electrons that can participate in the formation of chemical bonds.
Num Aromatic CarboRings Aromatic rings composed solely of carbon atoms.
Num Aromatic HeteroRings Aromatic rings containing at least one heteroatom.
Num Saturated CarboRings Saturated rings made only of carbon atoms.
Num Saturated HeteroRings Saturated rings containing at least one heteroatom.
BalabanJ Topological index to quantify molecular branching.
BertzCT A measure of structural complexity of the molecule.
Ipc Information content on the vertex degree.
HallKierAlpha Valence connectivity index used in molecular shape analysis.
Kappa1 Shape descriptor based on the skeleton of the molecule.
Kappa2 Hydrogen suppressed graph descriptor.
Kappa3 Hydrogen complete graph descriptor.
Chi0 Randić molecular connectivity index.
Chi1 Valence modified Randić molecular connectivity index.
Chi0n Randić connectivity index normalized.
Chi1n Valence modified Randić connectivity index normalized.
Chi2n Second order Randić connectivity index normalized.
Chi3n Third order Randić connectivity index normalized.
Chi4n Fourth order Randić connectivity index normalized.
Chi0v Randić connectivity index for valence electrons.
Chi1v First order valence molecular connectivity index.
Chi2v Second order valence molecular connectivity index.
Chi3v Third order valence molecular connectivity index.
Chi4v Fourth order valence molecular connectivity index.
Molar Refractivity Measure of the molecule’s polarizability.
AMW Average molecular weight of all atoms in the molecule.
Max Partial Charge Maximum partial charge in the molecule.
Min Partial Charge Minimum partial charge in the molecule.
Max Abs Partial Charge Maximum absolute value of the partial charges in the molecule.
Min Abs Partial Charge Minimum absolute value of the partial charges in the molecule.
Labute ASA Labute’s Approximate Surface Area, an estimate of the molecular surface area.
Max EState Index Maximum electrotopological state index of the atoms in the molecule.
Min EState Index Minimum electrotopological state index of the atoms in the molecule.
Max Abs EState Index Maximum absolute value of the electrotopological state indices in the molecule.
Min Abs EState Index Minimum absolute value of the electrotopological state indices in the molecule.
fr_C_O Frequency of carbon-oxygen bonds in the molecule.

A.2 Training details

The xx embedding layer, yy embedding layer, and prediction layer in LICO are MLPs with a hidden dimension of 10241024. We train LICO for 2000020000 steps with a batch size of 44. For each data point in the batch, we randomly decide whether to sample an intrinsic or a synthetic function, with the probability of choosing synthetic functions being 0.10.1. Each data point is a sequence of (x,y)(x,y) pairs with length n𝒰[64,800]n\sim\mathcal{U}[64,800]. If the function is an intrinsic function, we uniformly sample a property from Table 5, otherwise sample synthetic data following Equation (4).

We use Llama-2-7b [45] as the base LLM, and use LoRA [19] for parameter-efficient finetuning. We use a base learning rate of 5e45e-4 with a linear warmup for 10001000 steps and a cosine decay for the remaining 1900019000 steps. We use LoRA with a rank of 1616 and α\alpha scale of 1616.

A.3 Black-box optimization hyperparameters

We use Algorithm 1 to optimize a black-box function with LICO. We initialize the observed dataset 𝒟obs\mathcal{D}_{\text{obs}} with a population of 3434 molecules sampled randomly from ZINC. At each iteration, we use the best 3434 candidates in 𝒟obs\mathcal{D}_{\text{obs}} to generate new candidates via crossover and mutation operations, with the mutation rate being 0.010.01. The candidate pool size CC is 100100. We predict the mean μi\mu_{i} and standard deviation σi\sigma_{i} for each candidate xix_{i} in the pool using LICO. We employ a UCB acquisition function to compute the utility score ui=μi+βσiu_{i}=\mu_{i}+\beta\sigma_{i}, which balances exploration and exploitation. Following [14], we set β=10b\beta=10^{b}, where b𝒰[0.5,1.5]b\sim\mathcal{U}[-0.5,1.5]. We then pick k=15k=15 candidates with the highest utility scores. We evaluate each selected candidate xjx_{j} using the black-box function ff, and add the new data point (xj,yj)(x_{j},y_{j}) to the observed dataset 𝒟obs\mathcal{D}_{\text{obs}}. The process continues with the updated observed dataset, and stops when |𝒟obs|=1000.|\mathcal{D}_{\text{obs}}|=1000.

When predicting μi,σi=fθ(xi,𝒟obs)\mu_{i},\sigma_{i}=f_{\theta}(x_{i},\mathcal{D}_{\text{obs}}), we normalize all the ysy^{\prime}s values in 𝒟obs\mathcal{D}_{\text{obs}} to have mean 0 and standard deviation 11. This is to resemble the finetuning data distribution of LICO. We then denormalize μi\mu_{i} and σi\sigma_{i} to the original space.

A.4 Black-box optimization with LICO

Algorithm 1 outlines the optimization algorithm using LICO as the surrogate model.

Algorithm 1 Black-box optimization with LICO
0:  objective ff, LICO model fθf_{\theta}, budget BB, candidate pool size CC, acquisition function α\alpha, batch size kk
  Initialize 𝒟obs={}\mathcal{D}_{\text{obs}}=\{\}
  while |𝒟obs|<B|\mathcal{D}_{\text{obs}}|<B do
     Generate a set of candidates {xi}i=1C\{x_{i}\}_{i=1}^{C}
     for each candidate xix_{i} do
       Predict μi,σi=fθ(xi,𝒟obs)\mu_{i},\sigma_{i}=f_{\theta}(x_{i},\mathcal{D}_{\text{obs}})
       Compute utility score ui=α(μi,σi)u_{i}=\alpha(\mu_{i},\sigma_{i})
     end for
     Select kk candidates with the highest utility scores
     for each selected candidate xjx_{j} do
       Evaluate xjx_{j} using the actual objective yj=f(xj)y_{j}=f(x_{j})
       Add (xj,yj)(x_{j},y_{j}) to the observation dataset 𝒟obs\mathcal{D}_{\text{obs}}
     end for
  end while

Appendix B Baseline details

TNP

is a transformer-based architecture for in-context learning. We refer to Nguyen and Grover [36] for more details about TNP. We train a TNP model with 1616 attention layers and 20482048 hidden dimensions. Other hyperparameters are the same as for LICO. After training, we use TNP for black-box optimization using Algorithm 1 with the same optimization hyperparameters but replace LICO with TNP.

GP BO

replaces the LICO surrogate model in Algorithm 1 with a Gaussian Process with a Tanimoto kernel. We optimize the Gaussian Process hyperparameters via maximum likelihood estimation on the initial population sampled from ZINC. Other optimization hyperparameters are the same as for LICO.

Graph GA

is a model-free variant of Algorithm 1. Specifically, at each iteration, Graph GA generates 1515 candidates using the same crossover and mutation operations, and directly evaluates and adds them to 𝒟obs\mathcal{D}_{\text{obs}}, since it does not employ a surrogate model.

REINVENT

adopts a policy-based RL approach to finetune a pretrained RNN to generate SMILES strings with high returns. At each optimization iteration, we sample 1616 molecules from the finetuned RNN, evaluate these molecules using the black-box function ff, and add the new data points to 𝒟obs\mathcal{D}_{\text{obs}}. We refer to Gao et al. [14] for more details of the algorithm and other hyperparameters.

Appendix C Additional results

C.1 Additional metrics

In addition to AUC Average Top-10, we measure the optimization performance of different methods on AUC Average Top-1 and AUC Average Top-100 for a more comprehensive comparison. Table 6 and 7 show AUC Average Top-1 and AUC Average Top-100 performances, respectively.

Table 6: The performance of LICO and the baselines on 2121 optimization tasks in PMO with AUC Average Top-1 metric. A higher score is better. We report the mean and stddev of scores averaged over 55 random seeds. We use blue and violet to denote the best and second-best method for each task.
Task GP BO Graph GA LICO REINVENT TNP
albuterol_similarity 0.672±0.109{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.672\pm 0.109}} 0.647±0.0800.647\pm 0.080 0.695±0.150{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.695\pm 0.150}} 0.572±0.0260.572\pm 0.026 0.611±0.0420.611\pm 0.042
amlodipine_mpo 0.538±0.016{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.538\pm 0.016}} 0.526±0.0170.526\pm 0.017 0.560±0.026{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.560\pm 0.026}} 0.500±0.0160.500\pm 0.016 0.513±0.0160.513\pm 0.016
celecoxib_rediscovery 0.434±0.0520.434\pm 0.052 0.466±0.0620.466\pm 0.062 0.492±0.079{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.492\pm 0.079}} 0.415±0.0310.415\pm 0.031 0.482±0.067{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.482\pm 0.067}}
deco_hop 0.598±0.013{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.598\pm 0.013}} 0.590±0.0050.590\pm 0.005 0.603±0.012{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.603\pm 0.012}} 0.585±0.0100.585\pm 0.010 0.597±0.0020.597\pm 0.002
drd2_current 0.895±0.0670.895\pm 0.067 0.898±0.048{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.898\pm 0.048}} 0.902±0.055{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.902\pm 0.055}} 0.867±0.0770.867\pm 0.077 0.831±0.0430.831\pm 0.043
fexofenadine_mpo 0.728±0.022{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.728\pm 0.022}} 0.691±0.0110.691\pm 0.011 0.719±0.025{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.719\pm 0.025}} 0.696±0.0120.696\pm 0.012 0.706±0.0140.706\pm 0.014
isomers_c7h8n2o2 0.576±0.1540.576\pm 0.154 0.815±0.1200.815\pm 0.120 0.834±0.109{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.834\pm 0.109}} 0.846±0.070{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.846\pm 0.070}} 0.761±0.1450.761\pm 0.145
isomers_c9h10n2o2pf2cl 0.644±0.0530.644\pm 0.053 0.708±0.0830.708\pm 0.083 0.714±0.084{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.714\pm 0.084}} 0.724±0.043{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.724\pm 0.043}} 0.701±0.0860.701\pm 0.086
median1 0.235±0.0160.235\pm 0.016 0.233±0.0180.233\pm 0.018 0.242±0.020{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.242\pm 0.020}} 0.229±0.0150.229\pm 0.015 0.238±0.015{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.238\pm 0.015}}
median2 0.212±0.010{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.212\pm 0.010}} 0.193±0.0110.193\pm 0.011 0.201±0.0090.201\pm 0.009 0.209±0.013{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.209\pm 0.013}} 0.200±0.0180.200\pm 0.018
mestranol_similarity 0.449±0.028{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.449\pm 0.028}} 0.387±0.0200.387\pm 0.020 0.445±0.014{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.445\pm 0.014}} 0.433±0.0340.433\pm 0.034 0.406±0.0110.406\pm 0.011
osimertinib_mpo 0.788±0.008{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.788\pm 0.008}} 0.777±0.0080.777\pm 0.008 0.781±0.007{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.781\pm 0.007}} 0.780±0.0090.780\pm 0.009 0.776±0.0070.776\pm 0.007
perindopril_mpo 0.475±0.019{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.475\pm 0.019}} 0.460±0.0250.460\pm 0.025 0.492±0.011{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.492\pm 0.011}} 0.432±0.0100.432\pm 0.010 0.457±0.0120.457\pm 0.012
qed 0.926±0.0110.926\pm 0.011 0.930±0.0040.930\pm 0.004 0.935±0.002{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.935\pm 0.002}} 0.934±0.003{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.934\pm 0.003}} 0.931±0.0010.931\pm 0.001
ranolazine_mpo 0.729±0.024{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.729\pm 0.024}} 0.684±0.0150.684\pm 0.015 0.711±0.028{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.711\pm 0.028}} 0.657±0.0480.657\pm 0.048 0.669±0.0320.669\pm 0.032
scaffold_hop 0.486±0.010{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.486\pm 0.010}} 0.475±0.0080.475\pm 0.008 0.491±0.013{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.491\pm 0.013}} 0.468±0.0100.468\pm 0.010 0.484±0.0190.484\pm 0.019
sitagliptin_mpo 0.268±0.0980.268\pm 0.098 0.281±0.0690.281\pm 0.069 0.363±0.114{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.363\pm 0.114}} 0.333±0.030{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.333\pm 0.030}} 0.274±0.0440.274\pm 0.044
thiothixene_rediscovery 0.371±0.046{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.371\pm 0.046}} 0.351±0.0290.351\pm 0.029 0.368±0.041{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.368\pm 0.041}} 0.345±0.0260.345\pm 0.026 0.332±0.0410.332\pm 0.041
troglitazone_rediscovery 0.329±0.019{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.329\pm 0.019}} 0.289±0.0210.289\pm 0.021 0.309±0.033{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.309\pm 0.033}} 0.276±0.0090.276\pm 0.009 0.286±0.0120.286\pm 0.012
valsartan_smarts 0.000±0.000{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.000\pm 0.000}} 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000 0.000±0.000{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.000\pm 0.000}} 0.000±0.0000.000\pm 0.000
zaleplon_mpo 0.431±0.0310.431\pm 0.031 0.418±0.0220.418\pm 0.022 0.435±0.027{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.435\pm 0.027}} 0.456±0.020{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.456\pm 0.020}} 0.428±0.0220.428\pm 0.022
Sum of scores (\uparrow) 10.78410.784 10.818{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{10.818}} 11.291{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{11.291}} 10.75510.755 10.68310.683
Mean rank (\downarrow) 2.52{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{2.52}} 3.573.57 1.62{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{1.62}} 3.483.48 3.753.75
Table 7: The performance of LICO and the baselines on 2121 optimization tasks in PMO with AUC Average Top-100 metric. A higher score is better. We report the mean and stddev of scores averaged over 55 random seeds. We use blue and violet to denote the best and second-best method for each task.
Task GP BO Graph GA LICO REINVENT TNP
albuterol_similarity 0.548±0.100{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.548\pm 0.100}} 0.470±0.0420.470\pm 0.042 0.563±0.093{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.563\pm 0.093}} 0.395±0.0120.395\pm 0.012 0.448±0.0280.448\pm 0.028
amlodipine_mpo 0.458±0.008{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.458\pm 0.008}} 0.422±0.0140.422\pm 0.014 0.486±0.025{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.486\pm 0.025}} 0.407±0.0050.407\pm 0.005 0.420±0.0130.420\pm 0.013
celecoxib_rediscovery 0.363±0.040{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.363\pm 0.040}} 0.346±0.0360.346\pm 0.036 0.372±0.070{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.372\pm 0.070}} 0.296±0.0240.296\pm 0.024 0.346±0.0260.346\pm 0.026
deco_hop 0.579±0.013{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.579\pm 0.013}} 0.563±0.0060.563\pm 0.006 0.583±0.009{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.583\pm 0.009}} 0.550±0.0060.550\pm 0.006 0.568±0.0040.568\pm 0.004
drd2_current 0.741±0.097{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.741\pm 0.097}} 0.605±0.0860.605\pm 0.086 0.725±0.092{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.725\pm 0.092}} 0.615±0.0980.615\pm 0.098 0.556±0.0950.556\pm 0.095
fexofenadine_mpo 0.645±0.018{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.645\pm 0.018}} 0.588±0.0080.588\pm 0.008 0.636±0.022{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.636\pm 0.022}} 0.549±0.0040.549\pm 0.004 0.599±0.0160.599\pm 0.016
isomers_c7h8n2o2 0.300±0.1420.300\pm 0.142 0.535±0.091{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.535\pm 0.091}} 0.450±0.1490.450\pm 0.149 0.511±0.058{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.511\pm 0.058}} 0.492±0.1150.492\pm 0.115
isomers_c9h10n2o2pf2cl 0.474±0.038{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.474\pm 0.038}} 0.441±0.0680.441\pm 0.068 0.535±0.067{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.535\pm 0.067}} 0.445±0.0270.445\pm 0.027 0.447±0.0490.447\pm 0.049
median1 0.175±0.022{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.175\pm 0.022}} 0.168±0.0130.168\pm 0.013 0.166±0.0190.166\pm 0.019 0.162±0.0070.162\pm 0.007 0.170±0.008{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.170\pm 0.008}}
median2 0.184±0.006{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.184\pm 0.006}} 0.158±0.0080.158\pm 0.008 0.175±0.010{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.175\pm 0.010}} 0.155±0.0060.155\pm 0.006 0.162±0.0090.162\pm 0.009
mestranol_similarity 0.379±0.020{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.379\pm 0.020}} 0.311±0.0160.311\pm 0.016 0.361±0.030{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.361\pm 0.030}} 0.302±0.0160.302\pm 0.016 0.314±0.0030.314\pm 0.003
osimertinib_mpo 0.706±0.006{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.706\pm 0.006}} 0.667±0.0080.667\pm 0.008 0.694±0.010{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.694\pm 0.010}} 0.623±0.0140.623\pm 0.014 0.671±0.0060.671\pm 0.006
perindopril_mpo 0.405±0.019{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.405\pm 0.019}} 0.357±0.0120.357\pm 0.012 0.424±0.007{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.424\pm 0.007}} 0.332±0.0110.332\pm 0.011 0.359±0.0100.359\pm 0.010
qed 0.853±0.0100.853\pm 0.010 0.854±0.0110.854\pm 0.011 0.882±0.007{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.882\pm 0.007}} 0.874±0.003{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.874\pm 0.003}} 0.857±0.0030.857\pm 0.003
ranolazine_mpo 0.633±0.020{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.633\pm 0.020}} 0.462±0.0220.462\pm 0.022 0.617±0.021{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.617\pm 0.021}} 0.436±0.0400.436\pm 0.040 0.468±0.0420.468\pm 0.042
scaffold_hop 0.462±0.006{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.462\pm 0.006}} 0.435±0.0080.435\pm 0.008 0.462±0.006{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.462\pm 0.006}} 0.415±0.0090.415\pm 0.009 0.440±0.0100.440\pm 0.010
sitagliptin_mpo 0.133±0.0620.133\pm 0.062 0.103±0.0320.103\pm 0.032 0.171±0.045{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.171\pm 0.045}} 0.134±0.016{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.134\pm 0.016}} 0.100±0.0230.100\pm 0.023
thiothixene_rediscovery 0.311±0.030{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.311\pm 0.030}} 0.270±0.0150.270\pm 0.015 0.299±0.026{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.299\pm 0.026}} 0.256±0.0150.256\pm 0.015 0.261±0.0240.261\pm 0.024
troglitazone_rediscovery 0.283±0.014{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.283\pm 0.014}} 0.228±0.0080.228\pm 0.008 0.258±0.024{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.258\pm 0.024}} 0.201±0.0080.201\pm 0.008 0.230±0.0050.230\pm 0.005
valsartan_smarts 0.000±0.000{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.000\pm 0.000}} 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000 0.000±0.000{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.000\pm 0.000}} 0.000±0.0000.000\pm 0.000
zaleplon_mpo 0.301±0.036{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.301\pm 0.036}} 0.258±0.0160.258\pm 0.016 0.318±0.018{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.318\pm 0.018}} 0.296±0.0090.296\pm 0.009 0.257±0.0130.257\pm 0.013
Sum of scores (\uparrow) 8.933{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{8.933}} 8.2428.242 9.175{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{9.175}} 7.9547.954 8.1678.167
Mean rank (\downarrow) 1.95{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{1.95}} 3.623.62 1.76{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{1.76}} 4.144.14 3.453.45

Appendix D Broader impact

Our work studies the application of large language models to black-box optimization, particularly in the domain of molecular optimization. This intersection of machine learning and optimization holds significant promise for advancing our understanding of LLMs’ capabilities and limitations, and has significant potential in areas like material science and drug discovery. Our main goal is to enhance machine learning and optimization techniques, but it’s also important to consider how these advancements might affect society, such as speeding up the development of new medicines and materials.

Appendix E Compute Resources

All experiments in this paper are run on a cluster of 44 A6000 GPUs, each with 49GB of memory.

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: The paper introduces LICO, a novel method for leveraging arbitrary base LLMs for black-box optimization. name achieves state-of-the-art performance in PMO, a benchmark for molecular optimization. Both the abstract and introduction include these claims.

  5. Guidelines:

    • The answer NA means that the abstract and introduction do not include the claims made in the paper.

    • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

  6. 2.

    Limitations

  7. Question: Does the paper discuss the limitations of the work performed by the authors?

  8. Answer: [Yes]

  9. Justification: Yes, we mention the limitations of the paper in Section 6.

  10. Guidelines:

    • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    • The authors are encouraged to create a separate "Limitations" section in their paper.

    • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

  11. 3.

    Theory Assumptions and Proofs

  12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  13. Answer: [N/A]

  14. Justification: NA

  15. Guidelines:

    • The answer NA means that the paper does not include theoretical results.

    • All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    • All assumptions should be clearly stated or referenced in the statement of any theorems.

    • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    • Theorems and Lemmas that the proof relies upon should be properly referenced.

  16. 4.

    Experimental Result Reproducibility

  17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  18. Answer: [Yes]

  19. Justification: All training and evaluation details are outlined in Sections 45, and Appendix A.

  20. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

      1. (a)

        If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

      2. (b)

        If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

      3. (c)

        If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

      4. (d)

        We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

  21. 5.

    Open access to data and code

  22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  23. Answer: [Yes]

  24. Justification: Code, data, and checkpoints will be released upon acceptance.

  25. Guidelines:

    • The answer NA means that paper does not include experiments requiring code.

    • Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

  26. 6.

    Experimental Setting/Details

  27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  28. Answer: [Yes]

  29. Justification: All training and evaluation details are outlined in Sections 45, and Appendix A.

  30. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    • The full details can be provided either with the code, in appendix, or as supplemental material.

  31. 7.

    Experiment Statistical Significance

  32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  33. Answer: [Yes]

  34. Justification: Yes, each result is averaged over 55 random seeds.

  35. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    • The assumptions made should be given (e.g., Normally distributed errors).

    • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

  36. 8.

    Experiments Compute Resources

  37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  38. Answer: [Yes]

  39. Justification: We specify the compute resources in Section E.

  40. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

  41. 9.

    Code Of Ethics

  42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  43. Answer: [Yes]

  44. Justification: The research in the paper conforms with the NeurIPS Code of Ethics in every respect.

  45. Guidelines:

    • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

  46. 10.

    Broader Impacts

  47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  48. Answer: [Yes]

  49. Justification: We discuss the broader impacts of the paper in Section D.

  50. Guidelines:

    • The answer NA means that there is no societal impact of the work performed.

    • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

  51. 11.

    Safeguards

  52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  53. Answer: [N/A]

  54. Justification: NA

  55. Guidelines:

    • The answer NA means that the paper poses no such risks.

    • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  56. 12.

    Licenses for existing assets

  57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  58. Answer: [Yes]

  59. Justification: We used and cited PMO, an open-source benchmark for molecular optimization.

  60. Guidelines:

    • The answer NA means that the paper does not use existing assets.

    • The authors should cite the original paper that produced the code package or dataset.

    • The authors should state which version of the asset is used and, if possible, include a URL.

    • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

  61. 13.

    New Assets

  62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  63. Answer: [N/A]

  64. Justification: NA

  65. Guidelines:

    • The answer NA means that the paper does not release new assets.

    • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    • The paper should discuss whether and how consent was obtained from people whose asset is used.

    • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

  66. 14.

    Crowdsourcing and Research with Human Subjects

  67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  68. Answer: [N/A]

  69. Justification: NA

  70. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

  71. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  73. Answer: [N/A]

  74. Justification: NA

  75. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.