\contourlength

1.4pt

LICO: Large Language Models for In-Context Molecular Optimization

Tung Nguyen, Aditya Grover
University of California, Los Angeles
{tungnd,adityag}@cs.ucla.edu

Abstract

Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark comprising over $20$ objective functions.

1 Introduction

Black-box optimization (BBO) is the problem of optimizing an unknown, often complex objective function without direct access to its structure or derivatives. This problem is ubiquitous in many science and engineering fields, including material discovery [18], protein engineering [8, 40, 2], molecular design [16], mechanical design [5, 30], and neural architecture search [52]. Typically, BBO involves an iterative process where each step constructs a surrogate model to approximate the objective function. This model then guides the selection of promising candidates for subsequent evaluation. The main challenge of this approach lies in learning an effective surrogate function that can accurately estimate the objective from limited historical data.

In stark contrast, we have seen impressive generalization abilities of Large Language Models (LLMs) [10, 1, 11, 43, 44, 45, 22, 23] for language-driven reasoning over many kinds of domains. By pretraining on Internet-scale data, LLMs have demonstrated exceptional pattern-matching abilities and generalization from limited observations in both natural language [10, 25, 48] and other domains [32, 35, 17]. This positions LLMs as a promising solution for enhancing surrogate modeling for BBO. Some recent works have indeed shown great potential for using LLMs for solving optimization problems [50, 12, 51, 3]. The main idea behind these methods is to frame the optimization problem in natural language, and prompt the language model using previously collected observations to make predictions for new data points [3] or to propose better candidates [50, 12, 51, 33, 38, 34, 28, 7, 31]. However, this approach has several limitations. First, performing optimization in the text space requires the problem and solution to be expressed in natural language, thus limiting this approach to selected domains. Second, the scarcity of domain-relevant data in the text corpora used to train language models poses generalization challenges when using these models for general scientific domains such as molecular optimization. Therefore, existing works have only demonstrated the success of LLMs in neural architecture search [3, 12, 51], prompt optimization [50], and code generation [33, 28], corresponding to domains that are well-represented in the training dataset for common language models [10, 44, 22]. Third, relying on verbose textual descriptions for both the problem and its solution imposes practical constraints by inflating the context length and thereby reducing the number of historical observations the model can effectively utilize.

In this work, we propose Large Language Models for In-Context Optimization (LICO), a general-purpose model that leverages LLMs for black-box optimization, with a particular application to the molecular domain. To generalize a language model to a new scientific domain unseen during pretraining, we equip the model with two embedding layers for embedding the previously collected molecules and their scores, and a prediction head to predict the score of unseen candidates. Intuitively, the embedding layers map the molecules and their scores to the same feature space already learned by the language model, allowing the model to perform in-context learning in this space instead of the raw text space. Unlike previous methods, this approach is applicable to domains that may not be easily described in natural language such as molecular optimization. Moreover, avoiding verbose textual descriptions enables the model to condition on more historical observations, thus scaling better to harder problems that cannot be solved within a few steps.

We train the new layers together with the (frozen) LLM to perform in-context predictions on a family of functions. Specifically, for each function sampled from this family, we condition the model on a set of inputs and their corresponding evaluations, and task the model to predict the function value of the remaining data points. This task mimics surrogate modeling in BBO, where the surrogate model has to iteratively update its estimation of the underlying objective by conditioning on historical data. An ideal function family to train the model should be close to the target objective functions we want to optimize, but also be diverse enough to encourage generalization. Therefore, we propose to combine intrinsic functions and synthetically generated functions for training LICO. Intrinsic functions are inherent properties of the input that are easy to evaluate. In molecular optimization, for example, intrinsic functions include molecular weight, the number of rings, or heavy atom count, which are obtained via simple computation on the molecule. These intrinsic functions are closely related to the actual objective functions we want to optimize such as bioactivities against a target disease. To facilitate generalization outside of the intrinsic functions, we additionally train LICO on synthetic functions defined over the same target domain that are generated by Gaussian Processes. Our empirical evidence shows the importance of learning from both intrinsic and synthetic functions to the performance of the model on downstream tasks. Figure 1 illustrates our approach.

After training, LICO is capable of optimizing a wide range of molecular properties purely via in-context prompting. While the methodology of LICO applies to general scientific domains, in this paper we focus on molecular optimization. This problem plays a pivotal role in advancing drug and material discovery. The complexity of molecular structures and the vastness of the chemical space present unique challenges to black-box optimization algorithms. Moreover, since molecule-relevant data is likely under-represented in the pretraining corpora of existing language models, molecular optimization is a good problem to test the performance and applicability of LICO. We evaluate LICO against the state-of-the-art methods on Practical Molecular Optimization (PMO) [14], a challenging molecular optimization benchmark with over $20$ objective functions. The experiments show that LICO achieves the best performance and is the highest-ranked method in the benchmark.

Figure 1: Our proposed approach. We equip a pretrained LLM with an embedding layer for

x

, an embedding layer for

y

, and a prediction layer. We train the model on semi-synthetic data to predict

y

given

x

and previous

(x,y)

pairs. We prepend each

x

with a special token <x> and each

y

with a special token <y> to guide in-context reasoning.

2 Problem Statement

Let $f:\mathcal{X}\rightarrow\mathbb{R}$ be a real-valued function that operates on a $d$ -dimensional space $\mathcal{X}\subseteq\mathbb{R}^{d}$ . In black-box optimization (BBO), the goal is to find the input $x^{\star}$ that maximizes $f$ :

x^{\star}\in\operatorname*{arg\,max}_{x\in\mathcal{X}}f(x),

(1)

where we do not have direct access to the structure or gradient information of $f$ . In molecular optimization, $\mathcal{X}$ is the space of all possible molecules, and $f$ is a certain property of the molecule we want to optimize over, such as bioactivities against a disease. While $f$ is unknown, we often have access to an unlabeled dataset $\mathcal{D}_{\text{u}}$ that consists of molecules $x^{\prime}s$ without the corresponding function values $y^{\prime}s$ . ZINC [42] is such a dataset, which contains thousands to millions of unlabeled molecules.

To solve the optimization in (1), we can query $f$ with a limited budget, since evaluation often involves expensive physical experiments. To overcome this challenge, a common BBO approach learns a surrogate model $f_{\theta}$ that approximates the objective $f$ from past observations $\mathcal{D}_{\text{obs}}=\{(x_{i},y_{i})\}_{i=1}^{n}$ , which starts empty and incrementally expands with new data points $(x,f(x))$ we query at each iteration. Formally, a surrogate model represents a predictive distribution $p_{\theta}(y\mid x,\mathcal{D}_{\text{obs}})$ of the function value $y$ conditioned on the input $x$ and the evolving observed dataset $\mathcal{D}_{\text{obs}}$ . The prediction of this surrogate guides the selection of candidates to balance exploration and exploitation for subsequent function evaluation. The newly selected points are then added to $\mathcal{D}_{\text{obs}}$ , and the process continues.

The success of this approach highly depends on the efficiency of the surrogate model $f_{\theta}$ in estimating $f$ from limited data in $\mathcal{D}_{\text{obs}}$ at each iteration. This resembles few-shot prediction, a setting that Large Language Models (LLMs) have proven to excel in. By pretraining on vast Internet-scale data, LLMs can learn generalizable patterns from limited data, and are capable of adapting to multiple functions at test time simply via in-context prompting [10, 35, 26, 27]. A recent line of works [50, 51, 12, 3] has exploited this ability of LLMs for optimization, but they relied on natural language as the interface, thus lacking generality to scientific domains. In this work, we propose a more general and efficient approach to leveraging LLMs for black-box optimization.

3 Related Work

LLMs for Optimization Recent works have explored the use of LLMs for optimization. The general idea behind these works is to prompt the model with the textual description of the optimization problem and historical evaluations for few-shot reasoning. Yang et al. [50], Liu et al. [31], Zhang et al. [51], Ma et al. [33] propose to prompt the language model to directly suggest better candidates to evaluate given the past inputs and their corresponding scores. Meyerson et al. [34], Lehman et al. [28], Bradley et al. [7] integrate LLMs with evolutionary algorithms, and prompt the model to perform crossover and mutation operations based on the population at each optimization step. Anonymous [3] study the use of LLMs to enhance several components in Bayesian optimization, including warmstarting, surrogate modeling, and candidate generation.

The common approach in existing works has several inherent limitations. First, for general scientific domains, the input $x$ may not be easily described by natural language. Second, even when there is a textual description of the input, for instance, molecules can be represented by SMILES strings [49], it is likely that molecule-relevant data is significantly under-represented in the text corpora used for training existing language models. This hinders the generalization of LLMs to new domains outside of their training distribution. These challenges are why existing works only consider hyperparameter optimization, prompt optimization, and code generation, domains that are easy to describe in text, and are well-represented in the training data of LLMs. Furthermore, from an engineering perspective, naively prompting a language model with verbose textual descriptions of the input $x$ results in an excessively long context, thus reducing the number of examples the model can condition on. For example, an LLM with a maximum context length of $4000$ can only utilize up to $100$ past observations, assuming the average length of each data point is $40$ . This practically limits the scalability of this approach to harder problems that require more steps to solve.

LLMs for Non-language Tasks In addition to optimization, several works have studied the extension of pretrained LLMs to non-language domains with two main directions. The first direction considers problems that can be described in natural language, and prompts a pretrained LLM to solve the problem directly in the text space [35, 13, 17]. The second direction tackles more general problems, and does it by learning separate encoders for the new domain and aligning it with the embedding space of the pretrained LLM [32, 41, 47, 29]. Our work is closely related to the latter direction. However, as discussed in the following sections, while many of these works completely leave the word space, we find it beneficial to include language instruction while training the new modules.

4 Method

We introduce LICO, a methodology for extending arbitrary base LLMs for surrogate modeling in black-box optimization. While the method applies to broad scientific domains, we choose molecular optimization to demonstrate LICO in this paper. We aim to develop a model capable of efficiently adapting to various objective functions after training. To achieve this, we propose a simple extension to existing LLMs and an unsupervised objective using semi-synthetic data to facilitate generalization.

4.1 Model Architecture

In black-box optimization, a surrogate model $f_{\theta}$ estimates the distribution of the function value $y$ given the input $x$ and past observations $\mathcal{D}_{\text{obs}}=\{(x_{i},y_{i})\}_{i=1}^{n}$ the model has collected until the optimization iteration $t$ :

p_{\theta}(y\mid x,x_{1},y_{1},x_{2},y_{2},\dots,x_{n},y_{n}),

(2)

where $x_{i}$ and $y_{i}=f(x_{i})$ are drawn from an objective function $f$ . Our goal is to explore LLMs to model $p_{\theta}$ . As discussed earlier, we make no assumptions on the domain $\mathcal{X}$ to be expressed with natural language. To extend a pretrained language model to an arbitrary new domain, we equip the model with $3$ new layers – an embedding layer for the inputs $x^{\prime}s$ , an embedding layer for the function values $y^{\prime}s$ , and a prediction layer for predicting the unknown function value $y$ . Learning separate embedding layers offers several benefits. First, the new embedding layers encode $x$ and $y$ to a shared hidden space obtained by the language model via pretraining, which enables the model to escape the raw text space and perform in-context reasoning in the hidden space instead. Moreover, by embedding each input $x$ to a single hidden vector instead of spanning it over several tokens, we effectively reduce the sequence length and thus allow the model to scale to more conditioning examples.

However, it is challenging for the model to perform this prediction task without any context information about the task. This is because, from the model point of view, embeddings of $x$ and $y$ do not mean anything more than some high-dimensional vectors. In other words, the model does not know what task it should perform and what each token in the embedding sequence represents. To address this issue, we prepend each sequence with a task prompt and prepend each input $x$ with a special token <x> and each function value $y$ with a special token <y>. The task prompt instructs the model to perform the task, while the special tokens <x> and <y> inform the model of the position of each input $x$ and the corresponding function value $y$ . In other words, we use a language the model has mastered (natural language) to guide the learning of a new “foreign language” (e.g., molecule). In practice, the task prompt is “Each x is a molecule and each y is the property of the corresponding molecule. Predict y given x.”, whereas <x> and <y> are two single characters “x” and “y”. Finally, we apply the prediction layer on top of each token <y> to predict the function value given the tokens preceding it. Each prediction consists of a mean and a standard deviation value which will be used for the selection of candidates during optimization. Figure 1 illustrates the architecture of LICO.

It is worth noting that the combination of natural language and domain-specific embeddings is the main distinction between LICO and previous works such as FPT [32] which applies pretrained LLMs to sequence classification tasks in non-language modalities. FPT also learns new embedding layers for the new domain, but relies entirely on the pretrained self-attention layers to model these embeddings without any language instructions. This distinction stems from the different nature of the tasks we aim to tackle. In sequence classification, the model produces a single prediction for the entire sequence, thus having a good representation of the sequence via self-attention is sufficient. For in-context learning, however, the model needs to associate each input $x$ with its corresponding value $y$ to infer the underlying function $f$ and make predictions for unknown $y$ . A language instruction that specifies where $x$ is and where $y$ is helps the model identify this association and improve its in-context reasoning. Our ablation study in 5.2.1 confirms this utility of retaining language tokens.

4.2 Semi-synthetic Training

Our goal is to train LICO on the unlabeled data $\mathcal{D}_{\text{u}}$ with an unsupervised objective to facilitate efficient generalization to an arbitrary objective function $f$ in the same domain $\mathcal{X}$ after training. Our key insight is, that if we train the model to perform the estimation in (2) for a wide range of functions, it should be able to adapt to any objective function after training. While the true function values are unknown before optimization, we can use the unlabeled data $x^{\prime}s$ to generate training data from other functions. Assume we have access to a family of functions $\tilde{\mathcal{F}}$ that operate on the same input domain $\mathcal{X}$ . For each function $\tilde{f}$ drawn from $\tilde{\mathcal{F}}$ , we sample a set of function evaluations $\{(x_{i},y_{i})\}_{i=1}^{n}$ and train the model to autoregressively predict $y$ given the input $x$ and preceding $(x,y)$ pairs:

\displaystyle\mathcal{L}(\theta)=\mathbb{E}\left[\sum_{i=1}^{n}\log p_{\theta}(y_{i}\mid x_{i},x_{<i},y_{<i})\right],

(3)

in which the expectation is with respect to $\tilde{f}\sim\tilde{\mathcal{F}}$ , $x_{1:n}\sim\mathcal{D}_{\text{u}}$ , and $y_{1:n}=\tilde{f}(x_{1:n})$ . After training, the estimation in (2) can be done purely via in-context prompting, where we condition the model on past observations to make predictions for new data points.

Ideally, the function family $\tilde{\mathcal{F}}$ should be close to the downstream objective $f$ , but also be diverse enough to encourage broad generalization across functions. To achieve this, we propose to train LICO on a mix of intrinsic and synthetic functions, which we term semi-synthetic training. Intrinsic functions are functions that map each input molecule $x$ to an inherent property of $x$ . For example, molecular weight, the number of rings, or heavy atom count are intrinsic properties of the molecule that are known from domain knowledge or can be easily computed using standard tools. These intrinsic properties are closely related to many downstream objective functions. For example, the biological activity of a drug molecule, such as its ability to inhibit a particular enzyme, is often closely related to the molecule’s shape or conformation. Therefore, training LICO from these functions encourages the model to learn useful representations of the input $x$ and obtain good prior knowledge about the optimization domain.

However, it is important to note that we are ultimately interested in optimizing other functions outside of the intrinsic function set. Training the model only on a limited set of intrinsic functions may result in overfitting and poor generalization to unseen functions. To diversify the training data, we additionally train the model on synthetically generated functions. A synthetic function family should be easy to sample from and be capable of producing diverse functions. Many such families exist, including Gaussian Processes (GPs), randomly constructed Gaussian Mixture Models, or randomly initialized neural networks. We choose to generate synthetic functions from Gaussian Processes with a Tanimoto kernel due to its simplicity and efficiency. Tanimoto kernel, also known as the Jaccard coefficient, measures the similarity between two vectors of binary values, a representation that is widely used for many scientific domains such as chemistry, drug discovery, or bioinformatics. Specifically, each synthetic function $\tilde{f}$ is sampled as follows,

\tilde{f}\sim\mathcal{G}\mathcal{P}(0,\mathcal{K}),\hskip 5.69046pt\mathcal{K}(x,x^{\prime})=\frac{x\cdot x^{\prime}}{||x||^{2}+||x^{\prime}||^{2}-x\cdot x^{\prime}},

(4)

where $\mathcal{K}(x,x^{\prime})$ is the Tanimoto kernel that measures the similarity between $x$ and $x^{\prime}$ .

The final family of functions $\tilde{\mathcal{F}}$ used to train LICO is a mixture of intrinsic and synthetic functions with a certain ratio. This design choice is critical to the model’s performance. Intuitively, training on both types of functions ensures proximity to the downstream objectives and good coverage of the function space for efficient generalization. The use of intrinsic functions is also the main difference between our work and ExPT [37], a recent method that studies pure synthetic pretraining for optimization. We hypothesize that while synthetic data alone is sufficient for ExPT on a few simple tasks, for a more complex domain such as molecular optimization, synthetic training provides too little relevant signal for the model to generalize to downstream objectives. We empirically show the importance of both intrinsic and synthetic functions in the ablation study in section 5.2.2.

4.3 LICO for Black-box Optimization

After training, a single LICO model can be used for optimizing various objective functions within the domain $\mathcal{X}$ . Optimization involves an iterative process. At each iteration $t$ , we generate a set of candidates $\{x_{i}\}_{i=1}^{C}$ for which the model predicts the mean $\mu_{i}$ and standard deviation $\sigma_{i}$ conditioned on prior observations $\mathcal{D}_{\text{obs}}$ , a dataset of $(x,y)$ pairs collected until $t$ . An acquisition function $\alpha$ then calculates a utility score based on $\mu_{i}$ and $\sigma_{i}$ for each candidate, balancing between exploration (favoring high $\sigma$ ) and exploitation (favoring high $\mu$ ). The top $k$ candidates determined by their utility scores are evaluated using the objective function $f$ . These $k$ candidates and their corresponding evaluations are incorporated into the dataset $\mathcal{D}_{\text{obs}}$ , and the cycle repeats. This process terminates once we exhaust the evaluation budget of $B$ . Algorithm 1 summarizes the optimization process.

5 Experiments

We evaluate LICO on molecular optimization, where the goal is to design new molecules with desired properties such as high chemical stability, low toxicity, or selective inhibition against a target disease. This problem plays a pivotal role in advancing drug and material discovery.

5.1 PMO Benchmark

Benchmark We evaluate LICO on Practical Molecular Optimization (PMO) [14], a standard benchmark for molecular optimization with a focus on sample efficiency. We experiment on $21$ optimization objectives provided by PMO, including QED [6], DRD2 [39], and $19$ objective functions from Guacamol [9]. QED assesses a molecule’s drug-likeness by identifying certain "red flags". DRD2 is a machine learning model trained on experimental data to predict bioactivities for specific target diseases. Guacamol objectives emulate drug discovery goals through a multi-property objective (MPO) approach, considering factors like target molecule similarity, molecular weights, and CLogP. All objective values range from $0$ to $1$ , with $1$ indicating the best outcome.

Baselines We compare LICO against $3$ leading methods in PMO, namely REINVENT [39], Graph GA [21], and GP BO [46]. REINVENT is a reinforcement learning method that refines a pretrained RNN for generating SMILES strings. Graph GA, inspired by evolutionary processes, utilizes crossover operations derived from graph matching and mutation at atom and fragment levels to explore the molecule space. GP BO is a Bayesian optimization method that constructs a surrogate function using Gaussian Processes and employs an acquisition function that combines the surrogate’s predictions with uncertainty estimates to guide candidate selection. In addition to PMO baselines, we also compare LICO with TNP [36], a state-of-the-art transformers model for few-shot learning. GP BO and TNP are similar to LICO, where the difference is we use an LLM for surrogate modeling instead of a GP or TNP.

LICO training We use ZINC 250K as the unlabeled dataset $\mathcal{D}_{\text{u}}$ . ZINC 250K contains around $250000$ molecules sampled from the full ZINC database [42] with moderate size and high pharmaceutical relevance and popularity. We adopt $2$ -radius $2048$ bit molecular fingerprints as the input feature of the molecule. To generate training data, we use $47$ intrinsic properties of the molecule as the intrinsic functions, which we present in detail in Appendix A.1. We train LICO for $20000$ iterations with a batch size of $4$ , where each data point is a sequence of $(x,y)$ pairs sampled from an intrinsic or synthetic function. The ratio of synthetic data is $0.1$ . We use Llama-2-7b [45] as the base LLM, and use LoRA [19] for parameter-efficient finetuning.

Optimization details We limit the optimization budget of all methods to $1000$ function calls. We report the area under the curve (AUC) of the top- $10$ average objective value against the number of function calls as the performance metric. AUC metric favors methods that obtain high values with a smaller number of function calls, thus evaluating both optimization capability and sample efficiency. We min-max scale the AUC values to $[0,1]$ . We aggregate the performance for each method across $5$ seeds for better reproducibility as suggested by PMO.

Table 1: The performance of LICO and the baselines on

21

optimization tasks in PMO. Higher score is better. We report the mean and stddev of scores averaged over

5

random seeds. We use blue and violet to denote the best and second-best method for each task.

Task	GP BO	Graph GA	LICO	REINVENT	TNP
albuterol_similarity	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.636\pm 0.106}}$	$0.583\pm 0.065$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.656\pm 0.125}}$	$0.496\pm 0.020$	$0.550\pm 0.034$
amlodipine_mpo	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.519\pm 0.014}}$	$0.501\pm 0.016$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.541\pm 0.026}}$	$0.472\pm 0.008$	$0.491\pm 0.014$
celecoxib_rediscovery	$0.411\pm 0.046$	$0.424\pm 0.049$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.447\pm 0.073}}$	$0.370\pm 0.029$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.429\pm 0.048}}$
deco_hop	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.593\pm 0.013}}$	$0.581\pm 0.006$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.596\pm 0.010}}$	$0.572\pm 0.006$	$0.586\pm 0.003$
drd2	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.857\pm 0.080}}$	$0.833\pm 0.065$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.859\pm 0.066}}$	$0.775\pm 0.086$	$0.758\pm 0.066$
fexofenadine_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.707\pm 0.021}}$	$0.666\pm 0.009$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.700\pm 0.023}}$	$0.650\pm 0.007$	$0.680\pm 0.015$
isomers_c7h8n2o2	$0.545\pm 0.158$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.735\pm 0.112}}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.779\pm 0.099}}$	$0.725\pm 0.064$	$0.694\pm 0.123$
isomers_c9h10n2o2pf2cl	$0.599\pm 0.059$	$0.630\pm 0.086$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.672\pm 0.075}}$	$0.630\pm 0.032$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.635\pm 0.071}}$
median1	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.213\pm 0.020}}$	$0.208\pm 0.015$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.217\pm 0.019}}$	$0.205\pm 0.012$	$0.210\pm 0.012$
median2	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.203\pm 0.009}}$	$0.181\pm 0.009$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.193\pm 0.009}}$	$0.188\pm 0.010$	$0.186\pm 0.013$
mestranol_similarity	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.427\pm 0.025}}$	$0.362\pm 0.017$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.423\pm 0.016}}$	$0.379\pm 0.026$	$0.368\pm 0.005$
osimertinib_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.766\pm 0.006}}$	$0.751\pm 0.005$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.759\pm 0.008}}$	$0.737\pm 0.007$	$0.752\pm 0.006$
perindopril_mpo	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.458\pm 0.019}}$	$0.435\pm 0.016$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.473\pm 0.009}}$	$0.404\pm 0.009$	$0.433\pm 0.010$
qed	$0.912\pm 0.010$	$0.914\pm 0.007$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.925\pm 0.005}}$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.921\pm 0.002}}$	$0.917\pm 0.002$
ranolazine_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.701\pm 0.023}}$	$0.620\pm 0.014$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.687\pm 0.029}}$	$0.574\pm 0.044$	$0.613\pm 0.033$
scaffold_hop	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.478\pm 0.009}}$	$0.461\pm 0.008$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.480\pm 0.008}}$	$0.447\pm 0.010$	$0.469\pm 0.015$
sitagliptin_mpo	$0.232\pm 0.083$	$0.229\pm 0.053$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.315\pm 0.097}}$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.261\pm 0.026}}$	$0.221\pm 0.034$
thiothixene_rediscovery	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.351\pm 0.039}}$	$0.322\pm 0.023$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.343\pm 0.035}}$	$0.311\pm 0.021$	$0.307\pm 0.034$
troglitazone_rediscovery	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.313\pm 0.018}}$	$0.267\pm 0.015$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.292\pm 0.028}}$	$0.246\pm 0.009$	$0.266\pm 0.009$
valsartan_smarts	$0.000\pm 0.000$	$0.000\pm 0.000$	$0.000\pm 0.000$	$0.000\pm 0.000$	$0.000\pm 0.000$
zaleplon_mpo	$0.392\pm 0.034$	$0.374\pm 0.024$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.404\pm 0.022}}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.406\pm 0.017}}$	$0.377\pm 0.018$
Sum of scores ( $\uparrow$ )	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{10.313}}$	$10.076$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{10.760}}$	$9.772$	$9.944$
Mean rank ( $\downarrow$ )	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{2.33}}$	$3.62$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{1.48}}$	$3.86$	$3.65$

Results Table 1 summarizes the performance of the $5$ considered methods across $21$ optimization tasks. Overall, LICO is the leading method in this benchmark, achieving the highest aggregated score and the lowest mean rank. Specifically, LICO achieves the best performance in $12/21$ tasks and the second-best performance in the remaining $8/21$ tasks. It is important to note that LICO achieves this impressive result without being explicitly trained on data from downstream objectives. This shows the effectiveness of semi-synthetic training in enabling generalization to a broad range of functions via in-context prompting. TNP performs poorly in this benchmark, despite sharing a similar architecture with LICO. This performance gap highlights the importance of the pattern-matching capabilities LLMs acquire through extensive pretraining, which are crucial for adapting to new domains.

The second best-performing method in this benchmark is GP BO, a method very similar to LICO, where the only difference between the two is the surrogate model. This indicates the superiority of LICO compared to GP, a popular surrogate model for black-box optimization. To verify this, we compare the predictive performance of LICO and GP on several objective functions. We do this by first labeling the ZINC unlabeled dataset with the objective functions and randomly choosing a subset of the labeled data points for evaluation. For each task, we vary the number of examples given to each method from $32$ to $512$ , and evaluate their performance on $128$ held-out data points. We use negative log-likelihood, mean squared error, and root mean squared calibration error as the evaluation metrics. Figure 2 compares the predictive performance of LICO and GP in $3$ objective functions, median1, ranolazine_mpo, and troglitazone_rediscovery. The figure shows that the optimization performance of the method closely aligns with the predictive performance of the surrogate model. In median1 and ranolazine_mpo where LICO outperforms GP in terms of optimization score, the model also achieves lower negative log-likelihood, mean squared error, and calibration error. Similarly, LICO has worse predictive performance in troglitazone_rediscovery where it underperforms GP. This verifies our hypothesis and proves the effectiveness of LICO for surrogate modeling.

Refer to caption — Figure 2: The predictive performance of LICO and GP on $3$ objective functions in PMO with different metrics and varying numbers of observations.

Table 2: Performance of LICO on

5

tasks with different language instructions.

Task	albuterol_similarity	amlodipine_mpo	celecoxib_rediscovery	deco_hop	drd2	Sum ( $\uparrow$ )
LICO w/o Language	$0.615\pm 0.104$	$0.491\pm 0.018$	$0.396\pm 0.051$	$0.585\pm 0.010$	$0.840\pm 0.063$	$2.927$
LICO w/o Task prompt	$0.641\pm 0.107$	$0.523\pm 0.018$	$\mathbf{0.457\pm 0.041}$	$0.595\pm 0.006$	$0.844\pm 0.105$	$3.060$
LICO	$\mathbf{0.656\pm 0.125}$	$\mathbf{0.541\pm 0.026}$	$0.447\pm 0.073$	$\mathbf{0.596\pm 0.010}$	$\mathbf{0.859\pm 0.066}$	$\mathbf{3.099}$

Table 3: Performance of LICO on

5

tasks with different ratios of synthetic data.

Task	albuterol_similarity	amlodipine_mpo	celecoxib_rediscovery	deco_hop	drd2	Sum ( $\uparrow$ )
LICO Intrinsic	$0.598\pm 0.115$	$0.524\pm 0.029$	$0.412\pm 0.042$	$0.585\pm 0.005$	$0.891\pm 0.032$	$3.010$
LICO 0.1 Synthetic	$0.656\pm 0.125$	$\mathbf{0.541\pm 0.026}$	$\mathbf{0.447\pm 0.073}$	$\mathbf{0.596\pm 0.010}$	$0.859\pm 0.066$	$\mathbf{3.099}$
LICO 0.5 Synthetic	$\mathbf{0.663\pm 0.140}$	$0.504\pm 0.016$	$0.402\pm 0.016$	$0.588\pm 0.006$	$\mathbf{0.907\pm 0.020}$	$3.063$
LICO Synthetic	$0.547\pm 0.080$	$0.498\pm 0.026$	$0.404\pm 0.103$	$0.585\pm 0.003$	$0.902\pm 0.012$	$2.936$

We also note that there is a discrepancy between Table 1 and the results reported in PMO. This is due to several reasons. First, we use a smaller optimization budget of $1000$ queries compared to $10000$ in PMO. We believe $1000$ is a more reasonable budget while still allowing optimization methods to achieve meaningful performances. Moreover, we found the GP BO implementation in PMO to be suboptimal, specifically in how it generated the candidate pool. The original implementation applied crossover and mutation to a mix of the best individuals and randomly selected individuals from the last iteration to generate the candidate pool for the current iteration. However, we found that only using the best individuals from the last iteration results in a much better performance. By implementing this change, GP BO becomes a much stronger baseline than Graph GA and REINVENT.

5.2 Ablation Analysis

We perform various ablation studies to understand the importance of different components and design choices in LICO. For the ablation experiments, we consider the first $5$ tasks in Table 1 only. We report the aggregated performance of different models using AUC Top- $10$ across $5$ random seeds.

5.2.1 LICO without language instruction

First, we examine the importance of language instructions to the performance of LICO. We compare $3$ variants of LICO: 1) LICO without any language instruction, 2) LICO with special tokens <x> and <y> but without a task prompt, and 3) LICO with both special tokens and the task prompt. Table 2 compares the performance of the $3$ variants. LICO performs the best in $4/5$ tasks, followed by LICO without the task prompt. LICO without any language instruction performs the worst, often by a large margin. This result confirms the importance of guiding a pretrained LLM with language instruction when applying the model to in-context reasoning in a completely new domain.

5.2.2 LICO with different synthetic ratios

We investigate the importance of training LICO on both intrinsic and synthetic data. To do this, we gradually increase the ratio of synthetic functions in the training data from $0$ (intrinsic-only) to $1$ (synthetic-only), and compare the performance of LICO across different ratios. Table 3 shows that LICO with semi-synthetic training performs the best, outperforming both intrinsic-only and synthetic-only data. Training with synthetic data only performs the worst, which is expected when synthetic functions generated by a GP do not include any domain knowledge that is encoded by the intrinsic functions. In other words, synthetic data alone provides too little relevant signal for the model to generalize to unseen downstream objectives. Training with intrinsic functions only, on the other hand, results in quite good performances on most tasks. However, in tasks like albuterol_similarity, semi-synthetic training outperforms this baseline by a large margin. We hypothesize that the underlying objective in albuterol_similarity is far from the intrinsic functions used to train LICO, leading to poor generalization. Finally, training with small ( $0.1$ ) to moderate ( $0.5$ ) ratios of synthetic data achieves similarly good performance.

Task	Pretrained LLM	Scratch LLM
albuterol_similarity	$\mathbf{0.656\pm 0.125}$	$0.575\pm 0.064$
amlodipine_mpo	$\mathbf{0.541\pm 0.026}$	$0.503\pm 0.029$
celecoxib_rediscovery	$\mathbf{0.447\pm 0.073}$	$0.410\pm 0.034$
deco_hop	$\mathbf{0.596\pm 0.010}$	$0.583\pm 0.005$
drd2	$\mathbf{0.859\pm 0.066}$	$0.827\pm 0.085$
Sum	$\mathbf{3.099}$	$2.898$

5.2.3 Randomly initialized vs Pretrained LLMs

To understand the importance of using a pretrained LLM, we compare LICO with an autoregressive transformer model of the same size (7B). The transformer architecture is the same as in [15], and we train this model to perform in-context learning on the semi-synthetic data from scratch. Table 4 shows the comparison. The scratch model performs much worse than LICO with a pretrained LLM on all tasks despite sharing the same number of parameters. This highlights the importance of the pattern-matching capabilities that LLMs like Llama-2 acquire through extensive language pretraining.

5.2.4 LICO with different LLM sizes

Previous works have shown the favorable scaling laws of Large Language Models where larger models consistently perform better on downstream tasks [24]. In this section, we investigate the scaling properties of LLMs but in the context of black-box optimization. Specifically, we compare $4$ different base LLMs with different sizes – Qwen-1.5 1.8B and 4B [4], Phi-2 2.7B [20], and Llama-2 7B [45]. We use the same language instructions for all models. We evaluate each model on the first $8$ tasks in Table 1 and average the results across $5$ random seeds. We report the sum of performance across $8$ tasks.

The comparison in Figure 3 shows that the optimization performance scales consistently with the model size, with Llama-2 7B being the best method. This experiment indicates that larger LLMs not only perform better in language tasks but also obtain stronger pattern-matching capabilities that can be transferred to a completely different domain. Given this scaling, we can further improve the current performance of LICO by scaling up the base LLM size.

6 Conclusion and Future Work

We develop LICO, a new method that leverages pretrained Large Language Models for black-box optimization. LICO extends existing LLMs to non-language domains with separate embedding and prediction layers. To enable efficient generalization to various optimization tasks, we train LICO on a diverse set of semi-synthetic functions for few-shot predictions. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark with over 20 objective functions. Ablation analyses highlight the importance of incorporating language instruction to guide in-context learning and semi-synthetic training for better generalization.

One limitation of our method is the assumption of an accessible set of intrinsic functions. While this is true for molecular optimization, it may not apply to other scientific domains. In such cases, a better synthetic data generation process incorporating domain knowledge is needed to aid generalization.

Future directions include evaluating LICO in other domains to test its applicability and generality, exploring other prompts that better exploit the capabilities of pretrained LLMs, and using LLMs for other aspects of optimization, such as candidate suggestion or exploration.

Acknowledgements

This work is supported by grants from Amazon, Cisco, Microsoft, and Samsung.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Angermueller et al. [2020] Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. 2020.
Anonymous [2024] Anonymous. Large language models to enhance bayesian optimization, 2024. URL https://openreview.net/forum?id=OOxotBmGol.
Bai et al. [2023] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
Berkenkamp et al. [2016] Felix Berkenkamp, Angela P Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with gaussian processes. In 2016 IEEE international conference on robotics and automation (ICRA), pages 491–496. IEEE, 2016.
Bickerton et al. [2012] G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90–98, 2012.
Bradley et al. [2024] Herbie Bradley, Honglu Fan, Theodoros Galanos, Ryan Zhou, Daniel Scott, and Joel Lehman. The openelm library: Leveraging progress in language models for novel evolutionary algorithms. Genetic Programming Theory and Practice XX. Springer, 2024.
Brookes et al. [2019] David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
Brown et al. [2019] Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher. Guacamol: benchmarking models for de novo molecular design. Journal of chemical information and modeling, 59(3):1096–1108, 2019.
Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
Bubeck et al. [2023] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
Chen et al. [2023] Angelica Chen, David M Dohan, and David R So. Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838, 2023.
Dinh et al. [2022] Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, and Kangwook Lee. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
Gao et al. [2022] Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor Coley. Sample efficiency matters: a benchmark for practical molecular optimization. Advances in Neural Information Processing Systems, 35:21342–21357, 2022.
Garg et al. [2022] Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
Gaulton et al. [2012] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2012.
Gruver et al. [2023] Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820, 2023.
Hamidieh [2018] Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 154:346–354, 2018.
Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Javaheripi et al. [2023] Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
Jensen [2019] Jan H Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical science, 10(12):3567–3572, 2019.
Jiang et al. [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Jiang et al. [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
Kaplan et al. [2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
Krishnamoorthy et al. [2023a] Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Generative pretraining for black-box optimization. In ICML, 2023a.
Krishnamoorthy et al. [2023b] Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Diffusion models for black-box optimization. In ICML, 2023b.
Lehman et al. [2023] Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331–366. Springer, 2023.
Li et al. [2022] Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
Liao et al. [2019] Thomas Liao, Grant Wang, Brian Yang, Rene Lee, Kristofer Pister, Sergey Levine, and Roberto Calandra. Data-efficient learning of morphology and controller for a microrobot. In 2019 International Conference on Robotics and Automation (ICRA), pages 2488–2494. IEEE, 2019.
Liu et al. [2023] Fei Liu, Xi Lin, Zhenkun Wang, Shunyu Yao, Xialiang Tong, Mingxuan Yuan, and Qingfu Zhang. Large language model for multi-objective evolutionary optimization. arXiv preprint arXiv:2310.12541, 2023.
Lu et al. [2022] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Frozen pretrained transformers as universal computation engines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7628–7636, 2022.
Ma et al. [2023] Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
Meyerson et al. [2023] Elliot Meyerson, Mark J Nelson, Herbie Bradley, Arash Moradi, Amy K Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. arXiv preprint arXiv:2302.12170, 2023.
Mirchandani et al. [2023] Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
Nguyen and Grover [2022] Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. In ICML, 2022.
Nguyen et al. [2023] Tung Nguyen, Sudhanshu Agrawal, and Aditya Grover. Expt: Synthetic pretraining for few-shot experimental design. In NeurIPS, 2023.
Nie et al. [2023] Allen Nie, Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. Importance of directional feedback for llm-based optimizers. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
Olivecrona et al. [2017] Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9(1):1–14, 2017.
Sarkisyan et al. [2016] Karen S Sarkisyan, Dmitry A Bolotin, Margarita V Meer, Dinara R Usmanova, Alexander S Mishin, George V Sharonov, Dmitry N Ivankov, Nina G Bozhanova, Mikhail S Baranov, Onuralp Soylemez, et al. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016.
Shen et al. [2023] Junhong Shen, Liam Li, Lucio M Dery, Corey Staten, Mikhail Khodak, Graham Neubig, and Ameet Talwalkar. Cross-modal fine-tuning: Align then refine. arXiv preprint arXiv:2302.05738, 2023.
Sterling and Irwin [2015] Teague Sterling and John J Irwin. Zinc 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11):2324–2337, 2015.
Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Tripp et al. [2021] Austin Tripp, Gregor NC Simm, and José Miguel Hernández-Lobato. A fresh look at de novo molecular design benchmarks. In NeurIPS 2021 AI for Science Workshop, 2021.
Tsimpoukelli et al. [2021] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
Weininger [1988] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
Yang et al. [2023] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
Zhang et al. [2023] Michael R Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. arXiv e-prints, pages arXiv–2312, 2023.
Zoph and Le [2016] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.

Appendix A LICO implementation details

A.1 Molecular intrinsic functions

We utilize $47$ intrinsic properties of molecules for pretraining LICO. Table 5 shows the intrinsic properties and their explanation.

Table 5: Inherent Properties of Molecules and their Explanations

Property	Explanation
Molecular Weight	Total mass of all atoms in the molecule.
Number of Rotatable Bonds	Bonds that allow free rotation around themselves.
Number of Rings	Count of ring structures in the molecule.
Number of H Donors	Atoms in the molecule that can donate a hydrogen atom.
Number of H Acceptors	Atoms in the molecule capable of accepting a hydrogen atom.
Num Aromatic Rings	Count of rings with a pattern of alternating single and double bonds.
Num Aliphatic Rings	Count of non-aromatic rings in the molecule.
Num Saturated Rings	Rings with single bonds only.
Num Heteroatoms	Atoms other than carbon or hydrogen.
Fraction Csp3	Fraction of carbon atoms bonded with a single pair of electrons.
Heavy Atom Count	Count of all atoms except hydrogen.
Num Valence Electrons	Total number of electrons that can participate in the formation of chemical bonds.
Num Aromatic CarboRings	Aromatic rings composed solely of carbon atoms.
Num Aromatic HeteroRings	Aromatic rings containing at least one heteroatom.
Num Saturated CarboRings	Saturated rings made only of carbon atoms.
Num Saturated HeteroRings	Saturated rings containing at least one heteroatom.
BalabanJ	Topological index to quantify molecular branching.
BertzCT	A measure of structural complexity of the molecule.
Ipc	Information content on the vertex degree.
HallKierAlpha	Valence connectivity index used in molecular shape analysis.
Kappa1	Shape descriptor based on the skeleton of the molecule.
Kappa2	Hydrogen suppressed graph descriptor.
Kappa3	Hydrogen complete graph descriptor.
Chi0	Randić molecular connectivity index.
Chi1	Valence modified Randić molecular connectivity index.
Chi0n	Randić connectivity index normalized.
Chi1n	Valence modified Randić connectivity index normalized.
Chi2n	Second order Randić connectivity index normalized.
Chi3n	Third order Randić connectivity index normalized.
Chi4n	Fourth order Randić connectivity index normalized.
Chi0v	Randić connectivity index for valence electrons.
Chi1v	First order valence molecular connectivity index.
Chi2v	Second order valence molecular connectivity index.
Chi3v	Third order valence molecular connectivity index.
Chi4v	Fourth order valence molecular connectivity index.
Molar Refractivity	Measure of the molecule’s polarizability.
AMW	Average molecular weight of all atoms in the molecule.
Max Partial Charge	Maximum partial charge in the molecule.
Min Partial Charge	Minimum partial charge in the molecule.
Max Abs Partial Charge	Maximum absolute value of the partial charges in the molecule.
Min Abs Partial Charge	Minimum absolute value of the partial charges in the molecule.
Labute ASA	Labute’s Approximate Surface Area, an estimate of the molecular surface area.
Max EState Index	Maximum electrotopological state index of the atoms in the molecule.
Min EState Index	Minimum electrotopological state index of the atoms in the molecule.
Max Abs EState Index	Maximum absolute value of the electrotopological state indices in the molecule.
Min Abs EState Index	Minimum absolute value of the electrotopological state indices in the molecule.
fr_C_O	Frequency of carbon-oxygen bonds in the molecule.

A.2 Training details

The $x$ embedding layer, $y$ embedding layer, and prediction layer in LICO are MLPs with a hidden dimension of $1024$ . We train LICO for $20000$ steps with a batch size of $4$ . For each data point in the batch, we randomly decide whether to sample an intrinsic or a synthetic function, with the probability of choosing synthetic functions being $0.1$ . Each data point is a sequence of $(x,y)$ pairs with length $n\sim\mathcal{U}[64,800]$ . If the function is an intrinsic function, we uniformly sample a property from Table 5, otherwise sample synthetic data following Equation (4).

We use Llama-2-7b [45] as the base LLM, and use LoRA [19] for parameter-efficient finetuning. We use a base learning rate of $5e-4$ with a linear warmup for $1000$ steps and a cosine decay for the remaining $19000$ steps. We use LoRA with a rank of $16$ and $\alpha$ scale of $16$ .

A.3 Black-box optimization hyperparameters

We use Algorithm 1 to optimize a black-box function with LICO. We initialize the observed dataset $\mathcal{D}_{\text{obs}}$ with a population of $34$ molecules sampled randomly from ZINC. At each iteration, we use the best $34$ candidates in $\mathcal{D}_{\text{obs}}$ to generate new candidates via crossover and mutation operations, with the mutation rate being $0.01$ . The candidate pool size $C$ is $100$ . We predict the mean $\mu_{i}$ and standard deviation $\sigma_{i}$ for each candidate $x_{i}$ in the pool using LICO. We employ a UCB acquisition function to compute the utility score $u_{i}=\mu_{i}+\beta\sigma_{i}$ , which balances exploration and exploitation. Following [14], we set $\beta=10^{b}$ , where $b\sim\mathcal{U}[-0.5,1.5]$ . We then pick $k=15$ candidates with the highest utility scores. We evaluate each selected candidate $x_{j}$ using the black-box function $f$ , and add the new data point $(x_{j},y_{j})$ to the observed dataset $\mathcal{D}_{\text{obs}}$ . The process continues with the updated observed dataset, and stops when $|\mathcal{D}_{\text{obs}}|=1000.$

When predicting $\mu_{i},\sigma_{i}=f_{\theta}(x_{i},\mathcal{D}_{\text{obs}})$ , we normalize all the $y^{\prime}s$ values in $\mathcal{D}_{\text{obs}}$ to have mean $0$ and standard deviation $1$ . This is to resemble the finetuning data distribution of LICO. We then denormalize $\mu_{i}$ and $\sigma_{i}$ to the original space.

A.4 Black-box optimization with LICO

Algorithm 1 outlines the optimization algorithm using LICO as the surrogate model.

Algorithm 1 Black-box optimization with LICO

0: objective

f

, LICO model

f_{\theta}

, budget

B

, candidate pool size

C

, acquisition function

\alpha

, batch size

k

Initialize

\mathcal{D}_{\text{obs}}=\{\}

while

|\mathcal{D}_{\text{obs}}|<B

Generate a set of candidates

\{x_{i}\}_{i=1}^{C}

for each candidate

x_{i}

Predict

\mu_{i},\sigma_{i}=f_{\theta}(x_{i},\mathcal{D}_{\text{obs}})

Compute utility score

u_{i}=\alpha(\mu_{i},\sigma_{i})

end for

Select

k

candidates with the highest utility scores

for each selected candidate

x_{j}

Evaluate

x_{j}

using the actual objective

y_{j}=f(x_{j})

Add

(x_{j},y_{j})

to the observation dataset

\mathcal{D}_{\text{obs}}

end for

end while

Appendix B Baseline details

TNP

is a transformer-based architecture for in-context learning. We refer to Nguyen and Grover [36] for more details about TNP. We train a TNP model with $16$ attention layers and $2048$ hidden dimensions. Other hyperparameters are the same as for LICO. After training, we use TNP for black-box optimization using Algorithm 1 with the same optimization hyperparameters but replace LICO with TNP.

GP BO

replaces the LICO surrogate model in Algorithm 1 with a Gaussian Process with a Tanimoto kernel. We optimize the Gaussian Process hyperparameters via maximum likelihood estimation on the initial population sampled from ZINC. Other optimization hyperparameters are the same as for LICO.

Graph GA

is a model-free variant of Algorithm 1. Specifically, at each iteration, Graph GA generates $15$ candidates using the same crossover and mutation operations, and directly evaluates and adds them to $\mathcal{D}_{\text{obs}}$ , since it does not employ a surrogate model.

REINVENT

adopts a policy-based RL approach to finetune a pretrained RNN to generate SMILES strings with high returns. At each optimization iteration, we sample $16$ molecules from the finetuned RNN, evaluate these molecules using the black-box function $f$ , and add the new data points to $\mathcal{D}_{\text{obs}}$ . We refer to Gao et al. [14] for more details of the algorithm and other hyperparameters.

Appendix C Additional results

C.1 Additional metrics

In addition to AUC Average Top-10, we measure the optimization performance of different methods on AUC Average Top-1 and AUC Average Top-100 for a more comprehensive comparison. Table 6 and 7 show AUC Average Top-1 and AUC Average Top-100 performances, respectively.

Table 6: The performance of LICO and the baselines on

21

optimization tasks in PMO with AUC Average Top-1 metric. A higher score is better. We report the mean and stddev of scores averaged over

5

random seeds. We use blue and violet to denote the best and second-best method for each task.

Task	GP BO	Graph GA	LICO	REINVENT	TNP
albuterol_similarity	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.672\pm 0.109}}$	$0.647\pm 0.080$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.695\pm 0.150}}$	$0.572\pm 0.026$	$0.611\pm 0.042$
amlodipine_mpo	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.538\pm 0.016}}$	$0.526\pm 0.017$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.560\pm 0.026}}$	$0.500\pm 0.016$	$0.513\pm 0.016$
celecoxib_rediscovery	$0.434\pm 0.052$	$0.466\pm 0.062$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.492\pm 0.079}}$	$0.415\pm 0.031$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.482\pm 0.067}}$
deco_hop	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.598\pm 0.013}}$	$0.590\pm 0.005$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.603\pm 0.012}}$	$0.585\pm 0.010$	$0.597\pm 0.002$
drd2_current	$0.895\pm 0.067$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.898\pm 0.048}}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.902\pm 0.055}}$	$0.867\pm 0.077$	$0.831\pm 0.043$
fexofenadine_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.728\pm 0.022}}$	$0.691\pm 0.011$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.719\pm 0.025}}$	$0.696\pm 0.012$	$0.706\pm 0.014$
isomers_c7h8n2o2	$0.576\pm 0.154$	$0.815\pm 0.120$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.834\pm 0.109}}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.846\pm 0.070}}$	$0.761\pm 0.145$
isomers_c9h10n2o2pf2cl	$0.644\pm 0.053$	$0.708\pm 0.083$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.714\pm 0.084}}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.724\pm 0.043}}$	$0.701\pm 0.086$
median1	$0.235\pm 0.016$	$0.233\pm 0.018$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.242\pm 0.020}}$	$0.229\pm 0.015$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.238\pm 0.015}}$
median2	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.212\pm 0.010}}$	$0.193\pm 0.011$	$0.201\pm 0.009$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.209\pm 0.013}}$	$0.200\pm 0.018$
mestranol_similarity	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.449\pm 0.028}}$	$0.387\pm 0.020$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.445\pm 0.014}}$	$0.433\pm 0.034$	$0.406\pm 0.011$
osimertinib_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.788\pm 0.008}}$	$0.777\pm 0.008$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.781\pm 0.007}}$	$0.780\pm 0.009$	$0.776\pm 0.007$
perindopril_mpo	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.475\pm 0.019}}$	$0.460\pm 0.025$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.492\pm 0.011}}$	$0.432\pm 0.010$	$0.457\pm 0.012$
qed	$0.926\pm 0.011$	$0.930\pm 0.004$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.935\pm 0.002}}$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.934\pm 0.003}}$	$0.931\pm 0.001$
ranolazine_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.729\pm 0.024}}$	$0.684\pm 0.015$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.711\pm 0.028}}$	$0.657\pm 0.048$	$0.669\pm 0.032$
scaffold_hop	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.486\pm 0.010}}$	$0.475\pm 0.008$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.491\pm 0.013}}$	$0.468\pm 0.010$	$0.484\pm 0.019$
sitagliptin_mpo	$0.268\pm 0.098$	$0.281\pm 0.069$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.363\pm 0.114}}$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.333\pm 0.030}}$	$0.274\pm 0.044$
thiothixene_rediscovery	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.371\pm 0.046}}$	$0.351\pm 0.029$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.368\pm 0.041}}$	$0.345\pm 0.026$	$0.332\pm 0.041$
troglitazone_rediscovery	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.329\pm 0.019}}$	$0.289\pm 0.021$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.309\pm 0.033}}$	$0.276\pm 0.009$	$0.286\pm 0.012$
valsartan_smarts	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.000\pm 0.000}}$	$0.000\pm 0.000$	$0.000\pm 0.000$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.000\pm 0.000}}$	$0.000\pm 0.000$
zaleplon_mpo	$0.431\pm 0.031$	$0.418\pm 0.022$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.435\pm 0.027}}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.456\pm 0.020}}$	$0.428\pm 0.022$
Sum of scores ( $\uparrow$ )	$10.784$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{10.818}}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{11.291}}$	$10.755$	$10.683$
Mean rank ( $\downarrow$ )	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{2.52}}$	$3.57$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{1.62}}$	$3.48$	$3.75$

Table 7: The performance of LICO and the baselines on

21

optimization tasks in PMO with AUC Average Top-100 metric. A higher score is better. We report the mean and stddev of scores averaged over

5

random seeds. We use blue and violet to denote the best and second-best method for each task.

Task	GP BO	Graph GA	LICO	REINVENT	TNP
albuterol_similarity	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.548\pm 0.100}}$	$0.470\pm 0.042$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.563\pm 0.093}}$	$0.395\pm 0.012$	$0.448\pm 0.028$
amlodipine_mpo	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.458\pm 0.008}}$	$0.422\pm 0.014$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.486\pm 0.025}}$	$0.407\pm 0.005$	$0.420\pm 0.013$
celecoxib_rediscovery	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.363\pm 0.040}}$	$0.346\pm 0.036$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.372\pm 0.070}}$	$0.296\pm 0.024$	$0.346\pm 0.026$
deco_hop	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.579\pm 0.013}}$	$0.563\pm 0.006$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.583\pm 0.009}}$	$0.550\pm 0.006$	$0.568\pm 0.004$
drd2_current	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.741\pm 0.097}}$	$0.605\pm 0.086$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.725\pm 0.092}}$	$0.615\pm 0.098$	$0.556\pm 0.095$
fexofenadine_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.645\pm 0.018}}$	$0.588\pm 0.008$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.636\pm 0.022}}$	$0.549\pm 0.004$	$0.599\pm 0.016$
isomers_c7h8n2o2	$0.300\pm 0.142$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.535\pm 0.091}}$	$0.450\pm 0.149$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.511\pm 0.058}}$	$0.492\pm 0.115$
isomers_c9h10n2o2pf2cl	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.474\pm 0.038}}$	$0.441\pm 0.068$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.535\pm 0.067}}$	$0.445\pm 0.027$	$0.447\pm 0.049$
median1	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.175\pm 0.022}}$	$0.168\pm 0.013$	$0.166\pm 0.019$	$0.162\pm 0.007$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.170\pm 0.008}}$
median2	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.184\pm 0.006}}$	$0.158\pm 0.008$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.175\pm 0.010}}$	$0.155\pm 0.006$	$0.162\pm 0.009$
mestranol_similarity	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.379\pm 0.020}}$	$0.311\pm 0.016$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.361\pm 0.030}}$	$0.302\pm 0.016$	$0.314\pm 0.003$
osimertinib_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.706\pm 0.006}}$	$0.667\pm 0.008$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.694\pm 0.010}}$	$0.623\pm 0.014$	$0.671\pm 0.006$
perindopril_mpo	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.405\pm 0.019}}$	$0.357\pm 0.012$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.424\pm 0.007}}$	$0.332\pm 0.011$	$0.359\pm 0.010$
qed	$0.853\pm 0.010$	$0.854\pm 0.011$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.882\pm 0.007}}$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.874\pm 0.003}}$	$0.857\pm 0.003$
ranolazine_mpo	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.633\pm 0.020}}$	$0.462\pm 0.022$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.617\pm 0.021}}$	$0.436\pm 0.040$	$0.468\pm 0.042$
scaffold_hop	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.462\pm 0.006}}$	$0.435\pm 0.008$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.462\pm 0.006}}$	$0.415\pm 0.009$	$0.440\pm 0.010$
sitagliptin_mpo	$0.133\pm 0.062$	$0.103\pm 0.032$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.171\pm 0.045}}$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.134\pm 0.016}}$	$0.100\pm 0.023$
thiothixene_rediscovery	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.311\pm 0.030}}$	$0.270\pm 0.015$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.299\pm 0.026}}$	$0.256\pm 0.015$	$0.261\pm 0.024$
troglitazone_rediscovery	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.283\pm 0.014}}$	$0.228\pm 0.008$	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.258\pm 0.024}}$	$0.201\pm 0.008$	$0.230\pm 0.005$
valsartan_smarts	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.000\pm 0.000}}$	$0.000\pm 0.000$	$0.000\pm 0.000$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.000\pm 0.000}}$	$0.000\pm 0.000$
zaleplon_mpo	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.301\pm 0.036}}$	$0.258\pm 0.016$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{0.318\pm 0.018}}$	$0.296\pm 0.009$	$0.257\pm 0.013$
Sum of scores ( $\uparrow$ )	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{8.933}}$	$8.242$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{9.175}}$	$7.954$	$8.167$
Mean rank ( $\downarrow$ )	${\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{1.95}}$	$3.62$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\mathbf{1.76}}$	$4.14$	$3.45$

Appendix D Broader impact

Our work studies the application of large language models to black-box optimization, particularly in the domain of molecular optimization. This intersection of machine learning and optimization holds significant promise for advancing our understanding of LLMs’ capabilities and limitations, and has significant potential in areas like material science and drug discovery. Our main goal is to enhance machine learning and optimization techniques, but it’s also important to consider how these advancements might affect society, such as speeding up the development of new medicines and materials.

Appendix E Compute Resources

All experiments in this paper are run on a cluster of $4$ A6000 GPUs, each with 49GB of memory.

NeurIPS Paper Checklist

1.

Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Answer: [Yes]
Justification: The paper introduces LICO, a novel method for leveraging arbitrary base LLMs for black-box optimization. name achieves state-of-the-art performance in PMO, a benchmark for molecular optimization. Both the abstract and introduction include these claims.
Guidelines:
- •
  
  The answer NA means that the abstract and introduction do not include the claims made in the paper.
- •
  
  The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
- •
  
  The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
- •
  
  It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
2.

Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: Yes, we mention the limitations of the paper in Section 6.
Guidelines:
- •
  
  The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
- •
  
  The authors are encouraged to create a separate "Limitations" section in their paper.
- •
  
  The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
- •
  
  The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
- •
  
  The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
- •
  
  The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
- •
  
  If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
- •
  
  While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
3.

Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [N/A]
Justification: NA
Guidelines:
- •
  
  The answer NA means that the paper does not include theoretical results.
- •
  
  All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
- •
  
  All assumptions should be clearly stated or referenced in the statement of any theorems.
- •
  
  The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
- •
  
  Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
- •
  
  Theorems and Lemmas that the proof relies upon should be properly referenced.
4.

Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes]
Justification: All training and evaluation details are outlined in Sections 4, 5, and Appendix A.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
- •
  
  If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
- •
  
  Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
- •
  While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
  1. (a)
    
    If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
  2. (b)
    
    If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
  3. (c)
    
    If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
  4. (d)
    
    We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
5.

Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [Yes]
Justification: Code, data, and checkpoints will be released upon acceptance.
Guidelines:
- •
  
  The answer NA means that paper does not include experiments requiring code.
- •
  
  Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
- •
  
  The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
- •
  
  The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
- •
  
  At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
- •
  
  Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
6.

Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes]
Justification: All training and evaluation details are outlined in Sections 4, 5, and Appendix A.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
- •
  
  The full details can be provided either with the code, in appendix, or as supplemental material.
7.

Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [Yes]
Justification: Yes, each result is averaged over $5$ random seeds.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
- •
  
  The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
- •
  
  The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
- •
  
  The assumptions made should be given (e.g., Normally distributed errors).
- •
  
  It should be clear whether the error bar is the standard deviation or the standard error of the mean.
- •
  
  It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
- •
  
  For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
- •
  
  If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
8.

Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [Yes]
Justification: We specify the compute resources in Section E.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
- •
  
  The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
- •
  
  The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).
9.

Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes]
Justification: The research in the paper conforms with the NeurIPS Code of Ethics in every respect.
Guidelines:
- •
  
  The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
- •
  
  If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
- •
  
  The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
10.

Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [Yes]
Justification: We discuss the broader impacts of the paper in Section D.
Guidelines:
- •
  
  The answer NA means that there is no societal impact of the work performed.
- •
  
  If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
- •
  
  Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
- •
  
  The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
- •
  
  The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
- •
  
  If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
11.

Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [N/A]
Justification: NA
Guidelines:
- •
  
  The answer NA means that the paper poses no such risks.
- •
  
  Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
- •
  
  Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
- •
  
  We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12.

Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes]
Justification: We used and cited PMO, an open-source benchmark for molecular optimization.
Guidelines:
- •
  
  The answer NA means that the paper does not use existing assets.
- •
  
  The authors should cite the original paper that produced the code package or dataset.
- •
  
  The authors should state which version of the asset is used and, if possible, include a URL.
- •
  
  The name of the license (e.g., CC-BY 4.0) should be included for each asset.
- •
  
  For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
- •
  
  If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
- •
  
  For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
- •
  
  If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
13.

New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [N/A]
Justification: NA
Guidelines:
- •
  
  The answer NA means that the paper does not release new assets.
- •
  
  Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
- •
  
  The paper should discuss whether and how consent was obtained from people whose asset is used.
- •
  
  At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
14.

Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [N/A]
Justification: NA
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
- •
  
  According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
15.

Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [N/A]
Justification: NA
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
- •
  
  We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
- •
  
  For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.