This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GROOT: Effective Design of Biological Sequences
with Limited Experimental Data

Thanh V. T. Tran FPT Software AI CenterHanoiVietnam ThanhTVT1@fpt.com Nhat Khang Ngo FPT Software AI CenterHo Chi MinhVietnam KhangNN4@fpt.com Viet Anh Nguyen FPT Software AI CenterHanoiVietnam AnhNV117@fpt.com  and  Truong Son Hy The University of Alabama at BirminghamBirminghamAL 35294United States thy@uab.edu
(2018)
Abstract.

Latent space optimization (LSO) is a powerful method for designing discrete, high-dimensional biological sequences that maximize expensive black-box functions, such as wet lab experiments. This is accomplished by learning a latent space from available data and using a surrogate model fΦf_{\Phi} to guide optimization algorithms toward optimal outputs. However, existing methods struggle when labeled data is limited, as training fΦf_{\Phi} with few labeled data points can lead to subpar outputs, offering no advantage over the training data itself. We address this challenge by introducing GROOT , a GRaph-based Latent SmOOThing for Biological Sequence Optimization. In particular, GROOT generates pseudo-labels for neighbors sampled around the training latent embeddings. These pseudo-labels are then refined and smoothed by Label Propagation. Additionally, we theoretically and empirically justify our approach, demonstrate GROOT’s ability to extrapolate to regions beyond the training set while maintaining reliability within an upper bound of their expected distances from the training regions. We evaluate GROOT on various biological sequence design tasks, including protein optimization (GFP and AAV) and three tasks with exact oracles from Design-Bench. The results demonstrate that GROOT equalizes and surpasses existing methods without requiring access to black-box oracles or vast amounts of labeled data, highlighting its practicality and effectiveness. We release our code at https://anonymous.4open.science/r/GROOT-D554.

Protein Optimization, Latent Space Optimization, Landscape Smoothing, Label Propagation
\dagger: Equal contribution. *: Corresponding Author.
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06ccs: Computing methodologies Neural networksccs: Applied computing Molecular evolution

1. Introduction

Proteins are crucial biomolecules that play diverse and essential roles in every living organism. Enhancing protein functions or cellular fitness is vital for industrial, research, and therapeutic applications (Huang et al., 2016; Wang et al., 2021). One powerful approach to achieve this is directed evolution, which involves iteratively performing random mutations and screening for proteins with desired phenotypes (Arnold, 1996). However, the vast space of possible proteins makes exhaustive searches impractical in nature, the laboratory, or computationally. Consequently, recent machine learning (ML) methods have been developed to improve the sample efficiency of this evolutionary search (Jain et al., 2022; Ren et al., 2022; Lee et al., 2024). When experimental data is available, these approaches use a surrogate model fΦf_{\Phi}, trained to guide optimization algorithms toward optimal outputs.

While these methods have achieved state-of-the-art results on various benchmarks, most neglect the scenario of extremely limited labeled data and fail to utilize the abundant unlabeled data. This is a significant issue because exploring protein function and characteristics through iterative mutation in wet-lab is costly (Fowler and Fields, 2014). As later shown in Section 4.2, previous works perform poorly in this context, proving that training fΦf_{\Phi} with few labeled data points can lead to subpar designs, offering no advantage over the training data itself. One potential solution for dealing with noisy and limited data is to regularize the fitness landscape model, smoothing the sequence representation and facilitating the use of gradient-based optimization algorithms. Although some works have addressed this problem (Section 2), none can optimize effectively in the case of labeled scarcity.

1.1. Contribution

Our work builds on the observation by Trabucco et al. (2022) that surrogate models fΦf_{\Phi} trained on limited labeled data are vulnerable to noisy labels. This sensitivity can lead the model to sample false negatives or become trapped in suboptimal solutions (local minima) when used for guiding offline optimization algorithms. To address this issue, the ultimate objective is to enhance the predictive ability of fΦf_{\Phi}, reducing the risk of finding suboptimal solutions. To fulfill this, we propose GROOT , a novel framework designed to tackle the problem of labeled scarcity by generating synthetic sample representations from existing data. From the features encoded by the encoder, we formulate sequences as a graph with fitness values as node attributes and apply Label Propagation (Zhou et al., 2003) to generate labels for newly created nodes synthesized through interpolation within existing nodes. The smoothed data is then fitted with a neural network for optimization. Figure 1 provides an overview of our framework. We evaluate our method on two fitness optimization tasks: Green Fluorescent Proteins (GFP) (Sarkisyan et al., 2016) and Adeno-Associated Virus (AAV) (Bryant et al., 2021). Our results show that GROOT outperforms previous state-of-the-art baselines across all difficulties of both tasks. We also demonstrate that our method performs stably in extreme cases where there are fewer than 100 labeled training sequences, unlike other approaches. Our contributions are summarized as follows:

  • We introduce GROOT , a novel framework that uses graph-based smoothing to train a smoothed fitness model, which is then used in the optimization process.

  • We theoretically show that our smoothing technique can expand into extrapolation regions while keeping a reasonable distance from the training data. This helps to reduce errors occurring when the surrogate model makes predictions for unseen points too far from the training set.

  • We empirically show that GROOT achieves state-of-the-art results across all difficulties in two protein design benchmarks, AAV and GFP. Notably, in extreme cases with very limited labeled data, our approach performs stably, achieving 66 fold fitness improvement in GFP and 1.31.3 times higher in AAV compared to the training set.

  • We further evaluate GROOT on diverse tasks of different domains (e.g., robotics, DNA) within the Design-Bench. Our experimental results are competitive with state-of-the-art approaches, highlighting the method’s domain-agnostic capabilities.

2. Related work

LSO for Sequence Design

Directed evolution is a traditional paradigm in protein sequence design that has achieved notable successes (Arnold, 2018). Within this framework, various machine learning algorithms have been proposed to improve the sample efficiency of evolutionary searches (Ren et al., 2022; Qiu and Wei, 2022; Emami et al., 2023; Tran and Hy, 2024). However, most of these methods optimize protein sequences directly in the sequence space, dealing with discrete, high-dimensional decision variables. Alternatively, Gómez-Bombarelli et al. (2018) and Castro et al. (2022) employed a VAE model and applied gradient ascent to optimize the latent representation, which is then decoded into biological sequences. Similarly, Lee et al. (2023) used an off-policy reinforcement learning method to facilitate updates in the representation space. Another notable approach (Stanton et al., 2022) involves training a denoising autoencoder with a discriminative multi-task Gaussian process head, enabling gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. Recently, latent diffusion has been introduced for designing novel proteins, leveraging the capabilities of a protein language model (Chen et al., 2023).

Protein Fitness Regularization

The NK model was an early effort to represent protein fitness smoothness using a statistical model of epistasis (Kauffman and Weinberger, 1989). Gómez-Bombarelli et al. (2018) approached this by mapping sequences to a regularized latent fitness landscape, incorporating an auxiliary network to predict fitness from the latent space. Castro et al. (2022) further enhanced this work by introducing negative sampling and interpolation-based regularization into the regularization of latent representation. Additionally, Frey et al. (2024) proposed to learn a smoothed energy function, allowing sequences to be sampled from the smoothed data manifold using Markov Chain Monte Carlo (MCMC). Currently, GGS (Kirjner et al., 2024) is most related to our work by utilizing discrete regularization using graph-based smoothing techniques. However, GGS enforces similar sequences to have similar fitness, potentially leading to suboptimal performance since fitness can change significantly with a single mutation (Brookes et al., 2022a). Our work addresses this by encoding sequences into the latent space and constructing a graph based on the similarity of latent vectors. Furthermore, since our framework operates in the latent space, it is not constrained by biological domains and can be applied to other fields like robotics, as demonstrated in Appendix F.

3. Method

We begin this section by formally defining the problem. Next, we detail the construction of the protein latent space and present our proposed latent graph-based smoothing framework, GROOT. Figure 1 provides a visual overview of our method.

3.1. Problem Formulation

We address the challenge of designing proteins to find high-fitness sequences ss within the sequence space 𝒜L\mathcal{A}^{L}, where 𝒜\mathcal{A} represents the amino acid vocabulary (i.e. |𝒜|=20|\mathcal{A}|=20 since both animal and plant proteins are made up of about 20 common amino acids) and LL is the desired sequence length. Our goal is to design sequences that maximize a black-box protein fitness function 𝒪:𝒜L\mathcal{O}:\mathcal{A}^{L}\mapsto\mathbb{R}, which can only be evaluated through wet-lab experiments:

(1) s=argmaxs𝒜L𝒪(s).s^{*}=\underset{s\in\mathcal{A}^{L}}{\text{argmax}}\leavevmode\nobreak\ \mathcal{O}(s).

For in-silico evaluation, given a static dataset of all known sequences and fitness measurements 𝒟={(s,y)|s𝒜L,y}\mathcal{D}^{*}=\{(s,y)|s\in\mathcal{A}^{L},y\in\mathbb{R}\}, an oracle 𝒪ψ\mathcal{O}_{\psi} parameterized by ψ\psi is trained to minimize the prediction error on 𝒟\mathcal{D}^{*}. Afterward, 𝒪ψ\mathcal{O}_{\psi} is used as an approximator for the black-box function 𝒪\mathcal{O} to evaluate computational methods that are developed on a training subset 𝒟\mathcal{D} of 𝒟\mathcal{D}^{*}. In other words, given 𝒟\mathcal{D}, our task is to generate sequences s^\hat{s} that optimize the fitness approximated by 𝒪ψ\mathcal{O}_{\psi}. To do this, we train a surrogate model ff on 𝒟\mathcal{D} and use it to guide the search for optimal candidates, which are later evaluated by 𝒪ψ\mathcal{O}_{\psi}. This setup is similar to work done by Kirjner et al. (2024); Jain et al. (2022).

Refer to caption
Figure 1. Overall framework of GROOT. After encoding sequences into the latent space, we generate new samples by adding Gaussian noise to existing vectors. These synthetic data lie outside the training set’s convex hull but within a reliable zone, as their distances from the hull are below a certain upper bound. We construct a kNN graph and run label propagation to smooth and refine node labels. These nodes and their fitness values are then used to train the surrogate model, which is subsequently employed for optimization.
\Description

Overall framework of GROOT.

3.2. Constructing Latent Space of Protein Sequences

Unsupervised learning has achieved remarkable success in domains like natural language processing (Brown et al., 2020) and computer vision (Kirillov et al., 2023). This method efficiently learns data representations by identifying underlying patterns without requiring labeled data. The label-free nature of unsupervised learning aligns well with protein design challenges, where experimental fitness evaluations are costly, while unlabeled sequences are abundant. In this work, considering a dataset of unlabeled sequences of a protein family s𝒜Ls\in\mathcal{A}^{L} denoted as 𝒟u={si}i=1Nu\mathcal{D}_{u}=\{s_{i}\}_{i=1}^{N_{u}}, we train a VAE comprising an encoder ϕ:𝒜Ldh\phi:{\mathcal{A}^{L}}\mapsto\mathbb{R}^{d_{h}} and a decoder θ:d𝒜L\theta:\mathbb{R}^{d}\mapsto\mathcal{A}^{L}. Each sequence sis_{i} is encoded into a low-dimensional vector hidhh_{i}\in\mathbb{R}^{d_{h}}. Subsequently, the mean μid\mu_{i}\in\mathbb{R}^{d} and log-variance logσid\log\sigma_{i}\in\mathbb{R}^{d} of the variational posterior approximation are computed from hih_{i} using two feed-forward networks, μϕ:dhd\mu_{\phi}:\mathbb{R}^{d_{h}}\mapsto\mathbb{R}^{d} and σϕ:dhd\sigma_{\phi}:\mathbb{R}^{d_{h}}\mapsto\mathbb{R}^{d}. A latent vector xidx_{i}\in\mathbb{R}^{d} is then sampled from the Gaussian distribution 𝒩(μi,σi2)\mathcal{N}(\mu_{i},\sigma_{i}^{2}), and the decoder θ\theta maps xix_{i} back to the reconstructed protein sequence s^i𝒜L\hat{s}_{i}\in\mathcal{A}^{L}. The training objective involves the cross-entropy loss 𝒞(s^i,si)\mathcal{C}(\hat{s}_{i},s_{i}) between the ground truth sequence sis_{i} and the generated s^i\hat{s}_{i}, as well as the Kullback-Leibler (KL) divergence between 𝒩(μi,σi2)\mathcal{N}(\mu_{i},\sigma_{i}^{2}) and 𝒩(0,Id)\mathcal{N}(0,I_{d}):

(2) vae=1Nui=1Nu𝒞(s^i,si)+ηNui=1NuDKL(𝒩(μi,σi2)𝒩(0,Id)).\mathcal{L}_{vae}=\frac{1}{N_{u}}\sum_{i=1}^{N_{u}}\mathcal{C}(\hat{s}_{i},s_{i})+\frac{\eta}{N_{u}}\sum_{i=1}^{N_{u}}D_{\text{KL}}(\mathcal{N}(\mu_{i},\sigma_{i}^{2})\|\mathcal{N}(0,I_{d})).

Here, η\eta is the hyperparameter to control the disentanglement property of the VAE’s latent space. A critical distinction must be made between the latent space of protein sequences and the protein fitness landscape. While training a VAE-based model, we do not access protein fitness scores, but we learn their continuous representations. Consequently, this VAE-based model is trained on the entirety of available protein family sequences. Subsequently, labeled sequences are encoded by the pretrained VAE into dd-dimensional latent points, enabling the supervised training of a surrogate model, fΦ:df_{\Phi}:\mathbb{R}^{d}\mapsto\mathbb{R}, to predict fitness values and approximate the fitness landscape.

3.2.1. VAE’s Architecture

We go into detail regarding the architecture of the VAE used in our study.

Encoder

incorporates a pre-trained ESM-2 (Lin et al., 2023) followed by a latent encoder to compute the latent representation zz. In our study, we leverage the powerful representation of the pre-trained 30-layer ESM-2 by making it the encoder of our model. Given an input sequence s=s0,s1,,sLs=\langle s_{0},s_{1},\cdots,s_{L}\rangle, where si𝒜s_{i}\in\mathcal{A}, ESM-2 computes representations for each token sis_{i} in ss, resulting in a token-level hidden representation H=h0,h1,,hL,hidhH=\langle h_{0},h_{1},\cdots,h_{L}\rangle,h_{i}\in\mathbb{R}^{d_{h}}. We calculate the global representation hdhh\in\mathbb{R}^{d_{h}} of ss via a weighted sum of its tokens:

(3) h=i=1LωTexp(hi)i=1LωTexp(hi)hi.h=\sum_{i=1}^{L}\frac{\omega^{T}\exp{(h_{i})}}{\sum_{i=1}^{L}\omega^{T}\exp{(h_{i})}}h_{i}.

here, ω\omega is a learnable global attention vector. Then, two multi-layer perceptrons (MLPs) are used to compute μ=MLP1(h)\mu=\text{MLP}_{1}(h) and logσ=MLP2(h)\log\sigma=\text{MLP}_{2}(h), where the latent dimension is dd. Finally, a latent representation xdx\in\mathbb{R}^{d} is sampled from 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}), which is further proceeded to the decoder to reconstruct the sequence s^\hat{s}.

Decoder

To decode latent points into sequences, we employ a deep convolutional network as proposed by Castro et al. (2022), consisting of four one-dimensional convolutional layers. ReLU activations and batch normalization layers are applied between the convolutional layers, except for the final layer.

3.3. Latent Graph-based Smoothing

Graph Construction

We need to build a graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}) from the sequences in 𝒟\mathcal{D} to perform label propagation. Algorithm 1 demonstrates our algorithm to construct a kNN graph from the latent vectors xx. The graph’s nodes are created by sampling NN sequences s𝒟s\in\mathcal{D}, and the edges are constructed by kk-nearest neighbors to the latent embeddings of x=ϕ(s)x=\phi(s), where ϕ\phi is the pre-trained encoder. We introduce new nodes to GG by interpolating the learned latent xx with random noise ϵ𝒩(0,Id)\epsilon\in\mathcal{N}(0,I_{d}). We argue that Levenshtein distance is an inadequate metric for computing protein sequence similarity as done in (Kirjner et al., 2024) because marginal amino acid variations can result in substantial fitness disparities (Maynard Smith, 1970; Brookes et al., 2022b). Meanwhile, high-dimensional latent embeddings are effective at capturing implicit patterns within protein sequences. Consequently, Euclidean distance can serve as a suitable metric to quantify sequence similarity for constructing kNN graphs, as proteins with comparable properties (indicated by small fitness score differences) tend to cluster closely in the latent space. Algorithm 1 presents our graph construction pipeline.

Algorithm 1 CreateGraph: Latent Graph Construction
1:training embeddings 𝒱\mathcal{V}, #\#of graph nodes NN, constant β\beta
2:while |𝒱|<N\lvert\mathcal{V}\rvert<N do
3:     x¯𝒰(𝒱)\overline{x}\sim\mathcal{U}(\mathcal{V})
4:     ϵ𝒩(0,Id)\epsilon\sim\mathcal{N}(0,I_{d})                  \triangleright Sample Gaussian noise ϵ\epsilon.
5:     zβx¯+(1β)ϵz\xleftarrow{}\beta*\overline{x}+(1-\beta)*\epsilon \triangleright Interpolate between x¯\overline{x} and ϵ\epsilon.
6:     𝒱𝒱{z}\mathcal{V}\xleftarrow{}\mathcal{V}\cup\{z\}
7:end while
8:x𝒱kNN(x,𝒱)\mathcal{E}\xleftarrow{}\bigcup_{x\in\mathcal{V}}\texttt{kNN}(x,\mathcal{V}) \triangleright Construct edges (Algorithm 4)
9:return G=(𝒱,)G=(\mathcal{V},\mathcal{E})
Generating Pseudo-label

We consider a constructed graph G. For each node v𝒱v\in\mathcal{V}, we assign a fitness label yy, creating a set of fitness labels YY. This set YY can be further decomposed into Y={Yn,Yu}Y=\{Y_{n},Y_{u}\}, representing the known and unknown labels, respectively. Known labels YnY_{n} are obtained from the training dataset, while unknown labels YuY_{u} are initialized by 0 and assigned to randomly generated nodes, as shown in line 3 of Algorithm 2. We then run the following label propagation in mm times.

(4) Y=αD1/2AD1/2Y+(1α)Y,Y^{\prime}=\alpha D^{-1/2}AD^{-1/2}Y+(1-\alpha)Y,

here, α[0,1]\alpha\in[0,1] is a weighted coefficient and AA is a weighted adjacency matrix, defined as:

(5) Aij={γd(xi,xj)if ij,0if i=j,A_{ij}=\begin{cases}\frac{\gamma}{d(x_{i},x_{j})}&\text{if }i\neq j,\\ 0&\text{if }i=j,\end{cases}

and d(,)d(\cdot,\cdot) is the Euclidean distance, γ\gamma is a controllable factor. In summary, Algorithm 2 details our smoothing strategy by label propagation.

Algorithm 2 Smooth: Latent Graph-based Smoothing
1:graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}), label propagation layers mm
2:AWeightedAdjacencyMatrix()A\leftarrow\texttt{WeightedAdjacencyMatrix}(\mathcal{E}) \triangleright Equation 5
3:DDegreeMatrix(𝒱)D\leftarrow\texttt{DegreeMatrix}(\mathcal{V}) \triangleright Compute degree matrix.
4:Y[0 if is_synthetic else yi]i=1N{Y}\leftarrow{[0\text{ if }\texttt{is\_synthetic}\text{ else }y_{i}]_{i=1}^{N}}
5:for i=0,,m1i=0,\ldots,m-1 do
6:     YLabelPropagation(A,D,Y)Y\leftarrow\texttt{LabelPropagation}(A,D,Y) \triangleright Equation 4
7:end for
8:return YY

3.4. Theoretical Justification

This section delves into the theoretical foundations of our proposed method. We begin by discussing the concepts of interpolation and extrapolation within high-dimensional latent embedding spaces. These concepts ensure the correctness of generating additional pseudo-samples within the latent space of a VAE-based model. Subsequently, leveraging the definition of convex hulls, we establish an upper bound for the expected distance between the generated nodes and the collection of training latent embeddings. This bound informs the intuition behind our kNN graph construction, detailed in  Algorithm 1. Consequently, employing label propagation within this bounded distance emerges as a rational strategy for assigning pseudo-labels to the newly generated nodes.

We first list the definitions of a convex hull, interpolation, and extrapolation as follows:

Definition 3.1.

Let 𝕏={x1,,xN}\mathbb{X}=\{x_{1},\dots,x_{N}\} be a set of NN points in d\mathbb{R}^{d}, a convex hull of 𝕏\mathbb{X} is defined as:

Conv(𝕏){i=1Nλixi|λi0 and i=1Nλi=1}.\text{Conv}(\mathbb{X})\triangleq\bigg{\{}\sum_{i=1}^{N}\lambda_{i}x_{i}|\lambda_{i}\geq 0\text{ and }\sum_{i=1}^{N}\lambda_{i}=1\bigg{\}}.
Definition 3.2 (Balestriero et al. (2021)).

Interpolation occurs for a sample xx whenever this sample belongs to Conv(𝕏)\text{Conv}(\mathbb{X}), if not, extrapolation occurs.

Assumptions

Given a set of NN training protein sequences 𝕊={s1,s2,,sN}\mathbb{S}=\{s_{1},s_{2},\dots,s_{N}\}, we use the encoder ϕ\phi of the pretrained VAE to compute their latent embeddings, resulting in a set of latent vectors 𝕏={x1,x2,,xN}\mathbb{X}=\{x_{1},x_{2},\dots,x_{N}\}, where xi=ϕ(si)x_{i}=\phi(s_{i}) (see Section 3.2). In this paper, we consider the scenario wherein we have limited access to experimental data, and we train a VAE-based model to represent continuous dd-dimensional latent vectors of those available sequences. This, therefore, gives rise to two mild assumptions in our work.

  • Assumption 1: NN is significantly smaller than the exponential in dd, where NN represents the number of data samples.

  • Assumption 2: Each latent vector xx is sampled from 𝒩(0,Id)\mathcal{N}(0,I_{d}). This is achieved by regularizing the model with KLKL divergence loss as shown in the Equation 2.

We assess GROOT’s extrapolation capabilities by computing the probability of its generated samples falling outside the convex hull of the training data within the latent space. Subsequently, we derive an upper bound for these reliable extrapolation regions. These findings are formalized in the following propositions.

Proposition 0.

Let 𝕏={xid}i=1N\mathbb{X}=\{x_{i}\in\mathbb{R}^{d}\}_{i=1}^{N} be a set of NN i.i.d dd-dimensional samples from 𝒩(0,Id)\mathcal{N}(0,I_{d}). Assume that Nexp(d2(Cβ2+2))N\ll\exp\big{(}\frac{d}{2(C^{2}_{\beta}+2)}\big{)} and let z=βx¯+(1β)ϵz=\beta*\overline{x}+(1-\beta)*\epsilon for some x¯𝕏\overline{x}\in\mathbb{X}, β(0,1)\beta\in(0,1), Cβ=1+β1βC_{\beta}=\frac{1+\beta}{1-\beta} and ϵ𝒩(0,Id)\epsilon\sim\mathcal{N}(0,I_{d}), then

limd𝐏(zConv(𝕏))=1.\lim_{d\rightarrow\infty}\mathbf{P}(z\notin\text{Conv}(\mathbb{X}))=1.

We leave the proof of the presented proposition in Section A.1. Proposition 3.3 elucidates that in a limited data scenario (i.e. Nexp(d2(Cβ2+2))N\ll\exp\left(\frac{d}{2(C^{2}_{\beta}+2)}\right)), the probability for a synthetic latent node zz lying outside the convex hull of training set 𝕏\mathbb{X} goes to 1 as the latent dimension dd grows. Based on Definition 3.2, this theoretical result guarantees the correctness of our formula shown in line 4 of Algorithm 1 in expanding the latent space of protein sequences. However, when generated nodes are located far from the training convex hull, their embedding vectors differ significantly from those in the training set. This disparity hinders the identification of meaningful similarities and can result in unreliable fitness scores determined through label propagation. This brings us to Proposition 3.4.

Proposition 0.

Let 𝕏={xid}i=1N\mathbb{X}=\{x_{i}\in\mathbb{R}^{d}\}_{i=1}^{N} be a set of NN i.i.d dd-dimensional samples from 𝒩(0,Id)\mathcal{N}(0,I_{d}). For any x¯𝕏\overline{x}\in\mathbb{X}, we define z=βx¯+(1β)ϵz=\beta*\overline{x}+(1-\beta)*\epsilon with ϵ𝒩(0,Id),β(0,1)\epsilon\sim\mathcal{N}(0,I_{d}),\beta\in(0,1), the following holds:

𝔼[D(z,Conv(𝕏))]<2(1β)d,\mathbb{E}[D(z,\text{Conv}(\mathbb{X}))]<2(1-\beta)\sqrt{d},

where D(z,Conv(𝕏)))infxConv(𝕏)d(z,x)D(z,\text{Conv}(\mathbb{X})))\triangleq\inf_{x\in\text{Conv}(\mathbb{X})}d(z,x) is the distance from a point zdz\in\mathbb{R}^{d} to Conv(𝕏)\text{Conv}(\mathbb{X}), and d(,)d(\cdot,\cdot) is the Euclidean distance.

The proof of this proposition can be found in Section A.2. Proposition 3.4 allows us to quantify the expected distance from a randomly generated node based on our formula to the available training set. The derived upper bound increases linearly w.r.t square root of the latent dimension dd with a rate of 2(1β)2(1-\beta). It is worth noting that β\beta is used to control the exploration rate of our algorithm. As β1\beta\rightarrow 1, the generated node zz closely resembles a training node x¯\overline{x}. Conversely, a lower β\beta introduces more noise into the latent vectors, resulting in samples that are further from the source nodes. This upper bound ensures that synthetic nodes remain within a controllable ”reliable zone”, making label propagation a sensible choice, as the label values of these synthetic nodes cannot deviate too far from the overall distribution of existing nodes.

In summary, Propositions 3.3 and 3.4 offer theoretical underpinnings for our latent-based smoothing approach. By mapping discrete sequences to a continuous dd-dimensional latent space, we establish a quantitative relationship between newly added and original training nodes, governed by parameters dd and β\beta. This quantitative connection enhances the interpretability of our method.

3.5. Model-based Optimization Algorithm

Latent space model-based optimization (MBO) employs surrogate models, fΦ:df_{\Phi}:\mathbb{R}^{d}\mapsto\mathbb{R}, to efficiently guide the search for optimal values that maximize expensive black-box functions. The effectiveness of MBO depends on the accuracy of the surrogate model as poorly trained surrogates can hinder the optimization process by providing misleading information. As shown in Algorithm 3, lines 1, 4, and 5 encompass the standard MBO process of encoding training vectors into latent space and training a surrogate model to predict fitness scores Y based on these embeddings. Meanwhile, lines 2 and 3 are two additional steps introduced by GROOT before surrogate training. GROOT is agnostic to domains as we can use any encoding method to compute latent embeddings of target domains. Moreover, GROOT can be used in tandem with any surrogate models as it only accesses the data representations. These two properties make GROOT a versatile approach for optimization tasks in a wide range of domains.

Algorithm 3 GROOT
1:training dataset 𝒟\mathcal{D}, pre-trained encoder ϕ\phi, #\# of graph nodes NN, constant β\beta, label propagation layers mm
2:𝒱{xi=ϕ(si),si𝒟}i=1n\mathcal{V}\leftarrow\{x_{i}=\phi(s_{i}),s_{i}\in\mathcal{D}\}_{i=1}^{n} \triangleright Compute latent embeddings.
3:𝒱,CreateGraph(𝒱,N,β)\mathcal{V}^{\prime},\mathcal{E}\leftarrow\texttt{CreateGraph}(\mathcal{V},N,\beta) \triangleright Algorithm 1
4:Y^Smooth((𝒱,),L)\hat{Y}\leftarrow\texttt{Smooth}\big{(}(\mathcal{V}^{\prime},\mathcal{E}),L\big{)} \triangleright Algorithm 2
5:ΦargminΦ𝔼(x,y)(𝒱,Y^)[(yfΦ(x))2]\Phi\leftarrow\arg\min_{\Phi}\mathbb{E}_{(x,y)\sim(\mathcal{V}^{\prime},\hat{Y})}\big{[}(y-f_{\Phi}(x))^{2}\big{]} \triangleright Train surrogate model.
6:X,YMBO(Φ,𝒱)X^{*},Y^{*}\leftarrow\texttt{MBO}(\Phi,\mathcal{V}^{\prime}) \triangleright Model-based optimization.
7:return X,YX^{*},Y^{*}

4. Experiments

We demonstrate the benefits of our proposed GROOT on two protein tasks with varying levels of label scarcity. Furthermore, since obtaining the ground-truth fitness of generated sequences is costly, we further evaluate our method on Design-Bench (Trabucco et al., 2022). These benchmarks have exact oracles, providing a better validation of our method’s performance on real-world datasets.

Design-Bench

Due to space constraints, we leave the presentation of tasks, baselines, evaluation metrics, and implementation details in Appendix F. Table 1 shows the performance of various methods across three Design-Bench tasks and their mean scores. While most methods perform well, GROOT achieves the highest average score in maximum performance and is slightly behind the SOTA baseline in median and mean benchmarks. Since these tasks are evaluated by an exact oracle and our performance is competitive with other SOTA methods, this confirms that our approach is effective on real-world data, enhancing the reliability of our work.

Table 1. Comparison of GROOT and the baselines on 3 Design-Bench tasks. Bold results indicate the best value, and underlined results indicate the second-best value. The standard deviation of 3 runs with different random seeds is indicated in parentheses.
Method D’Kitty Ant TF Bind 8 Mean
Score \uparrow
Median MINs 0.86 (0.01) 0.49 (0.15) 0.42 (0.02) 0.59
BONET 0.85 (0.01) 0.60 (0.12) 0.44 (0.00) 0.63
BDI 0.59 (0.02) 0.40 (0.02) 0.54 (0.03) 0.51
ExPT 0.90 (0.01) 0.71 (0.02) 0.47 (0.01) 0.69
GROOT 0.82 (0.01) 0.62 (0.01) 0.52 (0.01) 0.65
Max MINs 0.93 (0.01) 0.89 (0.01) 0.81 (0.03) 0.88
BONET 0.91 (0.01) 0.84 (0.04) 0.69 (0.15) 0.81
BDI 0.92 (0.01) 0.81 (0.09) 0.91 (0.07) 0.88
ExPT 0.97 (0.01) 0.97 (0.00) 0.93 (0.04) 0.96
GROOT 0.98 (0.01) 0.97 (0.00) 0.98 (0.02) 0.98
Mean MINs 0.62 (0.03) 0.01 (0.01) 0.42 (0.03) 0.35
BONET 0.84 (0.02) 0.58 (0.02) 0.45 (0.01) 0.62
BDI 0.57 (0.03) 0.39 (0.01) 0.54 (0.03) 0.50
ExPT 0.87 (0.02) 0.64 (0.03) 0.48 (0.01) 0.66
GROOT 0.78 (0.03) 0.60 (0.01) 0.55 (0.02) 0.64

4.1. Experiment Setup

Datasets and Oracles

Following Kirjner et al. (2024), we evaluate our method on two proteins, GFP (Sarkisyan et al., 2016) and AAV (Bryant et al., 2021). The length LL is 237 for GFP and 28 for the functional segment of AAV. The full GFP dataset 𝒟\mathcal{D^{*}} comprises 54,025 mutant sequences with associated log-fluorescence intensity, while the AAV dataset contains 44,156 sequences linked to their ability to package a DNA payload. The fitness values in both datasets are min-max normalized for evaluation but remain unchanged during training and inference. To demonstrate the advantages of our smoothing method, we use the harder1, harder2, and harder3 level benchmarks proposed by Kirjner et al. (2024) to sample training datasets 𝒟\mathcal{D} for each task, simulating scenarios with scarce labeled data111Results of other difficulties are presented in the Appendix C. Table 2 provides statistics for each level benchmark. All methods, including baselines, start optimization using the entire training set 𝒟\mathcal{D} and are evaluated on the best 128 generated sequences, with approximated fitness predicted by oracles provided by Kirjner et al. (2024). It is important to note that oracles do not participate in the optimization process and are only used as in-silico evaluators to validate final proposed designs of each method.

Table 2. Statistic of benchmarks.
Task Difficulty Fitness Mutational Best |𝒟||\mathcal{D}|
Range (%\%) Gap Fitness
AAV Harder1 <30<30th 1313 0.330.33 11571157
Harder2 <20<20th 1313 0.290.29 920920
Harder3 <10<10th 1313 0.240.24 476476
GFP Harder1 <30<30th 88 0.100.10 11291129
Harder2 <20<20th 88 0.010.01 792792
Harder3 <10<10th 88 0.010.01 397397
Table 3. AAV and GFP optimization results for GROOT and baseline methods. The standard deviation of 5 runs with different random seeds is indicated in parentheses.
AAV harder1 task AAV harder2 task AAV harder3 task
Method Fitness \uparrow Diversity Novelty Fitness \uparrow Diversity Novelty Fitness \uparrow Diversity Novelty
AdaLead 0.38 (0.0) 5.5 (0.5) 7.0 (0.7) 0.43 (0.0) 4.2 (0.7) 7.8 (0.8) 0.37 (0.0) 6.22 (0.9) 8.0 (1.2)
CbAS 0.02 (0.0) 22.9 (0.1) 18.5 (0.5) 0.01 (0.0) 23.2 (0.1) 19.3 (0.4) 0.01 (0.0) 23.2 (0.1) 19.3 (0.4)
BO 0.00 (0.0) 20.4 (0.3) 21.8 (0.4) 0.01 (0.0) 20.4 (0.0) 22.0 (0.0) 0.01 (0.0) 20.6 (0.3) 22.0 (0.0)
GFN-AL 0.00 (0.0) 15.4 (6.2) 21.6 (0.5) 0.00 (0.0) 8.1 (3.5) 21.6 (1.0) 0.00 (0.0) 7.6 (0.8) 22.6 (1.4)
PEX 0.23 (0.0) 6.4 (0.5) 3.8 (0.7) 0.30 (0.0) 7.8 (0.4) 5.0 (0.0) 0.26 (0.0) 7.3 (0.7) 4.4 (0.5)
GGS 0.30 (0.0) 13.6 (0.2) 14.5 (0.3) 0.27 (0.0) 16.0 (0.0) 19.4 (0.0) 0.38 (0.0) 7.0 (0.1) 9.6 (0.1)
ReLSO 0.15 (0.0) 20.9 (0.0) 13.0 (0.0) 0.17 (0.0) 20.3 (0.0) 13.0 (0.0) 0.22 (0.0) 17.8 (0.0) 11.0 (0.0)
S-ReLSO 0.24 (0.0) 11.5 (0.0) 13.0 (0.0) 0.28 (0.0) 16.4 (0.0) 6.5 (0.0) 0.27 (0.0) 17.7 (0.0) 11.0 (0.0)
GROOT 0.46 (0.1) 9.8 (1.6) 12.2 (0.5) 0.45 (0.0) 9.9 (0.8) 13.0 (0.0) 0.42 (0.1) 11.0 (2.0) 13.0 (0.0)
GFP harder1 task GFP harder2 task GFP harder3 task
Method Fitness \uparrow Diversity Novelty Fitness \uparrow Diversity Novelty Fitness \uparrow Diversity Novelty
AdaLead 0.39 (0.0) 8.4 (3.2) 9.0 (1.2) 0.4 (0.0) 7.3 (2.8) 9.8 (0.4) 0.42 (0.0) 6.4 (2.3) 9.0 (1.2)
CbAS -0.08 (0.0) 172.2 (35.7) 201.5 (1.5) -0.09 (0.0) 158.4 (34.8) 202.0 (0.7) -0.08 (0.0) 186.4 (33.4) 201.5 (0.9)
BO -0.08 (0.1) 58.9 (1.9) 192.3 (11.3) -0.04 (0.1) 57.1 (1.7) 192.3 (11.3) -0.07 (0.1) 57.8 (2.2) 177.9 (41.2)
GFN-AL 0.21 (0.1) 74.3 (55.3) 219.2 (3.3) 0.14 (0.2) 27.0 (9.5) 223.5 (2.4) 0.21 (0.0) 37.5 (21.7) 219.8 (4.3)
PEX 0.13 (0.0) 12.6 (1.2) 7.1 (1.1) 0.17 (0.0) 12.6 (1.2) 7.1 (1.1) 0.19 (0.0) 12.2 (1.1) 7.8 (1.7)
GGS 0.67 (0.0) 4.7 (0.2) 9.1 (0.1) 0.60 (0.0) 5.4 (0.2) 9.8 (0.1) 0.00 (0.0) 15.7 (0.4) 19.0 (2.2)
ReLSO 0.94 (0.0) 0.0 (0.0) 8.0 (0.0) 0.94 (0.0) 0.0 (0.0) 8.0 (0.0) 0.94 (0.0) 0.0 (0.0) 8.0 (0.0)
S-ReLSO 0.94 (0.0) 0.0 (0.0) 8.0 (0.0) 0.94 (0.0) 0.0 (0.0) 8.0 (0.0) 0.94 (0.0) 0.0 (0.0) 8.0 (0.0)
GROOT 0.88 (0.0) 3.0 (0.2) 7.0 (0.0) 0.87 (0.0) 3.0 (0.1) 7.5 (0.5) 0.62 (0.2) 7.6 (1.5) 8.6 (1.5)
\dagger indicates that the generated population has collapsed (i.e., producing only a single sequence).
Baselines

We evaluate our method against several representative baselines using the open-source toolkit FLEXS (Sinai et al., 2020): AdaLead (Sinai et al., 2020), CbAS (Brookes et al., 2019), and Bayesian Optimization (BO) (Wilson et al., 2017). In addition to these algorithms, we also benchmark against some most recent methods: GFN-AL (Jain et al., 2022), ReLSO (Castro et al., 2022), PEX (Ren et al., 2022), and GGS (Kirjner et al., 2024), utilizing and modifying the code provided by their respective authors. Furthermore, we assess ReLSO enhanced with our smoothing strategy for better optimization, termed S-ReLSO.

Evaluation Metrics

We use three metrics proposed by Jain et al. (2022): fitness, diversity, and novelty. Let the optimized sequences 𝒢={g1,,gK}\mathcal{G}^{*}=\{g_{1}^{*},\ldots,g_{K}^{*}\}. Fitness is defined as the median of the evaluated fitness of K=128K=128 proposed designs. Diversity is the median of the distances between every pair of sequences in 𝒢\mathcal{G}^{*}. Novelty is defined as the median of the distances between every generated sequences to the training set. Mathematical definitions of these metrics are defined in the Appendix B.

Implementation Details

For each task, we use the full dataset 𝒟\mathcal{D}^{*} to train the VAE, whose architecture has been described in Section 3.2. We utilize the pretrained checkpoint esm2_t30_150M_UR50D222https://huggingface.co/facebook/esm2_t30_150M_UR50D of the ESM-2 model, fine-tuning only the last layer while freezing the rest. The latent representation space dimension is set to d=320d=320. For the latent graph-based smoothing, we set the number of nodes to N=20,000N=20,000, label propagation’s coefficient to α=0.2\alpha=0.2, controller factor to γ=1.0\gamma=1.0, the number of propagation layers to Nlayers=1N_{\text{layers}}=1, and the number of neighbors to k=8k=8. It is important to note that these hyperparameters are not finely tuned. The hyperparameter tuning process is described in the Appendix E. After the smoothing process, the features and refined labels are inputted into a surrogate model, specifically a shallow 2-layer multi-layers perceptron (MLP) with a dropout rate of 0.20.2 in the first layer, to train a smoothed surrogate model. Despite its simplicity, we consider MLP a suitable choice due to its proven effectiveness in previous studies (Huang et al., 2021). For model-based optimization, we select two gradient-based algorithms to exploit the smoothness of the surrogate model: Gradient Ascent (GA) with a learning rate of 0.0050.005 for 400400 iterations and Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) (Liu and Nocedal, 1989) for a maximum of 66 iterations. While various optimization algorithms are available, we choose these gradient-based methods based on the premise that they can effectively leverage the smoothness of the surrogate model. It is crucial to note that our primary focus is not on evaluating multiple optimization methods, but rather on demonstrating the effectiveness of the smoothing strategy. We empirically evaluate these two algorithms and report the best results.

4.2. Numerical Results

We report the mean and standard deviation of the evaluation metrics over five runs with different random seeds in Table 3. Firstly, compared to the best fitness of the training data (Table 2), our method successfully optimizes all tasks. For AAV tasks, our method achieves SOTA results across all difficulty levels. Notably, although our method only slightly exceeds AdaLead in the harder2 task, GROOT generates sequences whose novelty falls within the mutation gap range, indicating appropriate extrapolation and a higher likelihood of producing synthesizable proteins. Additionally, applying our smoothing strategy to ReLSO improves the model’s performance by 60%60\% in the harder1, 64.7%64.7\% in the harder2, and 22.7%22.7\% in the harder3 tasks. For GFP tasks, GROOT outperforms all baseline methods across all difficulties, especially in the harder3 difficulty, where GGS failed to optimize properly. It is crucial to mention that ReLSO failed to generate diverse sequences in GFP tasks, as indicated by zero diversity across five different runs. Overall, we observe that GROOT achieves the highest fitness while maintaining respectable diversity and novelty.

4.3. Analysis

Impact of smoothing on extrapolation and general performance

For each benchmark, we assess the impact of smoothing on extrapolation capabilities by measuring the Mean Absolute Error (MAE) of the surrogate model on the benchmark’s training and holdout datasets relative to the experimental ground-truth. Table 4 shows the benefits of smoothing on extrapolation to held out ground-truth experimental data. Our results demonstrate that smoothing reduces MAE for both the training and holdout sets. This reduction occurs because our smoothing strategy acts as a data augmentation technique in the latent space, enhancing the robustness of the supervised model. We also observe that the MAE for both sets increases gradually as the task becomes more difficult. We hypothesize that this is due to the smaller training set size in more difficult tasks, requiring a greater number of new samples to construct a graph. While the labels of these new samples are not exact, they help to smooth the latent landscape, resulting in a smoother but not perfectly accurate surrogate model, which in turn increases the MAE.

Additionally, Table 5 demonstrates how a smoothed model dramatically outperforms its unsmoothed counterpart across all tasks. For AAV, the results show up to a threefold improvement compared to the unsmoothed surrogate model. For GFP, without smoothing, the optimization algorithm struggles to find a maximum point, resulting in suboptimal performance. With our smoothing technique, the optimization algorithm can optimize properly. Our findings indicate that our smoothing strategy enables gradient-based optimization methods to better navigate the search space and identify superior maxima. However, we also observe that applying smoothing decreases diversity. This can be explained by the optimization process converging, causing the population to concentrate in certain high-fitness regions. In contrast, without proper optimization, the population diverges and spreads throughout the search space.

Table 4. Smoothing improves extrapolation on every tasks. The mean of 5 different runs is reported.
Task Difficulty Smoothed Train Holdout
MAE \downarrow MAE \downarrow
AAV Harder1 No 4.94 8.93
Yes 1.02 5.78
Harder2 No 4.67 8.91
Yes 1.10 6.11
Harder3 No 4.13 8.87
Yes 1.44 7.30
GFP Harder1 No 1.39 2.81
Yes 0.22 1.71
Harder2 No 1.34 2.81
Yes 0.31 1.85
Harder3 No 1.33 2.80
Yes 0.51 2.09
Table 5. Smoothing improves performance on every tasks. The mean of 5 different runs is reported.
Task Difficulty Smoothed Fitness \uparrow Diversity Novelty
AAV Harder1 No 0.12 20.0 10.0
Yes 0.46 9.8 12.2
Harder2 No 0.11 20.0 9.6
Yes 0.45 9.9 13.0
Harder3 No 0.12 20.1 10.0
Yes 0.42 11.0 13.0
GFP Harder1 No -0.12 71.0 42.2
Yes 0.88 3.0 7.0
Harder2 No -0.18 69.5 41.1
Yes 0.87 3.0 7.5
Harder3 No -0.17 64.0 37.0
Yes 0.62 7.6 8.6
Refer to caption
Figure 2. Distance from generated nodes outside the convex hull to the set of training nodes. The mean and standard deviation over 5 different runs are reported. We set 𝜷=0.5\bm{\beta=0.5} to maintain a constant upper bound on the distance.
\Description

Distance from generated nodes outside the convex hull to the set of training nodes.

Empirical analysis of our propositions

In this section, we directly validate our proposed Propositions 3.3 and 3.4 through experiments. Firstly, we train multiple VAEs with the same architecture described in Section 3.2, modifying only the latent dimension size dd. For each VAE version, we construct a graph and apply the smoothing technique as described in Section 3.3 to generate new nodes. Secondly, to determine whether the generated nodes lie within the convex hull, we check if each point can be expressed as a convex combination of the training points. This is formulated as a linear programming problem and can be easily solved using existing libraries333https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linprog.html. After identifying the number of nodes lying outside the convex hull, we calculate the Euclidean distance D(z,𝕏)D(z,\mathbb{X}) between the outside nodes and the set of training nodes.

We validate our propositions on the GFP harder3 task, setting β=0.5\beta=0.5 consistently across all runs to maintain a constant upper bound. Our experiments reveal that, regardless of dimensional size, 100% of the synthetic nodes lie outside the convex hull. Figure 2 demonstrates distance calculations between the outside nodes and the set of training nodes. It is evident that although the distance increases with the latent dimension, none of the generated nodes in any VAE versions exceed the proven upper bound. This demonstrates that our smoothing technique can expand into extrapolation regions while remaining within a suitable range, avoiding distances too far from the training set that could introduce errors in surrogate model predictions.

Explore the limits of smoothing strategy

In this section, we investigate the smoothing capabilities of our method by testing its limits. We use the GFP harder3 task and the AAV harder3 task as examined cases, further subsampling the dataset with ratios r{0.05,0.1,0.2,0.5,0.7,1.0}r\in\{0.05,0.1,0.2,0.5,0.7,1.0\} in two different ways, namely: random and lowest. In the random setting, data points are randomly subsampled with the given ratio rr, while in the lowest setting, we use the fraction rr of the dataset with the lowest fitness scores.

Using the same hyperparameters, Figure 3 shows the median performance of GROOT on AAV and GFP w.r.t the ratio rr in both settings. In the random setting, our method outperforms the best data point in the entire harder3 dataset using only 20%20\% of the labeled data, which amounts to fewer than 100100 samples per set. In the lowest setting, the method requires 50%50\% of the labeled data, approximately 200200 samples per set, to achieve similar performance. This demonstrates that our method is effective even under extreme conditions with limited labeled data.

Refer to caption
Figure 3. The performance of GROOT on AAV and GFP harder3 tasks when we vary the labeled data ratio rr. The mean and standard deviation over 5 different runs are reported.
\Description

Performance when varying the training data ratio rr.

5. Conclusion

This paper presents a novel method, GROOT, to address the scarcity of labeled data in practical protein sequence design. GROOT embeds protein sequences into a continuous latent space. In this space, we generate additional synthetic latent points by interpolating existing data points and Gaussian noise and estimating their fitness scores using label propagation. We provide theoretical underpinnings based on convex hull and extrapolation to support GROOT’s efficacy. Experimental results confirm the method’s effectiveness in protein design under extreme labeled data-limited conditions.

Limitations

Our proposed method, GROOT, has some limitations on hyperparameters. These numbers are sensitive to landscape characteristics, and one should search for their optimal values when optimizing new landscapes. We have detailed the hyperparameter tuning process in Appendix E and found that, despite some sensitivity differences between datasets, the optimal hyperparameters remain relatively stable across multiple difficulty levels within the same landscape, thereby reducing the burden of searching for optimal settings. Additionally, we theoretically characterize the relationship between GROOT’s hyperparameters and the sampled nodes, guaranteeing controllability over the entire framework. A promising extension of GROOT is to find better ways to determine the optimal value of β\beta, which can be done by analyzing the effect of random mutations on each family of protein sequences.

References

  • (1)
  • Ahn et al. (2020) Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. 2020. ROBEL: Robotics Benchmarks for Learning with Low-Cost Robots. In Proceedings of the Conference on Robot Learning (Proceedings of Machine Learning Research, Vol. 100), Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura (Eds.). PMLR, 1300–1313. https://proceedings.mlr.press/v100/ahn20a.html
  • Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2623–2631. https://doi.org/10.1145/3292500.3330701
  • Arnold (1996) Frances H. Arnold. 1996. Directed evolution: Creating biocatalysts for the future. Chemical Engineering Science 51, 23 (1996), 5091–5102. https://doi.org/10.1016/S0009-2509(96)00288-6
  • Arnold (2018) Frances H. Arnold. 2018. Directed Evolution: Bringing New Chemistry to Life. Angewandte Chemie International Edition 57, 16 (2018), 4143–4148. https://doi.org/10.1002/anie.201708408 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/anie.201708408
  • Balestriero et al. (2021) Randall Balestriero, Jerome Pesenti, and Yann LeCun. 2021. Learning in high dimension always amounts to extrapolation. arXiv preprint arXiv:2110.09485 (2021).
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:1606.01540 [cs.LG] https://arxiv.org/abs/1606.01540
  • Brookes et al. (2019) David Brookes, Hahnbeom Park, and Jennifer Listgarten. 2019. Conditioning by adaptive sampling for robust design. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 773–782. https://proceedings.mlr.press/v97/brookes19a.html
  • Brookes et al. (2022a) David H. Brookes, Amirali Aghazadeh, and Jennifer Listgarten. 2022a. On the sparsity of fitness functions and implications for learning. Proceedings of the National Academy of Sciences 119, 1 (2022), e2109649118. https://doi.org/10.1073/pnas.2109649118 arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.2109649118
  • Brookes et al. (2022b) David H Brookes, Amirali Aghazadeh, and Jennifer Listgarten. 2022b. On the sparsity of fitness functions and implications for learning. Proceedings of the National Academy of Sciences 119, 1 (2022), e2109649118.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  • Bryant et al. (2021) Drew H. Bryant, Ali Bashir, Sam Sinai, Nina K. Jain, Pierce J. Ogden, Patrick F. Riley, George M. Church, Lucy J. Colwell, and Eric D. Kelsic. 2021. Deep diversification of an AAV capsid protein by machine learning. Nature Biotechnology 39, 6 (01 Jun 2021), 691–696. https://doi.org/10.1038/s41587-020-00793-4
  • Castro et al. (2022) Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin Givechian, Dhananjay Bhaskar, and Smita Krishnaswamy. 2022. Transformer-based protein generation with regularized latent space optimization. Nature Machine Intelligence 4, 10 (01 Oct 2022), 840–851. https://doi.org/10.1038/s42256-022-00532-1
  • Chandrasekaran et al. (2012) Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. 2012. The convex geometry of linear inverse problems. Foundations of Computational mathematics 12, 6 (2012), 805–849.
  • Chen et al. (2022) Can Chen, Yingxueff Zhang, Jie Fu, Xue (Steve) Liu, and Mark Coates. 2022. Bidirectional Learning for Offline Infinite-width Model-based Optimization. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 29454–29467. https://proceedings.neurips.cc/paper_files/paper/2022/file/bd391cf5bdc4b63674d6da3edc1bde0d-Paper-Conference.pdf
  • Chen et al. (2023) Tianlai Chen, Pranay Vure, Rishab Pulugurta, and Pranam Chatterjee. 2023. AMP-Diffusion: Integrating Latent Diffusion with Protein Language Models for Antimicrobial Peptide Generation. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop. https://openreview.net/forum?id=145TM9VQhx
  • Emami et al. (2023) Patrick Emami, Aidan Perreault, Jeffrey Law, David Biagioni, and Peter St. John. 2023. Plug and play directed evolution of proteins with gradient-based discrete MCMC. Machine Learning: Science and Technology 4, 2 (April 2023), 025014. https://doi.org/10.1088/2632-2153/accacd
  • Fowler and Fields (2014) Douglas M. Fowler and Stanley Fields. 2014. Deep mutational scanning: a new style of protein science. Nature Methods 11, 8 (01 Aug 2014), 801–807. https://doi.org/10.1038/nmeth.3027
  • Frey et al. (2024) Nathan C. Frey, Dan Berenberg, Karina Zadorozhny, Joseph Kleinhenz, Julien Lafrance-Vanasse, Isidro Hotzel, Yan Wu, Stephen Ra, Richard Bonneau, Kyunghyun Cho, Andreas Loukas, Vladimir Gligorijevic, and Saeed Saremi. 2024. Protein Discovery with Discrete Walk-Jump Sampling. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=zMPHKOmQNb
  • Gómez-Bombarelli et al. (2018) Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. 2018. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science 4, 2 (2018), 268–276. https://doi.org/10.1021/acscentsci.7b00572 arXiv:https://doi.org/10.1021/acscentsci.7b00572 PMID: 29532027.
  • Huang et al. (2016) Po-Ssu Huang, Scott E. Boyken, and David Baker. 2016. The coming of age of de novo protein design. Nature 537, 7620 (01 Sep 2016), 320–327. https://doi.org/10.1038/nature19946
  • Huang et al. (2021) Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin Benson. 2021. Combining Label Propagation and Simple Models out-performs Graph Neural Networks. In International Conference on Learning Representations. https://openreview.net/forum?id=8E1-f3VhX1o
  • Jain et al. (2022) Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, and Yoshua Bengio. 2022. Biological Sequence Design with GFlowNets. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 9786–9801. https://proceedings.mlr.press/v162/jain22a.html
  • Kauffman and Weinberger (1989) Stuart A. Kauffman and Edward D. Weinberger. 1989. The NK model of rugged fitness landscapes and its application to maturation of the immune response. Journal of Theoretical Biology 141, 2 (Nov. 1989), 211–245. https://doi.org/10.1016/s0022-5193(89)80019-0
  • Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 (2023).
  • Kirjner et al. (2024) Andrew Kirjner, Jason Yim, Raman Samusevich, Shahar Bracha, Tommi S. Jaakkola, Regina Barzilay, and Ila R Fiete. 2024. Improving protein optimization with smoothed fitness landscapes. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=rxlF2Zv8x0
  • Kumar and Levine (2020) Aviral Kumar and Sergey Levine. 2020. Model Inversion Networks for Model-Based Optimization. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 5126–5137. https://proceedings.neurips.cc/paper_files/paper/2020/file/373e4c5d8edfa8b74fd4b6791d0cf6dc-Paper.pdf
  • Lee et al. (2023) Minji Lee, Luiz Felipe Vecchietti, Hyunkyu Jung, Hyunjoo Ro, Meeyoung Cha, and Ho Min Kim. 2023. Protein Sequence Design in a Latent Space via Model-based Reinforcement Learning. https://openreview.net/forum?id=OhjGzRE5N6o
  • Lee et al. (2024) Minji Lee, Luiz Felipe Vecchietti, Hyunkyu Jung, Hyun Joo Ro, Meeyoung Cha, and Ho Min Kim. 2024. Robust Optimization in Protein Fitness Landscapes Using Reinforcement Learning in Latent Space. In Forty-first International Conference on Machine Learning. https://openreview.net/forum?id=0zbxwvJqwf
  • Lin et al. (2023) Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 6637 (2023), 1123–1130. https://doi.org/10.1126/science.ade2574 arXiv:https://www.science.org/doi/pdf/10.1126/science.ade2574
  • Liu and Nocedal (1989) Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45, 1 (01 Aug 1989), 503–528. https://doi.org/10.1007/BF01589116
  • Mashkaria et al. (2023) Satvik Mehul Mashkaria, Siddarth Krishnamoorthy, and Aditya Grover. 2023. Generative Pretraining for Black-Box Optimization. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 24173–24197. https://proceedings.mlr.press/v202/mashkaria23a.html
  • Maynard Smith (1970) John Maynard Smith. 1970. Natural Selection and the Concept of a Protein Space. Nature 225, 5232 (Feb. 1970), 563–564. https://doi.org/10.1038/225563a0
  • Nguyen et al. (2023) Tung Nguyen, Sudhanshu Agrawal, and Aditya Grover. 2023. ExPT: Synthetic Pretraining for Few-Shot Experimental Design. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 45856–45869. https://proceedings.neurips.cc/paper_files/paper/2023/file/8fab4407e1fe9006b39180525c0d323c-Paper-Conference.pdf
  • Qiu and Wei (2022) Yuchi Qiu and Guo-Wei Wei. 2022. CLADE 2.0: Evolution-Driven Cluster Learning-Assisted Directed Evolution. Journal of Chemical Information and Modeling 62, 19 (Sept. 2022), 4629–4641. https://doi.org/10.1021/acs.jcim.2c01046
  • Raghavan et al. (2007) Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. 2007. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76 (Sep 2007), 036106. Issue 3. https://doi.org/10.1103/PhysRevE.76.036106
  • Ren et al. (2022) Zhizhou Ren, Jiahan Li, Fan Ding, Yuan Zhou, Jianzhu Ma, and Jian Peng. 2022. Proximal Exploration for Model-guided Protein Sequence Design. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 18520–18536. https://proceedings.mlr.press/v162/ren22a.html
  • Sarkisyan et al. (2016) Karen S. Sarkisyan, Dmitry A. Bolotin, Margarita V. Meer, Dinara R. Usmanova, Alexander S. Mishin, George V. Sharonov, Dmitry N. Ivankov, Nina G. Bozhanova, Mikhail S. Baranov, Onuralp Soylemez, Natalya S. Bogatyreva, Peter K. Vlasov, Evgeny S. Egorov, Maria D. Logacheva, Alexey S. Kondrashov, Dmitry M. Chudakov, Ekaterina V. Putintseva, Ilgar Z. Mamedov, Dan S. Tawfik, Konstantin A. Lukyanov, and Fyodor A. Kondrashov. 2016. Local fitness landscape of the green fluorescent protein. Nature 533, 7603 (01 May 2016), 397–401. https://doi.org/10.1038/nature17995
  • Sinai et al. (2020) Sam Sinai, Richard Wang, Alexander Whatley, Stewart Slocum, Elina Locane, and Eric D. Kelsic. 2020. AdaLead: A simple and robust adaptive greedy search algorithm for sequence design. CoRR abs/2010.02141 (2020). arXiv:2010.02141 https://arxiv.org/abs/2010.02141
  • Stanton et al. (2022) Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, and Andrew Gordon Wilson. 2022. Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 20459–20478. https://proceedings.mlr.press/v162/stanton22a.html
  • Trabucco et al. (2022) Brandon Trabucco, Xinyang Geng, Aviral Kumar, and Sergey Levine. 2022. Design-Bench: Benchmarks for Data-Driven Offline Model-Based Optimization. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 21658–21676. https://proceedings.mlr.press/v162/trabucco22a.html
  • Trabucco et al. (2021) Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. 2021. Conservative Objective Models for Effective Offline Model-Based Optimization. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 10358–10368. https://proceedings.mlr.press/v139/trabucco21a.html
  • Tran and Hy (2024) Thanh V. T. Tran and Truong Son Hy. 2024. Protein Design by Directed Evolution Guided by Large Language Models. bioRxiv (2024). https://doi.org/10.1101/2023.11.28.568945 arXiv:https://www.biorxiv.org/content/early/2024/05/02/2023.11.28.568945.full.pdf
  • Wang et al. (2021) Yajie Wang, Pu Xue, Mingfeng Cao, Tianhao Yu, Stephan T. Lane, and Huimin Zhao. 2021. Directed Evolution: Methodologies and Applications. Chemical Reviews 121, 20 (2021), 12384–12444. https://doi.org/10.1021/acs.chemrev.1c00260 arXiv:https://doi.org/10.1021/acs.chemrev.1c00260 PMID: 34297541.
  • Wilson et al. (2017) James T. Wilson, Riccardo Moriconi, Frank Hutter, and Marc Peter Deisenroth. 2017. The reparameterization trick for acquisition functions. arXiv:1712.00424 [stat.ML] https://arxiv.org/abs/1712.00424
  • Zhou et al. (2003) Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with Local and Global Consistency. In Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf (Eds.), Vol. 16. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2003/file/87682805257e619d49b8e0dfdc14affa-Paper.pdf

Appendix A Proofs

A.1. Proof of Proposition 3.3

Proof.

We compute the probability that zz belongs to Conv(𝕏)\text{Conv}(\mathbb{X}), 𝐏(zConv(𝕏))\mathbf{P}(z\in\text{Conv}(\mathbb{X})). By definition, we have

z\displaystyle z =i=1nλixi\displaystyle=\sum_{i=1}^{n}\lambda_{i}x_{i}
βx¯+(1β)ϵ=\displaystyle\beta\overline{x}+(1-\beta)\epsilon= i=1nλixi\displaystyle\sum_{i=1}^{n}\lambda_{i}x_{i}
ϵ=\displaystyle\epsilon= β1βx¯+i=1nλi1βxi\displaystyle-\frac{\beta}{1-\beta}\overline{x}+\sum_{i=1}^{n}\frac{\lambda_{i}}{1-\beta}x_{i}

Since x¯𝕏d\overline{x}\in\mathbb{X}_{d}, without loss of generality, let x¯=x1\overline{x}=x_{1}. The other cases can be computed similar, taking the mean gives the same results.

ϵ=λ1β1βx1+i=2nλi1βxi\epsilon=\frac{\lambda_{1}-\beta}{1-\beta}x_{1}+\sum_{i=2}^{n}\frac{\lambda_{i}}{1-\beta}x_{i}

This will lead to

ϵ2\displaystyle\|\epsilon\|^{2} =λ1β1βϵ,x1+i=2nλi1βϵ,xi\displaystyle=\frac{\lambda_{1}-\beta}{1-\beta}\langle\epsilon,x_{1}\rangle+\sum_{i=2}^{n}\frac{\lambda_{i}}{1-\beta}\langle\epsilon,x_{i}\rangle
1+β1βCβmaxi=1,n¯|ϵ,xi|.\displaystyle\leq\underbrace{\frac{1+\beta}{1-\beta}}_{C_{\beta}}\max_{i=\overline{1,n}}|\langle\epsilon,x_{i}\rangle|.

Since all xix_{i} are independent, for x𝒩(0,Id)x\in\mathcal{N}(0,I_{d}), we have:

𝐏(ϵ2Cβmaxi=1,n¯|ϵ,xi|)=1(𝐏(ϵ2Cβ|ϵ,x|))n.\mathbf{P}\left(\|\epsilon\|^{2}\leq C_{\beta}\max_{i=\overline{1,n}}|\langle\epsilon,x_{i}\rangle|\right)=1-\left(\mathbf{P}(\|\epsilon\|^{2}\geq C_{\beta}|\langle\epsilon,x\rangle|)\right)^{n}.

Note that ϵ,x=12(ϵ+x22ϵx22)=12(AB)\left\langle\epsilon,x\right\rangle=\frac{1}{2}\left(\left\|\frac{\epsilon+x}{\sqrt{2}}\right\|^{2}-\left\|\frac{\epsilon-x}{\sqrt{2}}\right\|^{2}\right)=\frac{1}{2}(A-B) where A,Bχd2A,B\sim\chi_{d}^{2}. From here, since dd is large, we consider Chi-square as the normal approximations

A,B,ϵ2𝒩(d,2d).A,B,\|\epsilon\|^{2}\sim\mathcal{N}(d,2d).

We consider the two distribution

S1=ϵ2Cβ2A+Cβ2B,\displaystyle S_{1}=\|\epsilon\|^{2}-\frac{C_{\beta}}{2}A+\frac{C_{\beta}}{2}B,
S2=ϵ2+Cβ2ACβ2B.\displaystyle S_{2}=\|\epsilon\|^{2}+\frac{C_{\beta}}{2}A-\frac{C_{\beta}}{2}B.

Since A,B,ϵ2A,B,\|\epsilon\|^{2} are all normal, S1S_{1} and S2S_{2} are also normal. Moreover,

𝔼S1=𝔼S2=d,\mathbb{E}S_{1}=\mathbb{E}S_{2}=d,

and the variances are

Var(S1)=\displaystyle\operatorname{Var}(S_{1})=\; Cβ24Var(A)+Cβ24Var(B)+Var(ϵ2)Cβ22Cov(A,B)\displaystyle\frac{C_{\beta}^{2}}{4}\operatorname{Var}(A)+\frac{C_{\beta}^{2}}{4}\operatorname{Var}(B)+\operatorname{Var}(\|\epsilon\|^{2})-\frac{C_{\beta}^{2}}{2}\operatorname{Cov}\left(A,B\right)
CβCov(ϵ2,A)+CβCov(ϵ2,B)\displaystyle-C_{\beta}\operatorname{Cov}\left(\|\epsilon\|^{2},A\right)+C_{\beta}\operatorname{Cov}\left(\|\epsilon\|^{2},B\right)
=\displaystyle=\; d(Cβ2+2)Cβ22Cov(A,B)\displaystyle d(C_{\beta}^{2}+2)-\frac{C_{\beta}^{2}}{2}\operatorname{Cov}\left(A,B\right)
CβCov(ϵ2,A)+CβCov(ϵ2,B)\displaystyle-C_{\beta}\operatorname{Cov}\left(\|\epsilon\|^{2},A\right)+C_{\beta}\operatorname{Cov}\left(\|\epsilon\|^{2},B\right)
Var(S2)=\displaystyle\operatorname{Var}(S_{2})=\; d(Cβ2+2)Cβ22Cov(A,B)\displaystyle d(C_{\beta}^{2}+2)-\frac{C_{\beta}^{2}}{2}\operatorname{Cov}\left(A,B\right)
+CβCov(ϵ2,A)CβCov(ϵ2,B)\displaystyle+C_{\beta}\operatorname{Cov}\left(\|\epsilon\|^{2},A\right)-C_{\beta}\operatorname{Cov}\left(\|\epsilon\|^{2},B\right)

Due to symmetric distribution of xx, we can see that Cov(ϵ2,A)=Cov(ϵ2,B)\operatorname{Cov}\left(\|\epsilon\|^{2},A\right)=\operatorname{Cov}\left(\|\epsilon\|^{2},B\right). Moreover, ϵ\epsilon and xx are independent random vectors, thus ϵ+x\epsilon+x and ϵx\epsilon-x are also independent, thus Cov(A,B)=0\operatorname{Cov}(A,B)=0. From these deductions, we have shown that

S1,S2𝒩(d,d(Cβ2+2)).S_{1},S_{2}\sim\mathcal{N}\left(d,d(C_{\beta}^{2}+2)\right).

Since ϵ2Cβ2|AB|0\|\epsilon\|^{2}-\frac{C_{\beta}}{2}|A-B|\geq 0 iff S1,S20S_{1},S_{2}\geq 0, we derive that

𝐏(ϵ2Cβ|ϵ,x|)\displaystyle\mathbf{P}(\|\epsilon\|^{2}\geq C_{\beta}|\langle\epsilon,x\rangle|)\; =𝐏(S1>0)×𝐏(S2>0)\displaystyle=\mathbf{P}(S_{1}>0)\times\mathbf{P}(S_{2}>0)
=Φ(dCβ2+2)2\displaystyle=\Phi\left(\sqrt{\frac{d}{C^{2}_{\beta}+2}}\right)^{2}
(1exp(d/(2(Cβ2+2)))2πd(Cβ2+2)1)2\displaystyle\approx\left(1-\frac{\exp\left(-d/(2(C^{2}_{\beta}+2))\right)}{\sqrt{2\pi d}(C^{2}_{\beta}+2)^{-1}}\right)^{2}

Finally, we get that

𝐏(ϵ2Cβmaxi=1,N¯|ϵ,x|)\displaystyle\mathbf{P}\left(\|\epsilon\|^{2}\leq C_{\beta}\max_{i=\overline{1,N}}|\langle\epsilon,x\rangle|\right)
\displaystyle\approx\; 1(1exp(d2(Cβ2+2))2πd(Cβ2+2)1)2N\displaystyle 1-\left(1-\frac{\exp\left(-\frac{d}{2(C^{2}_{\beta}+2)}\right)}{\sqrt{2\pi d}(C^{2}_{\beta}+2)^{-1}}\right)^{2N}
=\displaystyle=\; 1exp(2Nlog(1exp(d2(Cβ2+2))2πd(Cβ2+2)1))\displaystyle 1-\exp\left(2N\log\left(1-\frac{\exp\left(-\frac{d}{2(C^{2}_{\beta}+2)}\right)}{\sqrt{2\pi d}(C^{2}_{\beta}+2)^{-1}}\right)\right)
\displaystyle\approx\; 1exp(2Nexp(d2(Cβ2+2))2πd(Cβ2+2)1)\displaystyle 1-\exp\left(-2N\frac{\exp\left(-\frac{d}{2(C^{2}_{\beta}+2)}\right)}{\sqrt{2\pi d}(C^{2}_{\beta}+2)^{-1}}\right)

Thus for all Nexp(d2(Cβ2+2))N\ll\exp\left(\frac{d}{2(C^{2}_{\beta}+2)}\right) and dd sufficiently large, we have

𝐏(ϵ2Cβmaxi=1,N¯|ϵ,x|)0\mathbf{P}\left(\|\epsilon\|^{2}\leq C_{\beta}\max_{i=\overline{1,N}}|\langle\epsilon,x\rangle|\right)\approx 0

In other words,

limd𝐏(zConv(𝕏d))=0.\lim_{d\to\infty}\mathbf{P}(z\in\operatorname{Conv}(\mathbb{X}_{d}))=0.

A.2. Proof of Proposition 3.4

Proof.

Since D(z,Conv(𝕏))=infxConv(𝕏)(zx)D(z,\text{Conv}(\mathbb{X}))=\underset{x\in\text{Conv}(\mathbb{X})}{\inf}\big{(}\|z-x\|\big{)}, for any x¯Conv(𝕏)\overline{x}\in\text{Conv}(\mathbb{X}), we should have:

𝔼[D(z,Conv(𝕏))]𝔼zx¯.\displaystyle\mathbb{E}[D(z,\text{Conv}(\mathbb{X}))]\leq\mathbb{E}\|z-\overline{x}\|.

Replace zz by βx¯+(1β)ϵ\beta*\overline{x}+(1-\beta)*\epsilon, we obtain:

𝔼zx¯\displaystyle\mathbb{E}\|z-\overline{x}\| 𝔼βx¯+(1β)ϵ(1β)x¯βx¯\displaystyle\leq\mathbb{E}\|\beta*\overline{x}+(1-\beta)*\epsilon-(1-\beta)*\overline{x}-\beta*\overline{x}\|
𝔼(1β)(ϵx¯)\displaystyle\leq\mathbb{E}\|(1-\beta)(\epsilon-\overline{x})\|
(1β)(𝔼ϵ+𝔼x¯)\displaystyle\leq(1-\beta)(\mathbb{E}\|\epsilon\|+\mathbb{E}\|\overline{x}\|)

We have that ϵ𝒩(0,Id)\epsilon\sim\mathcal{N}(0,I_{d}) and x¯𝒩(0,Id)\overline{x}\sim\mathcal{N}(0,I_{d}). Chandrasekaran et al. (2012) has shown that the upper bound of their expectations’ norms is d\sqrt{d}. Thus, we have

𝔼zx¯\displaystyle\mathbb{E}\|z-\overline{x}\| (1β)(𝔼ϵ+𝔼x¯)\displaystyle\leq(1-\beta)\big{(}\mathbb{E}\|\epsilon\|+\mathbb{E}\|\overline{x}\|\big{)}
<(1β)(d+d)=2(1β)d.\displaystyle<(1-\beta)\big{(}\sqrt{d}+\sqrt{d}\big{)}=2(1-\beta)\sqrt{d}.

Therefore,

𝔼[D(z,Conv(𝕏))]<2(1β)d,\mathbb{E}[D(z,\text{Conv}(\mathbb{X}))]<2(1-\beta)\sqrt{d},

which completes the proof. ∎

Appendix B Evaluation Metrics

We provide mathematical definitions for each metric. Note that 𝒪ψ\mathcal{O}_{\psi} is the evaluator provided by Kirjner et al. (2024) to predict approximate fitness, serving as a proxy for experimental validation.

  • (Normalized) Fitness = median({ξ(s^i;Y)}i=1Nsamples)\text{median}\left(\left\{\xi(\hat{s}_{i};Y^{*})\right\}_{i=1}^{N_{\text{samples}}}\right) where ξ(s^;Y)=𝒪ψ(s^)min(Y)max(Y)min(Y)\xi(\hat{s};Y^{*})=\frac{\mathcal{O}_{\psi}(\hat{s})-\min(Y^{*})}{\max(Y^{*})-\min(Y^{*})} is the min-max normalized fitness based on the lowest and highest known fitness in YY^{*}.

  • Diversity = median({dist(s,s^):s,s^S^,ss^})\text{median}\left(\left\{\text{dist}(s,\hat{s}):s,\hat{s}\in\hat{S},s\neq\hat{s}\right\}\right) is the average sample similarity.

  • Novelty = median({η(s^i;S)}i=1Nsamples)\text{median}\left(\left\{\eta(\hat{s}_{i};S)\right\}_{i=1}^{N_{\text{samples}}}\right) where
    η(s;S)=min({dist(s,s^):s^S,s^s})\eta(s;S)=\min\left(\left\{\text{dist}(s,\hat{s}):\hat{s}\in S^{*},\hat{s}\neq s\right\}\right) is the minimum distance of sample ss to any of the starting sequences SS.

Algorithm 4 kNN: k-Nearest Neighbors
1:Current node xx, All nodes 𝒱\mathcal{V}
2:D(x)x𝒱/{x}xxD(x)\leftarrow\bigcup_{x^{\prime}\in\mathcal{V}/\{x\}}||x^{\prime}-x|| \triangleright Euclidean distance.
3:𝒳TopK(D(x),𝒱)\mathcal{X}^{\prime}\leftarrow\texttt{TopK}(D(x),\mathcal{V}) \triangleright Retrieve K closest nodes to xx.
4:(x)x𝒳(x,x)\mathcal{E}(x)\leftarrow\bigcup_{x^{\prime}\in\mathcal{X}^{\prime}}(x,x^{\prime}) \triangleright Construct neighborhood around xx.
5:return (x)\mathcal{E}(x)

Appendix C Numerical Results For Other Protein Tasks

In this section, we present experimental results of other difficulties of two protein datasets: GFP and AAV. Table 6 provides statistics for each level benchmark. We use the same settings as described in Section 4.1 for this evaluation. Table 7 outlined the results of baselines and our GROOT. Although our method is designed for scenarios with limited labeled data, it achieves state-of-the-art (SOTA) results across all difficulty levels in both datasets. Additionally, the enhanced version of ReLSO incorporating our smoothing strategy, S-ReLSO, outperforms the original version by a significant margin in all tasks, particularly in the GFP medium task, where an 8-fold improvement is observed. These results demonstrate the effectiveness of our framework.

Table 6. Statistic of benchmarks.
Task Difficulty Fitness Mutational Best |𝒟||\mathcal{D}|
Range (%\%) Gap Fitness
AAV Easy 506050-60th 0 0.530.53 56095609
Medium 204020-40th 66 0.380.38 28282828
Hard <30<30th 77 0.330.33 24262426
GFP Easy 506050-60th 0 0.790.79 44134413
Medium 204020-40th 66 0.620.62 21392139
Hard <30<30th 77 0.100.10 34483448
Table 7. AAV and GFP optimization results for GROOT and baseline methods. The standard deviation of 5 runs with different random seeds is indicated in parentheses.
AAV easy task AAV medium task AAV hard task
Method Fitness \uparrow Diversity Novelty Fitness \uparrow Diversity Novelty Fitness \uparrow Diversity Novelty
AdaLead 0.53 (0.0) 5.7 (0.4) 3.0 (0.7) 0.49 (0.0) 5.3 (0.7) 6.3 (0.4) 0.46 (0.0) 5.4 (1.7) 6.9 (0.9)
CbAS 0.03 (0.0) 23.2 (0.2) 17.5 (0.5) 0.02 (0.0) 23.1 (0.1) 18.3 (0.4) 0.02 (0.0) 23.0 (0.2) 18.5 (0.5)
BO 0.01 (0.0) 20.1 (0.5) 22.3 (0.4) 0.00 (0.0) 20.4 (0.2) 21.5 (0.5) 0.01 (0.0) 20.5 (0.2) 20.8 (0.4)
GFN-AL 0.05 (0.0) 18.2 (3.3) 21.0 (0.0) 0.01 (0.0) 16.9 (2.4) 21.4 (1.0) 0.01 (0.0) 14.4 (6.2) 21.6 (0.5)
PEX 0.44 (0.0) 6.33 (1.1) 4.0 (0.6) 0.35 (0.0) 6.86 (0.9) 4.2 (1.0) 0.29 (0.0) 6.43 (0.5) 3.8 (0.7)
GGS 0.49 (0.0) 9.0 (0.2) 8.0 (0.0) 0.51 (0.0) 4.9 (0.2) 5.4 (0.5) 0.60 (0.0) 4.5 (0.5) 7.0 (0.0)
ReLSO 0.18 (0.0) 1.4 (0.0) 5.0 (0.0) 0.18 (0.0) 15.7 (0.0) 11.0 (0.0) 0.03 (0.0) 18.6 (0.0) 15.0 (0.0)
S-ReLSO 0.47 (0.0) 8.6 (0.0) 4.0 (0.0) 0.26 (0.0) 13.7 (0.0) 7.0 (0.0) 0.2 (0.0) 8.3 (0.0) 11.0 (0.0)
GROOT 0.65 (0.0) 2.0 (0.7) 1.0 (0.0) 0.59 (0.0) 5.1 (2.3) 5.4 (0.6) 0.61 (0.0) 5.0 (1.0) 7.0 (0.0)
GFP easy task GFP medium task GFP hard task
Method Fitness \uparrow Diversity Novelty Fitness \uparrow Diversity Novelty Fitness \uparrow Diversity Novelty
AdaLead 0.68 (0.0) 7.0 (0.4) 2.3 (0.4) 0.52 (0.0) 8.6 (3.0) 4.5 (0.5) 0.47 (0.0) 8.3 (1.9) 9.5 (2.7)
CbAS -0.09 (0.0) 190.6 (1.5) 171.5 (4.6) -0.08 (0.0) 171.5 (59.3) 200.8 (2.2) -0.09 (0.0) 170.0 (29.2) 202.3 (0.8)
BO -0.06 (0.0) 55.7 (1.3) 198.1 (5.6) -0.04 (0.0) 58.1 (2.0) 199.8 (4.4) -0.03 (0.1) 59.4 (2.0) 197.0 (7.6)
GFN-AL 0.27 (0.1) 83.7 (83.2) 222.7 (1.2) 0.27 (0.1) 65.5 (12.5) 223.0 (1.0) 0.19 (0.1) 54.4 (32.7) 224.0 (5.4)
PEX 0.62 (0.0) 6.6 (0.3) 3.9 (0.2) 0.46 (0.0) 7.8 (0.7) 4.6 (1.0) 0.13 (0.0) 12.6 (1.2) 7.1 (1.1)
GGS 0.84 (0.0) 5.6 (0.2) 3.5 (0.2) 0.76 (0.0) 3.7 (0.2) 5.0 (0.0) 0.74 (0.0) 3.0 (0.1) 8.0 (0.0)
ReLSO 0.40 (0.0) 3.6 (0.0) 6.0 (0.0) 0.08 (0.0) 29.0 (0.0) 18.0 (0.0) 0.60 (0.0) 38.8 (0.0) 8.0 (0.0)
S-ReLSO 0.72 (0.0) 108.3 (0.0) 2.0 (0.0) 0.64 (0.0) 32.8 (0.0) 6.0 (0.0) 0.80 (0.0) 10.5 (0.0) 7.0 (0.0)
GROOT 0.91 (0.0) 1.9 (0.3) 1.0 (0.2) 0.87 (0.0) 2.7 (0.1) 5.0 (0.0) 0.88 (0.0) 2.7 (0.4) 6.0 (0.0)

Appendix D Additional Analyses

D.1. Efficiency of GROOT

In this part, we analyze the time complexity of our smoothing algorithm, excluding the extracting embeddings from training data and training surrogate model with smoothed data. We first divide our method into two phases and then analyze them individually:

  • Create graph (Algorithm 1): Initially, sampling xx from 𝒱\mathcal{V} and generating Gaussian noise ϵ\epsilon are O(1)O(1) and O(d)O(d) operations, respectively. The interpolation to create zz and adding zz to 𝒱\mathcal{V} are O(d)O(d) and O(1)O(1). These steps occur in each iteration of the while loop, which runs N|𝒱|N-|\mathcal{V}| times, resulting in a total complexity of O((N|𝒱|)d)O((N-|\mathcal{V}|)\cdot d). The most significant part of the complexity arises from the edge construction (Algorithm 4). Calculating Euclidean distances between xx and all other nodes in 𝒱\mathcal{V} involves dd operations for each of the |𝒱|1|\mathcal{V}|-1 distances, leading to O(d|𝒱|)O(d\cdot|\mathcal{V}|). Retrieving the KK closest nodes is O(|𝒱|logK)O(|\mathcal{V}|\log K), and constructing the neighborhood around xx is O(K)O(K). Since these steps must be performed for each node in 𝒱\mathcal{V}, the overall time complexity for edge construction is O(|𝒱|(d|𝒱|))+O(|𝒱|(|𝒱|logK))+O(|𝒱|K)O(|\mathcal{V}|\cdot(d\cdot|\mathcal{V}|))+O(|\mathcal{V}|\cdot(|\mathcal{V}|\log K))+O(|\mathcal{V}|\cdot K), simplifying to O(d|𝒱|2)O(d\cdot|\mathcal{V}|^{2}) since logKd\log K\ll d. Given that |𝒱||\mathcal{V}| eventually reaches NN, the overall time complexity is T1=O(dN2)T_{1}=O(d\cdot N^{2}).

  • Smooth (Algorithm 2): The algorithm begins with constructing the weighted adjacency matrix AA as defined in Equation 5. This construction involves calculating the Euclidean distance between each pair of nodes. Since this computation has been done in the previous step, the time complexity of this step is O(N2)O(N^{2}). As shown in Equation 4, a label propagation layer calculates the inverse square root of the degree matrix, D1/2D^{-1/2}, and multiplies it by the adjacency matrix, AA. Efficiently, the inverse of the diagonal matrix D can be computed in linear time, O(N)O(N). According to Raghavan et al. (2007), label propagation can be interpreted as label exchange across edges, resulting in a time complexity of O(||)O(|\mathcal{E}|), significantly less than the naive O(N3)O(N^{3}) matrix multiplication. Thus, the computation time involves O(N2)O(N^{2}) for constructing AA and DD, O(N)O(N) for taking the inversion, and O(m||)O(m\cdot|\mathcal{E}|) for running label propagation in mm layers. Moreover, since kNN graphs exhibit sparse structures with numerous sub-communities, the adjacency matrix AA is also sparse, and thus O(m||)O(mN2)O(m\cdot|\mathcal{E}|)\ll O(m\cdot N^{2}) in our work. Therefore, the overall time complexity is T2=2O(N2)+O(N)+O(m||)O(N2)T_{2}=2O(N^{2})+O(N)+O(m\cdot|\mathcal{E}|)\approx O(N^{2}).

In summary, GROOT has a time complexity of T=T1+T2=O(dN2)+O(N2)T=T_{1}+T_{2}=O(d\cdot N^{2})+O(N^{2}). Since O(N2)O(N^{2}) is dominated by O(dN2)O(d\cdot N^{2}), TT simplifies to O(dN2)O(d\cdot N^{2}).

Table 8. Results of best selected hyperparameters for each task.
Task Difficulty NN mm α\alpha kk Algorithm Fitness Diversity Novelty
AAV Harder1 4000 1 0.6 4 L-BFGS 0.56 (0.0) 6.2 (1.0) 13.0 (0.0)
Harder2 4000 1 0.6 4 0.51 (0.0) 6.2 (1.6) 12.8 (0.5)
Harder3 4000 1 0.6 4 0.45 (0.1) 9.4 (1.6) 13.2 (0.5)
GFP Harder1 18000 4 0.2 4 Grad. Ascent 0.89 (0.0) 3.0 (0.2) 7.4 (0.6)
Harder2 14000 4 0.2 5 0.87 (0.0) 3.1 (0.2) 7.6 (0.6)
Harder3 14000 4 0.2 4 0.78 (0.1) 5.2 (2.4) 8.0 (0.0)

Appendix E Hyperparameters Tuning Process

For all tasks detailed in the main text, we keep the architecture of the VAE and the surrogate model, as well as the hyperparameters for the optimization algorithms, consistent. We then use the Optuna (Akiba et al., 2019) package to tune the hyperparameters for each task. The range of hyperparameters is listed below:

  • Number of nodes N:{4000,5000,6000,,20000}N:\{4000,5000,6000,\ldots,20000\};

  • Coefficient α:{0.1,0.15,0.2,,0.9}\alpha:\{0.1,0.15,0.2,\ldots,0.9\};

  • Number of propagation layers m:{1,2,3,4}m:\{1,2,3,4\};

  • Number of neighbors k:{2,3,4,,8}k:\{2,3,4,\ldots,8\}

Table 8 presents the results of our method with the best corresponding hyperparameters for each main task. It is clear that the AAV dataset requires a smaller graph size, with 4,000 nodes and fewer propagation layers (m=1m=1) to optimize effectively. Conversely, the GFP dataset achieves better performance with a larger graph size, with nodes ranging from 14,000 to 18,000 and four propagation layers. The other hyperparameters, α\alpha and kk, remain consistent across different difficulties within the same dataset. These findings indicate that while hyperparameter selection is highly dependent on the landscape characteristics of each dataset, it remains stable across varying difficulty levels within the same dataset, thereby reducing the effort needed to find optimal settings.

Appendix F Design-Bench

Tasks

To demonstrate the domain-agnostic nature of our method, we conduct experiments on three tasks from Design-Bench444We exclude domains with highly inaccurate, noisy oracle functions (ChEMBL, Hopper, Superconductor, and TF Bind 10) or those too expensive to evaluate (NAS). (Trabucco et al., 2021). D’Kitty and Ant are continuous tasks with input dimensions of 56 and 60, respectively. The goal is to optimize the morphological structure of two simulated robots: Ant (Brockman et al., 2016) to run as fast as possible, and D’Kitty (Ahn et al., 2020) to reach a fixed target location. TF Bind 8 is a discrete task, where the objective is to find the length-8 DNA sequence with maximum binding affinity to the SIX6_REF_R1 transcription factor. The design space consists of sequences of one of four categorical variables, corresponding to four types of nucleotides. For each task, Design-Bench provides a public dataset, a larger hidden dataset for score normalization, and an exact oracle to evaluate proposed designs. To simulate the labeled scarcity scenarios, we randomly subsample 1%1\% of data points in the public set of each task.

Baselines

We compare GROOT with four state-of-the-art models in the offline setting: MINs (Kumar and Levine, 2020), BONET (Mashkaria et al., 2023), BDI (Chen et al., 2022), and ExPT (Nguyen et al., 2023). All baseline results are sourced from those reported by Nguyen et al. (2023).

Evaluation

For each method, we allow an optimization budget of Q=256Q=256. We report the median, max, and mean scores among the 256 proposed inputs. Following prior works, scores are normalized to [0,1][0,1] using the minimum and maximum function values from a large hidden dataset: ynorm=yyminymaxyminy_{\text{norm}}=\frac{y-y_{\min}}{y_{\max}-y_{\min}}. The mean and standard deviation of the scores are reported across three independent runs for each method.

Implementation Details

For each task, we use the full public dataset to train the VAE, which consists of a 6-layer Transformer encoder with 8 attention heads and a latent dimension size of 128. The decoder is a deep convolutional neural network, as described in Section 3.2. For model-based optimization, we utilize Gradient Ascent with learning rate of 0.0050.005 for 500500 iterations. For latent graph-based smoothing, we use the same hyperparameter ranges listed in Appendix E and employ Optuna to tune the hyperparameters for each task. Table 9 outlines the hyperparameter selection for each task.

Table 9. Optimal settings of Design-Bench tasks.
Task NN mm α\alpha kk
D’Kitty 16000 3 0.3 5
Ant 16000 3 0.3 5
TF Bind 8 14000 6 0.6 2